0

Dahua Lin

Papers
122

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
122papers

Authored papers

122

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

arXiv 2026

2026

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

arXiv 2026

2026

ETCHR: Editing To Clarify and Harness Reasoning

arXiv 2026

2026

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

arXiv 2026

2026

UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data

arXiv 2026

2026

AIDABench: AI Data Analytics Benchmark

arXiv 2026

2026

A Very Big Video Reasoning Suite

arXiv 2026

2026

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

arXiv 2026

2026

Demystifying Video Reasoning

arXiv 2026

2026

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

arXiv 2026

2026

Visual-ERM: Reward Modeling for Visual Equivalence

arXiv 2026

2026

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

arXiv 2025

2025

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

arXiv 2025

2025

Visual Agentic Reinforcement Fine-Tuning

arXiv 2025

2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

arXiv 2025

2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

arXiv 2025

2025

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

arXiv 2025

2025

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

arXiv 2025

2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

CVPR 2025 1

2025

RelightVid: Temporal-Consistent Diffusion Model for Video Relighting

arXiv 2025

2025

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

arXiv 2025

2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

arXiv 2025

2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models

arXiv 2025

2025

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

arXiv 2025

2025

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

arXiv 2025

2025

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

arXiv 2025

2025

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

arXiv 2025

2025

ConsistCompose: Unified Multimodal Layout Control for Image Composition

arXiv 2025

2025

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

arXiv 2025

2025

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

arXiv 2025

2025

Scaling Spatial Intelligence with Multimodal Foundation Models

arXiv 2025

2025

Think Visually, Reason Textually: Vision-Language Synergy in ARC

arXiv 2025

2025

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

arXiv 2025

2025

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

arXiv 2025

2025

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

arXiv 2025

2025

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

arXiv 2025

2025

SPARK: Synergistic Policy And Reward Co-Evolving Framework

arXiv 2025

2025

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

arXiv 2025

2025

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

arXiv 2025

2025

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

ICCV 2025

2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

arXiv 2025

2025

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

arXiv 2025

2025

PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model

arXiv 2025

2025

MM-IFEngine: Towards Multimodal Instruction Following

arXiv 2025

2025

LEGION: Learning to Ground and Explain for Synthetic Image Detection

ICCV 2025

2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

arXiv 2025

2025

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

arXiv 2025

2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

arXiv 2025

2025

GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition

arXiv 2025

2025

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

arXiv 2025

2025

SS4D: Native 4D Generative Model via Structured Spacetime Latents

arXiv 2025

2025

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

arXiv 2025

2025

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

arXiv 2025

2025

GRUtopia: Dream General Robots in a City at Scale

arXiv 2024

2024

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

arXiv 2024

2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

ICCV 2025

2024

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

arXiv 2024

2024

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

arXiv 2024

2024

Imagine360: Immersive 360 Video Generation from Perspective Anchor

arXiv 2024

2024

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

arXiv 2024

2024

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

arXiv 2024

2024

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

arXiv 2024

2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

arXiv 2024

2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

arXiv 2024

2024

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

arXiv 2024

2024

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

arXiv 2024

2024

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

arXiv 2024

2024

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

arXiv 2024

2024

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

CVPR 2024 1

2024

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models

arXiv 2024

2024

InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems

arXiv 2024

2024

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

CVPR 2025 1

2024

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

arXiv 2024

2024

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

arXiv 2024

2024

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition

arXiv 2024

2024

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

arXiv 2024

2024

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

arXiv 2024

2024

Grounded 3D-LLM with Referent Tokens

arXiv 2024

2024

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study

arXiv 2024

2024

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

arXiv 2024

2024

Case2Code: Learning Inductive Reasoning with Synthetic Data

arXiv 2024

2024

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting

arXiv 2024

2024

InternLM-Law: An Open Source Chinese Legal Large Language Model

arXiv 2024

2024

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

arXiv 2024

2024

Balanced Data Sampling for Language Model Training with Clustering

arXiv 2024

2024

F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

arXiv 2024

2024

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

arXiv 2024

2024

Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation

arXiv 2024

2024

CriticEval: Evaluating Large Language Model as Critic

arXiv 2024

2024

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

arXiv 2024

2024

What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

arXiv 2024

2024

LongWanjuan: Towards Systematic Measurement for Long Text Quality

arXiv 2024

2024

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

arXiv 2024

2024

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

arXiv 2024

2024

OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection

arXiv 2024

2024

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

arXiv 2024

2024

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

arXiv 2023

2023

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

CVPR 2024 1

2023

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

ICCV 2023 1

2023

Scene as Occupancy

ICCV 2023 1

2023

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

arXiv 2023

2023

PointLLM: Empowering Large Language Models to Understand Point Clouds

arXiv 2023

2023

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

arXiv 2023

2023

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

arXiv 2023

2023

DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering

ICCV 2023 1

2023

HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion

arXiv 2023

2023

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

CVPR 2024 1

2023

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

arXiv 2023

2023

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

arXiv 2023

2023

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

CVPR 2024 1

2023

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

arXiv 2023

2023

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

ICCV 2023 1

2023

InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint

arXiv 2023

2023

Scaling Laws of RoPE-based Extrapolation

arXiv 2023

2023

Flames: Benchmarking Value Alignment of LLMs in Chinese

arXiv 2023

2023

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

CVPR 2023 1

2023

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

arXiv 2023

2023

OneLLM: One Framework to Align All Modalities with Language

CVPR 2024 1

2023

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

arXiv 2023

2023

Novel Policy Seeking with Constrained Optimization

novel-policy-seeking-with-constrained-1

2020

Self-Supervised Learning via Conditional Motion Propagation

self-supervised-learning-via-conditional-1

2019

Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination

arXiv 2018

2018

Affiliations

No known affiliations.

Frequent co-authors

10

from 122 papers