Dahua Lin
- Papers
- 122
Cite
Notes
Only stored in your browser.
Authored papers
122SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
arXiv 2026
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
arXiv 2026
ETCHR: Editing To Clarify and Harness Reasoning
arXiv 2026
InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery
arXiv 2026
UltraDexGrasp: Learning Universal Dexterous Grasping for Bimanual Robots with Synthetic Data
arXiv 2026
AIDABench: AI Data Analytics Benchmark
arXiv 2026
A Very Big Video Reasoning Suite
arXiv 2026
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
arXiv 2026
Demystifying Video Reasoning
arXiv 2026
Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs
arXiv 2026
Visual-ERM: Reward Modeling for Visual Equivalence
arXiv 2026
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
arXiv 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
Visual Agentic Reinforcement Fine-Tuning
arXiv 2025
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
arXiv 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
arXiv 2025
Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs
arXiv 2025
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models
arXiv 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
CVPR 2025 1
RelightVid: Temporal-Consistent Diffusion Model for Video Relighting
arXiv 2025
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models
arXiv 2025
ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
arXiv 2025
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
arXiv 2025
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
arXiv 2025
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
arXiv 2025
ConsistCompose: Unified Multimodal Layout Control for Image Composition
arXiv 2025
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
arXiv 2025
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
arXiv 2025
Scaling Spatial Intelligence with Multimodal Foundation Models
arXiv 2025
Think Visually, Reason Textually: Vision-Language Synergy in ARC
arXiv 2025
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
arXiv 2025
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
arXiv 2025
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
arXiv 2025
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
arXiv 2025
SPARK: Synergistic Policy And Reward Co-Evolving Framework
arXiv 2025
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
arXiv 2025
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
arXiv 2025
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
ICCV 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
arXiv 2025
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
arXiv 2025
PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model
arXiv 2025
MM-IFEngine: Towards Multimodal Instruction Following
arXiv 2025
LEGION: Learning to Ground and Explain for Synthetic Image Detection
ICCV 2025
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
arXiv 2025
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
arXiv 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
arXiv 2025
GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition
arXiv 2025
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
arXiv 2025
SS4D: Native 4D Generative Model via Structured Spacetime Latents
arXiv 2025
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
arXiv 2025
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
arXiv 2025
GRUtopia: Dream General Robots in a City at Scale
arXiv 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
arXiv 2024
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
ICCV 2025
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
arXiv 2024
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
arXiv 2024
Imagine360: Immersive 360 Video Generation from Perspective Anchor
arXiv 2024
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
arXiv 2024
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations
arXiv 2024
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
arXiv 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
arXiv 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
arXiv 2024
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
arXiv 2024
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
arXiv 2024
IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations
arXiv 2024
3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors
arXiv 2024
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
CVPR 2024 1
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
arXiv 2024
InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems
arXiv 2024
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
CVPR 2025 1
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
arXiv 2024
SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition
arXiv 2024
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
arXiv 2024
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
arXiv 2024
Grounded 3D-LLM with Referent Tokens
arXiv 2024
Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study
arXiv 2024
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
arXiv 2024
Case2Code: Learning Inductive Reasoning with Synthetic Data
arXiv 2024
SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting
arXiv 2024
InternLM-Law: An Open Source Chinese Legal Large Language Model
arXiv 2024
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin
arXiv 2024
Balanced Data Sampling for Language Model Training with Clustering
arXiv 2024
F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods
arXiv 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
arXiv 2024
Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation
arXiv 2024
CriticEval: Evaluating Large Language Model as Critic
arXiv 2024
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
arXiv 2024
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
arXiv 2024
LongWanjuan: Towards Systematic Measurement for Long Text Quality
arXiv 2024
AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
arXiv 2024
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
arXiv 2024
OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection
arXiv 2024
Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
arXiv 2024
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
arXiv 2023
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
CVPR 2024 1
Improving Pixel-based MIM by Reducing Wasted Modeling Capability
ICCV 2023 1
Scene as Occupancy
ICCV 2023 1
Unified Human-Scene Interaction via Prompted Chain-of-Contacts
arXiv 2023
PointLLM: Empowering Large Language Models to Understand Point Clouds
arXiv 2023
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues
arXiv 2023
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2023
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering
ICCV 2023 1
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
arXiv 2023
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
CVPR 2024 1
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
arXiv 2023
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
arXiv 2023
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2024 1
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
arXiv 2023
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
ICCV 2023 1
InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint
arXiv 2023
Scaling Laws of RoPE-based Extrapolation
arXiv 2023
Flames: Benchmarking Value Alignment of LLMs in Chinese
arXiv 2023
MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training
CVPR 2023 1
Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation
arXiv 2023
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024 1
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
arXiv 2023
Novel Policy Seeking with Constrained Optimization
novel-policy-seeking-with-constrained-1
Self-Supervised Learning via Conditional Motion Propagation
self-supervised-learning-via-conditional-1
Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination
arXiv 2018
Affiliations
Frequent co-authors
10from 122 papers