Yuhang Zang
- Papers
- 46
Cite
Notes
Only stored in your browser.
Authored papers
46WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
arXiv 2026
ETCHR: Editing To Clarify and Harness Reasoning
arXiv 2026
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
arXiv 2026
Unified Personalized Reward Model for Vision Generation
arXiv 2026
Visual-ERM: Reward Modeling for Visual Equivalence
arXiv 2026
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
arXiv 2026
Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation
arXiv 2026
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
arXiv 2025
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
arXiv 2025
Visual Agentic Reinforcement Fine-Tuning
arXiv 2025
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
CVPR 2025 1
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
arXiv 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
arXiv 2025
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
arXiv 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
CVPR 2025 1
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
arXiv 2025
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
arXiv 2025
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
arXiv 2025
Think Visually, Reason Textually: Vision-Language Synergy in ARC
arXiv 2025
G^2RPO: Granular GRPO for Precise Reward in Flow Models
arXiv 2025
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
arXiv 2025
SPARK: Synergistic Policy And Reward Co-Evolving Framework
arXiv 2025
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
arXiv 2025
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
arXiv 2025
MM-IFEngine: Towards Multimodal Instruction Following
arXiv 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
arXiv 2025
Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models
arXiv 2025
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
arXiv 2025
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
arXiv 2025
UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation
arXiv 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
arXiv 2025
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
arXiv 2025
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
arXiv 2025
Long-CLIP: Unlocking the Long-Text Capability of CLIP
arXiv 2024
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
ICCV 2025
Are We on the Right Way for Evaluating Large Vision-Language Models?
arXiv 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
arXiv 2024
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
arXiv 2024
WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation
arXiv 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
arXiv 2024
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization
arXiv 2024
MotionClone: Training-Free Motion Cloning for Controllable Video Generation
arXiv 2024
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
arXiv 2024
Contextual Object Detection with Multimodal Large Language Models
arXiv 2023
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
CVPR 2024 1
Unified Vision and Language Prompt Learning
arXiv 2022
Affiliations
Frequent co-authors
10from 46 papers