0

Yuhang Zang

Papers
46

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
46papers

Authored papers

46

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

arXiv 2026

2026

ETCHR: Editing To Clarify and Harness Reasoning

arXiv 2026

2026

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

arXiv 2026

2026

Unified Personalized Reward Model for Vision Generation

arXiv 2026

2026

Visual-ERM: Reward Modeling for Visual Equivalence

arXiv 2026

2026

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

arXiv 2026

2026

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

arXiv 2026

2026

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

arXiv 2025

2025

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

arXiv 2025

2025

Visual Agentic Reinforcement Fine-Tuning

arXiv 2025

2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

CVPR 2025 1

2025

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

arXiv 2025

2025

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

arXiv 2025

2025

Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

arXiv 2025

2025

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

CVPR 2025 1

2025

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

arXiv 2025

2025

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

arXiv 2025

2025

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

arXiv 2025

2025

Think Visually, Reason Textually: Vision-Language Synergy in ARC

arXiv 2025

2025

G^2RPO: Granular GRPO for Precise Reward in Flow Models

arXiv 2025

2025

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

arXiv 2025

2025

SPARK: Synergistic Policy And Reward Co-Evolving Framework

arXiv 2025

2025

STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

arXiv 2025

2025

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

arXiv 2025

2025

MM-IFEngine: Towards Multimodal Instruction Following

arXiv 2025

2025

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

arXiv 2025

2025

Beyond Fixed: Training-Free Variable-Length Denoising for Diffusion Large Language Models

arXiv 2025

2025

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

arXiv 2025

2025

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

arXiv 2025

2025

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

arXiv 2025

2025

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

arXiv 2025

2025

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

arXiv 2025

2025

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

arXiv 2025

2025

Long-CLIP: Unlocking the Long-Text Capability of CLIP

arXiv 2024

2024

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

ICCV 2025

2024

Are We on the Right Way for Evaluating Large Vision-Language Models?

arXiv 2024

2024

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

arXiv 2024

2024

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

arXiv 2024

2024

WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation

arXiv 2024

2024

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

arXiv 2024

2024

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

arXiv 2024

2024

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

arXiv 2024

2024

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

arXiv 2024

2024

Contextual Object Detection with Multimodal Large Language Models

arXiv 2023

2023

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

CVPR 2024 1

2023

Unified Vision and Language Prompt Learning

arXiv 2022

2022

Affiliations

No known affiliations.

Frequent co-authors

10

from 46 papers