Yuanxing Zhang
- Papers
- 35
Cite
Notes
Only stored in your browser.
Authored papers
35Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
arXiv 2026
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
arXiv 2026
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
arXiv 2026
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
arXiv 2026
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
arXiv 2026
VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining
arXiv 2026
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
arXiv 2026
CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
arXiv 2026
Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers
arXiv 2026
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
arXiv 2026
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
arXiv 2025
A Comprehensive Survey on Long Context Language Modeling
arXiv 2025
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
arXiv 2025
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
arXiv 2025
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
arXiv 2025
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model
arXiv 2025
Monet: Reasoning in Latent Visual Space Beyond Images and Language
arXiv 2025
SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
arXiv 2025
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
arXiv 2025
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
arXiv 2025
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs
arXiv 2025
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
arXiv 2025
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
arXiv 2025
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
arXiv 2025
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
arXiv 2025
ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding
arXiv 2025
IF-VidCap: Can Video Caption Models Follow Instructions?
arXiv 2025
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
arXiv 2025
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
arXiv 2025
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
arXiv 2025
ViDiC: Video Difference Captioning
arXiv 2025
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation
arXiv 2025
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
arXiv 2025
TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment
arXiv 2025
MIO: A Foundation Model on Multimodal Tokens
arXiv 2024
Affiliations
Frequent co-authors
10from 35 papers