0

Yuanxing Zhang

Papers
35

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
35papers

Authored papers

35

Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

arXiv 2026

2026

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

arXiv 2026

2026

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

arXiv 2026

2026

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

arXiv 2026

2026

OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

arXiv 2026

2026

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

arXiv 2026

2026

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

arXiv 2026

2026

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

arXiv 2026

2026

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

arXiv 2026

2026

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

arXiv 2026

2026

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

arXiv 2025

2025

A Comprehensive Survey on Long Context Language Modeling

arXiv 2025

2025

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

arXiv 2025

2025

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

arXiv 2025

2025

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

arXiv 2025

2025

Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

arXiv 2025

2025

Monet: Reasoning in Latent Visual Space Beyond Images and Language

arXiv 2025

2025

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

arXiv 2025

2025

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

arXiv 2025

2025

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

arXiv 2025

2025

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

arXiv 2025

2025

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

arXiv 2025

2025

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

arXiv 2025

2025

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

arXiv 2025

2025

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

arXiv 2025

2025

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

arXiv 2025

2025

IF-VidCap: Can Video Caption Models Follow Instructions?

arXiv 2025

2025

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

arXiv 2025

2025

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

arXiv 2025

2025

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

arXiv 2025

2025

ViDiC: Video Difference Captioning

arXiv 2025

2025

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

arXiv 2025

2025

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

arXiv 2025

2025

TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment

arXiv 2025

2025

MIO: A Foundation Model on Multimodal Tokens

arXiv 2024

2024

Affiliations

No known affiliations.

Frequent co-authors

10

from 35 papers