Xing Sun

RISE-Video: Can Video Generators Decode Implicit World Rules?

arXiv 2026

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

arXiv 2026

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision

arXiv 2026

Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

arXiv 2026

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

arXiv 2026

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

arXiv 2025

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

arXiv 2025

Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models

arXiv 2025

Streaming Video Instruction Tuning

arXiv 2025

TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference

arXiv 2025

Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization

arXiv 2025

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray

arXiv 2025

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

arXiv 2025

Training-Free Group Relative Policy Optimization

arXiv 2025

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

arXiv 2025

RocketEval: Efficient Automated LLM Evaluation via Grading Checklist

arXiv 2025

SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

arXiv 2025

RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following

arXiv 2025

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

arXiv 2025

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

arXiv 2024

Sinkhorn Distance Minimization for Knowledge Distillation

arXiv 2024

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

arXiv 2024

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence

arXiv 2024

Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models

arXiv 2024

Multimodal Label Relevance Ranking via Reinforcement Learning

arXiv 2024

FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema

arXiv 2024

Aligning and Prompting Everything All at Once for Universal Visual Perception

arXiv 2023

Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval

ICCV 2023 1

MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation

arXiv 2023

Co-Salient Object Detection with Co-Representation Purification

arXiv 2023

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

ICCV 2023 1

MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

arXiv 2023

Woodpecker: Hallucination Correction for Multimodal Large Language Models

arXiv 2023