Xing Sun
- Papers
- 35
Cite
Notes
Only stored in your browser.
Authored papers
35Toward Native Multimodal Modeling: A Roadmap
arXiv 2026
RISE-Video: Can Video Generators Decode Implicit World Rules?
arXiv 2026
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
arXiv 2026
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
arXiv 2026
Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding
arXiv 2026
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
arXiv 2026
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
arXiv 2025
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
arXiv 2025
Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
arXiv 2025
Streaming Video Instruction Tuning
arXiv 2025
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
arXiv 2025
Youtu-Agent: Scaling Agent Productivity with Automated Generation and Hybrid Policy Optimization
arXiv 2025
SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
arXiv 2025
Training-Free Group Relative Policy Optimization
arXiv 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv 2025
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv 2025
RocketEval: Efficient Automated LLM Evaluation via Grading Checklist
arXiv 2025
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
arXiv 2025
SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
arXiv 2025
RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following
arXiv 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
arXiv 2024
Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence
arXiv 2024
Sinkhorn Distance Minimization for Knowledge Distillation
arXiv 2024
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
arXiv 2024
Leveraging Open Knowledge for Advancing Task Expertise in Large Language Models
arXiv 2024
Multimodal Label Relevance Ranking via Reinforcement Learning
arXiv 2024
FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema
arXiv 2024
Aligning and Prompting Everything All at Once for Universal Visual Perception
arXiv 2023
Co-Salient Object Detection with Co-Representation Purification
arXiv 2023
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
ICCV 2023 1
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples
arXiv 2023
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023
Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval
ICCV 2023 1
MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation
arXiv 2023
Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion
arXiv 2020
Affiliations
Frequent co-authors
10from 35 papers