Yuying Ge
- Papers
- 26
Cite
Notes
Only stored in your browser.
Authored papers
26UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
arXiv 2026
AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
arXiv 2025
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
arXiv 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
arXiv 2025
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
arXiv 2025
Aligning Latent Spaces with Flow Priors
arXiv 2025
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
arXiv 2025
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
arXiv 2025
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
arXiv 2025
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
ICCV 2025
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
ICCV 2025
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
arXiv 2025
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
arXiv 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
arXiv 2025
SEED-Story: Multimodal Long Story Generation with Large Language Model
arXiv 2024
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
arXiv 2024
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
ICCV 2025
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
CVPR 2025 1
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing
arXiv 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
arXiv 2024
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
arXiv 2023
Making LLaMA SEE and Draw with SEED Tokenizer
arXiv 2023
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
arXiv 2023
GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields
arXiv 2023
All in One: Exploring Unified Video-Language Pre-training
CVPR 2023 1
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
CVPR 2023 1
Affiliations
Frequent co-authors
10from 26 papers