Yixiao Ge
- Papers
- 41
Cite
Notes
Only stored in your browser.
Authored papers
41UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
arXiv 2026
AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
arXiv 2025
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
arXiv 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
arXiv 2025
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
arXiv 2025
Aligning Latent Spaces with Flow Priors
arXiv 2025
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
arXiv 2025
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
arXiv 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
arXiv 2025
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
arXiv 2025
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
ICCV 2025
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
ICCV 2025
YOLO-World: Real-Time Open-Vocabulary Object Detection
CVPR 2024 1
GrootVL: Tree Topology is All You Need in State Space Model
arXiv 2024
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing
arXiv 2024
LLaMA Pro: Progressive LLaMA with Block Expansion
arXiv 2024
Taming Scalable Visual Tokenizer for Autoregressive Image Generation
ICCV 2025
SEED-Story: Multimodal Long Story Generation with Large Language Model
arXiv 2024
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
arXiv 2024
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
arXiv 2024
VoCo-LLaMA: Towards Vision Compression with Large Language Models
CVPR 2025 1
ST-LLM: Large Language Models Are Effective Temporal Learners
st-llm-large-language-models-are-effective
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
ICCV 2025
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
arXiv 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
CVPR 2024 1
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
CVPR 2025 1
Supervised Fine-tuning in turn Improves Visual Foundation Models
arXiv 2024
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
arXiv 2023
DreamDiffusion: Generating High-Quality Images from Brain EEG Signals
arXiv 2023
Making LLaMA SEE and Draw with SEED Tokenizer
arXiv 2023
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
NeurIPS 2023 11
Exploring Model Transferability through the Lens of Potential Energy
ICCV 2023 1
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
arXiv 2023
Vision-Language Instruction Tuning: A Review and Analysis
arXiv 2023
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
arXiv 2023
BoxSnake: Polygonal Instance Segmentation with Box Supervision
ICCV 2023 1
Binary Embedding-based Retrieval at Tencent
arXiv 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
ICCV 2023 1
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
ICCV 2023 1
All in One: Exploring Unified Video-Language Pre-training
CVPR 2023 1
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
CVPR 2023 1
Affiliations
Frequent co-authors
10from 41 papers