Yixiao Ge

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

arXiv 2025

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

arXiv 2025

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

arXiv 2025

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

arXiv 2025

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

ICCV 2025

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

arXiv 2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

arXiv 2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

arXiv 2025

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

arXiv 2025

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

ICCV 2025

YOLO-World: Real-Time Open-Vocabulary Object Detection

CVPR 2024 1

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

arXiv 2024

LLaMA Pro: Progressive LLaMA with Block Expansion

arXiv 2024

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

ICCV 2025

SEED-Story: Multimodal Long Story Generation with Large Language Model

arXiv 2024

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

arXiv 2024

GrootVL: Tree Topology is All You Need in State Space Model

arXiv 2024

VoCo-LLaMA: Towards Vision Compression with Large Language Models

CVPR 2025 1

ST-LLM: Large Language Models Are Effective Temporal Learners

st-llm-large-language-models-are-effective

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

ICCV 2025

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

arXiv 2024

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

CVPR 2024 1

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

CVPR 2025 1

Supervised Fine-tuning in turn Improves Visual Foundation Models

arXiv 2024

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

arXiv 2024

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

arXiv 2023

Making LLaMA SEE and Draw with SEED Tokenizer

arXiv 2023

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

NeurIPS 2023 11

Exploring Model Transferability through the Lens of Potential Energy

ICCV 2023 1

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

arXiv 2023

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

arXiv 2023

Vision-Language Instruction Tuning: A Review and Analysis

arXiv 2023

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

arXiv 2023

BoxSnake: Polygonal Instance Segmentation with Box Supervision

ICCV 2023 1

Binary Embedding-based Retrieval at Tencent

arXiv 2023

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

ICCV 2023 1

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

ICCV 2023 1

All in One: Exploring Unified Video-Language Pre-training

CVPR 2023 1

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

CVPR 2023 1