0

Yuying Ge

Papers
26

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
26papers

Authored papers

26

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

arXiv 2026

2026

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

arXiv 2025

2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

arXiv 2025

2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

arXiv 2025

2025

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

arXiv 2025

2025

Aligning Latent Spaces with Flow Priors

arXiv 2025

2025

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

arXiv 2025

2025

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

arXiv 2025

2025

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

arXiv 2025

2025

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

ICCV 2025

2025

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

ICCV 2025

2025

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

arXiv 2025

2025

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

arXiv 2025

2025

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

arXiv 2025

2025

SEED-Story: Multimodal Long Story Generation with Large Language Model

arXiv 2024

2024

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

arXiv 2024

2024

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

ICCV 2025

2024

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

CVPR 2025 1

2024

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

arXiv 2024

2024

Supervised Fine-tuning in turn Improves Visual Foundation Models

arXiv 2024

2024

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

arXiv 2023

2023

Making LLaMA SEE and Draw with SEED Tokenizer

arXiv 2023

2023

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

arXiv 2023

2023

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

arXiv 2023

2023

All in One: Exploring Unified Video-Language Pre-training

CVPR 2023 1

2022

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

CVPR 2023 1

2022

Affiliations

No known affiliations.

Frequent co-authors

10

from 26 papers