Jaehong Yoon

Self-Refining Video Sampling

arXiv 2026

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

arXiv 2026

AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

arXiv 2026

Are Video Reasoning Models Ready to Go Outside?

arXiv 2026

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

arXiv 2026

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

arXiv 2025

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

arXiv 2025

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

arXiv 2025

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

arXiv 2025

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

arXiv 2025

RSQ: Learning from Important Tokens Leads to Better Quantized LLMs

arXiv 2025

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

arXiv 2025

Planning with Sketch-Guided Verification for Physics-Aware Video Generation

arXiv 2025

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

arXiv 2025

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

arXiv 2024

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

arXiv 2024

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

arXiv 2024

Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection

arXiv 2024

Glider: Global and Local Instruction-Driven Expert Router

arXiv 2024

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

arXiv 2024