Wenbo Hu

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

arXiv 2026

OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

arXiv 2026

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

arXiv 2026

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

arXiv 2026

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

arXiv 2026

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

ICCV 2025

G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

arXiv 2025

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

arXiv 2025

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

arXiv 2025

Interleaving Reasoning for Better Text-to-Image Generation

arXiv 2025

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

ICCV 2025

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

ICCV 2025

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

arXiv 2024

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

arXiv 2024

Matryoshka Query Transformer for Large Vision-Language Models

arXiv 2024

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

CVPR 2025 1

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

CVPR 2025 1

Verbalized Representation Learning for Interpretable Few-Shot Generalization

ICCV 2025

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

arXiv 2024