Ziwei Liu
- Papers
- 171
Cite
Notes
Only stored in your browser.
Authored papers
171AI for Auto-Research: Roadmap & User Guide
arXiv 2026
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
arXiv 2026
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
arXiv 2026
PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
arXiv 2026
SpatialBench: Is Your Spatial Foundation Model an All-Round Player?
arXiv 2026
MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
arXiv 2026
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
arXiv 2026
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
arXiv 2026
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
arXiv 2026
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
arXiv 2026
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
arXiv 2026
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
arXiv 2026
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
arXiv 2026
ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors
arXiv 2026
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence
arXiv 2026
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
arXiv 2026
Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence
arXiv 2026
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
arXiv 2026
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
arXiv 2026
A Very Big Video Reasoning Suite
arXiv 2026
A Simple Baseline for Streaming Video Understanding
arXiv 2026
Demystifying Video Reasoning
arXiv 2026
HippoCamp: Benchmarking Contextual Agents on Personal Computers
arXiv 2026
VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining
arXiv 2026
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
arXiv 2026
The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation
arXiv 2025
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
arXiv 2026
3D Scene Generation: A Survey
arXiv 2025
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
arXiv 2025
EgoLife: Towards Egocentric Life Assistant
CVPR 2025 1
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
ICCV 2025
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
arXiv 2025
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
ICCV 2025
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
arXiv 2025
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
arXiv 2025
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
arXiv 2025
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
arXiv 2025
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
arXiv 2025
CFG-Zero*: Improved Classifier-Free Guidance for Flow Matching Models
arXiv 2025
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
ICCV 2025
Ola: Pushing the Frontiers of Omni-Modal Language Model
arXiv 2025
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
ICCV 2025
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation
arXiv 2025
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities
arXiv 2025
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
CVPR 2025 1
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
ICCV 2025
Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM
arXiv 2025
Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
arXiv 2025
Light-X: Generative 4D Video Rendering with Camera and Illumination Control
arXiv 2025
The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
arXiv 2025
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
arXiv 2025
Scaling Spatial Intelligence with Multimodal Foundation Models
arXiv 2025
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
arXiv 2025
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
arXiv 2025
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
arXiv 2025
Simulating the Visual World with Artificial Intelligence: A Roadmap
arXiv 2025
HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming
arXiv 2025
IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
arXiv 2025
3D and 4D World Modeling: A Survey
arXiv 2025
PhysX: Physical-Grounded 3D Asset Generation
arXiv 2025
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
arXiv 2025
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
arXiv 2025
Reconstructing 4D Spatial Intelligence: A Survey
arXiv 2025
4DNeX: Feed-Forward 4D Generative Modeling Made Easy
arXiv 2025
MMSearch-R1: Incentivizing LMMs to Search
arXiv 2025
DPoser-X: Diffusion Model as Robust 3D Whole-body Human Pose Prior
arXiv 2025
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
arXiv 2025
Cut2Next: Generating Next Shot via In-Context Tuning
arXiv 2025
The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
arXiv 2025
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
arXiv 2025
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
arXiv 2025
Visual Jigsaw Post-Training Improves MLLMs
arXiv 2025
FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
ICCV 2025
GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
arXiv 2025
LLaVA-OneVision: Easy Visual Task Transfer
arXiv 2024
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
ICCV 2025
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
arXiv 2024
A Comprehensive Survey on 3D Content Generation
arXiv 2024
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
arXiv 2024
Imagine360: Immersive 360 Video Generation from Perspective Anchor
arXiv 2024
DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes
arXiv 2024
Long Context Transfer from Language to Vision
arXiv 2024
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
CVPR 2025 1
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models
arXiv 2024
3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors
arXiv 2024
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
CVPR 2024 1
Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion
CVPR 2025 1
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
ICCV 2025
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
CVPR 2025 1
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
CVPR 2024 1
AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation
CVPR 2024 1
GaussianCity: Generative Gaussian Splatting for Unbounded 3D City Generation
arXiv 2024
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
arXiv 2024
ThemeStation: Generating Theme-Aware 3D Assets from Few Exemplars
arXiv 2024
WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation
arXiv 2024
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
arXiv 2024
AID: Attention Interpolation of Text-to-Image Diffusion
arXiv 2024
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey
arXiv 2024
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models
arXiv 2024
MMInA: Benchmarking Multihop Multimodal Internet Agents
arXiv 2024
Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation
arXiv 2024
4D Contrastive Superflows are Dense 3D Representation Learners
arXiv 2024
Latte: Latent Diffusion Transformer for Video Generation
arXiv 2024
Vlogger: Make Your Dream A Vlog
CVPR 2024 1
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
CVPR 2025 1
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
arXiv 2024
High-Fidelity Virtual Try-on with Large-Scale Unpaired Learning
arXiv 2024
WHAC: World-grounded Humans and Cameras
arXiv 2024
OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection
arXiv 2023
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
arXiv 2023
Detecting and Grounding Multi-Modal Media Manipulation
CVPR 2023 1
SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation
smpler-x-scaling-up-expressive-human-pose-and
Panoptic Video Scene Graph Generation
panoptic-video-scene-graph-generation
FreeU: Free Lunch in Diffusion U-Net
CVPR 2024 1
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering
ICCV 2023 1
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling
arXiv 2023
StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces
ICCV 2023 1
ReVersion: Diffusion-Based Relation Inversion from Images
arXiv 2023
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
arXiv 2023
OtterHD: A High-Resolution Multi-modality Model
arXiv 2023
F$^{2}$-NeRF: Fast Neural Radiance Field Training with Free Camera Trajectories
arXiv 2023
Segment Any Point Cloud Sequences by Distilling Vision Foundation Models
NeurIPS 2023 11
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
arXiv 2023
SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections
arXiv 2023
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
CVPR 2024 1
DreamGaussian4D: Generative 4D Gaussian Splatting
arXiv 2023
Collaborative Diffusion for Multi-Modal Face Generation and Editing
CVPR 2023 1
GauHuman: Articulated Gaussian Splatting from Monocular Human Videos
CVPR 2024 1
Deep Geometrized Cartoon Line Inbetweening
deep-geometrized-cartoon-line-inbetweening
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
ICCV 2023 1
SHERF: Generalizable Human NeRF from a Single Image
ICCV 2023 1
Text2Performer: Text-Driven Human Video Generation
ICCV 2023 1
SparseNeRF: Distilling Depth Ranking for Few-shot Novel View Synthesis
ICCV 2023 1
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
arXiv 2023
Class-Incremental Learning: A Survey
arXiv 2023
PERF: Panoramic Neural Radiance Field from a Single Panorama
arXiv 2023
Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need
arXiv 2023
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
arXiv 2023
ReliTalk: Relightable Talking Portrait Generation from a Single Video
arXiv 2023
FunQA: Towards Surprising Video Comprehension
arXiv 2023
Link-Context Learning for Multimodal LLMs
CVPR 2024 1
UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation
ICCV 2023 1
BiBench: Benchmarking and Analyzing Network Binarization
arXiv 2023
DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields
arXiv 2023
Learning without Forgetting for Vision-Language Models
arXiv 2023
FreeInit: Bridging Initialization Gap in Video Diffusion Models
arXiv 2023
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
arXiv 2023
Conditional Prompt Learning for Vision-Language Models
CVPR 2022 1
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
arXiv 2022
Masked Frequency Modeling for Self-Supervised Visual Pre-Training
arXiv 2022
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy
arXiv 2022
VToonify: Controllable High-Resolution Portrait Video Style Transfer
arXiv 2022
Pastiche Master: Exemplar-Based High-Resolution Portrait Style Transfer
CVPR 2022 1
StyleGAN-Human: A Data-Centric Odyssey of Human Generation
arXiv 2022
Text2Human: Text-Driven Controllable Human Image Generation
arXiv 2022
EVA3D: Compositional 3D Human Generation from 2D Image Collections
arXiv 2022
Panoptic Scene Graph Generation
arXiv 2022
Neural Prompt Search
arXiv 2022
Sparse Mixture-of-Experts are Domain Generalizable Learners
arXiv 2022
AnimeRun: 2D Animation Visual Correspondence from Open Source 3D Movies
arXiv 2022
BiBERT: Accurate Fully Binarized BERT
bibert-accurate-fully-binarized-bert
Benchmarking and Analyzing Point Cloud Classification under Corruptions
arXiv 2022
LaserMix for Semi-Supervised LiDAR Semantic Segmentation
CVPR 2023 1
Talk-to-Edit: Fine-Grained Facial Editing via Dialog
ICCV 2021 10
Unsupervised Object-Level Representation Learning from Scene Images
NeurIPS 2021 12
Delving into Inter-Image Invariance for Unsupervised Visual Representations
arXiv 2020
Long-tailed Recognition by Routing Diverse Distribution-Aware Experts
long-tailed-recognition-by-routing-diverse
ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on
arXiv 2020
MaskGAN: Towards Diverse and Interactive Facial Image Manipulation
maskgan-towards-diverse-and-interactive-1
Self-Supervised Learning via Conditional Motion Propagation
self-supervised-learning-via-conditional-1
Dynamic Graph CNN for Learning on Point Clouds
arXiv 2018
Affiliations
Frequent co-authors
10from 171 papers