Zuxuan Wu
- Papers
- 51
Cite
Notes
Only stored in your browser.
Authored papers
51Channel-wise Vector Quantization
arXiv 2026
CL-bench: A Benchmark for Context Learning
arXiv 2026
ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation
arXiv 2026
FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions
arXiv 2026
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
arXiv 2026
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
arXiv 2026
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
arXiv 2026
CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization
arXiv 2026
WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
arXiv 2026
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
arXiv 2026
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
arXiv 2026
Aligning Anime Video Generation with Human Feedback
arXiv 2025
Generalized Trajectory Scoring for End-to-end Multimodal Planning
arXiv 2025
SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
arXiv 2025
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
arXiv 2025
Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning
arXiv 2025
StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
arXiv 2025
Multimodal Referring Segmentation: A Survey
arXiv 2025
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
arXiv 2025
Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning
arXiv 2025
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models
arXiv 2025
Safety at Scale: A Comprehensive Survey of Large Model Safety
arXiv 2025
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
ICCV 2025
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
arXiv 2025
FOCUS: Towards Universal Foreground Segmentation
arXiv 2025
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
arXiv 2025
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
arXiv 2025
StableAnimator: High-Quality Identity-Preserving Human Image Animation
CVPR 2025 1
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
arXiv 2024
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
arXiv 2024
ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
arXiv 2024
Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection
arXiv 2024
Secrets of RLHF in Large Language Models Part II: Reward Modeling
arXiv 2024
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
arXiv 2024
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
arXiv 2024
REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents
ICCV 2025
Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
arXiv 2024
MouSi: Poly-Visual-Expert Vision-Language Models
arXiv 2024
OmniVid: A Generative Framework for Universal Video Understanding
CVPR 2024 1
A Survey on Video Diffusion Models
arXiv 2023
MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing
arXiv 2023
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
arXiv 2023
Implicit Temporal Modeling with Learnable Alignment for Video Recognition
ICCV 2023 1
SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation
arXiv 2023
MotionEditor: Editing Video Motion via Content-Aware Diffusion
CVPR 2024 1
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv 2023
Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models
arXiv 2023
ResFormer: Scaling ViTs with Multi-Resolution Training
CVPR 2023 1
Rethinking Nearest Neighbors for Visual Classification
arXiv 2021
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection
arXiv 2021
M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers
arXiv 2021
Affiliations
Frequent co-authors
10from 51 papers