0

Zuxuan Wu

Papers
51

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
51papers

Authored papers

51

Channel-wise Vector Quantization

arXiv 2026

2026

CL-bench: A Benchmark for Context Learning

arXiv 2026

2026

ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

arXiv 2026

2026

FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions

arXiv 2026

2026

FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance

arXiv 2026

2026

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

arXiv 2026

2026

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

arXiv 2026

2026

CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization

arXiv 2026

2026

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

arXiv 2026

2026

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

arXiv 2026

2026

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

arXiv 2026

2026

Aligning Anime Video Generation with Human Feedback

arXiv 2025

2025

Generalized Trajectory Scoring for End-to-end Multimodal Planning

arXiv 2025

2025

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

arXiv 2025

2025

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

arXiv 2025

2025

Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning

arXiv 2025

2025

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

arXiv 2025

2025

Multimodal Referring Segmentation: A Survey

arXiv 2025

2025

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

arXiv 2025

2025

Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

arXiv 2025

2025

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

arXiv 2025

2025

Safety at Scale: A Comprehensive Survey of Large Model Safety

arXiv 2025

2025

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

ICCV 2025

2025

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

arXiv 2025

2025

FOCUS: Towards Universal Foreground Segmentation

arXiv 2025

2025

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

arXiv 2025

2025

FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction

arXiv 2025

2025

StableAnimator: High-Quality Identity-Preserving Human Image Animation

CVPR 2025 1

2024

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

arXiv 2024

2024

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

arXiv 2024

2024

ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

arXiv 2024

2024

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

arXiv 2024

2024

Secrets of RLHF in Large Language Models Part II: Reward Modeling

arXiv 2024

2024

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

arXiv 2024

2024

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

arXiv 2024

2024

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

ICCV 2025

2024

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

arXiv 2024

2024

MouSi: Poly-Visual-Expert Vision-Language Models

arXiv 2024

2024

OmniVid: A Generative Framework for Universal Video Understanding

CVPR 2024 1

2024

A Survey on Video Diffusion Models

arXiv 2023

2023

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

arXiv 2023

2023

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

arXiv 2023

2023

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

ICCV 2023 1

2023

SEGIC: Unleashing the Emergent Correspondence for In-Context Segmentation

arXiv 2023

2023

MotionEditor: Editing Video Motion via Content-Aware Diffusion

CVPR 2024 1

2023

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

arXiv 2023

2023

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models

arXiv 2023

2023

ResFormer: Scaling ViTs with Multi-Resolution Training

CVPR 2023 1

2022

Rethinking Nearest Neighbors for Visual Classification

arXiv 2021

2021

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

arXiv 2021

2021

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

arXiv 2021

2021

Affiliations

No known affiliations.

Frequent co-authors

10

from 51 papers