Mike Zheng Shou
- Papers
- 88
Cite
Notes
Only stored in your browser.
Authored papers
88Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration
arXiv 2026
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
arXiv 2026
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
arXiv 2026
World Action Models: The Next Frontier in Embodied AI
arXiv 2026
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
arXiv 2026
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
arXiv 2026
Olaf-World: Orienting Latent Actions for Video World Modeling
arXiv 2026
Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
arXiv 2026
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
arXiv 2026
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
arXiv 2026
ShowUI-Aloha: Human-Taught GUI Agent
arXiv 2026
Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance
arXiv 2026
Code2Video: A Code-centric Paradigm for Educational Video Generation
arXiv 2025
Paper2Video: Automatic Video Generation from Scientific Papers
arXiv 2025
Show-o2: Improved Native Unified Multimodal Models
arXiv 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
arXiv 2025
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data
arXiv 2025
D-AR: Diffusion via Autoregressive Models
arXiv 2025
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
arXiv 2025
macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
arXiv 2025
SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
sam-i2v-upgrading-sam-to-support-promptable
Edit Transfer: Learning Image Editing via Vision In-Context Relations
arXiv 2025
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
arXiv 2025
Balanced Image Stylization with Style Matching Score
ICCV 2025
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
arXiv 2025
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
arXiv 2025
TPDiff: Temporal Pyramid Video Diffusion Model
arXiv 2025
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale
arXiv 2025
The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
arXiv 2025
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
arXiv 2025
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
arXiv 2025
Computer-Use Agents as Judges for Generative User Interface
arXiv 2025
H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
arXiv 2025
Reinforcement Learning in Vision: A Survey
arXiv 2025
Multi-human Interactive Talking Dataset
arXiv 2025
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
CVPR 2025 1
Factorized Learning for Temporally Grounded Video-Language Models
arXiv 2025
Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model
arXiv 2025
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
arXiv 2025
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
CVPR 2025 1
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
long-context-autoregressive-video-modeling
Automated Movie Generation via Multi-Agent CoT Planning
arXiv 2025
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
arXiv 2025
WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point
arXiv 2025
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
ICCV 2025
Impossible Videos
arXiv 2025
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
arXiv 2024
FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data
arXiv 2024
Skinned Motion Retargeting with Dense Geometric Interaction Perception
arXiv 2024
ROICtrl: Boosting Instance Control for Visual Generation
CVPR 2025 1
Visual Perception by Large Language Model's Weights
arXiv 2024
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
arXiv 2024
Learning Video Context as Interleaved Multimodal Sequences
arXiv 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
arXiv 2024
Image Watermarks are Removable Using Controllable Regeneration from Clean Noise
arXiv 2024
DragAnything: Motion Control for Anything using Entity Representation
arXiv 2024
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
arXiv 2024
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
arXiv 2024
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025 1
Hallucination of Multimodal Large Language Models: A Survey
arXiv 2024
Faster Diffusion via Temporal Attention Decomposition
arXiv 2024
LOVA3: Learning to Visual Question Answering, Asking and Assessment
arXiv 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
arXiv 2024
Learning Long-form Video Prior via Generative Pre-Training
arXiv 2024
Too Large; Data Reduction for Vision-Language Pre-Training
ICCV 2023 1
MLLMs-Augmented Visual-Language Representation Learning
arXiv 2023
Making Vision Transformers Efficient from A Token Sparsification View
CVPR 2023 1
SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels
arXiv 2023
Unsupervised Open-Vocabulary Object Localization in Videos
ICCV 2023 1
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
ICCV 2023 1
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
arXiv 2023
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
arXiv 2023
UniVTG: Towards Unified Video-Language Temporal Grounding
ICCV 2023 1
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
ICCV 2023 1
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
arXiv 2023
VisorGPT: Learning Visual Prior via Generative Pre-Training
arXiv 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
ICCV 2023 1
CVPR 2023 Text Guided Video Editing Competition
arXiv 2023
Bootstrapping SparseFormers from Vision Foundation Models
CVPR 2024 1
Parrot Captions Teach CLIP to Spot Text
arXiv 2023
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
arXiv 2022
Label-Efficient Online Continual Object Detection in Streaming Video
ICCV 2023 1
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
ICCV 2023 1
All in One: Exploring Unified Video-Language Pre-training
CVPR 2023 1
Egocentric Video-Language Pretraining
arXiv 2022
Position-guided Text Prompt for Vision-Language Pre-training
CVPR 2023 1
Ego4D: Around the World in 3,000 Hours of Egocentric Video
CVPR 2022 1
AVA-AVD: Audio-Visual Speaker Diarization in the Wild
arXiv 2021
Affiliations
Frequent co-authors
10from 88 papers