Ying Shan
- Papers
- 110
Cite
Notes
Only stored in your browser.
Authored papers
110Pixal3D: Pixel-Aligned 3D Generation from Images
arXiv 2026
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
arXiv 2026
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
arXiv 2026
Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels
arXiv 2026
CutClaw: Agentic Hours-Long Video Editing via Music Synchronization
arXiv 2026
Semantic Generative Tuning for Unified Multimodal Models
arXiv 2026
MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
arXiv 2026
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
ICCV 2025
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control
arXiv 2025
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
arXiv 2025
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
arXiv 2025
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
arXiv 2025
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
arXiv 2025
AudioStory: Generating Long-Form Narrative Audio with Large Language Models
arXiv 2025
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
arXiv 2025
AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
arXiv 2025
MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO
arXiv 2025
Cobra: Efficient Line Art COlorization with BRoAder References
arXiv 2025
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
arXiv 2025
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
arXiv 2025
FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios
arXiv 2025
Aligning Latent Spaces with Flow Priors
arXiv 2025
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
arXiv 2025
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
arXiv 2025
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
arXiv 2025
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
arXiv 2025
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts
arXiv 2025
GenCompositor: Generative Video Compositing with Diffusion Transformer
arXiv 2025
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
arXiv 2025
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
ICCV 2025
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
ICCV 2025
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
ICCV 2025
Sci-Fi: Symmetric Constraint for Frame Inbetweening
arXiv 2025
YOLO-World: Real-Time Open-Vocabulary Object Detection
CVPR 2024 1
StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos
arXiv 2024
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
arXiv 2024
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
CVPR 2024 1
DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation
CVPR 2025 1
CV-VAE: A Compatible Video VAE for Latent Generative Video Models
arXiv 2024
DOGE: Towards Versatile Visual Document Grounding and Referring
ICCV 2025
LLaMA Pro: Progressive LLaMA with Block Expansion
arXiv 2024
ST-LLM: Large Language Models Are Effective Temporal Learners
st-llm-large-language-models-are-effective
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
ICCV 2025
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
arXiv 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
CVPR 2024 1
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
CVPR 2025 1
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
arXiv 2024
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation
arXiv 2024
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
arXiv 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
arXiv 2024
DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing
CVPR 2024 1
ColorFlow: Retrieval-Augmented Image Sequence Colorization
arXiv 2024
GrootVL: Tree Topology is All You Need in State Space Model
arXiv 2024
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction
ICCV 2025
BrushEdit: All-In-One Image Inpainting and Editing
arXiv 2024
NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images
CVPR 2025 1
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing
arXiv 2024
InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models
arXiv 2024
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
CVPR 2025 1
BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion
arXiv 2024
Taming Scalable Visual Tokenizer for Autoregressive Image Generation
ICCV 2025
SEED-Story: Multimodal Long Story Generation with Large Language Model
arXiv 2024
MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model
arXiv 2024
Taming Rectified Flow for Inversion and Editing
arXiv 2024
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
arXiv 2024
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
CVPR 2025 1
VoCo-LLaMA: Towards Vision Compression with Large Language Models
CVPR 2025 1
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
arXiv 2024
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
arXiv 2024
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
arXiv 2024
GS-IR: 3D Gaussian Splatting for Inverse Rendering
CVPR 2024 1
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing
ICCV 2023 1
StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter
arXiv 2023
ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models
arXiv 2023
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
CVPR 2024 1
Inserting Anybody in Diffusion Models via Celeb Basis
inserting-anybody-in-diffusion-models-via
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
CVPR 2024 1
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
arXiv 2023
Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
ICCV 2023 1
Improved Test-Time Adaptation for Domain Generalization
CVPR 2023 1
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
arXiv 2023
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
arXiv 2023
DreamDiffusion: Generating High-Quality Images from Brain EEG Signals
arXiv 2023
Making LLaMA SEE and Draw with SEED Tokenizer
arXiv 2023
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling
arXiv 2023
M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
arXiv 2023
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
arXiv 2023
DPE: Disentanglement of Pose and Expression for General Video Portrait Editing
CVPR 2023 1
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
arXiv 2023
Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos
arXiv 2023
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
NeurIPS 2023 11
MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
ICCV 2023 1
AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
arXiv 2023
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning
arXiv 2023
CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models
arXiv 2023
TaleCrafter: Interactive Story Visualization with Multiple Characters
arXiv 2023
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation
arXiv 2023
Vision-Language Instruction Tuning: A Review and Analysis
arXiv 2023
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
arXiv 2023
Binary Embedding-based Retrieval at Tencent
arXiv 2023
Exploring Model Transferability through the Lens of Potential Energy
ICCV 2023 1
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
ICCV 2023 1
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
CVPR 2023 1
Latent Video Diffusion Models for High-Fidelity Long Video Generation
arXiv 2022
AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos
arXiv 2022
Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection
ICCV 2023 1
All in One: Exploring Unified Video-Language Pre-training
CVPR 2023 1
UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection
CVPR 2022 1
Towards Real-World Blind Face Restoration with Generative Facial Prior
CVPR 2021 1
Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data
arXiv 2021
Affiliations
Frequent co-authors
10from 110 papers