0

Ying Shan

Papers
110

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
110papers

Authored papers

110

Pixal3D: Pixel-Aligned 3D Generation from Images

arXiv 2026

2026

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

arXiv 2026

2026

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

arXiv 2026

2026

Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels

arXiv 2026

2026

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

arXiv 2026

2026

Semantic Generative Tuning for Unified Multimodal Models

arXiv 2026

2026

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

arXiv 2026

2026

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

ICCV 2025

2025

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

arXiv 2025

2025

GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

arXiv 2025

2025

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

arXiv 2025

2025

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

arXiv 2025

2025

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

arXiv 2025

2025

AudioStory: Generating Long-Form Narrative Audio with Large Language Models

arXiv 2025

2025

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

arXiv 2025

2025

AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation

arXiv 2025

2025

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

arXiv 2025

2025

Cobra: Efficient Line Art COlorization with BRoAder References

arXiv 2025

2025

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

arXiv 2025

2025

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

arXiv 2025

2025

FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios

arXiv 2025

2025

Aligning Latent Spaces with Flow Priors

arXiv 2025

2025

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

arXiv 2025

2025

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

arXiv 2025

2025

ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

arXiv 2025

2025

ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

arXiv 2025

2025

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

arXiv 2025

2025

GenCompositor: Generative Video Compositing with Diffusion Transformer

arXiv 2025

2025

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

arXiv 2025

2025

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

ICCV 2025

2025

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

ICCV 2025

2025

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

ICCV 2025

2025

Sci-Fi: Symmetric Constraint for Frame Inbetweening

arXiv 2025

2025

YOLO-World: Real-Time Open-Vocabulary Object Detection

CVPR 2024 1

2024

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

arXiv 2024

2024

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

arXiv 2024

2024

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

CVPR 2024 1

2024

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

CVPR 2025 1

2024

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

arXiv 2024

2024

DOGE: Towards Versatile Visual Document Grounding and Referring

ICCV 2025

2024

LLaMA Pro: Progressive LLaMA with Block Expansion

arXiv 2024

2024

ST-LLM: Large Language Models Are Effective Temporal Learners

st-llm-large-language-models-are-effective

2024

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

ICCV 2025

2024

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

arXiv 2024

2024

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

CVPR 2024 1

2024

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

CVPR 2025 1

2024

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

arXiv 2024

2024

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

arXiv 2024

2024

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

arXiv 2024

2024

Supervised Fine-tuning in turn Improves Visual Foundation Models

arXiv 2024

2024

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

CVPR 2024 1

2024

ColorFlow: Retrieval-Augmented Image Sequence Colorization

arXiv 2024

2024

GrootVL: Tree Topology is All You Need in State Space Model

arXiv 2024

2024

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

ICCV 2025

2024

BrushEdit: All-In-One Image Inpainting and Editing

arXiv 2024

2024

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

CVPR 2025 1

2024

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

arXiv 2024

2024

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

arXiv 2024

2024

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

CVPR 2025 1

2024

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

arXiv 2024

2024

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

ICCV 2025

2024

SEED-Story: Multimodal Long Story Generation with Large Language Model

arXiv 2024

2024

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

arXiv 2024

2024

Taming Rectified Flow for Inversion and Editing

arXiv 2024

2024

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

arXiv 2024

2024

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

CVPR 2025 1

2024

VoCo-LLaMA: Towards Vision Compression with Large Language Models

CVPR 2025 1

2024

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

arXiv 2024

2024

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

arXiv 2024

2024

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

arXiv 2024

2024

GS-IR: 3D Gaussian Splatting for Inverse Rendering

CVPR 2024 1

2023

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

ICCV 2023 1

2023

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

arXiv 2023

2023

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

arXiv 2023

2023

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

CVPR 2024 1

2023

Inserting Anybody in Diffusion Models via Celeb Basis

inserting-anybody-in-diffusion-models-via

2023

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

CVPR 2024 1

2023

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

arXiv 2023

2023

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

ICCV 2023 1

2023

Improved Test-Time Adaptation for Domain Generalization

CVPR 2023 1

2023

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

arXiv 2023

2023

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

arXiv 2023

2023

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

arXiv 2023

2023

Making LLaMA SEE and Draw with SEED Tokenizer

arXiv 2023

2023

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

arXiv 2023

2023

M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

arXiv 2023

2023

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

arXiv 2023

2023

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

CVPR 2023 1

2023

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

arXiv 2023

2023

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

arXiv 2023

2023

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

NeurIPS 2023 11

2023

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

ICCV 2023 1

2023

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

arXiv 2023

2023

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

arXiv 2023

2023

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

arXiv 2023

2023

TaleCrafter: Interactive Story Visualization with Multiple Characters

arXiv 2023

2023

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

arXiv 2023

2023

Vision-Language Instruction Tuning: A Review and Analysis

arXiv 2023

2023

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

arXiv 2023

2023

Binary Embedding-based Retrieval at Tencent

arXiv 2023

2023

Exploring Model Transferability through the Lens of Potential Energy

ICCV 2023 1

2023

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

ICCV 2023 1

2022

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

CVPR 2023 1

2022

Latent Video Diffusion Models for High-Fidelity Long Video Generation

arXiv 2022

2022

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

arXiv 2022

2022

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

ICCV 2023 1

2022

All in One: Exploring Unified Video-Language Pre-training

CVPR 2023 1

2022

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

CVPR 2022 1

2022

Towards Real-World Blind Face Restoration with Generative Facial Prior

CVPR 2021 1

2021

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

arXiv 2021

2021

Affiliations

No known affiliations.

Frequent co-authors

10

from 110 papers