Yali Wang

Papers: 22

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

22papers

Authored papers

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

arXiv 2026

2026

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

arXiv 2025

2025

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

arXiv 2025

2025

MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

arXiv 2025

2025

LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents

ICCV 2025

2025

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

ICCV 2025

2025

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

arXiv 2024

2024

VideoMamba: State Space Model for Efficient Video Understanding

arXiv 2024

2024

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model

arXiv 2024

2024

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

arXiv 2024

2024

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

arXiv 2024

2024

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025 1

2024

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

arXiv 2024

2024

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

arXiv 2024

2024

Vlogger: Make Your Dream A Vlog

CVPR 2024 1

2024

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

arXiv 2024

2024

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

CVPR 2023 1

2023

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024 1

2023

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

ICCV 2023 1

2023

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

arXiv 2022

2022

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

arXiv 2022

2022

Self-slimmed Vision Transformer

arXiv 2021

2021

Affiliations

No known affiliations.

Frequent co-authors

from 22 papers

Yu Qiao

LiMin Wang

Kunchang Li

Yi Wang

Yinan He

Xinhao Li

Xiangyu Zeng

Chenting Wang

Jiashuo Yu

Zhengrong Yue