LiMin Wang
- Papers
- 63
Cite
Notes
Only stored in your browser.
Authored papers
63LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
arXiv 2026
RIVER: A Real-Time Interaction Benchmark for Video LLMs
arXiv 2026
Towards Pixel-Level VLM Perception via Simple Points Prediction
arXiv 2026
Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
arXiv 2026
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
arXiv 2025
DDT: Decoupled Diffusion Transformer
ddt-decoupled-diffusion-transformer
Make Your Training Flexible: Towards Deployment-Efficient Video Models
ICCV 2025
SORCE: Small Object Retrieval in Complex Environments
arXiv 2025
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
arXiv 2025
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment
arXiv 2025
PixNerd: Pixel Neural Field Diffusion
arXiv 2025
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
arXiv 2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
ICCV 2025
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
arXiv 2025
Differentiable Solver Search for Fast Diffusion Sampling
arXiv 2025
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
arXiv 2025
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
arXiv 2025
DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging
arXiv 2025
History-Aware Transformation of ReID Features for Multiple Object Tracking
arXiv 2025
MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving
arXiv 2025
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
arXiv 2025
Multiple Object Tracking as ID Prediction
CVPR 2025 1
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
CVPR 2025 1
VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model
arXiv 2024
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
CVPR 2025 1
VFIMamba: Video Frame Interpolation with State Space Models
arXiv 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
arXiv 2024
VideoMamba: State Space Model for Efficient Video Understanding
arXiv 2024
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
arXiv 2024
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
arXiv 2024
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
arXiv 2024
Taming Scalable Visual Tokenizer for Autoregressive Image Generation
ICCV 2025
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
arXiv 2024
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
arXiv 2024
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
arXiv 2024
Accelerating Image Generation with Sub-path Linear Approximation Model
arXiv 2024
Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training
arXiv 2024
SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
CVPR 2024 1
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
CVPR 2024 1
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
CVPR 2023 1
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024 1
MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
ICCV 2023 1
SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos
ICCV 2023 1
Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation
CVPR 2023 1
Deep Equilibrium Object Detection
ICCV 2023 1
SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes
ICCV 2023 1
Memory-and-Anticipation Transformer for Online Action Understanding
ICCV 2023 1
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
ICCV 2023 1
Efficient Video Action Detection with Token Dropout and Context Refinement
ICCV 2023 1
StageInteractor: Query-based Object Detector with Cross-stage Interaction
ICCV 2023 1
MGMAE: Motion Guided Masking for Video Masked Autoencoding
ICCV 2023 1
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video
arXiv 2023
Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation
arXiv 2023
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
arXiv 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
videomae-masked-autoencoders-are-data
MixFormer: End-to-End Tracking with Iterative Mixed Attention
mixformer-end-to-end-tracking-with-iterative
Recovering 3D Human Mesh from Monocular Images: A Survey
arXiv 2022
MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions
ICCV 2021 10
Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs
arXiv 2016
Affiliations
Frequent co-authors
10from 63 papers