Xihan Wei
- Papers
- 12
Cite
Notes
Only stored in your browser.
Authored papers
12See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
arXiv 2026
LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
CVPR 2025 1
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning
arXiv 2025
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
arXiv 2025
CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization
arXiv 2025
ViSpeak: Visual Instruction Feedback in Streaming Videos
ICCV 2025
IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
arXiv 2025
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs
arXiv 2025
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
arXiv 2025
LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning
arXiv 2025
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
arXiv 2025
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness
arXiv 2025
Affiliations
Frequent co-authors
10from 12 papers