Qi Wu
- Papers
- 24
Cite
Notes
Only stored in your browser.
Authored papers
24X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction
arXiv 2026
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding
arXiv 2025
3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
CVPR 2025 1
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models
arXiv 2024
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models
arXiv 2024
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System
arXiv 2024
Evaluating and Advancing Multimodal Large Language Models in Ability Lens
arXiv 2024
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
ICCV 2025
A Survey of Medical Vision-and-Language Applications and Their Techniques
arXiv 2024
ModaVerse: Efficiently Transforming Modalities with LLMs
CVPR 2024 1
Streaming Video Diffusion: Online Video Editing with Diffusion Models
arXiv 2024
MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training
arXiv 2024
AerialVLN: Vision-and-Language Navigation for UAVs
ICCV 2023 1
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
arXiv 2023
Scaling Data Generation in Vision-and-Language Navigation
ICCV 2023 1
VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation
ICCV 2023 1
Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment
arXiv 2023
Identity-Consistent Aggregation for Video Object Detection
ICCV 2023 1
WebVLN: Vision-and-Language Navigation on Websites
arXiv 2023
Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors
CVPR 2024 1
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
ICCV 2023 1
March in Chat: Interactive Prompting for Remote Embodied Referring Expression
ICCV 2023 1
A Recurrent Vision-and-Language BERT for Navigation
arXiv 2020
Confidence-aware Non-repetitive Multimodal Transformers for TextCaps
arXiv 2020
Affiliations
Frequent co-authors
10from 24 papers