0

Qi Wu

Papers
24

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
24papers

Authored papers

24

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

arXiv 2026

2026

H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding

arXiv 2025

2025

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

CVPR 2025 1

2024

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

arXiv 2024

2024

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

arXiv 2024

2024

Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System

arXiv 2024

2024

Evaluating and Advancing Multimodal Large Language Models in Ability Lens

arXiv 2024

2024

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

ICCV 2025

2024

A Survey of Medical Vision-and-Language Applications and Their Techniques

arXiv 2024

2024

ModaVerse: Efficiently Transforming Modalities with LLMs

CVPR 2024 1

2024

Streaming Video Diffusion: Online Video Editing with Diffusion Models

arXiv 2024

2024

MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

arXiv 2024

2024

AerialVLN: Vision-and-Language Navigation for UAVs

ICCV 2023 1

2023

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

arXiv 2023

2023

Scaling Data Generation in Vision-and-Language Navigation

ICCV 2023 1

2023

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

ICCV 2023 1

2023

Likelihood-Based Text-to-Image Evaluation with Patch-Level Perceptual and Semantic Credit Assignment

arXiv 2023

2023

Identity-Consistent Aggregation for Video Object Detection

ICCV 2023 1

2023

WebVLN: Vision-and-Language Navigation on Websites

arXiv 2023

2023

Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors

CVPR 2024 1

2023

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

ICCV 2023 1

2023

March in Chat: Interactive Prompting for Remote Embodied Referring Expression

ICCV 2023 1

2023

A Recurrent Vision-and-Language BERT for Navigation

arXiv 2020

2020

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

arXiv 2020

2020

Affiliations

No known affiliations.

Frequent co-authors

10

from 24 papers