Weidi Xie
- Papers
- 47
Cite
Notes
Only stored in your browser.
Authored papers
47OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
arXiv 2026
Real-World Point Tracking with Verifier-Guided Pseudo-Labeling
arXiv 2026
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
arXiv 2026
Multi-Agent System for Comprehensive Soccer Understanding
arXiv 2025
ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification
arXiv 2025
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
arXiv 2025
Evolving Diagnostic Agents in a Virtual Clinical Environment
arXiv 2025
EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis
arXiv 2025
SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass
arXiv 2025
Rethinking Whole-Body CT Image Interpretation: An Abnormality-Centric Approach
arXiv 2025
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
arXiv 2025
A Knowledge-enhanced Pathology Vision-language Foundation Model for Cancer Diagnosis
arXiv 2024
MatchTime: Towards Automatic Soccer Game Commentary Generation
arXiv 2024
RaTEScore: A Metric for Radiology Report Generation
arXiv 2024
Towards Universal Soccer Video Understanding
CVPR 2025 1
Moving Object Segmentation: All You Need Is SAM (and Flow)
arXiv 2024
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
CVPR 2025 1
Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
arXiv 2024
A Sanity Check for AI-generated Image Detection
arXiv 2024
Towards Evaluating and Building Versatile Large Language Models for Medicine
arXiv 2024
MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities
ICCV 2025
Towards Building Multilingual Language Model for Medicine
arXiv 2024
VISA: Reasoning Video Object Segmentation via Large Language Models
arXiv 2024
Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data
arXiv 2023
Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models
arXiv 2023
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2023
One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts
arXiv 2023
OvarNet: Towards Open-vocabulary Object Attribute Recognition
CVPR 2023 1
arXiVeri: Automatic table verification with GPT
arXiv 2023
PMC-LLaMA: Towards Building Open-source Language Models for Medicine
arXiv 2023
Towards Open-Vocabulary Video Instance Segmentation
ICCV 2023 1
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
arXiv 2023
AutoAD: Movie Description in Context
CVPR 2023 1
Grounded Question-Answering in Long Egocentric Videos
CVPR 2024 1
Zero-shot Composed Text-Image Retrieval
arXiv 2023
Joint-Relation Transformer for Multi-Person Motion Prediction
ICCV 2023 1
Boost Video Frame Interpolation via Motion Adaptation
arXiv 2023
ReCo: Retrieve and Co-segment for Zero-shot Transfer
reco-retrieve-and-co-segment-for-zero-shot
PromptDet: Towards Open-vocabulary Detection using Uncurated Images
arXiv 2022
CounTR: Transformer-based Generalised Visual Counting
arXiv 2022
K-Space Transformer for Undersampled MRI Reconstruction
arXiv 2022
Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models
arXiv 2022
Label, Verify, Correct: A Simple Few Shot Object Detection Method
CVPR 2022 1
Prompting Visual-Language Models for Efficient Video Understanding
arXiv 2021
All you need are a few pixels: semantic segmentation with PixelPick
arXiv 2021
Self-supervised Co-training for Video Representation Learning
NeurIPS 2020 12
VGGSound: A Large-scale Audio-Visual Dataset
arXiv 2020
Affiliations
Frequent co-authors
10from 47 papers