Wei Li
- Papers
- 82
Cite
Notes
Only stored in your browser.
Authored papers
82QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining
arXiv 2026
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
arXiv 2026
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
arXiv 2026
ArtHOI: Articulated Human-Object Interaction Synthesis by 4D Reconstruction from Video Priors
arXiv 2026
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
arXiv 2025
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
arXiv 2025
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
arXiv 2025
Real Garment Benchmark (RGBench): A Comprehensive Benchmark for Robotic Garment Manipulation featuring a High-Fidelity Scalable Simulator
arXiv 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
arXiv 2025
video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
arXiv 2025
Spatial Distillation based Distribution Alignment (SDDA) for Cross-Headset EEG Classification
arXiv 2025
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
arXiv 2025
MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing
arXiv 2025
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
arXiv 2025
MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation
arXiv 2025
TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
arXiv 2025
Light-X: Generative 4D Video Rendering with Camera and Illumination Control
arXiv 2025
Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model
arXiv 2025
Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
arXiv 2025
Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets?
arXiv 2025
Artemis: Structured Visual Reasoning for Perception Policy Learning
arXiv 2025
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
arXiv 2025
MMSearch-R1: Incentivizing LMMs to Search
arXiv 2025
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
arXiv 2025
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
arXiv 2025
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
arXiv 2025
CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research
arXiv 2025
ACVUBench: Audio-Centric Video Understanding Benchmark
arXiv 2025
Exploring Federated Pruning for Large Language Models
arXiv 2025
Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging
arXiv 2025
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
CVPR 2025 1
SORCE: Small Object Retrieval in Complex Environments
arXiv 2025
CoS: Chain-of-Shot Prompting for Long Video Understanding
arXiv 2025
Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning
arXiv 2025
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
arXiv 2025
Region-Constraint In-Context Generation for Instructional Video Editing
arXiv 2025
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
ICCV 2025
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
ICCV 2025
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
arXiv 2025
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
arXiv 2025
MVCNet: Multi-View Contrastive Network for Motor Imagery Classification
arXiv 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
arXiv 2025
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
arXiv 2024
A Survey on LLM-as-a-Judge
arXiv 2024
OMG-Seg: Is One Model Good Enough For All Segmentation?
CVPR 2024 1
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area
arXiv 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
Improving Natural Language Capability of Code Large Language Model
arXiv 2024
Can a large language model be a gaslighter?
arXiv 2024
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
ICCV 2025
A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL
arXiv 2024
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving
arXiv 2024
WildAvatar: Web-scale In-the-wild Video Dataset for 3D Avatar Creation
arXiv 2024
F-LMM: Grounding Frozen Large Multimodal Models
CVPR 2025 1
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
arXiv 2024
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
arXiv 2024
SALMONN: Towards Generic Hearing Abilities for Large Language Models
arXiv 2023
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
arXiv 2023
VIGC: Visual Instruction Generation and Correction
arXiv 2023
A Holistic Evaluation of Piano Sound Quality
arXiv 2023
MERTech: Instrument Playing Technique Detection Using Self-Supervised Pretrained Model With Multi-Task Finetuning
arXiv 2023
Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation
arXiv 2023
Contextual Object Detection with Multimodal Large Language Models
arXiv 2023
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
arXiv 2023
SoccerNet 2023 Challenges Results
arXiv 2023
Correlational Image Modeling for Self-Supervised Visual Pre-Training
CVPR 2023 1
Frame-Level Multi-Label Playing Technique Detection Using Multi-Scale Network and Self-Attention Mechanism
arXiv 2023
Balancing Logit Variation for Long-tailed Semantic Segmentation
balancing-logit-variation-for-long-tailed
TransHP: Image Classification with Hierarchical Prompting
transhp-image-classification-with
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models
arXiv 2023
EMelodyGen: Emotion-Conditioned Melody Generation in ABC Notation with the Musical Feature Template
arXiv 2023
Masked Frequency Modeling for Self-Supervised Visual Pre-Training
arXiv 2022
Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels
CVPR 2022 1
Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios
arXiv 2022
WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning
arXiv 2022
SoccerNet 2022 Challenges Results
arXiv 2022
Unified Vision and Language Prompt Learning
arXiv 2022
Learning from Future: A Novel Self-Training Framework for Semantic Segmentation
arXiv 2022
Do Transformer Modifications Transfer Across Implementations and Applications?
EMNLP 2021 11
A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection
arXiv 2021
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
ACL 2021 5
DVI: Depth Guided Video Inpainting for Autonomous Driving
ECCV 2020 8
Affiliations
Frequent co-authors
10from 82 papers