Wenhai Wang
- Papers
- 58
Cite
Notes
Only stored in your browser.
Authored papers
58Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
arXiv 2026
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
arXiv 2025
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
arXiv 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
arXiv 2025
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
arXiv 2025
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
arXiv 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
arXiv 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
arXiv 2025
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
arXiv 2025
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
arXiv 2025
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
arXiv 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
arXiv 2025
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
arXiv 2025
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
arXiv 2025
GenExam: A Multidisciplinary Text-to-Image Exam
arXiv 2025
Sequential Diffusion Language Models
arXiv 2025
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis
arXiv 2025
Fair-PP: A Synthetic Dataset for Aligning LLM with Personalized Preferences of Social Equity
arXiv 2025
CoMemo: LVLMs Need Image Context with Image Memory
arXiv 2025
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
arXiv 2025
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
arXiv 2025
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
arXiv 2024
Needle In A Multimodal Haystack
arXiv 2024
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024 1
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
arXiv 2024
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models
arXiv 2024
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area
arXiv 2024
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
CVPR 2025 1
DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction
ICCV 2025
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv 2024
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
arXiv 2024
Low-light Image Enhancement via CLIP-Fourier Guided Wavelet Diffusion
arXiv 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
arXiv 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
arXiv 2024
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
arXiv 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments
arXiv 2024
FB-BEV: BEV Representation from Forward-Backward View Transformations
ICCV 2023 1
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NeurIPS 2023 11
A Survey of Reasoning with Foundation Models
arXiv 2023
Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
leveraging-vision-centric-multi-modal
Prompting Frameworks for Large Language Models: A Survey
arXiv 2023
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models
arXiv 2023
ControlLLM: Augment Language Models with Tools by Searching on Graphs
arXiv 2023
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
arXiv 2023
Planning-oriented Autonomous Driving
CVPR 2023 1
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
CVPR 2023 1
Demystify Transformers & Convolutions in Modern Image Deep Networks
arXiv 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023 1
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
NeurIPS 2021 12
PVT v2: Improved Baselines with Pyramid Vision Transformer
arXiv 2021
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers
CVPR 2022 1
FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation
arXiv 2021
Selective Kernel Networks
selective-kernel-networks-1
Affiliations
Frequent co-authors
10from 58 papers