Zhe Chen
- Papers
- 36
Cite
Notes
Only stored in your browser.
Authored papers
36AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization
arXiv 2026
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
arXiv 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
arXiv 2025
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
arXiv 2025
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
arXiv 2025
MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
arXiv 2025
Sequential Diffusion Language Models
arXiv 2025
Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers
arXiv 2025
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO
arXiv 2025
RARE: Retrieval-Augmented Reasoning Modeling
arXiv 2025
MedS$^3$: Towards Medical Small Language Models with Self-Evolved Slow Thinking
arXiv 2025
Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding
arXiv 2024
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
arXiv 2024
Needle In A Multimodal Haystack
arXiv 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
arXiv 2024
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area
arXiv 2024
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
CVPR 2025 1
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv 2024
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
arXiv 2024
WHU-Synthetic: A Synthetic Perception Dataset for 3-D Multitask Model Research
arXiv 2024
MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation
arXiv 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
arXiv 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
arXiv 2024
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
arXiv 2024
Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments
arXiv 2024
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NeurIPS 2023 11
OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision
ICCV 2023 1
DDP: Diffusion Model for Dense Visual Prediction
ICCV 2023 1
Traffic Flow Optimisation for Lifelong Multi-Agent Path Finding
arXiv 2023
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
CVPR 2023 1
CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose
CVPR 2023 1
FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation
arXiv 2021
Affiliations
Frequent co-authors
10from 36 papers