Shanghang Zhang
- Papers
- 37
Cite
Notes
Only stored in your browser.
Authored papers
37RoboBrain 2.5: Depth in Sight, Time in Mind
arXiv 2026
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
arXiv 2025
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
CVPR 2025 1
AstraNav-World: World Model for Foresight Control and Consistency
arXiv 2025
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
arXiv 2025
GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control
arXiv 2025
EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?
arXiv 2025
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
arXiv 2025
Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation
arXiv 2025
LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information
arXiv 2025
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
arXiv 2024
Training-free Regional Prompting for Diffusion Transformers
arXiv 2024
Unveiling the Tapestry of Consistency in Large Vision-Language Models
arXiv 2024
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
arXiv 2024
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
arXiv 2024
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models
arXiv 2024
DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing
arXiv 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
arXiv 2024
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
arXiv 2024
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
arXiv 2024
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
arXiv 2024
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
ICCV 2025
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
arXiv 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
arXiv 2024
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
arXiv 2023
Gradient-based Parameter Selection for Efficient Fine-Tuning
CVPR 2024 1
MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning
arXiv 2023
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
arXiv 2023
RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision
arXiv 2023
Q-Diffusion: Quantizing Diffusion Models
ICCV 2023 1
MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID
CVPR 2023 1
ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation
arXiv 2023
I-MedSAM: Implicit Medical Image Segmentation with Segment Anything
arXiv 2023
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers
CVPR 2023 1
Domain-Adaptive Text Classification with Structured Knowledge from Unlabeled Data
arXiv 2022
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
ICCV 2023 1
Cross-Domain Sentiment Classification with Contrastive Learning and Mutual Information Maximization
arXiv 2020
Affiliations
Frequent co-authors
10from 37 papers