0

Shanghang Zhang

Papers
37

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
37papers

Authored papers

37

RoboBrain 2.5: Depth in Sight, Time in Mind

arXiv 2026

2026

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

arXiv 2025

2025

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

CVPR 2025 1

2025

AstraNav-World: World Model for Foresight Control and Consistency

arXiv 2025

2025

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

arXiv 2025

2025

GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control

arXiv 2025

2025

EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?

arXiv 2025

2025

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

arXiv 2025

2025

Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation

arXiv 2025

2025

LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

arXiv 2025

2025

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

arXiv 2024

2024

Training-free Regional Prompting for Diffusion Transformers

arXiv 2024

2024

Unveiling the Tapestry of Consistency in Large Vision-Language Models

arXiv 2024

2024

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

arXiv 2024

2024

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

arXiv 2024

2024

FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models

arXiv 2024

2024

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

arXiv 2024

2024

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

arXiv 2024

2024

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

arXiv 2024

2024

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

arXiv 2024

2024

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

arXiv 2024

2024

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

ICCV 2025

2024

Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models

arXiv 2024

2024

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

arXiv 2024

2024

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

arXiv 2023

2023

Gradient-based Parameter Selection for Efficient Fine-Tuning

CVPR 2024 1

2023

MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning

arXiv 2023

2023

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

arXiv 2023

2023

RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

arXiv 2023

2023

Q-Diffusion: Quantizing Diffusion Models

ICCV 2023 1

2023

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID

CVPR 2023 1

2023

ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation

arXiv 2023

2023

I-MedSAM: Implicit Medical Image Segmentation with Segment Anything

arXiv 2023

2023

NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers

CVPR 2023 1

2022

Domain-Adaptive Text Classification with Structured Knowledge from Unlabeled Data

arXiv 2022

2022

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

ICCV 2023 1

2022

Cross-Domain Sentiment Classification with Contrastive Learning and Mutual Information Maximization

arXiv 2020

2020

Affiliations

No known affiliations.

Frequent co-authors

10

from 37 papers