Wengang Zhou

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

arXiv 2025

Robust Multimodal Large Language Models Against Modality Conflict

arXiv 2025

Uni-Sign: Toward Unified Sign Language Understanding at Scale

arXiv 2025

Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation

arXiv 2025

ROOT: VLM based System for Indoor Scene Understanding and Beyond

arXiv 2024

BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language?

arXiv 2024

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

arXiv 2024

DeepEraser: Deep Iterative Context Mining for Generic Text Eraser

arXiv 2024

Sinkhorn Distance Minimization for Knowledge Distillation

arXiv 2024

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

arXiv 2024

TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding

arXiv 2024

EG4D: Explicit Generation of 4D Object without Score Distillation

arXiv 2024

Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning

arXiv 2024

Masked Motion Predictors are Strong 3D Action Representation Learners

ICCV 2023 1

Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs

arXiv 2023

Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

ICCV 2023 1

DIRE for Diffusion-Generated Image Detection

ICCV 2023 1

Hybrid and Collaborative Passage Reranking

arXiv 2023