Jieyu Zhang

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

arXiv 2026

WildDet3D: Scaling Promptable 3D Detection in the Wild

arXiv 2026

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

arXiv 2026

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

arXiv 2025

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

arXiv 2025

Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

arXiv 2025

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

arXiv 2025

Spatial Mental Modeling from Limited Views

arXiv 2025

MolmoAct: Action Reasoning Models that can Reason in Space

arXiv 2025

Adaptive In-conversation Team Building for Language Model Agents

arXiv 2024

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

arXiv 2024

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

arXiv 2024

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

arXiv 2024

Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training

arXiv 2024

DataComp: In search of the next generation of multimodal datasets

NeurIPS 2023 11

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

sugarcrepe-fixing-hackable-benchmarks-for

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

large-language-model-as-attributed-training

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

arXiv 2023

EcoAssistant: Using LLM Assistant More Affordably and Accurately

arXiv 2023

When to Learn What: Model-Adaptive Data Augmentation Curriculum

ICCV 2023 1

Subclass-balancing Contrastive Learning for Long-tailed Recognition

ICCV 2023 1