James Zou

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

arXiv 2026

Learning to Discover at Test Time

arXiv 2026

Introspective Diffusion Language Models

arXiv 2026

Recursive Multi-Agent Systems

arXiv 2026

Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey

arXiv 2026

Forecasting Scientific Progress with Artificial Intelligence

arXiv 2026

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

arXiv 2026

OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning

arXiv 2025

Cartridges: Lightweight and general-purpose long context representations via self-study

arXiv 2025

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

arXiv 2025

Solving Inequality Proofs with Large Language Models

arXiv 2025

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

arXiv 2025

Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents

arXiv 2025

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

arXiv 2025

4KAgent: Agentic Any Image to 4K Super-Resolution

arXiv 2025

Where LLM Agents Fail and How They can Learn From Failures

arXiv 2025

S-Chain: Structured Visual Chain-of-Thought For Medicine

arXiv 2025

UQ: Assessing Language Models on Unsolved Questions

arXiv 2025

Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute

arXiv 2025

SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning

arXiv 2025

Cost-of-Pass: An Economic Framework for Evaluating Language Models

arXiv 2025

Learning a Canonical Basis of Human Preferences from Binary Ratings

arXiv 2025

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

arXiv 2025

MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents

arXiv 2025

Optimizing Model Selection for Compound AI Systems

arXiv 2025

Generative Evaluation of Complex Reasoning in Large Language Models

arXiv 2025

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

arXiv 2025

Latent Collaboration in Multi-Agent Systems

arXiv 2025

TrustLLM: Trustworthiness in Large Language Models

arXiv 2024

Simple linear attention language models balance the recall-throughput tradeoff

arXiv 2024

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

arXiv 2024

TextGrad: Automatic "Differentiation" via Text

arXiv 2024

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

arXiv 2024

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

arXiv 2024

Belief in the Machine: Investigating Epistemological Blind Spots of Language Models

arXiv 2024

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

arXiv 2024

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

arXiv 2024

AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning

arXiv 2024

Reducing Hallucinations in Vision-Language Models via Latent Space Steering

arXiv 2024

SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals

arXiv 2024

FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?

arXiv 2024

CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models

arXiv 2024

TFG: Unified Training-Free Guidance for Diffusion Models

arXiv 2024

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

arXiv 2024

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

arXiv 2024

Optimizing Calibration by Gaining Aware of Prediction Correctness

arXiv 2024

What's documented in AI? Systematic Analysis of 32K AI Model Cards

arXiv 2024

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

arXiv 2023

In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering

arXiv 2023

DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models

arXiv 2023

GPT detectors are biased against non-native English writers

arXiv 2023

Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models

arXiv 2023

Accuracy on the Curve: On the Nonlinear Correlation of ML Performance Between Data Subpopulations

arXiv 2023

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

arXiv 2023

How is ChatGPT's behavior changing over time?

arXiv 2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

arXiv 2023

Discover and Cure: Concept-aware Mitigation of Spurious Correlation

arXiv 2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR

When and why vision-language models behave like bags-of-words, and what to do about it?

arXiv 2022

MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts

metashift-a-dataset-of-datasets-for

Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

arXiv 2022

DataPerf: Benchmarks for Data-Centric AI Development

dataperf-benchmarks-for-data-centric-ai

Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild

arXiv 2019

Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems

arXiv 2019

Data Shapley: Equitable Valuation of Data for Machine Learning

arXiv 2019

Towards Automatic Concept-based Explanations

towards-automatic-concept-based-explanations