James Zou
- Papers
- 67
Cite
Notes
Only stored in your browser.
Authored papers
67AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
arXiv 2026
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
arXiv 2026
Learning to Discover at Test Time
arXiv 2026
Introspective Diffusion Language Models
arXiv 2026
Recursive Multi-Agent Systems
arXiv 2026
Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey
arXiv 2026
Forecasting Scientific Progress with Artificial Intelligence
arXiv 2026
Latent Collaboration in Multi-Agent Systems
arXiv 2025
DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
arXiv 2026
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
arXiv 2025
Cartridges: Lightweight and general-purpose long context representations via self-study
arXiv 2025
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models
arXiv 2025
Solving Inequality Proofs with Large Language Models
arXiv 2025
MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports
arXiv 2025
UQ: Assessing Language Models on Unsolved Questions
arXiv 2025
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
arXiv 2025
Optimizing Model Selection for Compound AI Systems
arXiv 2025
Learning a Canonical Basis of Human Preferences from Binary Ratings
arXiv 2025
Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents
arXiv 2025
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
arXiv 2025
4KAgent: Agentic Any Image to 4K Super-Resolution
arXiv 2025
Where LLM Agents Fail and How They can Learn From Failures
arXiv 2025
S-Chain: Structured Visual Chain-of-Thought For Medicine
arXiv 2025
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
arXiv 2025
MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
arXiv 2025
SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning
arXiv 2025
Generative Evaluation of Complex Reasoning in Large Language Models
arXiv 2025
Cost-of-Pass: An Economic Framework for Evaluating Language Models
arXiv 2025
SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning
arXiv 2025
TrustLLM: Trustworthiness in Large Language Models
arXiv 2024
Simple linear attention language models balance the recall-throughput tradeoff
arXiv 2024
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
arXiv 2024
TextGrad: Automatic "Differentiation" via Text
arXiv 2024
Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models
arXiv 2024
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
arXiv 2024
TFG: Unified Training-Free Guidance for Diffusion Models
arXiv 2024
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
arXiv 2024
ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence
arXiv 2024
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
arXiv 2024
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
arXiv 2024
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases
arXiv 2024
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning
arXiv 2024
Reducing Hallucinations in Vision-Language Models via Latent Space Steering
arXiv 2024
SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals
arXiv 2024
FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?
arXiv 2024
Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face
arXiv 2024
Optimizing Calibration by Gaining Aware of Prediction Correctness
arXiv 2024
What's documented in AI? Systematic Analysis of 32K AI Model Cards
arXiv 2024
BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks
arXiv 2023
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models
arXiv 2023
Discover and Cure: Concept-aware Mitigation of Spurious Correlation
arXiv 2023
GPT detectors are biased against non-native English writers
arXiv 2023
Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models
arXiv 2023
Can large language models provide useful feedback on research papers? A large-scale empirical analysis
arXiv 2023
How is ChatGPT's behavior changing over time?
arXiv 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
arXiv 2023
In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
arXiv 2023
Accuracy on the Curve: On the Nonlinear Correlation of ML Performance Between Data Subpopulations
arXiv 2023
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
TMLR
DataPerf: Benchmarks for Data-Centric AI Development
dataperf-benchmarks-for-data-centric-ai
When and why vision-language models behave like bags-of-words, and what to do about it?
arXiv 2022
MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts
metashift-a-dataset-of-datasets-for
Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale
arXiv 2022
Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild
arXiv 2019
Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems
arXiv 2019
Data Shapley: Equitable Valuation of Data for Machine Learning
arXiv 2019
Towards Automatic Concept-based Explanations
towards-automatic-concept-based-explanations
Affiliations
Frequent co-authors
10from 67 papers