0

Graham Neubig

Associate professor at CMU LTI and co-founder/chief scientist of All Hands AI; one of the leading academic voices on coding agents and SWE-Bench-style evaluation.

Role
professor
Currently at
All Hands AI
Papers
80

Cite

Notes

Only stored in your browser.

80papers·2eval contribs·1tool contribs

Authored papers

80

Effective Strategies for Asynchronous Software Engineering Agents

arXiv 2026

2026

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

arXiv 2026

2026

What do Language Models Learn and When? The Implicit Curriculum Hypothesis

arXiv 2026

2026

Modeling Distinct Human Interaction in Web Agents

arXiv 2026

2026

M-Prometheus: A Suite of Open Multilingual LLM Judges

arXiv 2025

2025

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

arXiv 2025

2025

Go-Browse: Training Web Agents with Structured Exploration

arXiv 2025

2025

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

arXiv 2025

2025

AutoPresent: Designing Structured Visuals from Scratch

CVPR 2025 1

2025

Inducing Programmatic Skills for Agentic Tasks

arXiv 2025

2025

Demystifying Long Chain-of-Thought Reasoning in LLMs

arXiv 2025

2025

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

arXiv 2025

2025

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

arXiv 2025

2025

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

arXiv 2025

2025

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

arXiv 2025

2025

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

arXiv 2025

2025

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions

arXiv 2025

2025

The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks

arXiv 2025

2025

SWE-Gym: An Open Environment for Training Software Engineering Agents and Verifiers

preprint

2024

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

ACL

2024

WebArena: A Realistic Web Environment for Building Autonomous Agents

ICLR

2024

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

arXiv 2024

2024

Repetition Improves Language Model Embeddings

arXiv 2024

2024

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

arXiv 2024

2024

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

arXiv 2024

2024

CodeRAG-Bench: Can Retrieval Augment Code Generation?

arXiv 2024

2024

TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks

arXiv 2024

2024

SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents

arXiv 2024

2024

Beyond Browsing: API-Based Web Agents

arXiv 2024

2024

RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems

arXiv 2024

2024

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

arXiv 2024

2024

Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate

arXiv 2024

2024

Evaluating Language Models as Synthetic Data Generators

arXiv 2024

2024

Language Modeling with Editable External Knowledge

arXiv 2024

2024

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

arXiv 2024

2024

Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes

arXiv 2024

2024

Instruction-tuned Language Models are Better Knowledge Learners

arXiv 2024

2024

Better Synthetic Data by Retrieving and Transforming Existing Datasets

arXiv 2024

2024

The BrowserGym Ecosystem for Web Agent Research

arXiv 2024

2024

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

arXiv 2024

2024

Active Retrieval Augmented Generation

arXiv 2023

2023

Alignment for Honesty

arXiv 2023

2023

Learning Performance-Improving Code Edits

arXiv 2023

2023

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

arXiv 2023

2023

Learning to Filter Context for Retrieval-Augmented Generation

arXiv 2023

2023

An In-depth Look at Gemini's Language Abilities

arXiv 2023

2023

Why do Nearest Neighbor Language Models Work?

arXiv 2023

2023

DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

arXiv 2023

2023

Unlimiformer: Long-Range Transformers with Unlimited Length Input

NeurIPS 2023 11

2023

Divergences between Language Models and Human Brains

arXiv 2023

2023

ChatGPT MT: Competitive for High- (but not Low-) Resource Languages

arXiv 2023

2023

Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

arXiv 2022

2022

BRIO: Bringing Order to Abstractive Summarization

ACL 2022 5

2022

Mega: Moving Average Equipped Gated Attention

arXiv 2022

2022

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

arXiv 2022

2022

DocPrompting: Generating Code by Retrieving the Docs

arXiv 2022

2022

MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

arXiv 2022

2022

Language Models of Code are Few-Shot Commonsense Learners

arXiv 2022

2022

Execution-Based Evaluation for Open-Domain Code Generation

arXiv 2022

2022

OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering

omnitab-pretraining-with-natural-and

2022

Learning to Model Editing Processes

learning-to-model-editing-processes

2022

Testing the Ability of Language Models to Interpret Figurative Language

NAACL 2022 7

2022

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

arXiv 2022

2022

Quality-Aware Decoding for Neural Machine Translation

NAACL 2022 7

2022

Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data

ACL 2022 5

2022

BARTScore: Evaluating Generated Text as Text Generation

NeurIPS 2021 12

2021

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

arXiv 2021

2021

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

EACL 2021 2

2021

MasakhaNER: Named Entity Recognition for African Languages

arXiv 2021

2021

Towards a Unified View of Parameter-Efficient Transfer Learning

towards-a-unified-view-of-parameter-efficient

2021

XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

EMNLP 2021 11

2021

AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages

ACL 2022 5

2021

Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

incorporating-external-knowledge-through-pre-1

2020

Weight Poisoning Attacks on Pre-trained Models

arXiv 2020

2020

Detecting Hallucinated Content in Conditional Neural Sequence Generation

detecting-hallucinated-content-in-conditional

2020

WikiAsp: A Dataset for Multi-domain Aspect-based Summarization

arXiv 2020

2020

TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data

arXiv 2020

2020

Are Sixteen Heads Really Better than One?

are-sixteen-heads-really-better-than-one-1

2019

Learning to Deceive with Attention-Based Explanations

learning-to-deceive-with-attention-based-1

2019

Learning Character-level Compositionality with Visual Features

learning-character-level-compositionality-1

2017

Eval contributions

2

Tool contributions

1

Affiliations

Frequent co-authors

10

from 80 papers