Graham Neubig
Associate professor at CMU LTI and co-founder/chief scientist of All Hands AI; one of the leading academic voices on coding agents and SWE-Bench-style evaluation.
- Role
- professor
- Currently at
- All Hands AI
- twitter.com/gneubig
- GitHub
- github.com/neubig
- Scholar
- scholar.google.com/citations
- Papers
- 80
Cite
Notes
Only stored in your browser.
Authored papers
80Effective Strategies for Asynchronous Software Engineering Agents
arXiv 2026
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists
arXiv 2026
What do Language Models Learn and When? The Implicit Curriculum Hypothesis
arXiv 2026
Modeling Distinct Human Interaction in Web Agents
arXiv 2026
M-Prometheus: A Suite of Open Multilingual LLM Judges
arXiv 2025
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
arXiv 2025
Go-Browse: Training Web Agents with Structured Exploration
arXiv 2025
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
arXiv 2025
AutoPresent: Designing Structured Visuals from Scratch
CVPR 2025 1
Inducing Programmatic Skills for Agentic Tasks
arXiv 2025
Demystifying Long Chain-of-Thought Reasoning in LLMs
arXiv 2025
RefineBench: Evaluating Refinement Capability of Language Models via Checklists
arXiv 2025
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
arXiv 2025
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
arXiv 2025
Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
arXiv 2025
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators
arXiv 2025
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
arXiv 2025
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
arXiv 2025
SWE-Gym: An Open Environment for Training Software Engineering Agents and Verifiers
preprint
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
ACL
WebArena: A Realistic Web Environment for Building Autonomous Agents
ICLR
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
arXiv 2024
Repetition Improves Language Model Embeddings
arXiv 2024
GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
arXiv 2024
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
arXiv 2024
CodeRAG-Bench: Can Retrieval Augment Code Generation?
arXiv 2024
TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks
arXiv 2024
SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents
arXiv 2024
Beyond Browsing: API-Based Web Agents
arXiv 2024
RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems
arXiv 2024
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
arXiv 2024
Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
arXiv 2024
Evaluating Language Models as Synthetic Data Generators
arXiv 2024
Language Modeling with Editable External Knowledge
arXiv 2024
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning
arXiv 2024
Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes
arXiv 2024
Instruction-tuned Language Models are Better Knowledge Learners
arXiv 2024
Better Synthetic Data by Retrieving and Transforming Existing Datasets
arXiv 2024
The BrowserGym Ecosystem for Web Agent Research
arXiv 2024
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
arXiv 2024
Active Retrieval Augmented Generation
arXiv 2023
Alignment for Honesty
arXiv 2023
Learning Performance-Improving Code Edits
arXiv 2023
CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
arXiv 2023
Learning to Filter Context for Retrieval-Augmented Generation
arXiv 2023
An In-depth Look at Gemini's Language Abilities
arXiv 2023
Why do Nearest Neighbor Language Models Work?
arXiv 2023
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions
arXiv 2023
Unlimiformer: Long-Range Transformers with Unlimited Length Input
NeurIPS 2023 11
Divergences between Language Models and Human Brains
arXiv 2023
ChatGPT MT: Competitive for High- (but not Low-) Resource Languages
arXiv 2023
Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval
arXiv 2022
BRIO: Bringing Order to Abstractive Summarization
ACL 2022 5
Mega: Moving Average Equipped Gated Attention
arXiv 2022
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
arXiv 2022
DocPrompting: Generating Code by Retrieving the Docs
arXiv 2022
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition
arXiv 2022
Language Models of Code are Few-Shot Commonsense Learners
arXiv 2022
Execution-Based Evaluation for Open-Domain Code Generation
arXiv 2022
OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering
omnitab-pretraining-with-natural-and
Learning to Model Editing Processes
learning-to-model-editing-processes
Testing the Ability of Language Models to Interpret Figurative Language
NAACL 2022 7
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
arXiv 2022
Quality-Aware Decoding for Neural Machine Translation
NAACL 2022 7
Show Me More Details: Discovering Hierarchies of Procedures from Semi-structured Web Data
ACL 2022 5
BARTScore: Evaluating Generated Text as Text Generation
NeurIPS 2021 12
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
arXiv 2021
Word Alignment by Fine-tuning Embeddings on Parallel Corpora
EACL 2021 2
MasakhaNER: Named Entity Recognition for African Languages
arXiv 2021
Towards a Unified View of Parameter-Efficient Transfer Learning
towards-a-unified-view-of-parameter-efficient
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
EMNLP 2021 11
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages
ACL 2022 5
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation
incorporating-external-knowledge-through-pre-1
Weight Poisoning Attacks on Pre-trained Models
arXiv 2020
Detecting Hallucinated Content in Conditional Neural Sequence Generation
detecting-hallucinated-content-in-conditional
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
arXiv 2020
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
arXiv 2020
Are Sixteen Heads Really Better than One?
are-sixteen-heads-really-better-than-one-1
Learning to Deceive with Attention-Based Explanations
learning-to-deceive-with-attention-based-1
Learning Character-level Compositionality with Visual Features
learning-character-level-compositionality-1
Eval contributions
2Tool contributions
1Affiliations
Frequent co-authors
10from 80 papers