Papers

Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.

Filtered by domain: Coding AgentsClear

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Jiacheng Liu, Xiaohan Zhao, Xinyi Shang et al. · 14 Apr 2026

Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user.

Agents Coding Agents

1.8k0.3/h

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Andy Konwinski, Etash Guha, John Yang et al. · 17 Jan 2026

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains.

Agents Coding Agents Language Modeling

2.4k

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

28 May 2026

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains.

Agents Coding Agents Language Modeling

1740.1/h

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

10 Jun 2026

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring.

Coding Agents

330.1/h

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

12 Jun 2026

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using…

Agents Coding Agents Instruction Following Language Modeling

1.5k0.2/h

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

16 Jun 2026

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets,…

Coding Agents

1580.1/h

Tmax: A simple recipe for terminal agents

22 Jun 2026

Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of…

Coding Agents Language Modeling

1300.5/h

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

23 Jun 2026

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems.

Coding Agents

500.2/h