Jiacheng Liu, Xiaohan Zhao, Xinyi Shang et al. · 14 Apr 2026
Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user.
Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.
Jiacheng Liu, Xiaohan Zhao, Xinyi Shang et al. · 14 Apr 2026
Claude Code is an agentic coding tool that can run shell commands, edit files, and call external services on behalf of the user.
Andy Konwinski, Etash Guha, John Yang et al. · 17 Jan 2026
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains.
28 May 2026
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains.
10 Jun 2026
General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring.
12 Jun 2026
We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using…
16 Jun 2026
Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets,…
22 Jun 2026
Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little academic work has examined RL-based training of these models, likely due to difficult benchmarks, a lack of data, and a lack of…
23 Jun 2026
We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems.