General Reasoning

General Reasoning is an org.

Type: org

Cite

Notes

Only stored in your browser.

Evals

Tools

302

Models

Papers

Boards

People

Tools

302

Alquerque

Alquerque is an environment for evaluating agents on playing Alquerque, an ancient board game and precursor to checkers, against an LLM opponent. This environment wraps the Alquerque implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Secretary

Secretary is an environment for evaluating agents on optimal stopping and sequential decision-making under uncertainty.

RL EnvDecision Making in Games

DAComp DE

DAComp-DE is a benchmark of 110 data engineering tasks that require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines

RL EnvData Engineering and Analysis

LogicPuzzle

LogicPuzzle is an environment for evaluating agents on deductive reasoning and logic grid puzzle solving.

RL EnvLogic Puzzle Reasoning

ScenarioPlanning

ScenarioPlanning is an environment for evaluating agents on creative problem-solving and strategy formulation for survival scenarios.

RL EnvDecision Making in Games

GameOfPureStrategy

GameOfPureStrategy is an environment for evaluating agents on strategic card play in a two-player competitive game. This environment wraps the GameOfPureStrategy implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

FrontierScience

An implementation of the FrontierScience evaluation.

RL EnvExpert Level Scientific Reasoning

SWE Bench Multilingual

A benchmark of 300 software engineering tasks across 42 repositories and 9 programming languages: C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust. Each instance is derived from a real GitHub pull request, following the same format and evaluation protocol as SWE-b…

RL EnvCode GenerationCode

AARDData

AARDData is an environment which teaches agents to do pretraining data tasks.

RL EnvAI Research Tasks

EMMA

EMMA (Enhanced MultiModal reAsoning) is a benchmark for assessing organic multimodal reasoning in MLLMs across mathematics, physics, chemistry, and coding.

RL EnvMultimodal Scientific Reasoning

AnthropicPerformance

Anthropic's performance take-home test as an environment.

RL EnvCode Performance Optimization

BeyondAIME

BeyondAIME is a benchmark for evaluating generalized STEM reasoning that extends AIME-style problems to probe deep, stepwise mathematical problem-solving.

RL EnvMathematical Reasoning

Ml Dev Bench

A benchmark for evaluating AI agents on real world ML development tasks. 30 tasks covering various aspects of model development, including dataset management, debugging model and code failures, and implementing new ideas to achieve strong performance on various machine learni…

RL EnvMachine Learning Engineering

SimpleQA

SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions.

RL EnvQuestion Answering

FrontierCS

FrontierCS is a benchmark of 156 expert-designed, open-ended computer science problems across diverse areas that require models to produce executable programs (with an expert reference solution and automatic evaluator provided for each problem) rather than direct answers. It t…

RL EnvComputer Science Mastery Evaluation

GBAGym

GBA Eval from Mechanize Inc. ported as an RL training environment.

RL Env

WasmInterpInRust

Build wasmrun from scratch

RL Env

USACO

USACO consists of 307 problems from the USA Computing Olympiad, complete with exhaustive test cases, problem analyses, and difficulty labels. Although zero-shot performance is poor, we find that we can well over double performance with a combination of retrieval and self-refle…

RL EnvCompetitive CodingCompetitive Programming ProblemsCompetitive Programming Problem Solving

EnigmaDecrypt

Enigma Decrypt is an environment where agents decrypt WWII-era Enigma-encoded German military messages. Each task presents an intercepted ciphertext encrypted with a historically accurate Wehrmacht Enigma I machine (3-rotor, plugboard configuration). The agent is given partial…

RL EnvCryptographic Decryption Reasoning

TextWorldSimple

TextWorld Simple is an environment for evaluating language model agents on text-based adventure games using Microsoft's TextWorld framework. Agents must navigate a 6-room house, find a key to escape a locked bedroom, locate a food item, and cook it on the stove -- all through…

RL EnvVision Language Navigation

Taboo

Taboo is an environment for evaluating agents on creative communication and word description under constraints.

RL EnvDecision Making in Games

Bandit

Bandit is an environment for evaluating agents on the classic multi-armed bandit problem. This environment wraps the Bandit implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

ACI Bench

Ambient Clinical Intelligence Benchmark (ACI-BENCH) is a benchmark corpus for evaluating AI-assisted clinical note generation from doctor–patient visit dialogues.

RL EnvAutomatic Visit Note Generation

Medical Reasoning

An implementation of https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT.

RL EnvMedical Reasoning

WordChains

WordChains is an environment for evaluating agents on word association and vocabulary knowledge through a chaining game where each word must start with the last letter of the previous word.

RL EnvDecision Making in Games

SettlersOfCatan

SettlersOfCatan is an environment for evaluating agents on strategic resource management and territory expansion in Settlers of Catan.

RL EnvDecision Making in Games

Othello

Othello is an environment for evaluating agents on the classic board game also known as Reversi.

RL EnvDecision Making in Games

CharacterConclave

CharacterConclave is an environment for evaluating agents on persuasive communication and impression management in a social competition. This environment wraps the CharacterConclave implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Tau2bench

tau2-bench is a benchmark for evaluating conversational AI agents in a Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user use tools to act in a shared, dynamic environment that stresses coordination, communication, and guidance.

RL EnvAgentic AI Evaluation

E2EBench

End-to-end evaluation of AI systems in complete workflows from input to final output

RL EnvEnd to End Software Development

MMMU

MMMU is a benchmark for evaluating multimodal models on massive multi-discipline tasks that require college-level subject knowledge and deliberate reasoning.

RL EnvMultimodal Reasoning

Gso

Developing high-performance software is a complex task that requires specialized expertise. GSO (Global Software Optimization) is a benchmark for evaluating language models' capabilities in developing high-performance software.

RL EnvSoftware EngineeringCode Performance OptimizationCode GenerationCode

Aider Polyglot

Aider’s polyglot benchmark tests LLMs on 225 challenging Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust.

RL EnvCode GenerationCode

AMO Bench

AMO-Bench is a 50-problem benchmark of original, expert-validated Olympiad-level math questions designed to test advanced mathematical reasoning beyond existing competitions, using final-answer grading to robustly evaluate top-tier LLMs where current benchmarks have saturated.

RL EnvMathematical Reasoning

PrincipiaCollection

Principia Collection is a large-scale dataset designed to enhance language models’ ability to derive mathematical objects from STEM-related problem statements. Each instance contains a problem statement, a ground truth answer, an answer type, and a topic label. The topics are…

RL EnvMathematical ReasoningPhysics Reasoning

Nemotron RL Math Stack Overflow

The Nemotron-RL-math-stack_overflow dataset contains mathematical problems and solutions sourced from the Stack Overflow forums. This is an implementation of https://huggingface.co/datasets/nvidia/Nemotron-RL-math-stack_overflow.

RL EnvMathematical Reasoning

WikiGame

A game of exploring and racing through Wikipedia articles!

RL EnvDecision Making in Games

YearGuessr

YearGuessr is the largest open benchmark for evaluating vision-language models’ ability to predict buildings’ construction years and to expose popularity-driven memorization bias. It contains 55,546 building images from 157 countries with multimodal attributes (continuous ordi…

RL EnvBuilding Age Estimation

Discovery30s

Discovery30s is a benchmark that tests the potential of vintage language models to reproduce scientific discoveries after the training data cutoff period. We construct the benchmark by taking a known discovery, e.g. Hückel's rule, and then breaking it down into a "question lad…

RL EnvAutomated Scientific Discovery

ControlEval

ControlEval is an evaluation dataset that comprises 500 control tasks with various specific design goals.

RL Env

SpreadsheetBench

SpreadsheetBench evaluates large language model agents' capabilities in manipulating complex real-world spreadsheets. The benchmark include 912 real questions gathered from online Excel forums, covering a variety of tabular data such as multiple tables, non-standard relational…

RL EnvSpreadsheet Control TasksSpreadsheet Manipulation

OfficeQA

OfficeQA is a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. Office…

RL EnvOffice Automation Tasks in Realistic Office WorkflowsMulti Document Reasoning

ADE Bench

ADE-bench is a framework for evaluating AI agents on data analyst tasks. It is designed to represent real-world data work with messy projects featuring hundreds of tables, often with multiple tables that could conceivably represent the same entity.

RL EnvData Science Agent EvaluationData Science Tasks

UltimateTicTacToe

UltimateTicTacToe is an environment for evaluating agents on strategic gameplay in Ultimate Tic-Tac-Toe, a complex variant where winning three sub-boards in a row determines victory.

RL EnvDecision Making in Games

MarsExplorer

MarsExplorer is an environment for evaluating agents on grid-based terrain exploration and coverage. Based on the MarsExplorer environment by Koutras et al., agents control a rover navigating a procedurally-generated 2D grid with obstacles, using a simulated LIDAR sensor to re…

RL EnvGridworld Games Evaluation

Zebra

Zebra puzzles (logic grid puzzles) with varying difficulty

RL Env

GolfCardGame

GolfCardGame is an environment for evaluating agents on strategic decision-making in the Golf card game, where the goal is to achieve the lowest score. This environment wraps the Golf implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Codenames

Codenames is an environment for evaluating agents on word association and collaborative team play in the Codenames board game. This environment wraps the Codenames implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Breakthrough

Breakthrough is an environment for evaluating agents on playing Breakthrough, a chess-like abstract strategy game, against an LLM opponent. This environment wraps the Breakthrough implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

IteratedRockPaperScissors

IteratedRockPaperScissors is an environment for evaluating agents on pattern recognition and strategic play in the classic hand game.

RL EnvDecision Making in Games

Sudoku

Suduko is an environment for evaluating agents on solving Sudoku puzzles of varying difficulty levels.

RL EnvDecision Making in Games

Debate

Debate is an environment for evaluating agents on persuasive argumentation where a panel of LLM judges determines the winner. This environment wraps the Debate implementation from TextArena, a framework for text-based game environments.

RL EnvDebate Speech Evaluation

SpiteAndMalice

SpiteAndMalice is an environment for evaluating agents on the competitive card game Spite and Malice.

RL EnvDecision Making in Games

IteratedTwoThirdsAverage

IteratedTwoThirdsAverage is an environment for evaluating agents on strategic reasoning about iterated dominance and opponent modeling.

RL EnvDecision Making in Games

AInsteinBench

AInsteinBench provides 244 scientific computing tasks derived from multiple scientific repositories. These tasks have been verified on execution and also reviewed by corresponding domain experts to verify both software engineering and scientific content accuracy. The tasks cov…

RL EnvScientific ReasoningScience

MMCircuitEval

MMCircuitEval is the first multimodal benchmark specifically designed to comprehensively evaluate MLLMs on electronic design automation (EDA) tasks in digital and analog circuit design. It contains 3614 expert-reviewed question-answer pairs drawn from textbooks, technical ques…

RL EnvElectronic Design Automation

SWE Smith

A dataset of 50k+ software engineering task instances generated from 128 popular Python repositories. Each instance includes an execution environment and a task that involves fixing a broken test. The dataset was used to train SWE-agent-LM-32B, which achieves 40.2% on SWE-benc…

RL EnvCode GenerationCode

MEDEC

MEDEC is the first publicly available benchmark for medical error detection and correction in clinical notes, covering five error types (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism).

RL EnvMedical Error Correction

MATH

MATH is a dataset of 12,500 challenging competition mathematics problems.

RL EnvMathematical Reasoning

PortManager

PortManager is a container port terminal management environment where agents schedule cranes, berths, yard storage, and truck/rail departures over a 168-hour (1-week) planning horizon. The simulation models a medium-large port with realistic vessel arrivals, STS crane operatio…

RL Env

ChineseSimpleQA

Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality of language models on short-question answering, covering six major topics and 99 diverse subtopics.

RL EnvQuestion Answering

LitQATrain

LitQATrain has distinctively specific questions on scientific literature and requires agents to search the literature in order to answer them. This is based on the LitQA benchmark by FutureHouse.

RL EnvScientific Literature Retrieval

SuppQATrain

SuppQATrain is an environment for evaluating question answering over supplementary materials of scientific papers, based on FutureHouse's SuppQA subtask within LAB-Bench. Each question asks about a specific verifiable fact found exclusively in a paper's supplementary data rath…

RL EnvScientific Literature Retrieval

IFEval

IFEval evaluates an LLM's model to follow instructions. It focuses on a set of verifiable instructions such as "write in more than 400 words" and "mention the keyword of AI at least 3 times".

RL EnvInstruction Following

MobileEnv

MobileEnv is an open, minimalist environment for training and evaluating coordination algorithms in wireless mobile networks. The environment allows modeling users moving around an area and can connect to one or multiple base stations.

RL Env

MLE Bench

MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering tasks. It consists of 82 ML engineering-related competitions from Kaggle.

RL EnvMachine Learning Engineering

PLawBench

PLawBench is a rubric-based benchmark designed to evaluate the performance of large language models (LLMs) in legal practice. It includes three legal tasks: legal consultation, case analysis, and legal document drafting, covering a wide range of real-world legal domains such a…

RL EnvLegal ReasoningLegal Case AnalysisLegal

ICUCoordinator

ICUCoordinator is a hospital bed management simulation where agents act as the bed coordinator at a 150-bed community hospital. Agents must allocate beds, manage staffing, handle patient admissions and discharges, schedule operating rooms, and make ambulance diversion decision…

RL EnvHospital Operations Evaluation

Chernobyl

Chernobyl is a nuclear power plant management environment where agents operate reactors through crisis scenarios inspired by four real historical disasters: Chernobyl (1986), Three Mile Island (1979), Fukushima Daiichi (2011), and Windscale (1957). The simulation couples point…

RL Env

NetHack

NetHack is an environment for evaluating agents on the classic roguelike dungeon exploration game NetHack, built on the NetHack Learning Environment (NLE). Agents explore procedurally generated dungeons by sending individual keystrokes and receiving ASCII terminal screen obser…

RL EnvDecision Making in Games

EsoLang

EsoLang-Bench is a benchmark for evaluating genuine reasoning in large language models via esoteric programming languages. It is designed to be resistant to data contamination and benchmark gaming, measuring transferable computational reasoning rather than memorization.

RL EnvComputational ReasoningCode GenerationCode

FoodDelivery

FoodDelivery is a food delivery dispatch optimization environment. An agent manages a fleet of e-bike couriers across a procedurally generated city, making real-time decisions about courier-order matching, order batching, surge pricing, and fleet repositioning under stochastic…

RL EnvFood Delivery Decision Making

MarketEntryGame

MarketEntryGame is an environment for evaluating agents on strategic decision-making in a market entry game with capacity constraints.

RL EnvDecision Making in Games

Threadneedle

Threadneedle is a monetary policy environment where an AI agent plays the role of the Bank of England's Monetary Policy Committee (MPC), setting Bank Rate and quantitative easing (QE) each quarter over 20 quarters. The economy evolves according to a simplified Heterogeneous Ag…

RL EnvEconomic Reasoning Problems

ReverseTicTacToe

ReverseTicTacToe is an environment for evaluating agents on the inverse of classic tic-tac-toe where getting three in a row loses.

RL EnvDecision Making in Games

HighSociety

HighSociety is an environment for evaluating agents on auction-based resource management and bidding strategy.

RL EnvDecision Making in Games

SimpleNegotiation

SimpleNegotiation is an environment for evaluating agents on resource negotiation and strategic trading.

RL EnvNegotiation Simulation

GDPVal

GDPval is a benchmark for evaluating AI model capabilities on real-world economically valuable tasks. We use the v2 version of the benchmark for this implementation which comes with rubric based grading.

RL EnvEconomically Valuable Tasks Evaluation

Chessenv

Chess is an environment for evaluating agents on playing chess against an LLM opponent. This environment wraps the Chess implementation from TextArena, a framework for text-based game environments.

RL EnvChessGames

NewRecruit

NewRecruit is an environment for evaluating agents on negotiation skills in a recruitment scenario where the recruiter and candidate must agree on salary, bonus, and job assignment.

RL EnvNegotiation Simulation

KuhnPoker

KuhnPoker is an environment for evaluating agents on simplified poker with a 3-card deck.

RL EnvPoker Playing Ability Evaluation

SecretMafia

SecretMafia is an environment for evaluating agents on social deduction, persuasion, and strategic deception in a Mafia-style game.

RL EnvDecision Making in Games

Skywork OR1 RL Data

Skywork-OR1-RL-Data is a dataset of verifiable, challenging, and diverse math problems (105K) and coding questions (14K).

RL EnvMathematical ReasoningCode GenerationCode

Countdown

Countdown is an environment for evaluating agents on the Countdown Numbers game, where agents must reach a target number by combining available numbers with arithmetic operations. This environment wraps the Countdown implementation from TextArena, a framework for text-based ga…

RL EnvDecision Making in Games

TwoDollar

TwoDollar is an environment for evaluating agents on economic negotiation and game-theoretic reasoning.

RL EnvDecision Making in Games

ThreePlayerGOPS

ThreePlayerGOPS is an environment for evaluating agents on strategic card play in the Game of Pure Strategy.

RL EnvSequential Decision Making in Card Games

Agent World Model

Agent World Model (AWM) is a fully synthetic environment generation pipeline that synthesizes 1,000 executable, SQL database-backed tool-use environments exposed via unified MCP interface for large-scale multi-turn agentic reinforcement learning. This is a port of [agent-worl…

RL Env

GraphWalks

GraphWalks is a single-turn environment that tests an agent's ability to perform graph operations on directed graphs presented as edge lists. Each task provides a directed graph and asks the agent to execute a specific operation - either finding the parent nodes of a target no…

RL EnvGraph Reasoning Tasks

Mastermind

Mastermind is an environment for evaluating agents on code-breaking and deductive reasoning with feedback.

RL EnvInteractive Code Breaking Task

RE Bench

RE-Bench consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts.

RL EnvMachine Learning Engineering Tasks

GermanWhist

GermanWhist is an environment for evaluating agents on trick-taking card game strategy. This environment wraps the GermanWhist implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

DSBC

DSBC assesses performance across a diverse set of eight data science task categories

RL EnvData Science Tasks

Game2048

2048 is an environment for evaluating agents on the classic sliding tile puzzle game. This environment wraps the 2048 implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

KernelBench

KernelBench is an open-source framework for evaluating language models' ability to generate fast, correct GPU kernels on a suite of 250 carefully selected PyTorch ML workloads representing a real-world engineering environment.

RL EnvGPU Kernel Optimization

MMLU ProX

MMLU-ProX is a comprehensive benchmark for assessing cross-linguistic reasoning in LLMs across 29 languages, built on an English benchmark with each language version containing 11,829 identical questions (and a lite version of 658 questions per language) to enable direct compa…

RL EnvMultilingual Reasoning Evaluation

FifteenPuzzle

FifteenPuzzle is an environment for evaluating agents on the classic sliding tile puzzle (also known as the 15-puzzle). This environment wraps the FifteenPuzzle implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

DontSayIt

DontSayIt is an environment for evaluating agents on playing Don't Say It, a conversational deception game where players try to make opponents say secret words, against an LLM opponent. This environment wraps the DontSayIt implementation from TextArena, a framework for text-ba…

RL EnvDecision Making in Games

Tak

Tak is an environment for evaluating agents on the full version of Tak, a strategic board game of roads and stacks.

RL EnvDecision Making in Games

WordSearch

WordSearch is an environment for evaluating agents on finding hidden words in letter grids.

RL EnvDecision Making in Games

LightsOut

LightsOut is an environment for evaluating agents on spatial reasoning and logic puzzle solving tasks.

RL EnvLogic Puzzle Reasoning

Crusade

Crusade is an environment for evaluating agents on playing Crusade, a chess-like strategy game with knight-movement mechanics, against an LLM opponent. This environment wraps the Crusade implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Chopsticks

Chopsticks is an environment for evaluating agents on playing Chopsticks, a hand game involving finger arithmetic and strategic redistribution, against an LLM opponent. This environment wraps the Chopsticks implementation from TextArena, a framework for text-based game environ…

RL EnvDecision Making in Games

Cybench

Cybench is a benchmark for evaluating the cybersecurity capabilities and risks of language models. Cybench includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.

RL EnvCapture the Flag Challenges

IndianPoker

IndianPoker is an environment for evaluating agents on poker strategy with imperfect information.

RL EnvPoker Playing Ability Evaluation

SpellingBee

SpellingBee is an environment for evaluating agents on word formation and vocabulary knowledge.

RL EnvDecision Making in Games

TicTacToe

TicTacToe is an environment for evaluating agents on playing Tic-Tac-Toe against an LLM opponent.

RL EnvDecision Making in Games

DAPO Math

DAPO is a mathematics dataset used for training the DAPO reasoning model, consisting of around 17k unique questions.

RL EnvMathematical Reasoning

QuantumTicTacToe

QuantumTicTacToe is an environment for evaluating agents on a quantum variant of tic-tac-toe where marks exist in superposition.

RL EnvDecision Making in Games

Surround

Surround is an environment for evaluating agents on the Tron-style game where players leave trails and must avoid collisions.

RL EnvDecision Making in Games

Sokoban

Sokoban is an environment for evaluating agents on spatial planning and sequential puzzle solving.

RL EnvSpatial Reasoning Evaluation

TwentyQuestions

TwentyQuestions is an environment for evaluating agents on playing the classic Twenty Questions game against an LLM gamemaster.

RL EnvDecision Making in Games

Battleship

Battleship is an environment for evaluating agents on playing the classic Battleship guessing game against an LLM opponent. This environment wraps the Battleship implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

DABStep

DABstep consists of over 450 data analysis tasks designed to evaluate the capabilities of state-of-the-art LLMs and AI agents.

RL EnvData Analysis Tasks

PaperBench

PaperBench is a benchmark for evaluating AI agents' ability to replicate state-of-the-art AI research from scratch by reproducing 20 ICML 2024 Spotlight and Oral papers - including understanding contributions, developing codebases, and executing experiments.

RL EnvPaper to Code Reproduction

CritPt

CritPt is a benchmark designed to test LLMs on unpublished, research‑level reasoning tasks across modern physics subfields, comprising 71 composite research challenges decomposed into 190 simpler checkpoint tasks created by 50+ active researchers.

RL EnvPhysics Research Reasoning

Slitherlink

Slitherlink is an environment for evaluating agents on logic puzzle solving with loop-drawing constraints.

RL EnvDecision Making in Games

EXP Bench

EXP-Bench is designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimenta…

RL EnvAI Research Tasks

Snake

Snake is an environment for evaluating agents on the classic multiplayer snake game.

RL EnvDecision Making in Games

FrozenLake

FrozenLake is an environment for evaluating agents on a grid navigation game where they must reach a goal while avoiding holes on a slippery frozen lake. This environment wraps the FrozenLake implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

RushHour

RushHour is an environment for evaluating agents on spatial planning and constraint-based puzzle solving.

RL EnvDecision Making in Games

ThreePlayerIPD

ThreePlayerIPD is an environment for evaluating agents on multi-player game theory and social dilemmas in the Iterated Prisoner's Dilemma.

RL EnvStrategic Decision Making in Game Theory

BlackjackEnv

Blackjack is an environment for evaluating agents on the classic card game. This environment wraps the Blackjack implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

WordLadder

WordLadder is an environment for evaluating agents on building word ladders by changing one letter at a time to transform a start word into an end word.

RL EnvDecision Making in Games

ThreePlayerTicTacToe

ThreePlayerTicTacToe is an environment for evaluating agents on strategic play in a three-player variant of Tic-Tac-Toe on a 5x5 board.

RL EnvDecision Making in Games

Wordle

Wordle is an environment for evaluating agents on the word-guessing game Wordle. The agent must deduce a hidden target word by submitting guesses and interpreting positional feedback (correct, misplaced, or absent letters).

RL EnvDecision Making in Games

SWE Perf

A benchmark of 140 code performance optimization tasks drawn from real GitHub pull requests. Each instance includes a full repository codebase, target functions to optimize, and reference solutions from human developers. Models are evaluated on their ability to generate patche…

RL EnvCode GenerationCode

SimpleQAVerified

SimpleQA Verified is a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality.

RL EnvFactuality Evaluation of Language Models

Volforecast

volforecast (volatility forecasting) is a benchmark that evaluates the ability of agents to forecast volatility in financial time series.

RL EnvFinancial Time Series Forecasting

ATLAS

An implementation of the ATLAS benchmark of https://arxiv.org/abs/2511.14366v2

RL EnvScientific ReasoningScience

OrganicChem1909

An environment that turns "Practical Methods of Organic Chemistry" (1909) into a question and answer dataset for training and evaluation.

RL EnvGraduate Level Chemistry Question Answering

OrganicChem1920

OpenReward environment for testing organic chemistry knowledge based on Holleman's "A Text-book of Organic Chemistry" (5th English ed., 1920).

RL EnvScientific Question Answering

BioClassify

An environment for classifying molecules as active or inactive against biological targets. Uses the HIV replication inhibition dataset.

RL EnvMolecular Property Prediction

Formula2SMILES

Generate a SMILES formula given a molecular formula and some constraints.

RL EnvMolecular Generation

PaperSearchQA

PaperSearchQA is a challenging factoid QA dataset for scientific papers with 60k samples. This is a re-implementation. Instead of RAG, we use web_search and fetch tools to allow the agent to search the internet for answers. Note these are optional and you can exclude these fr…

RL EnvScientific Question Answering

GSM8K

GSM8K is a classic math word problems dataset.

RL EnvMathematical Reasoning

MMLU Pro

MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.

RL EnvQuestion Answering

PRBench

Professional Reasoning Bench (PRBench) is a realistic, open-ended, rubric-based benchmark for evaluating models on economically consequential professional tasks in Finance and Law.

RL EnvHigh Stakes Professional Reasoning

Evoeval

EvoEval: Evolving Coding Benchmarks via LLM

RL Env

RetroSynth

Retrosynthesis benchmark for evaluating AI agents' ability to propose reactants that produce a target molecule.

RL EnvRetrosynthesis Evaluation

SuperGPQA

SuperGPQA is a benchmark for evaluating graduate-level knowledge and reasoning across 285 specialized disciplines.

RL EnvAcademic Question Answering

ToolMind Web QA

ToolMind-Web-QA is a validated public dataset designed for research on search-augmented and long-horizon search agents. This is an implementation of https://huggingface.co/datasets/Nanbeige/ToolMind-Web-QA.

RL EnvDeep Research Tasks

MMLU Redux

MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.

RL EnvAcademic Question Answering

SuperCHEM

SUPERChem is a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. This is an re-implementation.

RL EnvChemical Reasoning

MMLU Redux 2

MMLU-Redux is a subset of 5,700 manually re-annotated questions across 57 MMLU subjects. Implementation of: https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0.

RL EnvAcademic Question Answering

MathVista

An implementation of the MathVista environment.

RL EnvMultimodal Mathematical Reasoning

AIME2025

Problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.

RL EnvMathematical Reasoning

VeriSciQA

VeriSciQA is a visual question answering dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types

RL EnvScientific Visual Question Answering

NFLBench

This environment tests a model's ability to make predictive models of NFL and betting against market odds.

RL EnvSports Analytics

WhoDunit

A murder mystery environment.

RL EnvReasoning Evaluation

DataScienceComps

DataScienceComps consists of data science competitions that agents can be trained against.

RL EnvData Science Tasks

FineProofs RL

An implementation of the FineProofs-RL environment for theorem proving from https://huggingface.co/datasets/lm-provers/FineProofs-RL.

RL EnvAutomated Theorem Proving

Encyclo K

Encyclo-K is a statement-based benchmark for evaluating LLMs' comprehensive understanding by extracting standalone knowledge statements from authoritative textbooks and dynamically composing them into evaluation questions at test time.

RL EnvAcademic Question Answering

Crosswords

Cryptic crossword puzzles.

RL EnvCryptic Crossword Solving

Arc Agi 1

ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker…

RL EnvAbstract Visual Reasoning

Portfolio

Portfolio is an environment which tests agents ability to conduct portfolio optimisation tasks.

RL EnvPortfolio Optimization

ADME

An environment predicting ADME (Absorption, Distribution, Metabolism, Excretion) molecular properties from SMILES notation.

RL EnvMolecular Property Prediction

SWE Bench Pro

SWE-Bench Pro is an environment of 1,865 long-horizon software engineering tasks across 41 repositories. The public set contains 731 instances from copyleft-licensed repositories, designed to resist training data contamination. Tasks span bug fixes, feature requests, optimizat…

RL EnvCode GenerationCode

GSM1K

GSM1K implementation.

RL EnvGrade School Arithmetic

TennisBench

TennisBench is a benchmark for evaluating language model agents in tennis betting scenarios, testing their ability to develop and execute machine learning based betting strategies.

RL EnvSports Analytics

MathVision

An implementation of the MATH-Vision dataset which measures multimodal mathematical reasoning.

RL EnvMultimodal Mathematical Reasoning

Codepde

CodePDE is an environment for generating PDE solvers using large language models.

RL EnvScientific ReasoningCode GenerationScience

AIME2026

Problems from the American Invitational Mathematics Examination (AIME) 2026-I & II.

RL EnvMathematical Reasoning

OMOLAgent

OMOLAgent tests agents on the OMol25 dataset from FAIR. They are given access to the OMol-4M training dataset and evaluated on the OMol25 validation set.

RL EnvEvaluating Machine Learning Interatomic Potentials

BudgetDay

A series of real-world tasks related to budget day in the UK, which requires reading long-form documents, creating spreadsheets, analyses and more.

RL EnvFiscal Policy Design and OptimizationLong Context Reasoning

InverseIFEval

Inverse IFEval is a benchmark that measures models' counter-intuitive ability to override training-induced biases and comply with adversarial or unconventional instructions.

RL EnvInstruction Following

RubricHub

RubricHub is a large-scale dataset of questions requiring rubric grading criteria.

RL EnvRubric Generation and Reward Modeling

Financeagent Terminal

Port of https://github.com/laude-institute/harbor-datasets/tree/main/datasets/financeagent_terminal

RL EnvReal World Financial Research Tasks

GlobalPIQA

Global PIQA (a participatory commonsense reasoning benchmark for over 100 languages) was constructed by 335 researchers from 65 countries and covers 116 language varieties across five continents, 14 language families, and 23 writing systems.

RL EnvPhysical Commonsense Reasoning

Taubench

tau-bench challenges agents to coordinate, guide, and assist users in achieving shared objectives across complex enterprise domains.

RL EnvEnterprise Agentic System Evaluation

Kumo

KUMO is a novel benchmark designed to systematically evaluate the complex reasoning capabilities of Large Language Models (LLMs) through procedurally generated reasoning games.

RL EnvLogical Reasoning

IUPACNames

An environment for translating IUPAC names to SMILES formulae and vica versa.

RL EnvChemical Reasoning

SocialData

Social impact data science competitions.

RL EnvData Science Tasks

Biomni Eval1

Biomedical evaluation from the Biomni agent. Ported from https://huggingface.co/datasets/biomni/Eval1.

RL EnvBiomedical Question Answering

CL Bench

CL-bench, a real-world benchmark for context learning consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts.

RL EnvLong Context Evaluation

SWE Bench Verified

Implementation of https://openai.com/index/introducing-swe-bench-verified/

RL EnvCode GenerationCode

HLE Verified

HLE-Verified is a systematically audited and reliability-enhanced version of the Humanity’s Last Exam (HLE) benchmark.

RL EnvAcademic Question Answering

ScholarSearch

ScholarSearch is designed to evaluate the complex information retrieval capabilities of Large Language Models (LLMs) in academic research.

RL EnvAcademic Information Retrieval

PhysicsEval

PhysicsEval is a benchmark for evaluating the performance of large language models on mathematical and descriptive physics problems, including assessments using inference-time techniques and multi-agent verification frameworks.

RL EnvPhysics Problem Solving

TIRBench

TIR-Bench is a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation within chain-of-thought.

RL EnvAgentic Thinking with Images Reasoning

GPQA

An implementation of GPQA

RL EnvAcademic Question Answering

HealthBench

HealthBench tests how well AI models perform in realistic health scenarios, based on what physician experts say matters most.

RL EnvHealthcare Conversations

Deveval

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

RL Env

Counter

Simple testing environment where the agent must reach a target count from a starting count using increment and decrement tools. The episode ends when the submitted count matches the goal. The environment has 100 train tasks and 20 test tasks with random start/goal values in…

RL Env

ValidMol

ValidMol is an enviroment that tests molecule generation using programmatic verifiers.

RL EnvOpen Domain Molecule Generation

MMMLU

MMMLU is a translation of MMLU’s test set into 14 languages using professional human translators by OpenAI.

RL EnvMultilingual Question Answering

CaseLawQA

CaseLawQA is a dataset of legal classification tasks drawing from the Supreme Court and Songer Court of Appeals legal databases.

RL EnvLegal Annotation and Classification

ObscureFacts

ObscureFacts is an environment for evaluating an agent's ability to find answers to obscure trivia questions using web search. Agents must use web search tools to research and answer intentionally difficult factual questions spanning sports, technology, local history, and acad…

RL EnvWeb Browsing Agent Evaluation

Replicationbench

Evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature.

RL EnvScientific ReasoningScience

HMMT

HMMT is one of the largest and most prestigious high school competitions in the world.

RL EnvMathematical Reasoning

Open RL

Open-RL by Turing consists of self-contained, verifiable, and unambiguous STEM reasoning problems across Physics, Mathematics, Biology, and Chemistry.

RL EnvScientific ReasoningScience

Swe Gym Lite

This is a lite version of the SWE-Gym environment.

RL EnvSoftware Engineering Tasks

BFCL

BFCL is an evaluation of LLMs' ability to call functions and tools. The dataset represents common function calling use-cases in agents and enterprise workflows.

RL EnvFunction Calling Evaluation

MolecularSafety

An OpenReward environment for classifying molecular safety across multiple toxicity endpoints from SMILES notation.

RL EnvMolecular Property Prediction

LongFact

LongFact is a prompt set of 2,280 fact-seeking prompts requiring long-form responses.

RL EnvLong Form Factuality Evaluation

Ineqmath

Expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations.

RL EnvMathematical Reasoning

SolPredict

An OpenReward environment where agents train ML models to predict aqueous solubility (LogS) from molecular SMILES notation.

RL EnvMolecular Property Prediction

HLE

Humanity's Last Exam (HLE) is an LLM benchmark consisting of over 2,500 expert-level questions across a broad range of subjects.

RL EnvAcademic Question Answering

BullshitBenchv2

BullshitBenchv2 measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions. This is an implementation of a benchmark by PeterGPT.

RL EnvUnanswerable Question Benchmarking for LLMs

RealLawyer

RealLawyer simulates a real-life law firm workflow with simulated clients, realistic datarooms, and more.

RL EnvLegal ReasoningLegal

VolleyBench

VolleyBench is an environment which tests an agent's ability to predict Women's World Championship volleyball matches, where reward is based on real market odds.

RL EnvSports Analytics

BioReason

BioReason is a collection of environments for assessing biological reasoning.

RL EnvScientific ReasoningScience

Polymath

PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and four easy-to-hard difficulty levels. It ensures comprehensive difficulty coverage, language diversity, and high-quality translations to provide a highly discriminative testbed for evaluating…

RL EnvMultilingual Mathematical Reasoning

SECQUE

SECQUE is a comprehensive benchmark for evaluating large language models (LLMs) in financial analysis tasks.

RL EnvFinancial Analysis Evaluation

DeepSynth

DeepSynth is an environment for automatically synthesizing programs from examples. It combines machine learning predictions with efficient enumeration techniques in a very generic way.

RL EnvProgram SynthesisCode GenerationCode

MMLU

Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several other versions and spin-offs, such as MMLU-Pro, MMMLU and MMLU-Redux.

RL EnvQuestion Answering

ScienceAgentBench

ScienceAgentBench is a benchmark for evaluating language agents for data-driven scientific discovery.

RL EnvData Driven Scientific DiscoveryData Driven Scientific Discovery TasksData Driven Discovery

NL2RepoBench

NL2Repo is a benchmark designed to evaluate the performance of Large Language Models (LLMs) and coding agents on long-horizon tasks that require generating a complete, runnable code repository from scratch (0-to-1). The benchmark consists of 104 distinct tasks, each paired wit…

RL EnvEnd to End Software Development

WildTicTacToe

WildTicTacToe is an environment for evaluating agents on tactical gameplay in Wild Tic-Tac-Toe, a variant where players can place either X or O on any empty position. This environment wraps the WildTicTacToe implementation.

RL EnvDecision Making in Games

FeatureBench

FeatureBench is a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development.

RL EnvAgentic Coding for Feature DevelopmentSoftware Engineering Tasks

KellyBench

KellyBench is a benchmark that tests an agents' ability to make machine learning models for predicting football matches and betting against market odds.

RL EnvAI Research Tasks

Bioreason Pro RL Reasoning Data

Training dataset for reinforcement learning (GRPO) optimization of BioReason-Pro. Contains proteins with GO term annotations, InterPro domains, STRING protein-protein interactions, and protein metadata.

RL EnvScientific Reasoning in Genomics

LawBench

LawBench has been meticulously crafted to have precise assessment of the LLMs’ legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs…

RL EnvLegal Reasoning and Document UnderstandingLegal ReasoningLegal Text Reading ComprehensionLegal

SWE Atlas QnA

Codebase QnA is the first benchmark in the SWE-Atlas suite. It evaluates AI agents on deep code comprehension - tracing execution paths, explaining architectural decisions, and answering deeply technical questions about production-grade software systems.

RL EnvCode Understanding and Reasoning

Poker

Poker is an environment for evaluating agents on strategic decision-making in Texas Hold'em Poker, testing betting strategies, bluffing, and probabilistic reasoning.

RL EnvPoker Playing Ability Evaluation

DS1000

DS-1000 is a code generation benchmark with a thousand data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.

RL EnvData Science Tasks

FrontierCO

FrontierCO is a curated benchmark suite for evaluating ML-based solvers on large-scale and real-world Combinatorial Optimization (CO) problems. The benchmark spans 8 classical CO problems across 5 application domains, providing both training and evaluation instances specifical…

RL EnvOptimization Modeling with LLMs

IteratedPrisonersDilemma

IteratedPrisonersDilemma is an environment for evaluating agents on cooperation and defection strategies in the classic game theory dilemma.

RL EnvStrategic Decision Making in Game Theory

IteratedMatchingPennies

IteratedMatchingPennies is an environment for evaluating agents on mixed strategy equilibrium play in a classic game theory scenario.

RL EnvDecision Making in Games

LinesofAction

LinesOfAction is an environment for evaluating agents on the abstract strategy board game where players connect their pieces into a single group.

RL EnvDecision Making in Games

WinAsMuchAsYouCan

WinAsMuchAsYouCan is an environment for evaluating agents on strategic decision-making and cooperation in a multi-player coordination game.

RL EnvDecision Making in Games

TowerOfHanoi

TowerofHanoi is an environment for evaluating agents on solving the classic Tower of Hanoi puzzle with varying difficulty levels.

RL EnvDecision Making in Games

LiarsDice

LiarsDice is an environment for evaluating agents on strategic bluffing and probabilistic reasoning in Liar's Dice, a dice game where players must bid or call bluffs.

RL EnvDecision Making in Games

MemoryGame

MemoryGame is an environment for evaluating agents on the classic memory/concentration card-matching game.

RL EnvDecision Making in Games

ColonelBlotto

ColonelBlotto is an environment for evaluating agents on playing Colonel Blotto, a resource allocation strategy game, against an LLM opponent. This environment wraps the ColonelBlotto implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Minesweeper

Minesweeper is an environment for evaluating agents on spatial reasoning, probabilistic inference, and strategic exploration.

RL EnvSpatial Reasoning

ConnectFour

ConnectFour is an environment for evaluating agents on playing the classic Connect Four game against an LLM opponent. This environment wraps the ConnectFour implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Stratego

Stratego is an environment for evaluating agents on the classic strategy board game of hidden information and tactical combat.

RL EnvDecision Making in Games

PigDice

PigDice is an environment for evaluating agents on the dice game where players balance risk and reward to reach a target score.

RL EnvDecision Making in Games

TruthAndDeception

TruthAndDeception is an environment for evaluating agents on social deduction and persuasion through natural conversation.

RL EnvSocial Deduction Game Evaluation

Nim

Nim is an environment for evaluating agents on the classic mathematical strategy game where players remove objects from piles.

RL EnvDecision Making in Games

IteratedUltimatumGame

IteratedUltimatumGame is an environment for evaluating agents on fairness, negotiation, and strategic bargaining.

RL EnvDecision Making in Games

SimpleTak

SimpleTak is an environment for evaluating agents on a simplified version of Tak played on a 4x4 grid.

RL EnvDecision Making in Games

GuessWho

GuessWho is an environment for evaluating agents on the classic Guess Who game, where agents must identify a target character by asking yes-or-no questions about their traits. This environment wraps the GuessWho implementation from TextArena, a framework for text-based game en…

RL EnvDecision Making in Games

Hangman

Hangman is an environment for evaluating agents on word guessing and deductive reasoning tasks. This environment wraps the Hangman implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

Cryptarithm

Cryptarithm is an environment for evaluating agents on cryptarithmetic puzzles, where letters must be mapped to unique digits to make an arithmetic equation valid. This environment wraps the Cryptarithm implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

DiscoX

DiscoX is a benchmark for discourse-level and expert-level Chinese-English translation designed to evaluate discourse coherence and strict terminological precision in professional-domain texts.

RL EnvMachine Translation

IteratedStagHunt

IteratedStagHunt is an environment for evaluating agents on coordination and cooperation in a game with multiple equilibria.

RL EnvDecision Making in Games

SimpleBlindAuction

SimpleBlindAuction is an environment for evaluating agents on simultaneous sealed-bid auctions for multiple items.

RL EnvStrategic Planning and Execution in Auctions

GravityBench

Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations.

RL EnvGravitational Physics Discovery

TerminalBench2

Terminal-Bench is a popular benchmark for measuring the capabilities of agents and language models to perform valuable work in containerized environments. Tasks include assembling proteins for synthesis, debugging async code, and resolving security vulnerabilities.

RL EnvCommand Line Interface Tasks

OpenResearcher

This is a port of the dataset used to train the OpenResearcher model on long-horizon deep research. Based on https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Dataset.

RL EnvDeep Research Tasks

Geoguessr

A geoguessing environment where an agent needs to guess the country given an input image.

RL EnvGeolocalization

Crustbench

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation

RL EnvCode TranslationCode GenerationCode

ProcBench

ProcBench is a benchmark for directly evaluating LLMs' multi-step inference ability by providing pairs of explicit instructions and corresponding questions where the full procedures needed to solve each problem are specified, minimizing path exploration and implicit knowledge…

RL EnvMulti Step Reasoning and Following Procedure

IMO Bench

IMO-Bench is a suite of advanced reasoning benchmarks. IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench - a large-scale test on getting the right answer, IMO-ProofBench - a next-level evaluation for proof writing, and IMO-Gradin…

RL EnvMathematical Reasoning

IPLBench

IPLBench tests agents' ability to make predictive models of IPL cricket and bet on them versus market odds.

RL EnvSports Analytics

FinQA

An agentic implementation of https://github.com/czyssrs/FinQA

RL EnvFinancial Question Answering

TrialQATrain

TrialQATrain is a training environment that tests question answering and retrieval for clinical trials.

RL EnvMedical Question Answering

DCS

DCS is a simulation of a departure control system.

RL EnvAgentic AI Evaluation

POLARIS 53K

Implementation of the POLARIS-53K mathematics dataset.

RL EnvMathematical Reasoning

MolScent

OpenReward environment for matching scent/smell to molecules. Follows ether0's property-cat-smell pattern with multiple-choice questions and string-match verification.

RL EnvMolecular Property Prediction

LetterAuction

LetterAuction is an environment for evaluating agents on strategic letter auctions followed by word formation.

RL EnvDecision Making in Games

BBBPerm

An OpenReward environment for evaluating agents on blood-brain barrier (BBB) permeability tasks. Agents must either classify molecules by their ability to cross the BBB, or modify non-permeable molecules to become permeable.

RL EnvMolecular Property Prediction

ReasoningGym

Reasoning Gym is a community-created Python library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL).

RL EnvReasoning and Verifiable Reward Reinforcement Learning

RefSeqTrain

RefSeqTrain is a training environment for genomics question answering about NCBI RefSeq and Gene database records. Agents are given questions about specific verifiable facts from RefSeq gene, transcript, and protein records and must use web search to find and verify answers fr…

RL EnvScientific Literature Retrieval

ScrapeBench

ScrapeBench is an environment that tests an agents' ability to extract information from publicly available websites.

RL EnvWeb Agent Evaluation

UFCBench

A benchmark that tests AI ability to make AI models predicting UFC matches and betting against market odds.

RL EnvSports Analytics

BrowseComp ZH

BrowseComp-ZH is the first high-difficulty benchmark specifically designed to evaluate the real-world web browsing and reasoning capabilities of large language models (LLMs) in the Chinese information ecosystem.

RL EnvWeb Browsing and Multi Hop Question Answering in Chinese

GeneralReasoner

Implementation of https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified

RL EnvGeneralized Reasoning Evaluation

Briscola

Briscola is an environment for evaluating agents on playing Briscola, an Italian trick-taking card game, against an LLM opponent. This environment wraps the Briscola implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

MathCanvas

MathCanvas is a benchmark for evaluating visual-aided mathematical reasoning by requiring models to produce interleaved visual–textual solutions.

RL EnvMultimodal Mathematical Reasoning

BrowseComp

BrowseComp is a benchmark for measuring agents' web-browsing ability, comprising 1,266 questions that require persistent navigation to locate hard-to-find, entangled information. It yields short, easily verifiable answers and tests agents' persistence and creativity in finding…

RL EnvWeb Browsing Agent Evaluation

Kramabench

Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain k…

RL EnvSoftware Engineering TasksData Science Tasks

AIME2024

Problems from the American Invitational Mathematics Examination (AIME) 2024.

RL EnvMathematical Reasoning

SMT2025

The Stanford Math Tournament (SMT) is a prestigious annual math competition hosted by Stanford University.

RL EnvMathematical Reasoning

NCBIGenomeTrain

NCBIGenomeTrain is a training environment for genome-level question answering about the hg38 human reference genome. Each question requires retrieving or computing verifiable facts from the GRCh38/hg38 assembly, such as reference DNA sequences at specific coordinates, GC conte…

RL EnvScientific Literature Retrieval

SourceQualityTrain

SourceQualityTrain is a training environment for systematic review source quality assessment, based on the SourceQuality benchmark from LAB-Bench. Agents are given questions about why specific studies were excluded from systematic reviews, and must use web search to identify t…

RL EnvScientific Literature Retrieval

Qcircuitbench

Evaluate the capabilities of AI in designing and implementing quantum algorithms from the perspective of code generation.

RL EnvScientific ReasoningQuantum Algorithm DesignCode GenerationScience

Spreadsheetrl

Agentic Excel workbook editing: edit an xlsx to reach a requested end state, graded by workbook recalculation and answer-region cell match (SpreadsheetBench + ExcelForum; arXiv:2605.22642).

RL EnvSpreadsheet Manipulation

FrontierFinance

Agentic financial-research benchmark from Samaya AI: 220 expert investor queries spanning the investor workflow (financial data/modeling, sector/macro, earnings, company research, catalyst monitoring, screening). The agent researches each query via web search as of the query's…

RL EnvFinancial Reasoning

Code Contests

OpenThought code contests environment

RL EnvCode GenerationCode

EquityResearch

Sell-side equity research report writing on real listed companies. Agent gets a terse analyst-style coverage brief, researches current public information via web search, and submits a .docx report with its own estimates and an explicit fair value/target price. Graded per-crite…

RL EnvEquity Research Report Generation

DeckSmith

Research-grounded slide-deck generation across knowledge-work domains (finance, IB pitch books, legal, pharma, PE/real-estate, insurance, energy, academic). Agent gets a topic/deliverable brief, researches via web search, authors a PPTX; graded by a multimodal gpt-5-mini rubri…

RL EnvPowerpoint Task Completion

Yes No Black White

Forbidden-words conversation game (Yes/No/Black/White): hold an absolute never-say rule against an adversarial opponent under heavy context, prompt injection, and a negation-ban hard mode. Programmatic reward, no LLM grader.

RL Env

Anotherdeepresearch

Rubric-graded web + deep research: real search questions and deep-research briefs answered with live web search (Tavily), scored against per-task factual rubrics.

RL EnvDeep Research Tasks

FPL

FPL is an environment that tests an agent's ability to play fantasy football for the English Premier League.

RL EnvSports Analytics

CtfFlag

Solve real Capture-The-Flag challenges (crypto, forensics, misc, reversing) in a sandbox and submit the flag; graded by exact flag match server-side. 59 challenges, each with a verified offline solver.

RL Env

AirlineRM

AirlineRM is an airline network revenue management environment where an agent operates a hub-and-spoke carrier over a 30-day horizon. The agent makes daily decisions about fare class availability (opening/closing 8 nested fare buckets), overbooking limits, and disruption respo…

RL Env

Widesearch

Many real-world information-gathering tasks are not hard, just huge. Consider a financial analyst compiling key metrics for all companies in a sector, or a job seeker collecting every vacancy that meets their criteria. The challenge isn't cognitive complexity, but the sheer sc…

RL Env

DAComp DA

DAComp-DA is a benchmark of 100 data science which that pose open-ended business problems that demand strategic planning and insight synthesis.

RL EnvData Analysis and Data Modeling TasksData Analytics Insight DiscoveryData Analysis Tasks

ATC

An air traffic control simulation where an agent manages arrivals, departures, gate assignments, holding patterns, diversions, and runway configurations during weather disruptions at a realistic hub airport (Metro Hub International, inspired by JFK). The environment features a…

RL EnvAir Traffic Control Agent Evaluation

MicrogridGym

MicrogridGym is an environment for tuning cascaded PI controllers and droop coefficients for three-phase power electronic inverters in microgrid configurations. Based on the physics from the OpenModelica Microgrid Gym (OMG) toolbox, it implements a pure Python simulation of in…

RL EnvEnergy Management System Benchmarking

ICUSepsis

ICU-Sepsis is an environment for evaluating agents on a tabular Markov Decision Process (MDP) that models sepsis treatment in the intensive care unit. Agents select treatment actions representing combinations of vasopressor and IV fluid doses to maximize patient survival proba…

RL EnvSurvival Prediction

PrincipiaBench

Principia Bench is a benchmark designed to evaluate language models' ability to derive mathematical objects from STEM-related problem statements. Each instance contains a problem statement and a ground-truth answer. The problem statements are drawn from four benchmarks-RealMat…

RL EnvScientific ReasoningChemical ReasoningPhysics Problem SolvingLegal

DiscoveryBench

DiscoveryBench is designed to systematically assess current model capabilities in data-driven discovery tasks and provide a useful resource for improving them. Each DiscoveryBench task consists of a goal and dataset(s). Solving the task requires both statistical analysis and s…

RL EnvScientific ReasoningData Driven Scientific DiscoveryData Driven Scientific Discovery TasksScience

PowerGrid

PowerGrid is a power grid operator environment where agents dispatch generators, manage battery storage, handle renewable variability, and maintain grid frequency across crisis scenarios inspired by the 2021 Texas winter storm, the 2003 Northeast blackout, and the 2016 South A…

RL EnvPower Grid Control with Reinforcement Learning

BountyBench

A framework to capture offensive & defensive cyber-capabilities in evolving real-world systems, BountyBench is a benchmark with 25 systems with complex, real-world codebases, and include 40 bug bounties that cover 9 of the OWASP Top 10 Risks.

RL EnvAutomated Vulnerability RepairSoftware Vulnerability DetectionVulnerability Detection and Patching

CTF

CTF (Capture the flag) is an environment where agents attempt to find text strings - called flags - which are secretly hidden in purposefully vulnerable programs or websites.

RL EnvCapture the Flag Challenges

AIRS Bench

The AI Research Science Benchmark is an eval that quantifies the autonomous research abilities of LLM agents in the area of machine learning. AIRS-Bench comprises 20 tasks from state-of-the-art machine learning papers spanning diverse domains such as NLP, Code, Math, biochemic…

RL EnvAI Research TasksMachine Learning Engineering

GuessTheNumber

GuessTheNumber is an environment for evaluating agents on the classic number guessing game where agents receive feedback on whether their guess is too high or too low. This environment wraps the GuessTheNumber implementation from TextArena, a framework for text-based game envi…

RL EnvDecision Making in Games

Checkers

Checkers is an environment for evaluating agents on playing the classic Checkers game against an LLM opponent. This environment wraps the Checkers implementation from TextArena, a framework for text-based game environments.

RL EnvDecision Making in Games

PublicGoodsGames

PublicGoodsGame is an environment for evaluating agents on economic decision-making and social cooperation in a public goods game.

RL EnvDecision Making in Games

PegJump

PegJump is an environment for evaluating agents on strategic planning and sequential reasoning

RL EnvStrategic Gaming Evaluation

CrosswordsEnv

Crosswords is an environment for evaluating agents on crossword puzzle solving.

RL EnvDecision Making in Games

Santorini

Santorini is an environment for evaluating agents on strategic gameplay in Santorini, an abstract board game where players move workers and build structures to reach the third level.

RL EnvDecision Making in Games

Arc Agi 2

ARC-AGI-2 - the next iteration of the benchmark - is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.

RL EnvAbstract Visual Reasoning

MRCRV2

OpenAI MRCR (Multi-round co-reference resolution) is a long context dataset for benchmarking an LLM's ability to distinguish between multiple needles hidden in context.

RL EnvLong Context Language Model Evaluation

Chess

The chess environment has the agent play against Stockfish at varying difficulties (skill levels 0-20). The input format is UCI notation. The agent can play as white or black, and there are two environment variants: - ChessTextEnv: Observations are FEN strings (board repres…

RL EnvChessGames

Terminal Bench 2 Verified

Verified tasks from terminal bench 2, created by z.ai

RL EnvCommand Line Interface TasksComputer Use Agent Evaluation

PatentQATrain

PatentQATrain is a training environment for patent question answering, based on the PatentQA task from LAB-Bench-2. Agents are given questions about specific details from patents across diverse technology domains and must use web search to find and verify answers from Google P…

RL EnvIntellectual Property Tasks

APEX Agents

APEX–Agents is a benchmark from Mercor for evaluating whether AI agents can execute long-horizon, cross-application professional services tasks. Tasks were created by investment banking analysts, management consultants, and corporate lawyers, and require agents to navigate rea…

RL EnvAgentic AI Evaluation

OpenSWE

OpenSWE: 8,876 quality-filtered executable SWE tasks across 12.8k Python repos (GAIR-NLP)

RL EnvSoftware EngineeringCode