General Reasoning is an org.
Cite
Notes
Only stored in your browser.
End-to-end evaluation of AI systems in complete workflows from input to final output
SpreadsheetBench evaluates large language model agents' capabilities in manipulating complex real-world spreadsheets. The benchmark include 912 real questions gathered from online Excel forums, covering a variety of tabular data such as multiple tables, non-standard relational…
Principia Collection is a large-scale dataset designed to enhance language models’ ability to derive mathematical objects from STEM-related problem statements. Each instance contains a problem statement, a ground truth answer, an answer type, and a topic label. The topics are…
OfficeQA is a benchmark for evaluating AI agents on grounded, multi-document reasoning over a large and heterogeneous document corpus. The corpus consists of U.S. Treasury Bulletins spanning nearly 100 years, comprising 89,000 pages and over 26 million numerical values. Office…
Debate is an environment for evaluating agents on persuasive argumentation where a panel of LLM judges determines the winner. This environment wraps the Debate implementation from TextArena, a framework for text-based game environments.
GolfCardGame is an environment for evaluating agents on strategic decision-making in the Golf card game, where the goal is to achieve the lowest score. This environment wraps the Golf implementation from TextArena, a framework for text-based game environments.
The Nemotron-RL-math-stack_overflow dataset contains mathematical problems and solutions sourced from the Stack Overflow forums. This is an implementation of https://huggingface.co/datasets/nvidia/Nemotron-RL-math-stack_overflow.
MEDEC is the first publicly available benchmark for medical error detection and correction in clinical notes, covering five error types (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism).
APEX–Agents is a benchmark from Mercor for evaluating whether AI agents can execute long-horizon, cross-application professional services tasks. Tasks were created by investment banking analysts, management consultants, and corporate lawyers, and require agents to navigate rea…
A game of exploring and racing through Wikipedia articles!
YearGuessr is the largest open benchmark for evaluating vision-language models’ ability to predict buildings’ construction years and to expose popularity-driven memorization bias. It contains 55,546 building images from 157 countries with multimodal attributes (continuous ordi…
ControlEval is an evaluation dataset that comprises 500 control tasks with various specific design goals.
FPL is an environment that tests an agent's ability to play fantasy football for the English Premier League.
Suduko is an environment for evaluating agents on solving Sudoku puzzles of varying difficulty levels.
SpiteAndMalice is an environment for evaluating agents on the competitive card game Spite and Malice.
UltimateTicTacToe is an environment for evaluating agents on strategic gameplay in Ultimate Tic-Tac-Toe, a complex variant where winning three sub-boards in a row determines victory.
Zebra puzzles (logic grid puzzles) with varying difficulty
DAComp-DA is a benchmark of 100 data science which that pose open-ended business problems that demand strategic planning and insight synthesis.
PortManager is a container port terminal management environment where agents schedule cranes, berths, yard storage, and truck/rail departures over a 168-hour (1-week) planning horizon. The simulation models a medium-large port with realistic vessel arrivals, STS crane operatio…
MobileEnv is an open, minimalist environment for training and evaluating coordination algorithms in wireless mobile networks. The environment allows modeling users moving around an area and can connect to one or multiple base stations.
DiscoveryBench is designed to systematically assess current model capabilities in data-driven discovery tasks and provide a useful resource for improving them. Each DiscoveryBench task consists of a goal and dataset(s). Solving the task requires both statistical analysis and s…
An air traffic control simulation where an agent manages arrivals, departures, gate assignments, holding patterns, diversions, and runway configurations during weather disruptions at a realistic hub airport (Metro Hub International, inspired by JFK). The environment features a…
PLawBench is a rubric-based benchmark designed to evaluate the performance of large language models (LLMs) in legal practice. It includes three legal tasks: legal consultation, case analysis, and legal document drafting, covering a wide range of real-world legal domains such a…
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering tasks. It consists of 82 ML engineering-related competitions from Kaggle.
ICUCoordinator is a hospital bed management simulation where agents act as the bed coordinator at a 150-bed community hospital. Agents must allocate beds, manage staffing, handle patient admissions and discharges, schedule operating rooms, and make ambulance diversion decision…
Chernobyl is a nuclear power plant management environment where agents operate reactors through crisis scenarios inspired by four real historical disasters: Chernobyl (1986), Three Mile Island (1979), Fukushima Daiichi (2011), and Windscale (1957). The simulation couples point…
PowerGrid is a power grid operator environment where agents dispatch generators, manage battery storage, handle renewable variability, and maintain grid frequency across crisis scenarios inspired by the 2021 Texas winter storm, the 2003 Northeast blackout, and the 2016 South A…
NetHack is an environment for evaluating agents on the classic roguelike dungeon exploration game NetHack, built on the NetHack Learning Environment (NLE). Agents explore procedurally generated dungeons by sending individual keystrokes and receiving ASCII terminal screen obser…
DeepSynth is an environment for automatically synthesizing programs from examples. It combines machine learning predictions with efficient enumeration techniques in a very generic way.
ADE-bench is a framework for evaluating AI agents on data analyst tasks. It is designed to represent real-world data work with messy projects featuring hundreds of tables, often with multiple tables that could conceivably represent the same entity.
FoodDelivery is a food delivery dispatch optimization environment. An agent manages a fleet of e-bike couriers across a procedurally generated city, making real-time decisions about courier-order matching, order batching, surge pricing, and fleet repositioning under stochastic…
EsoLang-Bench is a benchmark for evaluating genuine reasoning in large language models via esoteric programming languages. It is designed to be resistant to data contamination and benchmark gaming, measuring transferable computational reasoning rather than memorization.
TextWorld Simple is an environment for evaluating language model agents on text-based adventure games using Microsoft's TextWorld framework. Agents must navigate a 6-room house, find a key to escape a locked bedroom, locate a food item, and cook it on the stove -- all through…
Principia Bench is a benchmark designed to evaluate language models' ability to derive mathematical objects from STEM-related problem statements. Each instance contains a problem statement and a ground-truth answer. The problem statements are drawn from four benchmarks-RealMat…
MicrogridGym is an environment for tuning cascaded PI controllers and droop coefficients for three-phase power electronic inverters in microgrid configurations. Based on the physics from the OpenModelica Microgrid Gym (OMG) toolbox, it implements a pure Python simulation of in…
Threadneedle is a monetary policy environment where an AI agent plays the role of the Bank of England's Monetary Policy Committee (MPC), setting Bank Rate and quantitative easing (QE) each quarter over 20 quarters. The economy evolves according to a simplified Heterogeneous Ag…
MarketEntryGame is an environment for evaluating agents on strategic decision-making in a market entry game with capacity constraints.
ReverseTicTacToe is an environment for evaluating agents on the inverse of classic tic-tac-toe where getting three in a row loses.
The AI Research Science Benchmark is an eval that quantifies the autonomous research abilities of LLM agents in the area of machine learning. AIRS-Bench comprises 20 tasks from state-of-the-art machine learning papers spanning diverse domains such as NLP, Code, Math, biochemic…
NCBIGenomeTrain is a training environment for genome-level question answering about the hg38 human reference genome. Each question requires retrieving or computing verifiable facts from the GRCh38/hg38 assembly, such as reference DNA sequences at specific coordinates, GC conte…
IteratedMatchingPennies is an environment for evaluating agents on mixed strategy equilibrium play in a classic game theory scenario.
HighSociety is an environment for evaluating agents on auction-based resource management and bidding strategy.
SimpleNegotiation is an environment for evaluating agents on resource negotiation and strategic trading.
GDPval is a benchmark for evaluating AI model capabilities on real-world economically valuable tasks. We use the v2 version of the benchmark for this implementation which comes with rubric based grading.
FrontierCS is a benchmark of 156 expert-designed, open-ended computer science problems across diverse areas that require models to produce executable programs (with an expert reference solution and automatic evaluator provided for each problem) rather than direct answers. It t…
IteratedStagHunt is an environment for evaluating agents on coordination and cooperation in a game with multiple equilibria.
NewRecruit is an environment for evaluating agents on negotiation skills in a recruitment scenario where the recruiter and candidate must agree on salary, bonus, and job assignment.
KuhnPoker is an environment for evaluating agents on simplified poker with a 3-card deck.
SecretMafia is an environment for evaluating agents on social deduction, persuasion, and strategic deception in a Mafia-style game.
Chess is an environment for evaluating agents on playing chess against an LLM opponent. This environment wraps the Chess implementation from TextArena, a framework for text-based game environments.
DS-1000 is a code generation benchmark with a thousand data science questions spanning seven Python libraries that (1) reflects diverse, realistic, and practical use cases, (2) has a reliable metric, (3) defends against memorization by perturbing questions.
Skywork-OR1-RL-Data is a dataset of verifiable, challenging, and diverse math problems (105K) and coding questions (14K).
A framework to capture offensive & defensive cyber-capabilities in evolving real-world systems, BountyBench is a benchmark with 25 systems with complex, real-world codebases, and include 40 bug bounties that cover 9 of the OWASP Top 10 Risks.
Countdown is an environment for evaluating agents on the Countdown Numbers game, where agents must reach a target number by combining available numbers with arithmetic operations. This environment wraps the Countdown implementation from TextArena, a framework for text-based ga…
TwoDollar is an environment for evaluating agents on economic negotiation and game-theoretic reasoning.
Agent World Model (AWM) is a fully synthetic environment generation pipeline that synthesizes 1,000 executable, SQL database-backed tool-use environments exposed via unified MCP interface for large-scale multi-turn agentic reinforcement learning. This is a port of [agent-worl…
GraphWalks is a single-turn environment that tests an agent's ability to perform graph operations on directed graphs presented as edge lists. Each task provides a directed graph and asks the agent to execute a specific operation - either finding the parent nodes of a target no…
ThreePlayerGOPS is an environment for evaluating agents on strategic card play in the Game of Pure Strategy.
GuessTheNumber is an environment for evaluating agents on the classic number guessing game where agents receive feedback on whether their guess is too high or too low. This environment wraps the GuessTheNumber implementation from TextArena, a framework for text-based game envi…
CharacterConclave is an environment for evaluating agents on persuasive communication and impression management in a social competition. This environment wraps the CharacterConclave implementation from TextArena, a framework for text-based game environments.
Mastermind is an environment for evaluating agents on code-breaking and deductive reasoning with feedback.
SimpleBlindAuction is an environment for evaluating agents on simultaneous sealed-bid auctions for multiple items.
LogicPuzzle is an environment for evaluating agents on deductive reasoning and logic grid puzzle solving.
RE-Bench consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts.
LinesOfAction is an environment for evaluating agents on the abstract strategy board game where players connect their pieces into a single group.
SettlersOfCatan is an environment for evaluating agents on strategic resource management and territory expansion in Settlers of Catan.
WildTicTacToe is an environment for evaluating agents on tactical gameplay in Wild Tic-Tac-Toe, a variant where players can place either X or O on any empty position. This environment wraps the WildTicTacToe implementation.
DSBC assesses performance across a diverse set of eight data science task categories
2048 is an environment for evaluating agents on the classic sliding tile puzzle game. This environment wraps the 2048 implementation from TextArena, a framework for text-based game environments.
GermanWhist is an environment for evaluating agents on trick-taking card game strategy. This environment wraps the GermanWhist implementation from TextArena, a framework for text-based game environments.
IteratedUltimatumGame is an environment for evaluating agents on fairness, negotiation, and strategic bargaining.
FifteenPuzzle is an environment for evaluating agents on the classic sliding tile puzzle (also known as the 15-puzzle). This environment wraps the FifteenPuzzle implementation from TextArena, a framework for text-based game environments.
KernelBench is an open-source framework for evaluating language models' ability to generate fast, correct GPU kernels on a suite of 250 carefully selected PyTorch ML workloads representing a real-world engineering environment.
MMLU-ProX is a comprehensive benchmark for assessing cross-linguistic reasoning in LLMs across 29 languages, built on an English benchmark with each language version containing 11,829 identical questions (and a lite version of 658 questions per language) to enable direct compa…
Secretary is an environment for evaluating agents on optimal stopping and sequential decision-making under uncertainty.
Tak is an environment for evaluating agents on the full version of Tak, a strategic board game of roads and stacks.
DontSayIt is an environment for evaluating agents on playing Don't Say It, a conversational deception game where players try to make opponents say secret words, against an LLM opponent. This environment wraps the DontSayIt implementation from TextArena, a framework for text-ba…
LightsOut is an environment for evaluating agents on spatial reasoning and logic puzzle solving tasks.
WordSearch is an environment for evaluating agents on finding hidden words in letter grids.
Chopsticks is an environment for evaluating agents on playing Chopsticks, a hand game involving finger arithmetic and strategic redistribution, against an LLM opponent. This environment wraps the Chopsticks implementation from TextArena, a framework for text-based game environ…
Crusade is an environment for evaluating agents on playing Crusade, a chess-like strategy game with knight-movement mechanics, against an LLM opponent. This environment wraps the Crusade implementation from TextArena, a framework for text-based game environments.
IteratedPrisonersDilemma is an environment for evaluating agents on cooperation and defection strategies in the classic game theory dilemma.
SpellingBee is an environment for evaluating agents on word formation and vocabulary knowledge.
IndianPoker is an environment for evaluating agents on poker strategy with imperfect information.
TicTacToe is an environment for evaluating agents on playing Tic-Tac-Toe against an LLM opponent.
Cybench is a benchmark for evaluating the cybersecurity capabilities and risks of language models. Cybench includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
QuantumTicTacToe is an environment for evaluating agents on a quantum variant of tic-tac-toe where marks exist in superposition.
TowerofHanoi is an environment for evaluating agents on solving the classic Tower of Hanoi puzzle with varying difficulty levels.
Surround is an environment for evaluating agents on the Tron-style game where players leave trails and must avoid collisions.
DAPO is a mathematics dataset used for training the DAPO reasoning model, consisting of around 17k unique questions.
MemoryGame is an environment for evaluating agents on the classic memory/concentration card-matching game.
WinAsMuchAsYouCan is an environment for evaluating agents on strategic decision-making and cooperation in a multi-player coordination game.
GameOfPureStrategy is an environment for evaluating agents on strategic card play in a two-player competitive game. This environment wraps the GameOfPureStrategy implementation from TextArena, a framework for text-based game environments.
ScenarioPlanning is an environment for evaluating agents on creative problem-solving and strategy formulation for survival scenarios.
Stratego is an environment for evaluating agents on the classic strategy board game of hidden information and tactical combat.
Sokoban is an environment for evaluating agents on spatial planning and sequential puzzle solving.
PublicGoodsGame is an environment for evaluating agents on economic decision-making and social cooperation in a public goods game.
TwentyQuestions is an environment for evaluating agents on playing the classic Twenty Questions game against an LLM gamemaster.
Santorini is an environment for evaluating agents on strategic gameplay in Santorini, an abstract board game where players move workers and build structures to reach the third level.
Bandit is an environment for evaluating agents on the classic multi-armed bandit problem. This environment wraps the Bandit implementation from TextArena, a framework for text-based game environments.
Poker is an environment for evaluating agents on strategic decision-making in Texas Hold'em Poker, testing betting strategies, bluffing, and probabilistic reasoning.
IteratedRockPaperScissors is an environment for evaluating agents on pattern recognition and strategic play in the classic hand game.
PegJump is an environment for evaluating agents on strategic planning and sequential reasoning
AInsteinBench provides 244 scientific computing tasks derived from multiple scientific repositories. These tasks have been verified on execution and also reviewed by corresponding domain experts to verify both software engineering and scientific content accuracy. The tasks cov…
Othello is an environment for evaluating agents on the classic board game also known as Reversi.
Taboo is an environment for evaluating agents on creative communication and word description under constraints.
Battleship is an environment for evaluating agents on playing the classic Battleship guessing game against an LLM opponent. This environment wraps the Battleship implementation from TextArena, a framework for text-based game environments.
DABstep consists of over 450 data analysis tasks designed to evaluate the capabilities of state-of-the-art LLMs and AI agents.
Briscola is an environment for evaluating agents on playing Briscola, an Italian trick-taking card game, against an LLM opponent. This environment wraps the Briscola implementation from TextArena, a framework for text-based game environments.
Slitherlink is an environment for evaluating agents on logic puzzle solving with loop-drawing constraints.
CritPt is a benchmark designed to test LLMs on unpublished, research‑level reasoning tasks across modern physics subfields, comprising 71 composite research challenges decomposed into 190 simpler checkpoint tasks created by 50+ active researchers.
EXP-Bench is designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimenta…
PaperBench is a benchmark for evaluating AI agents' ability to replicate state-of-the-art AI research from scratch by reproducing 20 ICML 2024 Spotlight and Oral papers - including understanding contributions, developing codebases, and executing experiments.
Snake is an environment for evaluating agents on the classic multiplayer snake game.
Crosswords is an environment for evaluating agents on crossword puzzle solving.
FrozenLake is an environment for evaluating agents on a grid navigation game where they must reach a goal while avoiding holes on a slippery frozen lake. This environment wraps the FrozenLake implementation from TextArena, a framework for text-based game environments.
LetterAuction is an environment for evaluating agents on strategic letter auctions followed by word formation.
RushHour is an environment for evaluating agents on spatial planning and constraint-based puzzle solving.
Blackjack is an environment for evaluating agents on the classic card game. This environment wraps the Blackjack implementation from TextArena, a framework for text-based game environments.
ThreePlayerIPD is an environment for evaluating agents on multi-player game theory and social dilemmas in the Iterated Prisoner's Dilemma.
Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations.
WordChains is an environment for evaluating agents on word association and vocabulary knowledge through a chaining game where each word must start with the last letter of the previous word.
WordLadder is an environment for evaluating agents on building word ladders by changing one letter at a time to transform a start word into an end word.
ThreePlayerTicTacToe is an environment for evaluating agents on strategic play in a three-player variant of Tic-Tac-Toe on a 5x5 board.
Wordle is an environment for evaluating agents on the word-guessing game Wordle. The agent must deduce a hidden target word by submitting guesses and interpreting positional feedback (correct, misplaced, or absent letters).
Ambient Clinical Intelligence Benchmark (ACI-BENCH) is a benchmark corpus for evaluating AI-assisted clinical note generation from doctor–patient visit dialogues.
tau2-bench is a benchmark for evaluating conversational AI agents in a Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user use tools to act in a shared, dynamic environment that stresses coordination, communication, and guidance.
A benchmark for evaluating AI agents on real world ML development tasks. 30 tasks covering various aspects of model development, including dataset management, debugging model and code failures, and implementing new ideas to achieve strong performance on various machine learni…
BeyondAIME is a benchmark for evaluating generalized STEM reasoning that extends AIME-style problems to probe deep, stepwise mathematical problem-solving.
volforecast (volatility forecasting) is a benchmark that evaluates the ability of agents to forecast volatility in financial time series.
A benchmark of 140 code performance optimization tasks drawn from real GitHub pull requests. Each instance includes a full repository codebase, target functions to optimize, and reference solutions from human developers. Models are evaluated on their ability to generate patche…
An implementation of the ATLAS benchmark of https://arxiv.org/abs/2511.14366v2
SimpleQA Verified is a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality.
EMMA (Enhanced MultiModal reAsoning) is a benchmark for assessing organic multimodal reasoning in MLLMs across mathematics, physics, chemistry, and coding.
An environment that turns "Practical Methods of Organic Chemistry" (1909) into a question and answer dataset for training and evaluation.
OpenReward environment for testing organic chemistry knowledge based on Holleman's "A Text-book of Organic Chemistry" (5th English ed., 1920).
An environment for classifying molecules as active or inactive against biological targets. Uses the HIV replication inhibition dataset.
GSM8K is a classic math word problems dataset.
PaperSearchQA is a challenging factoid QA dataset for scientific papers with 60k samples. This is a re-implementation. Instead of RAG, we use web_search and fetch tools to allow the agent to search the internet for answers. Note these are optional and you can exclude these fr…
Professional Reasoning Bench (PRBench) is a realistic, open-ended, rubric-based benchmark for evaluating models on economically consequential professional tasks in Finance and Law.
MMLU-Pro dataset is a more robust and challenging massive multi-task understanding dataset tailored to more rigorously benchmark large language models' capabilities. This dataset contains 12K complex questions across various disciplines.
ARC-AGI-2 - the next iteration of the benchmark - is designed to stress test the efficiency and capability of state-of-the-art AI reasoning systems, provide useful signal towards AGI, and re-inspire researchers to work on new ideas.
Generate a SMILES formula given a molecular formula and some constraints.
MMLU-Redux is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects.
AMO-Bench is a 50-problem benchmark of original, expert-validated Olympiad-level math questions designed to test advanced mathematical reasoning beyond existing competitions, using final-answer grading to robustly evaluate top-tier LLMs where current benchmarks have saturated.
ToolMind-Web-QA is a validated public dataset designed for research on search-augmented and long-horizon search agents. This is an implementation of https://huggingface.co/datasets/Nanbeige/ToolMind-Web-QA.
OpenAI MRCR (Multi-round co-reference resolution) is a long context dataset for benchmarking an LLM's ability to distinguish between multiple needles hidden in context.
EvoEval: Evolving Coding Benchmarks via LLM
Retrosynthesis benchmark for evaluating AI agents' ability to propose reactants that produce a target molecule.
Aider’s polyglot benchmark tests LLMs on 225 challenging Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust.
SuperGPQA is a benchmark for evaluating graduate-level knowledge and reasoning across 285 specialized disciplines.
SUPERChem is a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. This is an re-implementation.
Problems from the American Invitational Mathematics Examination (AIME) 2025-I & II.
MMLU-Redux is a subset of 5,700 manually re-annotated questions across 57 MMLU subjects. Implementation of: https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0.
An implementation of the MathVista environment.
Verified tasks from terminal bench 2, created by z.ai
VeriSciQA is a visual question answering dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types
A geoguessing environment where an agent needs to guess the country given an input image.
This environment tests a model's ability to make predictive models of NFL and betting against market odds.
Cryptic crossword puzzles.
DataScienceComps consists of data science competitions that agents can be trained against.
Encyclo-K is a statement-based benchmark for evaluating LLMs' comprehensive understanding by extracting standalone knowledge statements from authoritative textbooks and dynamically composing them into evaluation questions at test time.
An implementation of the FineProofs-RL environment for theorem proving from https://huggingface.co/datasets/lm-provers/FineProofs-RL.
A murder mystery environment.
Portfolio is an environment which tests agents ability to conduct portfolio optimisation tasks.
ARC-AGI-1 consists of 800 puzzle-like tasks, designed as grid-based visual reasoning problems. These tasks, trivial for humans but challenging for machines, typically provide only a small number of example input-output pairs (usually around three). This requires the test taker…
SWE-Bench Pro is an environment of 1,865 long-horizon software engineering tasks across 41 repositories. The public set contains 731 instances from copyleft-licensed repositories, designed to resist training data contamination. Tasks span bug fixes, feature requests, optimizat…
An implementation of https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT.
An implementation of the FrontierScience evaluation.
An environment predicting ADME (Absorption, Distribution, Metabolism, Excretion) molecular properties from SMILES notation.
A dataset of 50k+ software engineering task instances generated from 128 popular Python repositories. Each instance includes an execution environment and a task that involves fixing a broken test. The dataset was used to train SWE-agent-LM-32B, which achieves 40.2% on SWE-benc…
GSM1K implementation.
TennisBench is a benchmark for evaluating language model agents in tennis betting scenarios, testing their ability to develop and execute machine learning based betting strategies.
PatentQATrain is a training environment for patent question answering, based on the PatentQA task from LAB-Bench-2. Agents are given questions about specific details from patents across diverse technology domains and must use web search to find and verify answers from Google P…
An implementation of the MATH-Vision dataset which measures multimodal mathematical reasoning.
CodePDE is an environment for generating PDE solvers using large language models.
Problems from the American Invitational Mathematics Examination (AIME) 2026-I & II.
A series of real-world tasks related to budget day in the UK, which requires reading long-form documents, creating spreadsheets, analyses and more.
Inverse IFEval is a benchmark that measures models' counter-intuitive ability to override training-induced biases and comply with adversarial or unconventional instructions.
OMOLAgent tests agents on the OMol25 dataset from FAIR. They are given access to the OMol-4M training dataset and evaluated on the OMol25 validation set.
RubricHub is a large-scale dataset of questions requiring rubric grading criteria.
Port of https://github.com/laude-institute/harbor-datasets/tree/main/datasets/financeagent_terminal
Global PIQA (a participatory commonsense reasoning benchmark for over 100 languages) was constructed by 335 researchers from 65 countries and covers 116 language varieties across five continents, 14 language families, and 23 writing systems.
The chess environment has the agent play against Stockfish at varying difficulties (skill levels 0-20). The input format is UCI notation. The agent can play as white or black, and there are two environment variants: - ChessTextEnv: Observations are FEN strings (board repres…
tau-bench challenges agents to coordinate, guide, and assist users in achieving shared objectives across complex enterprise domains.
KUMO is a novel benchmark designed to systematically evaluate the complex reasoning capabilities of Large Language Models (LLMs) through procedurally generated reasoning games.
An environment for translating IUPAC names to SMILES formulae and vica versa.
Biomedical evaluation from the Biomni agent. Ported from https://huggingface.co/datasets/biomni/Eval1.
CL-bench, a real-world benchmark for context learning consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts.
Social impact data science competitions.
IMO-Bench is a suite of advanced reasoning benchmarks. IMO-Bench consists of three benchmarks that judge models on diverse capabilities: IMO-AnswerBench - a large-scale test on getting the right answer, IMO-ProofBench - a next-level evaluation for proof writing, and IMO-Gradin…
HLE-Verified is a systematically audited and reliability-enhanced version of the Humanity’s Last Exam (HLE) benchmark.
Implementation of https://openai.com/index/introducing-swe-bench-verified/
ScholarSearch is designed to evaluate the complex information retrieval capabilities of Large Language Models (LLMs) in academic research.
Anthropic's performance take-home test as an environment.
TIR-Bench is a comprehensive benchmark for evaluating agentic thinking-with-images across 13 diverse tasks, each requiring novel tool use for image processing and manipulation within chain-of-thought.
PhysicsEval is a benchmark for evaluating the performance of large language models on mathematical and descriptive physics problems, including assessments using inference-time techniques and multi-agent verification frameworks.
An implementation of GPQA
MathCanvas is a benchmark for evaluating visual-aided mathematical reasoning by requiring models to produce interleaved visual–textual solutions.
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
Simple testing environment where the agent must reach a target count from a starting count using increment and decrement tools. The episode ends when the submitted count matches the goal. The environment has 100 train tasks and 20 test tasks with random start/goal values in…
This is a port of the dataset used to train the OpenResearcher model on long-horizon deep research. Based on https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Dataset.
HealthBench tests how well AI models perform in realistic health scenarios, based on what physician experts say matters most.
ProcBench is a benchmark for directly evaluating LLMs' multi-step inference ability by providing pairs of explicit instructions and corresponding questions where the full procedures needed to solve each problem are specified, minimizing path exploration and implicit knowledge…
CaseLawQA is a dataset of legal classification tasks drawing from the Supreme Court and Songer Court of Appeals legal databases.
MMMLU is a translation of MMLU’s test set into 14 languages using professional human translators by OpenAI.
ValidMol is an enviroment that tests molecule generation using programmatic verifiers.
OpenReward environment for matching scent/smell to molecules. Follows ether0's property-cat-smell pattern with multiple-choice questions and string-match verification.
ObscureFacts is an environment for evaluating an agent's ability to find answers to obscure trivia questions using web search. Agents must use web search tools to research and answer intentionally difficult factual questions spanning sports, technology, local history, and acad…
HMMT is one of the largest and most prestigious high school competitions in the world.
Evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature.
IPLBench tests agents' ability to make predictive models of IPL cricket and bet on them versus market odds.
Open-RL by Turing consists of self-contained, verifiable, and unambiguous STEM reasoning problems across Physics, Mathematics, Biology, and Chemistry.
An OpenReward environment for classifying molecular safety across multiple toxicity endpoints from SMILES notation.
BFCL is an evaluation of LLMs' ability to call functions and tools. The dataset represents common function calling use-cases in agents and enterprise workflows.
This is a lite version of the SWE-Gym environment.
LongFact is a prompt set of 2,280 fact-seeking prompts requiring long-form responses.
Expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations.
LitQATrain has distinctively specific questions on scientific literature and requires agents to search the literature in order to answer them. This is based on the LitQA benchmark by FutureHouse.
An OpenReward environment where agents train ML models to predict aqueous solubility (LogS) from molecular SMILES notation.
DCS is a simulation of a departure control system.
BullshitBenchv2 measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions. This is an implementation of a benchmark by PeterGPT.
Humanity's Last Exam (HLE) is an LLM benchmark consisting of over 2,500 expert-level questions across a broad range of subjects.
SourceQualityTrain is a training environment for systematic review source quality assessment, based on the SourceQuality benchmark from LAB-Bench. Agents are given questions about why specific studies were excluded from systematic reviews, and must use web search to identify t…
RealLawyer simulates a real-life law firm workflow with simulated clients, realistic datarooms, and more.
VolleyBench is an environment which tests an agent's ability to predict Women's World Championship volleyball matches, where reward is based on real market odds.
BioReason is a collection of environments for assessing biological reasoning.
ScrapeBench is an environment that tests an agents' ability to extract information from publicly available websites.
Massive Multitask Language Understanding (MMLU) is a popular benchmark for evaluating the capabilities of large language models. It inspired several other versions and spin-offs, such as MMLU-Pro, MMMLU and MMLU-Redux.
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and four easy-to-hard difficulty levels. It ensures comprehensive difficulty coverage, language diversity, and high-quality translations to provide a highly discriminative testbed for evaluating…
SECQUE is a comprehensive benchmark for evaluating large language models (LLMs) in financial analysis tasks.
An agentic implementation of https://github.com/czyssrs/FinQA
AirlineRM is an airline network revenue management environment where an agent operates a hub-and-spoke carrier over a 30-day horizon. The agent makes daily decisions about fare class availability (opening/closing 8 nested fare buckets), overbooking limits, and disruption respo…
NL2Repo is a benchmark designed to evaluate the performance of Large Language Models (LLMs) and coding agents on long-horizon tasks that require generating a complete, runnable code repository from scratch (0-to-1). The benchmark consists of 104 distinct tasks, each paired wit…
ScienceAgentBench is a benchmark for evaluating language agents for data-driven scientific discovery.
Many real-world information-gathering tasks are not hard, just huge. Consider a financial analyst compiling key metrics for all companies in a sector, or a job seeker collecting every vacancy that meets their criteria. The challenge isn't cognitive complexity, but the sheer sc…
KellyBench is a benchmark that tests an agents' ability to make machine learning models for predicting football matches and betting against market odds.
FeatureBench is a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development.
LawBench has been meticulously crafted to have precise assessment of the LLMs’ legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs…
Codebase QnA is the first benchmark in the SWE-Atlas suite. It evaluates AI agents on deep code comprehension - tracing execution paths, explaining architectural decisions, and answering deeply technical questions about production-grade software systems.
DAComp-DE is a benchmark of 110 data engineering tasks that require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines
Training dataset for reinforcement learning (GRPO) optimization of BioReason-Pro. Contains proteins with GO term annotations, InterPro domains, STRING protein-protein interactions, and protein metadata.
CTF (Capture the flag) is an environment where agents attempt to find text strings - called flags - which are secretly hidden in purposefully vulnerable programs or websites.
USACO consists of 307 problems from the USA Computing Olympiad, complete with exhaustive test cases, problem analyses, and difficulty labels. Although zero-shot performance is poor, we find that we can well over double performance with a combination of retrieval and self-refle…
Enigma Decrypt is an environment where agents decrypt WWII-era Enigma-encoded German military messages. Each task presents an intercepted ciphertext encrypted with a historically accurate Wehrmacht Enigma I machine (3-rotor, plugboard configuration). The agent is given partial…
MarsExplorer is an environment for evaluating agents on grid-based terrain exploration and coverage. Based on the MarsExplorer environment by Koutras et al., agents control a rover navigating a procedurally-generated 2D grid with obstacles, using a simulated LIDAR sensor to re…
Alquerque is an environment for evaluating agents on playing Alquerque, an ancient board game and precursor to checkers, against an LLM opponent. This environment wraps the Alquerque implementation from TextArena, a framework for text-based game environments.
ICU-Sepsis is an environment for evaluating agents on a tabular Markov Decision Process (MDP) that models sepsis treatment in the intensive care unit. Agents select treatment actions representing combinations of vasopressor and IV fluid doses to maximize patient survival proba…
FrontierCO is a curated benchmark suite for evaluating ML-based solvers on large-scale and real-world Combinatorial Optimization (CO) problems. The benchmark spans 8 classical CO problems across 5 application domains, providing both training and evaluation instances specifical…
Codenames is an environment for evaluating agents on word association and collaborative team play in the Codenames board game. This environment wraps the Codenames implementation from TextArena, a framework for text-based game environments.
Hangman is an environment for evaluating agents on word guessing and deductive reasoning tasks. This environment wraps the Hangman implementation from TextArena, a framework for text-based game environments.
LiarsDice is an environment for evaluating agents on strategic bluffing and probabilistic reasoning in Liar's Dice, a dice game where players must bid or call bluffs.
Breakthrough is an environment for evaluating agents on playing Breakthrough, a chess-like abstract strategy game, against an LLM opponent. This environment wraps the Breakthrough implementation from TextArena, a framework for text-based game environments.
Checkers is an environment for evaluating agents on playing the classic Checkers game against an LLM opponent. This environment wraps the Checkers implementation from TextArena, a framework for text-based game environments.
GuessWho is an environment for evaluating agents on the classic Guess Who game, where agents must identify a target character by asking yes-or-no questions about their traits. This environment wraps the GuessWho implementation from TextArena, a framework for text-based game en…
TruthAndDeception is an environment for evaluating agents on social deduction and persuasion through natural conversation.
Minesweeper is an environment for evaluating agents on spatial reasoning, probabilistic inference, and strategic exploration.
ConnectFour is an environment for evaluating agents on playing the classic Connect Four game against an LLM opponent. This environment wraps the ConnectFour implementation from TextArena, a framework for text-based game environments.
OpenThought code contests environment
Nim is an environment for evaluating agents on the classic mathematical strategy game where players remove objects from piles.
PigDice is an environment for evaluating agents on the dice game where players balance risk and reward to reach a target score.
ColonelBlotto is an environment for evaluating agents on playing Colonel Blotto, a resource allocation strategy game, against an LLM opponent. This environment wraps the ColonelBlotto implementation from TextArena, a framework for text-based game environments.
SimpleTak is an environment for evaluating agents on a simplified version of Tak played on a 4x4 grid.
Cryptarithm is an environment for evaluating agents on cryptarithmetic puzzles, where letters must be mapped to unique digits to make an arithmetic equation valid. This environment wraps the Cryptarithm implementation from TextArena, a framework for text-based game environments.
IteratedTwoThirdsAverage is an environment for evaluating agents on strategic reasoning about iterated dominance and opponent modeling.
SimpleQA is a benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
MMCircuitEval is the first multimodal benchmark specifically designed to comprehensively evaluate MLLMs on electronic design automation (EDA) tasks in digital and analog circuit design. It contains 3614 expert-reviewed question-answer pairs drawn from textbooks, technical ques…
Terminal-Bench is a popular benchmark for measuring the capabilities of agents and language models to perform valuable work in containerized environments. Tasks include assembling proteins for synthesis, debugging async code, and resolving security vulnerabilities.
IFEval evaluates an LLM's model to follow instructions. It focuses on a set of verifiable instructions such as "write in more than 400 words" and "mention the keyword of AI at least 3 times".
MATH is a dataset of 12,500 challenging competition mathematics problems.
DiscoX is a benchmark for discourse-level and expert-level Chinese-English translation designed to evaluate discourse coherence and strict terminological precision in professional-domain texts.
Evaluate the capabilities of AI in designing and implementing quantum algorithms from the perspective of code generation.
CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
BrowseComp is a benchmark for measuring agents' web-browsing ability, comprising 1,266 questions that require persistent navigation to locate hard-to-find, entangled information. It yields short, easily verifiable answers and tests agents' persistence and creativity in finding…
Discovery30s is a benchmark that tests the potential of vintage language models to reproduce scientific discoveries after the training data cutoff period. We construct the benchmark by taking a known discovery, e.g. Hückel's rule, and then breaking it down into a "question lad…
TrialQATrain is a training environment that tests question answering and retrieval for clinical trials.
Implementation of the POLARIS-53K mathematics dataset.
Reasoning Gym is a community-created Python library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL).
An OpenReward environment for evaluating agents on blood-brain barrier (BBB) permeability tasks. Agents must either classify molecules by their ability to cross the BBB, or modify non-permeable molecules to become permeable.
A benchmark of 300 software engineering tasks across 42 repositories and 9 programming languages: C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby, and Rust. Each instance is derived from a real GitHub pull request, following the same format and evaluation protocol as SWE-b…
RefSeqTrain is a training environment for genomics question answering about NCBI RefSeq and Gene database records. Agents are given questions about specific verifiable facts from RefSeq gene, transcript, and protein records and must use web search to find and verify answers fr…
A benchmark that tests AI ability to make AI models predicting UFC matches and betting against market odds.
SuppQATrain is an environment for evaluating question answering over supplementary materials of scientific papers, based on FutureHouse's SuppQA subtask within LAB-Bench. Each question asks about a specific verifiable fact found exclusively in a paper's supplementary data rath…
Implementation of https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified
BrowseComp-ZH is the first high-difficulty benchmark specifically designed to evaluate the real-world web browsing and reasoning capabilities of large language models (LLMs) in the Chinese information ecosystem.
Chinese SimpleQA is the first comprehensive Chinese benchmark to evaluate the factuality of language models on short-question answering, covering six major topics and 99 diverse subtopics.
MMMU is a benchmark for evaluating multimodal models on massive multi-discipline tasks that require college-level subject knowledge and deliberate reasoning.
Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain k…
Problems from the American Invitational Mathematics Examination (AIME) 2024.
The Stanford Math Tournament (SMT) is a prestigious annual math competition hosted by Stanford University.
AARDData is an environment which teaches agents to do pretraining data tasks.
Developing high-performance software is a complex task that requires specialized expertise. GSO (Global Software Optimization) is a benchmark for evaluating language models' capabilities in developing high-performance software.