GeneralReasoner

Description

GeneralReasoner is an environment for evaluating general reasoning capabilities using the WebInstruct-verified dataset from the General-Reasoner project by TIGER-AI-Lab. It provides diverse reasoning tasks spanning multiple categories and difficulty levels, with LLM-based semantic grading for flexible answer evaluation.

Capabilities

General reasoning evaluation across multiple domains
Multi-category question answering
Semantic answer verification
Varied difficulty levels

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

Tasks

There are two splits in this environment:

train: 228,736 tasks
test: 1,000 tasks

Tasks span multiple categories with varying difficulty levels.

Reward Structure

This is a single-turn environment. The agent submits an answer via the answer tool. An LLM grader (gpt-5-mini) evaluates semantic correctness against the reference answer. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Data

Data consists of Parquet files sourced from the WebInstruct-verified dataset. Each row contains a question, answer, answer type, category, and difficulty level. Data is stored on the OpenReward platform.

Tools

Tool	Description
`answer`	Submit your answer for LLM grading. Ends the episode.

Time Horizon

Single-turn. The agent reads the question and submits one answer.

Environment Difficulty

[Put environment difficulty statistics here]

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in GeneralReasoner answer reasoning questions in a standard environment. The environment does not present direct safety risks.

Citation

@inproceedings{ma2025generalreasoner,
  title={General-Reasoner: Advancing {LLM} Reasoning Across All Domains},
  author={Ma, Xueguang and Liu, Qian and Jiang, Dongfu and Zhang, Ge and Ma, Zejun and Chen, Wenhu},
  booktitle={Proceedings of the Neural Information Processing Systems (NeurIPS)},
  year={2025}
}