scientific reasoning

GPQA RL Env (Community)

GPQA evaluation environment

GPQA RL Env (Prime Intellect)

Prime Intellect

GPQA evaluation environment

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

AI 2 ARC RL Env (Community)

AI2 Reasoning Challenge (ARC) benchmark environment for evaluating models on grade-school science questions

RL EnvReasoningScience

BBH RL Env (Community)

BigBenchHard (BBH) evaluation environment with Chain-of-Thought

RL EnvBbhReasoning

Bigbench BBH RL Env (Prime Community)

Prime Community

Big Bench + BBH implementation

RL EnvBigbenchBbhNLP

SFT DatasetMulti Turn DialogInstruction FollowingScientific Reasoning

Capybara

LDJnr

LDJnr's multi-turn reasoning dataset built with the Amplify-Instruct synthesis method - short but deep conversations on a single topic.

RL EnvDeepconfConfidenceReasoningMath

Deepconf RL Env (Community)

DeepConf environment for confidence-aware LLM reasoning evaluation

RL EnvGpqaSelf AssessmentSycophancy

Defend Concede RL Env (Community)

GPQA Defend/Concede environment for calibrated self-assessment training

RL EnvMulti ModalTool UseAcademicMultimodal

HLE RL Env (Prime Intellect)

Prime Intellect

Humanity's Last Exam evaluation environment

M ARC RL Env (Community)

Single-turn medical MCQ

RL EnvMedicalClinical

M ARC RL Env (Medarc)

Medarc

Single-turn medical MCQ

RL EnvMedicalClinical

MMLU PRO RL Env (Community)

MMLU-Pro evaluation environment

RL EnvStemMMLU

MMLU PRO RL Env (Prime Intellect)

Prime Intellect

MMLU-Pro evaluation environment

RL EnvStemMMLU

RL EnvGeneral KnowledgeNLP

MMLU RL Env (Community)

MMLU evaluator for multi-subject multiple-choice reasoning.

RL EnvGeneral KnowledgeNLP

MMLU RL Env (Prime Community)

Prime Community

MMLU evaluator for multi-subject multiple-choice reasoning.

SFT DatasetInstruction FollowingMulti Turn DialogCode Generation

OpenHermes 2.5

Teknium

Teknium's million-row aggregation of high-quality GPT-4-style synthetic instructions that became the de facto open SFT baseline of 2023-2024.

Openmed Medknowledge RL Env (Community)

Comprehensive medical knowledge MCQ environment using MMLU medical subsets.

SFT DatasetMathCode GenerationScientific Reasoning

OpenThoughts

Open Thoughts

A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models.

RL EnvMedicalClinicalMMLU

PRO Health RL Env (Medarc)

Medarc

Single-turn medical MCQ

SFT DatasetMathScientific Reasoning

s1K

Stanford Center for Research on Foundation Models (CRFM)

Stanford's hand-curated 1,000-problem reasoning dataset that, paired with budget forcing at inference, produced o1-competitive results for ~$50 of compute.

Scicode RL Env (Prime Community)

Prime Community

SciCode evaluation environment

RL EnvCodePythonStem

Scicode RL Env (Prime Intellect)

Prime Intellect

Scicode evaluation environment

RL EnvCodePythonStem

SFT DatasetMulti Turn DialogInstruction Following

ShareGPT

Anonymous Community

The original community-scraped corpus of ChatGPT conversations that bootstrapped Vicuna and the entire open-instruction-tuning era.

SFT DatasetInstruction FollowingMathCode Generation

Tülu 3 SFT Mixture

Allen Institute for AI (Ai2)

Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.