Prime Community is a team.
Cite
Notes
Only stored in your browser.
Word puzzle game where players find groups of 4 words sharing a common theme
Humanity's Last Examination (HLE) benchmark environment for Prime Community Environments
Polars DataFrame manipulation environment for training and evaluation
LLM Training Puzzles by Sasha Rush
Hanabi game
A multi-turn RL environment for formal theorem proving in Lean 4, where models alternate between reasoning, sketching proof code, and receiving ver...
Vision-SR1 environment (train+eval) using original graders
Verifiers port of the minif2f benchmark
Multi-turn Text-to-SQL environment with interactive database feedback following SkyRL-SQL methodology
Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.
GPU puzzles environment by Sasha Rush using modal sandboxes
Future House Aviary wrapper for verifiers - Scientific reasoning environments with tools
Data Agent Benchmark for Multi-step Reasoning benchmark
ART-E: a tool-using email research RL environment for Verifiers
Big Bench + BBH implementation
Codebase search environment for Triton GPU programming library - tests agent's ability to navigate and answer questions about the Triton codebase u...
Evaluates sycophantic behavior in LLMs across four tasks from Sharma et al. (ICLR 2024).
AgentHarm environment to evaluate agentic reasoning and safety
Multi-turn text-based environment for evaluating agents on the Spiral-Bench dataset.
text-based muli-turn fruit box game environment
BackendBench environment for LLM kernel benchmarking
AndroidWorld benchmark for evaluating autonomous agents on real Android apps with 116 tasks across 20 apps
LegalBench environment for legal reasoning tasks
Benchmark for agent robustness against prompt injection attacks in tool-use scenarios
Test model's ability to correctly click on target UI
BixBench scientific reasoning evaluation environment
A realistic virtual EHR environment to benchmark medical LLM agents on clinical tasks.
Mastermind multi-turn game environment for Verifiers
ClockBench: multimodal clock reading and reasoning benchmark implemented for verifiers.
Multi-turn environment for testing coding abilities across multiple programming languages using Exercism exercises
AidanBench multi-turn environment for Verifiers
Agentic RAG over Sherlock Holmes short stories for literary Q&A
MCP Universe environment for evaluating LLMs in wide range of tasks with MCP server
TransformerPuzzles by Sasha Rush
τ-bench: Tool-Agent-User benchmark for conversational agents in customer service domains with user simulation
Classic Infocom interactive fiction games (Zork, Enchanter, etc.) for evaluating LLM reasoning, planning, and world modeling
MMLU evaluator for multi-subject multiple-choice reasoning.
ENV for self-grading for LLM Writer Style.
Environment for the game Wiki Race
SciCode evaluation environment
GitHub MCP environment
ARC-AGI 1 + 2 with tool calling (Abstract and Reasoning Corpus)
BALROG benchmark integration for verifiers: unified RL evaluation across game environments.
Verifiers environment for BrowseComp-Plus Deep-Research Agent Benchmark. Controlled agent/retriever evaluation on the fixed human-verified corpus.