0

Enigmata RL Env (Prime Intellect)

Fresh

Enigmata environment for verifiers

Type
RL Env
Runtime
single-turn
License
unknown
Size
v0.1.4
Published
Oct 2025

Cite

Notes

Only stored in your browser.

Enigmata

Overview

  • Environment ID: enigmata
  • Short description: Synthetic, verifiable puzzle tasks with rule-based scoring across 36 tasks in 7 categories
  • Tags: enigmata, single-turn, reasoning, puzzles, verifiable, generator-verifier

This environment adapts the Enigmata suite as a self-contained evaluation and data-generation environment. Problems are programmatically generated and scored with task-specific, rule-based verifiers. It is designed for training and evaluating reasoning models without external LLM judges.

Adapted by Fido Wang - Github, X

Notes:

  • Python ≥ 3.11; dependencies are declared in pyproject.toml.
  • If the embedded Enigmata submodule is missing, the environment will automatically clone BytedTsinghua-SIA/Enigmata on first use.

Datasets

  • Primary dataset(s): Enigmata-Data (synthetic) and Enigmata-Eval (benchmark)
  • Source link:
  • Split sizes (Eval): 4,758 puzzle instances across Easy/Medium/Hard

Task

  • Type: single-turn
  • Parser: Identity parser (returns the raw completion)
  • Rubric overview: Single numeric score from a task-specific verifier (verify) with unit weight

Quickstart

Run an evaluation with defaults (no API keys required):

uv run vf-eval enigmata

Evaluate with a fixed number of examples and specific tasks:

uv run vf-eval enigmata \
  -a '{"num_train_examples": 200, "num_eval_examples": 200, "tasks": ["sudoku", "maze"]}'

Use the predefined benchmark split (downloads Enigmata-Eval from HuggingFace) and evaluate only sudoku:

uv run vf-eval enigmata \
  -a '{"use_predefined_eval_dataset": true, "tasks": "sudoku"}'

Notes:

  • Use -a / --env-args to pass environment configuration as JSON.
  • When use_predefined_eval_dataset is false, both train and eval sets are generated on the fly.
  • You can also generate and evaluate offline using the scripts in Enigmata/ (see that README for details).
  • Reproducibility: pass seed in env args, or set env var ENIGMATA_SEED to seed RNGs without editing code under Enigmata/. Eval generation uses seed + 1 automatically when seed is provided.

Seeding and Reproducibility

Minimal seeding is applied to stabilize generation without touching code under Enigmata/:

  • Python random
  • NumPy (if available)
  • PYTHONHASHSEED

Examples

Deterministic generation with a fixed seed:

uv run vf-eval enigmata \
  -a '{"num_train_examples": 100, "num_eval_examples": 100, "seed": 42}'

Use environment variable instead of args:

ENIGMATA_SEED=123 uv run vf-eval enigmata \
  -a '{"num_train_examples": 100, "num_eval_examples": 100}'

Different seeds for train vs eval:

uv run vf-eval enigmata \
  -a '{"num_train_examples": 100, "num_eval_examples": 100, "seed": 7}'

Environment Arguments

ArgTypeDefaultDescription
num_train_examplesint-1Number of generated training examples per run (-1 means generator-chosen)
num_eval_examplesint-1Number of generated evaluation examples per run (-1 means generator-chosen)
use_predefined_eval_datasetboolfalseIf true, loads BytedTsinghua-SIA/Enigmata-Eval from HF for eval
tasksstr or list"all"Filter to a task or list of tasks (e.g., "sudoku", ["sudoku","maze"])
difficultiesOptional[list]NoneList of difficulties to include in the environment.
system_promptstr""Optional system prompt propagated to the environment
seedOptional[int]NoneGlobal seed for reproducible generation (Python, NumPy). Eval uses seed+1

Metrics

MetricMeaning
rewardscore of evaluated troubles (typically 0 or 1; aggregated as mean)

Example Structure

Normalized examples produced by this environment follow this schema:

question: str
answer: str
info: 
  - task_name: str
  - difficulty: str
  - split: str
  - language: str
  - meta_json: Optional[str]  # JSON-encoded metadata

Verifier Integration

Per-example scoring dynamically imports verifiable_tasks.tasks.<task_name>.verifier and calls verify(solution: str, answer: str, meta: dict) -> float|int. If a verifier cannot be resolved, the reward defaults to 0.0 to fail closed.