Enigmata
Overview
- Environment ID:
enigmata - Short description: Synthetic, verifiable puzzle tasks with rule-based scoring across 36 tasks in 7 categories
- Tags: enigmata, single-turn, reasoning, puzzles, verifiable, generator-verifier
This environment adapts the Enigmata suite as a self-contained evaluation and data-generation environment. Problems are programmatically generated and scored with task-specific, rule-based verifiers. It is designed for training and evaluating reasoning models without external LLM judges.
Adapted by Fido Wang - Github, X
Notes:
- Python ≥ 3.11; dependencies are declared in
pyproject.toml. - If the embedded
Enigmatasubmodule is missing, the environment will automatically cloneBytedTsinghua-SIA/Enigmataon first use.
Datasets
- Primary dataset(s): Enigmata-Data (synthetic) and Enigmata-Eval (benchmark)
- Source link:
- Enigmata-Eval: HuggingFace Dataset
- Split sizes (Eval): 4,758 puzzle instances across Easy/Medium/Hard
Task
- Type: single-turn
- Parser: Identity parser (returns the raw completion)
- Rubric overview: Single numeric score from a task-specific verifier (
verify) with unit weight
Quickstart
Run an evaluation with defaults (no API keys required):
uv run vf-eval enigmata
Evaluate with a fixed number of examples and specific tasks:
uv run vf-eval enigmata \
-a '{"num_train_examples": 200, "num_eval_examples": 200, "tasks": ["sudoku", "maze"]}'
Use the predefined benchmark split (downloads Enigmata-Eval from HuggingFace) and evaluate only sudoku:
uv run vf-eval enigmata \
-a '{"use_predefined_eval_dataset": true, "tasks": "sudoku"}'
Notes:
- Use
-a/--env-argsto pass environment configuration as JSON. - When
use_predefined_eval_datasetisfalse, both train and eval sets are generated on the fly. - You can also generate and evaluate offline using the scripts in
Enigmata/(see that README for details). - Reproducibility: pass
seedin env args, or set env varENIGMATA_SEEDto seed RNGs without editing code underEnigmata/. Eval generation usesseed + 1automatically whenseedis provided.
Seeding and Reproducibility
Minimal seeding is applied to stabilize generation without touching code under Enigmata/:
- Python
random - NumPy (if available)
PYTHONHASHSEED
Examples
Deterministic generation with a fixed seed:
uv run vf-eval enigmata \
-a '{"num_train_examples": 100, "num_eval_examples": 100, "seed": 42}'
Use environment variable instead of args:
ENIGMATA_SEED=123 uv run vf-eval enigmata \
-a '{"num_train_examples": 100, "num_eval_examples": 100}'
Different seeds for train vs eval:
uv run vf-eval enigmata \
-a '{"num_train_examples": 100, "num_eval_examples": 100, "seed": 7}'
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
num_train_examples | int | -1 | Number of generated training examples per run (-1 means generator-chosen) |
num_eval_examples | int | -1 | Number of generated evaluation examples per run (-1 means generator-chosen) |
use_predefined_eval_dataset | bool | false | If true, loads BytedTsinghua-SIA/Enigmata-Eval from HF for eval |
tasks | str or list | "all" | Filter to a task or list of tasks (e.g., "sudoku", ["sudoku","maze"]) |
difficulties | Optional[list] | None | List of difficulties to include in the environment. |
system_prompt | str | "" | Optional system prompt propagated to the environment |
seed | Optional[int] | None | Global seed for reproducible generation (Python, NumPy). Eval uses seed+1 |
Metrics
| Metric | Meaning |
|---|---|
reward | score of evaluated troubles (typically 0 or 1; aggregated as mean) |
Example Structure
Normalized examples produced by this environment follow this schema:
question: str
answer: str
info:
- task_name: str
- difficulty: str
- split: str
- language: str
- meta_json: Optional[str] # JSON-encoded metadata
Verifier Integration
Per-example scoring dynamically imports verifiable_tasks.tasks.<task_name>.verifier and calls verify(solution: str, answer: str, meta: dict) -> float|int. If a verifier cannot be resolved, the reward defaults to 0.0 to fail closed.