meta-sparse-shaping

meta-sparse-shaping is a deterministic Verifiers environment for comparing sparse terminal reward with shaped intermediate reward on the same prompts.

Each example gives an integer list and a sequence of list operations. The model must return exactly one JSON object inside a <result>...</result> tag:

<result>{"trace": [[...], [...]], "answer": [...]}</result>

trace is the list after each operation. answer is the final list.

The environment exposes two reward modes:

reward_mode="sparse": format/schema credit is constant, but task reward mostly comes from exact final-answer correctness.
reward_mode="shaped": the same format/schema credit is used, plus partial credit for final-list element accuracy and exact intermediate trace steps.

This creates paired RL runs with identical data distributions and different reward surfaces.

Usage

from verifiers import load_environment

env = load_environment(
    "meta-sparse-shaping",
    seed=20260615,
    num_examples=128,
    min_length=4,
    max_length=7,
    min_ops=3,
    max_ops=5,
    operation_set="core",
    reward_mode="shaped",
)

Operation Sets

operation_set="core" uses:

add
subtract
multiply
reverse
rotate_left
sort_asc

operation_set="extended" also adds:

sort_desc
keep_even
keep_greater_than

Recommended First Probes

Start with both reward modes on the same seed using Qwen 2B or Llama 1B. The main comparison metrics are reward, final exact answer, answer element accuracy, trace step accuracy, schema validity, and exact-one-result.