0

IN Haystack RL Env (Community)

Fresh

Find the needle pattern in a haystack of similar-looking but structurally different lines.

Type
RL Env
License
apache-2.0
Published
Mar 2026

Cite

Notes

Only stored in your browser.

Patterned Needle in Haystack

A benchmark for abstract pattern recognition where the model must find "needle" lines hidden among "haystack" lines that differ only in their word-order pattern.

Concept

Each line in the haystack follows a pattern that describes word repetition. For example:

PatternExample Line
00122bird bird bread book book
01234cat dog fish tree lamp
01021dog cat dog bird cat

The key insight: bird bird bread book book and book book banana bidet bidet both conform to pattern 00122. The benchmark tests whether a model can recognize abstract patterns rather than memorizing specific words.

Task

Given a block of text where:

  • Most lines follow one of several "haystack" patterns
  • One or more lines follow a different "needle" pattern

The model must identify the needle segment(s) and output them in \boxed{}.

Difficulty Tuning

ParameterEffect
num_haystack_patternsMore patterns → needle less distinctive
num_needlesMultiple needles → more complex task
num_linesMore lines → harder to find needle
vocab_sizeMore words → harder to track repetitions
min/max_pattern_lengthLonger patterns → more complex structure
min/max_patterns_per_lineMultiple patterns per line → needle hidden among haystacks
min_haystack_appearancesEach haystack pattern appears at least this many times (default: 2)
modespacesno_spacesalphanumeric (increasing difficulty)
hint_levelLess hints → harder reasoning

Modes

  • spaces: Words separated by spaces (easiest)

    • Example: bird bird bread book book
  • no_spaces: Words concatenated without spaces

    • Example: birdbirdbreadbookbook
    • Model must first segment into words
  • alphanumeric: Random alphanumeric strings, no spaces (hardest)

    • Example: x7kmx7kmp2raaB3caB3c
    • Model must discover what the "words" even are

Hint Levels

  • none: "Find the line that doesn't belong."
  • minimal: "Most lines follow a pattern. Find the one that doesn't."
  • moderate: "Lines have hidden word-order patterns. Find the outlier."
  • full: Detailed explanation with pattern examples

Multiple Patterns Per Line

When max_patterns_per_line > 1, each line can contain multiple patterns concatenated together. Needle lines will have the needle pattern placed at a random position among haystack patterns, preventing the model from learning positional shortcuts.

Example with 3 patterns per line:

cat cat dog | bird bread bird | fish tree lamp book chair
^^^^^^^^^^   ^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^
haystack       haystack              NEEDLE

Usage

# Install
vf-install patterned-needle-in-haystack

# Quick test
vf-eval patterned-needle-in-haystack -n 5 -m gpt-4.1-mini

# With custom parameters
vf-eval patterned-needle-in-haystack \
    -n 100 \
    -m gpt-4.1-mini \
    --env-kwargs '{"num_lines": 100, "mode": "no_spaces", "hint_level": "minimal"}'

Configuration

from patterned_needle_in_haystack import load_environment

env = load_environment(
    # Pattern generation
    num_haystack_patterns=5,       # Distinct haystack patterns
    num_needles=1,                 # Needle lines per problem
    min_pattern_length=5,          # Minimum pattern length
    max_pattern_length=5,          # Maximum pattern length
    min_patterns_per_line=1,       # Min patterns per line
    max_patterns_per_line=1,       # Max patterns per line
    pattern_separator=" | ",       # Separator between patterns (spaces mode)
    min_haystack_appearances=2,    # Each haystack pattern appears at least 2x
    
    # Problem structure
    num_lines=50,                  # Total lines per problem
    vocab_size=30,                 # Unique words per problem
    
    # Difficulty
    mode="spaces",                 # "spaces", "no_spaces", "alphanumeric"
    hint_level="moderate",         # "none", "minimal", "moderate", "full"
    
    # Dataset
    num_samples=1000,              # Problems to generate
    seed=42,                       # For reproducibility (None = random)
)

Examples

Easy: Default settings

env = load_environment()
# 50 lines, 1 needle, pattern length 5, moderate hints

Medium: More lines, no spaces, minimal hints

env = load_environment(
    num_lines=100,
    mode="no_spaces",
    hint_level="minimal",
)

Hard: Multiple patterns per line, no hints

env = load_environment(
    num_lines=150,
    min_patterns_per_line=2,
    max_patterns_per_line=3,
    hint_level="none",
)

Expert: Alphanumeric, multiple needles, no hints

env = load_environment(
    num_needles=2,
    num_lines=200,
    mode="alphanumeric",
    hint_level="none",
    min_pattern_length=4,
    max_pattern_length=7,
)

Output Format

The model should output the exact word sequence for the needle segment(s) inside \boxed{}:

Single needle:

\boxed{fish tree lamp book chair}

Multiple needles (separated by |):

\boxed{fish tree lamp book chair | dog cat bird cat dog}

The needles must be in the order they appear in the text, separated by | (space, pipe, space).

Scoring

  • Single needle: Exact match on the needle pattern (whitespace normalized)
  • Multiple needles: Exact match with | separator (same count, same order, whitespace normalized)

Vocabulary

The environment uses NLTK's words corpus (~50k filtered English words) to ensure diverse, unique vocabulary for each problem. In alphanumeric mode, random alphanumeric strings are generated instead.

Running Ablations

The environment includes scripts for systematic ablation studies.

Quick Start

# Run scale ablation and aggregate results
python run_ablations.py -m gpt-5-mini --ablation scale --aggregate

# Run all ablations
python run_ablations.py -m gpt-5-mini --ablation all

# Dry run (show commands without executing)
python run_ablations.py -m gpt-5-mini --ablation all --dry-run

Available Ablations

AblationDescriptionConfigs
presentationMode × Hint Level12
scaleProblem Size × Num Needles (heatmap)36
complexityPattern Length × Patterns Per Line15
allRun all ablations63

Ablation Details

Presentation (Mode × Hint Level):

  • Modes: spaces, no_spaces, alphanumeric
  • Hints: none, minimal, moderate, full
  • Fixed: 50 lines, 1 needle, pattern length 5

Scale (Problem Size × Num Needles):

  • Lines: 30, 50, 75, 100, 150, 200, 300, 400, 600
  • Needles: 1, 2, 3, 5
  • Fixed: spaces mode, moderate hints
  • Great for heatmap visualization

Complexity (Pattern Length × Patterns Per Line):

  • Pattern lengths: (4,4), (5,5), (6,6), (8,8), (10,10)
  • Patterns per line: (1,1), (2,2), (3,3)
  • Fixed: 50 lines, 1 needle, spaces mode

CLI Options

python run_ablations.py --help

Options:
  -m, --model MODEL          Model to evaluate (required)
  --ablation ABLATION        Which ablation to run (default: presentation)
  -n, --num-samples N        Samples per config (default: 50)
  -r, --rollouts N           Rollouts per sample (default: 1)
  -c, --concurrency N        Concurrency (default: 50)
  -k, --api-key-var VAR      Environment variable for API key
  -b, --base-url URL         Base URL for API
  -a, --aggregate            Run aggregation after ablations
  --dry-run                  Print commands without executing

Aggregating Results

# Aggregate all results from outputs/
python aggregate_results.py

# Specify output file
python aggregate_results.py -o results_summary.csv

# Save raw individual results
python aggregate_results.py --raw-output raw_results.csv

Results are saved to outputs/evals/ and aggregated summaries to outputs/aggregate.csv.

Plotting Results

# Install analysis dependencies (pandas, matplotlib, seaborn)
pip install -e ".[analysis]"

# Show all plots interactively (default)
python plot_results.py

# Show specific ablation
python plot_results.py --ablation scale

# Save all plots to files (default: images/ directory)
python plot_results.py -s

# Save with custom output directory and DPI
python plot_results.py -s -o images/ --dpi 300

When saving (-s/--save), generates:

  • presentation_heatmap_{model}.png - Mode × Hint Level accuracy heatmap
  • scale_heatmap_{model}.png - Problem Size × Num Needles accuracy heatmap
  • complexity_heatmap_{model}.png - Pattern Length × Patterns/Line accuracy heatmap
  • overview.png - Combined multi-panel figure