Patterned Needle in Haystack

A benchmark for abstract pattern recognition where the model must find "needle" lines hidden among "haystack" lines that differ only in their word-order pattern.

Concept

Each line in the haystack follows a pattern that describes word repetition. For example:

Pattern	Example Line
`00122`	bird bird bread book book
`01234`	cat dog fish tree lamp
`01021`	dog cat dog bird cat

The key insight: bird bird bread book book and book book banana bidet bidet both conform to pattern 00122. The benchmark tests whether a model can recognize abstract patterns rather than memorizing specific words.

Task

Given a block of text where:

Most lines follow one of several "haystack" patterns
One or more lines follow a different "needle" pattern

The model must identify the needle segment(s) and output them in \boxed{}.

Difficulty Tuning

Parameter	Effect
`num_haystack_patterns`	More patterns → needle less distinctive
`num_needles`	Multiple needles → more complex task
`num_lines`	More lines → harder to find needle
`vocab_size`	More words → harder to track repetitions
`min/max_pattern_length`	Longer patterns → more complex structure
`min/max_patterns_per_line`	Multiple patterns per line → needle hidden among haystacks
`min_haystack_appearances`	Each haystack pattern appears at least this many times (default: 2)
`mode`	`spaces` → `no_spaces` → `alphanumeric` (increasing difficulty)
`hint_level`	Less hints → harder reasoning

Modes

spaces: Words separated by spaces (easiest)
- Example: bird bird bread book book
no_spaces: Words concatenated without spaces
- Example: birdbirdbreadbookbook
- Model must first segment into words
alphanumeric: Random alphanumeric strings, no spaces (hardest)
- Example: x7kmx7kmp2raaB3caB3c
- Model must discover what the "words" even are

Hint Levels

none: "Find the line that doesn't belong."
minimal: "Most lines follow a pattern. Find the one that doesn't."
moderate: "Lines have hidden word-order patterns. Find the outlier."
full: Detailed explanation with pattern examples

Multiple Patterns Per Line

When max_patterns_per_line > 1, each line can contain multiple patterns concatenated together. Needle lines will have the needle pattern placed at a random position among haystack patterns, preventing the model from learning positional shortcuts.

Example with 3 patterns per line:

cat cat dog | bird bread bird | fish tree lamp book chair
^^^^^^^^^^   ^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^
haystack       haystack              NEEDLE

Usage

# Install
vf-install patterned-needle-in-haystack

# Quick test
vf-eval patterned-needle-in-haystack -n 5 -m gpt-4.1-mini

# With custom parameters
vf-eval patterned-needle-in-haystack \
    -n 100 \
    -m gpt-4.1-mini \
    --env-kwargs '{"num_lines": 100, "mode": "no_spaces", "hint_level": "minimal"}'

Configuration

from patterned_needle_in_haystack import load_environment

env = load_environment(
    # Pattern generation
    num_haystack_patterns=5,       # Distinct haystack patterns
    num_needles=1,                 # Needle lines per problem
    min_pattern_length=5,          # Minimum pattern length
    max_pattern_length=5,          # Maximum pattern length
    min_patterns_per_line=1,       # Min patterns per line
    max_patterns_per_line=1,       # Max patterns per line
    pattern_separator=" | ",       # Separator between patterns (spaces mode)
    min_haystack_appearances=2,    # Each haystack pattern appears at least 2x
    
    # Problem structure
    num_lines=50,                  # Total lines per problem
    vocab_size=30,                 # Unique words per problem
    
    # Difficulty
    mode="spaces",                 # "spaces", "no_spaces", "alphanumeric"
    hint_level="moderate",         # "none", "minimal", "moderate", "full"
    
    # Dataset
    num_samples=1000,              # Problems to generate
    seed=42,                       # For reproducibility (None = random)
)

Examples

Easy: Default settings

env = load_environment()
# 50 lines, 1 needle, pattern length 5, moderate hints

Medium: More lines, no spaces, minimal hints

env = load_environment(
    num_lines=100,
    mode="no_spaces",
    hint_level="minimal",
)

Hard: Multiple patterns per line, no hints

env = load_environment(
    num_lines=150,
    min_patterns_per_line=2,
    max_patterns_per_line=3,
    hint_level="none",
)

Expert: Alphanumeric, multiple needles, no hints

env = load_environment(
    num_needles=2,
    num_lines=200,
    mode="alphanumeric",
    hint_level="none",
    min_pattern_length=4,
    max_pattern_length=7,
)

Output Format

The model should output the exact word sequence for the needle segment(s) inside \boxed{}:

Single needle:

\boxed{fish tree lamp book chair}

Multiple needles (separated by |):

\boxed{fish tree lamp book chair | dog cat bird cat dog}

The needles must be in the order they appear in the text, separated by | (space, pipe, space).

Scoring

Single needle: Exact match on the needle pattern (whitespace normalized)
Multiple needles: Exact match with | separator (same count, same order, whitespace normalized)

Vocabulary

The environment uses NLTK's words corpus (~50k filtered English words) to ensure diverse, unique vocabulary for each problem. In alphanumeric mode, random alphanumeric strings are generated instead.

Running Ablations

The environment includes scripts for systematic ablation studies.

Quick Start

# Run scale ablation and aggregate results
python run_ablations.py -m gpt-5-mini --ablation scale --aggregate

# Run all ablations
python run_ablations.py -m gpt-5-mini --ablation all

# Dry run (show commands without executing)
python run_ablations.py -m gpt-5-mini --ablation all --dry-run

Available Ablations

Ablation	Description	Configs
`presentation`	Mode × Hint Level	12
`scale`	Problem Size × Num Needles (heatmap)	36
`complexity`	Pattern Length × Patterns Per Line	15
`all`	Run all ablations	63

Ablation Details

Presentation (Mode × Hint Level):

Modes: spaces, no_spaces, alphanumeric
Hints: none, minimal, moderate, full
Fixed: 50 lines, 1 needle, pattern length 5

Scale (Problem Size × Num Needles):

Lines: 30, 50, 75, 100, 150, 200, 300, 400, 600
Needles: 1, 2, 3, 5
Fixed: spaces mode, moderate hints
Great for heatmap visualization

Complexity (Pattern Length × Patterns Per Line):

Pattern lengths: (4,4), (5,5), (6,6), (8,8), (10,10)
Patterns per line: (1,1), (2,2), (3,3)
Fixed: 50 lines, 1 needle, spaces mode

CLI Options

python run_ablations.py --help

Options:
  -m, --model MODEL          Model to evaluate (required)
  --ablation ABLATION        Which ablation to run (default: presentation)
  -n, --num-samples N        Samples per config (default: 50)
  -r, --rollouts N           Rollouts per sample (default: 1)
  -c, --concurrency N        Concurrency (default: 50)
  -k, --api-key-var VAR      Environment variable for API key
  -b, --base-url URL         Base URL for API
  -a, --aggregate            Run aggregation after ablations
  --dry-run                  Print commands without executing

Aggregating Results

# Aggregate all results from outputs/
python aggregate_results.py

# Specify output file
python aggregate_results.py -o results_summary.csv

# Save raw individual results
python aggregate_results.py --raw-output raw_results.csv

Results are saved to outputs/evals/ and aggregated summaries to outputs/aggregate.csv.

Plotting Results

# Install analysis dependencies (pandas, matplotlib, seaborn)
pip install -e ".[analysis]"

# Show all plots interactively (default)
python plot_results.py

# Show specific ablation
python plot_results.py --ablation scale

# Save all plots to files (default: images/ directory)
python plot_results.py -s

# Save with custom output directory and DPI
python plot_results.py -s -o images/ --dpi 300

When saving (-s/--save), generates:

presentation_heatmap_{model}.png - Mode × Hint Level accuracy heatmap
scale_heatmap_{model}.png - Problem Size × Num Needles accuracy heatmap
complexity_heatmap_{model}.png - Pattern Length × Patterns/Line accuracy heatmap
overview.png - Combined multi-panel figure