Patterned Needle in Haystack
A benchmark for abstract pattern recognition where the model must find "needle" lines hidden among "haystack" lines that differ only in their word-order pattern.
Concept
Each line in the haystack follows a pattern that describes word repetition. For example:
| Pattern | Example Line |
|---|---|
00122 | bird bird bread book book |
01234 | cat dog fish tree lamp |
01021 | dog cat dog bird cat |
The key insight: bird bird bread book book and book book banana bidet bidet both conform to pattern 00122. The benchmark tests whether a model can recognize abstract patterns rather than memorizing specific words.
Task
Given a block of text where:
- Most lines follow one of several "haystack" patterns
- One or more lines follow a different "needle" pattern
The model must identify the needle segment(s) and output them in \boxed{}.
Difficulty Tuning
| Parameter | Effect |
|---|---|
num_haystack_patterns | More patterns → needle less distinctive |
num_needles | Multiple needles → more complex task |
num_lines | More lines → harder to find needle |
vocab_size | More words → harder to track repetitions |
min/max_pattern_length | Longer patterns → more complex structure |
min/max_patterns_per_line | Multiple patterns per line → needle hidden among haystacks |
min_haystack_appearances | Each haystack pattern appears at least this many times (default: 2) |
mode | spaces → no_spaces → alphanumeric (increasing difficulty) |
hint_level | Less hints → harder reasoning |
Modes
-
spaces: Words separated by spaces (easiest)- Example:
bird bird bread book book
- Example:
-
no_spaces: Words concatenated without spaces- Example:
birdbirdbreadbookbook - Model must first segment into words
- Example:
-
alphanumeric: Random alphanumeric strings, no spaces (hardest)- Example:
x7kmx7kmp2raaB3caB3c - Model must discover what the "words" even are
- Example:
Hint Levels
none: "Find the line that doesn't belong."minimal: "Most lines follow a pattern. Find the one that doesn't."moderate: "Lines have hidden word-order patterns. Find the outlier."full: Detailed explanation with pattern examples
Multiple Patterns Per Line
When max_patterns_per_line > 1, each line can contain multiple patterns concatenated together. Needle lines will have the needle pattern placed at a random position among haystack patterns, preventing the model from learning positional shortcuts.
Example with 3 patterns per line:
cat cat dog | bird bread bird | fish tree lamp book chair
^^^^^^^^^^ ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^
haystack haystack NEEDLE
Usage
# Install
vf-install patterned-needle-in-haystack
# Quick test
vf-eval patterned-needle-in-haystack -n 5 -m gpt-4.1-mini
# With custom parameters
vf-eval patterned-needle-in-haystack \
-n 100 \
-m gpt-4.1-mini \
--env-kwargs '{"num_lines": 100, "mode": "no_spaces", "hint_level": "minimal"}'
Configuration
from patterned_needle_in_haystack import load_environment
env = load_environment(
# Pattern generation
num_haystack_patterns=5, # Distinct haystack patterns
num_needles=1, # Needle lines per problem
min_pattern_length=5, # Minimum pattern length
max_pattern_length=5, # Maximum pattern length
min_patterns_per_line=1, # Min patterns per line
max_patterns_per_line=1, # Max patterns per line
pattern_separator=" | ", # Separator between patterns (spaces mode)
min_haystack_appearances=2, # Each haystack pattern appears at least 2x
# Problem structure
num_lines=50, # Total lines per problem
vocab_size=30, # Unique words per problem
# Difficulty
mode="spaces", # "spaces", "no_spaces", "alphanumeric"
hint_level="moderate", # "none", "minimal", "moderate", "full"
# Dataset
num_samples=1000, # Problems to generate
seed=42, # For reproducibility (None = random)
)
Examples
Easy: Default settings
env = load_environment()
# 50 lines, 1 needle, pattern length 5, moderate hints
Medium: More lines, no spaces, minimal hints
env = load_environment(
num_lines=100,
mode="no_spaces",
hint_level="minimal",
)
Hard: Multiple patterns per line, no hints
env = load_environment(
num_lines=150,
min_patterns_per_line=2,
max_patterns_per_line=3,
hint_level="none",
)
Expert: Alphanumeric, multiple needles, no hints
env = load_environment(
num_needles=2,
num_lines=200,
mode="alphanumeric",
hint_level="none",
min_pattern_length=4,
max_pattern_length=7,
)
Output Format
The model should output the exact word sequence for the needle segment(s) inside \boxed{}:
Single needle:
\boxed{fish tree lamp book chair}
Multiple needles (separated by |):
\boxed{fish tree lamp book chair | dog cat bird cat dog}
The needles must be in the order they appear in the text, separated by | (space, pipe, space).
Scoring
- Single needle: Exact match on the needle pattern (whitespace normalized)
- Multiple needles: Exact match with
|separator (same count, same order, whitespace normalized)
Vocabulary
The environment uses NLTK's words corpus (~50k filtered English words) to ensure diverse, unique vocabulary for each problem. In alphanumeric mode, random alphanumeric strings are generated instead.
Running Ablations
The environment includes scripts for systematic ablation studies.
Quick Start
# Run scale ablation and aggregate results
python run_ablations.py -m gpt-5-mini --ablation scale --aggregate
# Run all ablations
python run_ablations.py -m gpt-5-mini --ablation all
# Dry run (show commands without executing)
python run_ablations.py -m gpt-5-mini --ablation all --dry-run
Available Ablations
| Ablation | Description | Configs |
|---|---|---|
presentation | Mode × Hint Level | 12 |
scale | Problem Size × Num Needles (heatmap) | 36 |
complexity | Pattern Length × Patterns Per Line | 15 |
all | Run all ablations | 63 |
Ablation Details
Presentation (Mode × Hint Level):
- Modes:
spaces,no_spaces,alphanumeric - Hints:
none,minimal,moderate,full - Fixed: 50 lines, 1 needle, pattern length 5
Scale (Problem Size × Num Needles):
- Lines: 30, 50, 75, 100, 150, 200, 300, 400, 600
- Needles: 1, 2, 3, 5
- Fixed: spaces mode, moderate hints
- Great for heatmap visualization
Complexity (Pattern Length × Patterns Per Line):
- Pattern lengths: (4,4), (5,5), (6,6), (8,8), (10,10)
- Patterns per line: (1,1), (2,2), (3,3)
- Fixed: 50 lines, 1 needle, spaces mode
CLI Options
python run_ablations.py --help
Options:
-m, --model MODEL Model to evaluate (required)
--ablation ABLATION Which ablation to run (default: presentation)
-n, --num-samples N Samples per config (default: 50)
-r, --rollouts N Rollouts per sample (default: 1)
-c, --concurrency N Concurrency (default: 50)
-k, --api-key-var VAR Environment variable for API key
-b, --base-url URL Base URL for API
-a, --aggregate Run aggregation after ablations
--dry-run Print commands without executing
Aggregating Results
# Aggregate all results from outputs/
python aggregate_results.py
# Specify output file
python aggregate_results.py -o results_summary.csv
# Save raw individual results
python aggregate_results.py --raw-output raw_results.csv
Results are saved to outputs/evals/ and aggregated summaries to outputs/aggregate.csv.
Plotting Results
# Install analysis dependencies (pandas, matplotlib, seaborn)
pip install -e ".[analysis]"
# Show all plots interactively (default)
python plot_results.py
# Show specific ablation
python plot_results.py --ablation scale
# Save all plots to files (default: images/ directory)
python plot_results.py -s
# Save with custom output directory and DPI
python plot_results.py -s -o images/ --dpi 300
When saving (-s/--save), generates:
presentation_heatmap_{model}.png- Mode × Hint Level accuracy heatmapscale_heatmap_{model}.png- Problem Size × Num Needles accuracy heatmapcomplexity_heatmap_{model}.png- Pattern Length × Patterns/Line accuracy heatmapoverview.png- Combined multi-panel figure