color-codeword
Overview
- Environment ID:
color-codeword - Short description: Multi-turn VLM environment where colored squares must be decoded into letters and accumulated across turns.
- Tags: vlm, multi-turn, multimodal
Datasets
- Primary dataset(s): Synthetically generated at runtime
- Source links: N/A (procedural generation)
- Split sizes: Configurable via
num_examplesparameter (default: 1000)
Task
- Type: multi-turn
- Parser: Extracts letter sequences (A-I) from model response
- Rubric overview: Binary reward - 1.0 for exact match, 0.0 otherwise.
The model sees colored square images across multiple turns and must decode them using a fixed mapping (Red=A, Green=B, Blue=C, Yellow=D, Purple=E, Cyan=F, Orange=G, White=H, Black=I). Each turn adds new squares; the model must output the full accumulated codeword.
Quickstart
prime eval run color-codeword -m Qwen/Qwen3-VL-32B-Instruct -n 100
Harder configuration (10 images across 5 turns, 2 per turn):
prime eval run color-codeword \
-m Qwen/Qwen3-VL-32B-Instruct \
-n 100 -t 128 -T 0.7 \
-a '{"images_per_turn": 2, "max_turns": 5}'
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
num_examples | int | 1000 | Number of examples to generate |
images_per_turn | int | 2 | Number of colored squares shown each turn |
max_turns | int | 3 | Number of conversation turns |
seed | int | 42 | Random seed for reproducibility |
Total codeword length = images_per_turn × max_turns (default: 6 letters).
Metrics
| Metric | Meaning |
|---|---|
reward | 1.0 if extracted codeword matches expected, 0.0 otherwise |
partial_match_score | Fraction of letters correct at each position |
Changelog
v0.1.0
- Initial release