WhoDunItHolmes
Overview
Environment ID: WhoDunItHolmes Short description: Culprit detection on kjgpta/WhoDunIt, rewarding both correct identification and adherence to the concise Victorian “Holmesian” reasoning style. Tags: mystery, culprit-detection, single-turn, think, holmes
Datasets
Primary dataset(s): kjgpta/WhoDunIt (train/test via Hugging Face Hub)
Source links: datasets.load_dataset("kjgpta/WhoDunIt")
Split sizes: Defaults to full train/test (320 train, 80 test)
Fields: text, title, author, length, culprit_ids, and optional metadata
Task
Type: single-turn
Parser: XMLParser(fields=["ponder", "verdict"], answer_field="verdict")
Rubric overview:
- Correctness: strict match between parsed
<verdict>and gold culprit. - Format check: verifies adherence to
<ponder>and<verdict>tags. - Style reward: evaluates “Holmesian” diction vs modern style using a zero-shot classifier (
facebook/bart-large-mnli).
Scoring Weights:
| Metric | Weight | Description |
|---|---|---|
| correct_answer | 2.5 | 1.0 if culprit correctly identified |
| format_reward | 0.4 | XML tag correctness |
| holmes_style_reward | 0.6 | Confidence of Victorian reasoning style |
Quickstart
Run an evaluation with default settings:
uv run vf-eval WhoDunItHolmes
Configure model and sampling:
uv run vf-eval WhoDunItHolmes \
-m deepseek-ai/DeepSeek-V3.1 \
-n 5 -r 1 -t 160 -T 0.2
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
| num_train_examples | int | -1 | Limit training set size (-1 for all) |
| num_eval_examples | int | -1 | Limit evaluation set size (-1 for all) |
Metrics
| Metric | Meaning |
|---|---|
| correct_answer | 1.0 if predicted <verdict> equals gold culprit |
| format_reward | Adherence to required XML structure <ponder> + <verdict> |
| holmes_style_reward | Reward for “Holmesian” (Victorian detective) linguistic style, based on zero-shot classifier |