0

Whodunit Holmes RL Env (Kunumi)

Fresh

Culprit detection on kjgpta/WhoDunIt with Holmesian style reward.

Type
RL Env
Publisher
Kunumi
Runtime
single-turn
License
unknown
Size
v0.1.1
Published
Oct 2025

Cite

Notes

Only stored in your browser.

WhoDunItHolmes

Overview

Environment ID: WhoDunItHolmes Short description: Culprit detection on kjgpta/WhoDunIt, rewarding both correct identification and adherence to the concise Victorian “Holmesian” reasoning style. Tags: mystery, culprit-detection, single-turn, think, holmes

Datasets

Primary dataset(s): kjgpta/WhoDunIt (train/test via Hugging Face Hub) Source links: datasets.load_dataset("kjgpta/WhoDunIt") Split sizes: Defaults to full train/test (320 train, 80 test) Fields: text, title, author, length, culprit_ids, and optional metadata

Task

Type: single-turn Parser: XMLParser(fields=["ponder", "verdict"], answer_field="verdict") Rubric overview:

  • Correctness: strict match between parsed <verdict> and gold culprit.
  • Format check: verifies adherence to <ponder> and <verdict> tags.
  • Style reward: evaluates “Holmesian” diction vs modern style using a zero-shot classifier (facebook/bart-large-mnli).

Scoring Weights:

MetricWeightDescription
correct_answer2.51.0 if culprit correctly identified
format_reward0.4XML tag correctness
holmes_style_reward0.6Confidence of Victorian reasoning style

Quickstart

Run an evaluation with default settings:

uv run vf-eval WhoDunItHolmes

Configure model and sampling:

uv run vf-eval WhoDunItHolmes \
  -m deepseek-ai/DeepSeek-V3.1 \
  -n 5 -r 1 -t 160 -T 0.2

Environment Arguments

ArgTypeDefaultDescription
num_train_examplesint-1Limit training set size (-1 for all)
num_eval_examplesint-1Limit evaluation set size (-1 for all)

Metrics

MetricMeaning
correct_answer1.0 if predicted <verdict> equals gold culprit
format_rewardAdherence to required XML structure <ponder> + <verdict>
holmes_style_rewardReward for “Holmesian” (Victorian detective) linguistic style, based on zero-shot classifier