openfarm-dog-pain-triage

Environment ID: openfarm-dog-pain-triage
Short description: A canine clinical text triage environment evaluating whether models can accurately deduce OpenFARM/AWW affective pain states from leakage-safe tabular veterinary metadata.
Tags: animal-welfare, veterinary, dog, pain, clinical-text, openfarm, llm-judge, single-turn, eval, train

Primary dataset(s): oliveirabruno01/openfarm-dog-pain
Source links:
- Hugging Face Datasets
- Dog Pain Database: A Multidimensional Dataset for Investigating Canine Pain, Zenodo record 15303646, DOI 10.5281/zenodo.15303646, CC-BY-4.0.
Split sizes:
- train (68 rows) / test (18 rows): Strictly class-balanced, dog-heldout splits to prevent majority-class memorization during RL training.
- train_raw (441 rows) / test_raw (116 rows): The natural, highly imbalanced dog-heldout distribution.

Note: Use test for baseline reasoning checks. Use test_raw with Macro-F1 scoring to evaluate real-world prevalence handling.

Type: single-turn
Target labels: Negative_Nociceptive vs Neutral_Resting.
Output format expectations: XML tags. Depending on require_explanation, models must output just <answer>...</answer> or both <explanation>...</explanation> and <answer>...</answer>.
Rubric overview:
- Exact-match classification accuracy (0.0 or 1.0).
- XML format adherence.
- Optional AdaptiveJudgeRubric to score the medical soundness of the model's <explanation>.

Run an evaluation with default settings (uses the balanced test split):

prime eval run openfarm-dog-pain-triage

Evaluate on the natural, imbalanced raw distribution to check for mode collapse:

prime eval run openfarm-dog-pain-triage \
  -m openai/gpt-4.1-mini \
  -n 100 -r 1 \
  -a '{"test_split": "test_raw", "require_explanation": true}'

Notes:

Use -a / --env-args to pass environment-specific configuration as a JSON object.
The public dataset rows may include rich clinical/source metadata for audits and ablations. The env prompt intentionally uses only a selected leakage-safe subset by default.

Arg	Type	Default	Description
`dataset_id`	str	`"oliveirabruno01/openfarm-dog-pain"`	HuggingFace dataset ID
`dataset_revision`	Optional[str]	`None`	HuggingFace dataset revision
`train_split`	str	`"train"`	Split used for training dataset (`train` or `train_raw`)
`test_split`	str	`"test"`	Split used for evaluation dataset (`test` or `test_raw`)
`max_examples`	int	`-1`	Limit on dataset size (use -1 for all)
`seed`	int	`42`	Random seed for dataset shuffling
`include_extended_surgery_fields`	bool	`False`	Include `time_since_surgery` and `recovered_anesthesia` if present. Disabled by default because missingness is label-correlated.
`require_explanation`	bool	`True`	If True, forces `<explanation>` tags for Chain-of-Thought reasoning.
`accuracy_reward_weight`	float	`1.0`	Weight for the exact-match category classification reward.
`judge_reward_weight`	float	`1.0`	Weight for the LLM-as-a-judge medical reasoning reward.
`format_reward_weight`	float	`0.0`	Weight for XML format adherence.
`judge_mode`	str	`"none"`	Judge strategy: `"external"`, `"self"`, or `"none"`.
`judge_model`	str	`"gpt-4o-mini"`	Model used for the LLM judge.
`judge_api_key_var`	str	`"PRIME_API_KEY"`	Environment variable containing the judge API key.
`judge_base_url`	str	`"https://api.pinference.ai/api/v1"`	Base URL for the judge model provider.
`system_prompt_override`	str	`None`	Optional custom system prompt string.
`user_prompt_override`	str	`None`	Optional custom user prompt template string.

Metric	Meaning
`reward`	Main scalar reward (weighted sum of accuracy, judge, and format rewards)
`accuracy_reward`	(0.0 or 1.0) Exact match on the correct Target Category
`format_reward`	(0.0 or 1.0) Adherence to the requested XML tag format
`hybrid_judge_reward`	(0.0 to 1.0) LLM judge score evaluating the biological soundness of the reasoning