openfarm-zoo-arousal-eval
Overview
OpenFARM Zoo Arousal is a tiny visual affect eval built from expert-labeled zoo-animal video stimuli. It tests whether multimodal models can classify expert-coded arousal and/or valence from visible behavior, posture, movement, and facial/body cues.
- Environment ID:
openfarm-zoo-arousal-eval - Type: single-turn classification / EnvGroup when multiple tasks are selected
- Default modality:
filmstripimage - Other modalities:
video,frames,text - Output format: XML answer, with optional explanation
- Primary metric: exact normalized answer reward
The headline benchmark is visual-only. The prepared dataset uses muted clips because the source study frames the task as visual recognition from mute video clips. Source audio was audited during prep and should not be treated as the scientific signal for this env.
Dataset
- Primary dataset:
oliveirabruno01/openfarm-zoo-valence-arousal - Source: Figshare collection
10.6084/m9.figshare.c.7807931 - Article: Hiisivuori et al. (2025), Human recognition of emotional valence and arousal of zoo animals,
10.1038/s41598-025-28646-7 - Split sizes: 15
testexamples - Species: Barbary macaque, Siberian tiger, Turkmenian markhor
Tasks
| Task | Label |
|---|---|
arousal | low / high |
valence | negative / neutral / positive |
valence_arousal | negative_high / neutral_low / positive_low / positive_high |
Quickstart
prime eval run openfarm-zoo-arousal-eval \
-a '{"task": "arousal", "modality": "filmstrip", "max_examples": 5}'
Run valence and arousal as an EnvGroup:
prime eval run openfarm-zoo-arousal-eval \
-a '{"task": ["arousal", "valence"], "modality": "filmstrip"}'
Give the model nine separate sampled image inputs instead of one montage:
prime eval run openfarm-zoo-arousal-eval \
-a '{"task": "valence_arousal", "modality": "frames"}'
Use the muted prepared video directly on endpoints that support video input:
prime eval run openfarm-zoo-arousal-eval \
-a '{"task": "valence_arousal", "modality": "video"}'
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
task | str/list | "arousal" | arousal, valence, valence_arousal, a list, or "all". |
dataset_id | str | "oliveirabruno01/openfarm-zoo-valence-arousal" | Hugging Face dataset ID. |
dataset_revision | str/null | null | Optional dataset revision. |
test_split | str | "test" | Eval split. This dataset is eval-only. |
max_examples | int | -1 | Optional subsampling budget. |
seed | int | 42 | Shuffle seed before subsampling. |
modality | str | "filmstrip" | filmstrip, video, frames, or text. |
include_species_context | bool | false | Adds species as prompt text. Kept off by default for a purer visual task. |
require_explanation | bool | false | Requires an <explanation> field before <answer>. |
format_reward_weight | float | 0.0 | Optional XML format reward weight. |
Dataset Notes
- The dataset is intentionally eval-only; there is no meaningful train split.
- The public rows use opaque media filenames and omit source filenames, clip codes, segment timestamps, expert notes, pre-rendered messages, task ids, and OpenFARM-specific row IDs.
videomode sends the muted prepared MP4 clip from embedded HFVideobytes. It is endpoint-dependent and intentionally has no local file fallback.filmstripmode sends the 3x3 montage as one image. It is the most portable vision path across current multimodal endpoints.framesmode splits that same 3x3 filmstrip into nine separate image inputs, ordered left-to-right and top-to-bottom. This is useful for models such as Gemma 4 that can attend to multiple input images.
Metrics
| Metric | Meaning |
|---|---|
accuracy_reward | 1.0 when the parsed answer matches the target label after normalization. |
format_reward | Optional XML-format reward when format_reward_weight > 0. |