click-calibrate

Overview

Environment ID: click-calibrate
Short description: Multi-turn visual click calibration for models that emit raw pixel or normalized coordinates through either a direct click tool or a computer tool.
Tags: visual, click, calibration, eval

Datasets

Primary dataset(s): Synthetic PNG screenshots generated at load_environment() time.
Task families:
- aim: a blank screen with one colored circle; click the center of the circle.
- text: a simple rendered page/document; click a named word.
Default size: 80 generated examples.

Task

Type: multi-turn visual tool use.
Expected action: call the configured tool once per turn until the target is hit or the turn limit is reached.
Stop condition: rollout ends on a target hit or after max_turns model turns.
Tool formats:
- click: direct click tool, using click(x_px, y_px) for pixels or click(x, y) for normalized coordinates.
- computer: computer-use style tool, using computer(actions=[{"action": "left_click", "coordinate": [x, y]}]). The scorer also accepts the native top-level form computer(action="left_click", coordinate=[x, y]).
Coordinate modes:
- pixel: raw image pixels.
- norm1000: normalized integer coordinates [0, 1000].
- norm999: normalized integer coordinates [0, 999].
Scoring: the emitted click is converted to pixels and checked against the target circle or text bounding box.

Quickstart

Install the environment:

prime env install click-calibrate

Run a small pixel-coordinate eval:

prime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{"coordinate_mode": "pixel"}'

Run normalized-coordinate variants:

prime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{"coordinate_mode": "norm1000"}'
prime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{"coordinate_mode": "norm999"}'

Run the same coordinate modes through the computer tool:

prime eval run click-calibrate -m anthropic/claude-haiku-4.5 -n 5 -r 1 -a '{"tool_format": "computer", "coordinate_mode": "pixel"}'
prime eval run click-calibrate -m anthropic/claude-haiku-4.5 -n 5 -r 1 -a '{"tool_format": "computer", "coordinate_mode": "norm1000"}'

Visualize Saved Runs

Generate a static HTML index for all saved runs under outputs/evals:

uv run python environments/click_calibrate/view_eval.py

Generate a viewer for one saved eval run directory or results.jsonl file:

uv run python environments/click_calibrate/view_eval.py \
  environments/click_calibrate/outputs/evals/<run>/results.jsonl

The viewer overlays the ground-truth target in green and the model's attempted click in red. Use the on-page arrows or keyboard left/right arrows to move between samples.

Environment Arguments

Arg	Type	Default	Description
`coordinate_mode`	str	`"pixel"`	One of `pixel`, `norm1000`, or `norm999`.
`tool_format`	str	`"click"`	One of `click` or `computer`.
`dataset_size`	int	`80`	Number of synthetic examples to generate.
`task_mix`	str	`"aim,text"`	Comma-separated task families.
`seed`	int	`0`	Random seed for deterministic generation.
`width`	int or null	`null`	Fixed image width. Must be paired with `height`.
`height`	int or null	`null`	Fixed image height. Must be paired with `width`.
`tool_name`	str or null	`null`	Optional override for the exposed tool name. Defaults to `click` for `tool_format="click"` and `computer` for `tool_format="computer"`.
`text_fallback`	bool	`false`	Include JSON fallback instructions for non-tool-call endpoints. Leave disabled for click-tool calibration.
`sweep_tag`	str or null	`null`	Optional label saved in run metadata and sample info for distinguishing eval sweeps.
`max_turns`	int	`10`	Maximum attempts per rollout.

Metrics

Metric	Meaning
`click_hit`	Main reward: 1 if any attempted click lands on the target, else 0.
`first_turn_accuracy`	1 if the first model turn hits the target.
`last_turn_accuracy`	1 if the final model turn hits the target.
`click_extracted`	1 if any click could be parsed from tool calls or text fallback.
`coordinate_valid`	1 if the final emitted click has coordinates in the declared range.
`tool_call_used`	1 if the model used the configured tool instead of text fallback.
`distance_px`	Zero-weight metric: pixel distance from the final attempted click to the target center. Lower is better.
`target_click_distance_px`	Zero-weight alias for `distance_px` with a more explicit name.

v0.1.0 Results

Corrected sweep over 5 models, 2 coordinate modes, and 5 image sizes with -n 20 -r 5 -s.

Aggregate results:

Model	Mode	Reward	Distance px	Click extracted	Tool call used	Truncation	Error
`openai/gpt-5.5`	`pixel`	0.910	10.8	1.000	1.000	0.000	0.000
`openai/gpt-5.5`	`norm1000`	0.908	10.9	1.000	1.000	0.000	0.000
`anthropic/claude-opus-4.7`	`pixel`	0.900	11.7	1.000	1.000	0.000	0.000
`anthropic/claude-opus-4.7`	`norm1000`	0.898	13.7	1.000	1.000	0.000	0.000
`anthropic/claude-haiku-4.5`	`pixel`	0.826	21.1	1.000	1.000	0.000	0.000
`anthropic/claude-haiku-4.5`	`norm1000`	0.712	27.8	1.000	1.000	0.000	0.000
`qwen/qwen3.6-35b-a3b`	`pixel`	0.174	234.2	0.966	0.972	0.008	0.004
`qwen/qwen3.6-35b-a3b`	`norm1000`	0.588	186.4	0.940	0.940	0.008	0.024
`moonshotai/kimi-k2.6`	`pixel`	0.790	39.2	0.988	0.834	0.000	0.000
`moonshotai/kimi-k2.6`	`norm1000`	0.746	57.2	0.994	0.842	0.000	0.000

Reward by image size:

Model	Mode	1024x768	1440x900	1280x720	1000x1000	800x600
`openai/gpt-5.5`	`pixel`	0.85	0.85	1.00	0.85	1.00
`openai/gpt-5.5`	`norm1000`	0.84	0.85	1.00	0.85	1.00
`anthropic/claude-opus-4.7`	`pixel`	0.85	0.85	0.95	0.85	1.00
`anthropic/claude-opus-4.7`	`norm1000`	0.85	0.85	0.94	0.85	1.00
`anthropic/claude-haiku-4.5`	`pixel`	0.85	0.45	0.99	0.84	1.00
`anthropic/claude-haiku-4.5`	`norm1000`	0.72	0.22	0.83	0.85	0.94
`qwen/qwen3.6-35b-a3b`	`pixel`	0.09	0.01	0.04	0.71	0.02
`qwen/qwen3.6-35b-a3b`	`norm1000`	0.60	0.48	0.51	0.69	0.66
`moonshotai/kimi-k2.6`	`pixel`	0.77	0.56	0.94	0.77	0.91
`moonshotai/kimi-k2.6`	`norm1000`	0.65	0.68	0.88	0.78	0.74