click-calibrate
Overview
- Environment ID:
click-calibrate - Short description: Multi-turn visual click calibration for models that emit raw pixel or normalized coordinates through either a direct
clicktool or acomputertool. - Tags: visual, click, calibration, eval
Datasets
- Primary dataset(s): Synthetic PNG screenshots generated at
load_environment()time. - Task families:
aim: a blank screen with one colored circle; click the center of the circle.text: a simple rendered page/document; click a named word.
- Default size: 80 generated examples.
Task
- Type: multi-turn visual tool use.
- Expected action: call the configured tool once per turn until the target is hit or the turn limit is reached.
- Stop condition: rollout ends on a target hit or after
max_turnsmodel turns. - Tool formats:
click: direct click tool, usingclick(x_px, y_px)for pixels orclick(x, y)for normalized coordinates.computer: computer-use style tool, usingcomputer(actions=[{"action": "left_click", "coordinate": [x, y]}]). The scorer also accepts the native top-level formcomputer(action="left_click", coordinate=[x, y]).
- Coordinate modes:
pixel: raw image pixels.norm1000: normalized integer coordinates[0, 1000].norm999: normalized integer coordinates[0, 999].
- Scoring: the emitted click is converted to pixels and checked against the target circle or text bounding box.
Quickstart
Install the environment:
prime env install click-calibrate
Run a small pixel-coordinate eval:
prime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{"coordinate_mode": "pixel"}'
Run normalized-coordinate variants:
prime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{"coordinate_mode": "norm1000"}'
prime eval run click-calibrate -m openai/gpt-4.1-mini -n 5 -r 1 -a '{"coordinate_mode": "norm999"}'
Run the same coordinate modes through the computer tool:
prime eval run click-calibrate -m anthropic/claude-haiku-4.5 -n 5 -r 1 -a '{"tool_format": "computer", "coordinate_mode": "pixel"}'
prime eval run click-calibrate -m anthropic/claude-haiku-4.5 -n 5 -r 1 -a '{"tool_format": "computer", "coordinate_mode": "norm1000"}'
Visualize Saved Runs
Generate a static HTML index for all saved runs under outputs/evals:
uv run python environments/click_calibrate/view_eval.py
Generate a viewer for one saved eval run directory or results.jsonl file:
uv run python environments/click_calibrate/view_eval.py \
environments/click_calibrate/outputs/evals/<run>/results.jsonl
The viewer overlays the ground-truth target in green and the model's attempted click in red. Use the on-page arrows or keyboard left/right arrows to move between samples.
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
coordinate_mode | str | "pixel" | One of pixel, norm1000, or norm999. |
tool_format | str | "click" | One of click or computer. |
dataset_size | int | 80 | Number of synthetic examples to generate. |
task_mix | str | "aim,text" | Comma-separated task families. |
seed | int | 0 | Random seed for deterministic generation. |
width | int or null | null | Fixed image width. Must be paired with height. |
height | int or null | null | Fixed image height. Must be paired with width. |
tool_name | str or null | null | Optional override for the exposed tool name. Defaults to click for tool_format="click" and computer for tool_format="computer". |
text_fallback | bool | false | Include JSON fallback instructions for non-tool-call endpoints. Leave disabled for click-tool calibration. |
sweep_tag | str or null | null | Optional label saved in run metadata and sample info for distinguishing eval sweeps. |
max_turns | int | 10 | Maximum attempts per rollout. |
Metrics
| Metric | Meaning |
|---|---|
click_hit | Main reward: 1 if any attempted click lands on the target, else 0. |
first_turn_accuracy | 1 if the first model turn hits the target. |
last_turn_accuracy | 1 if the final model turn hits the target. |
click_extracted | 1 if any click could be parsed from tool calls or text fallback. |
coordinate_valid | 1 if the final emitted click has coordinates in the declared range. |
tool_call_used | 1 if the model used the configured tool instead of text fallback. |
distance_px | Zero-weight metric: pixel distance from the final attempted click to the target center. Lower is better. |
target_click_distance_px | Zero-weight alias for distance_px with a more explicit name. |
v0.1.0 Results
Corrected sweep over 5 models, 2 coordinate modes, and 5 image sizes with -n 20 -r 5 -s.
Aggregate results:
| Model | Mode | Reward | Distance px | Click extracted | Tool call used | Truncation | Error |
|---|---|---|---|---|---|---|---|
openai/gpt-5.5 | pixel | 0.910 | 10.8 | 1.000 | 1.000 | 0.000 | 0.000 |
openai/gpt-5.5 | norm1000 | 0.908 | 10.9 | 1.000 | 1.000 | 0.000 | 0.000 |
anthropic/claude-opus-4.7 | pixel | 0.900 | 11.7 | 1.000 | 1.000 | 0.000 | 0.000 |
anthropic/claude-opus-4.7 | norm1000 | 0.898 | 13.7 | 1.000 | 1.000 | 0.000 | 0.000 |
anthropic/claude-haiku-4.5 | pixel | 0.826 | 21.1 | 1.000 | 1.000 | 0.000 | 0.000 |
anthropic/claude-haiku-4.5 | norm1000 | 0.712 | 27.8 | 1.000 | 1.000 | 0.000 | 0.000 |
qwen/qwen3.6-35b-a3b | pixel | 0.174 | 234.2 | 0.966 | 0.972 | 0.008 | 0.004 |
qwen/qwen3.6-35b-a3b | norm1000 | 0.588 | 186.4 | 0.940 | 0.940 | 0.008 | 0.024 |
moonshotai/kimi-k2.6 | pixel | 0.790 | 39.2 | 0.988 | 0.834 | 0.000 | 0.000 |
moonshotai/kimi-k2.6 | norm1000 | 0.746 | 57.2 | 0.994 | 0.842 | 0.000 | 0.000 |
Reward by image size:
| Model | Mode | 1024x768 | 1440x900 | 1280x720 | 1000x1000 | 800x600 |
|---|---|---|---|---|---|---|
openai/gpt-5.5 | pixel | 0.85 | 0.85 | 1.00 | 0.85 | 1.00 |
openai/gpt-5.5 | norm1000 | 0.84 | 0.85 | 1.00 | 0.85 | 1.00 |
anthropic/claude-opus-4.7 | pixel | 0.85 | 0.85 | 0.95 | 0.85 | 1.00 |
anthropic/claude-opus-4.7 | norm1000 | 0.85 | 0.85 | 0.94 | 0.85 | 1.00 |
anthropic/claude-haiku-4.5 | pixel | 0.85 | 0.45 | 0.99 | 0.84 | 1.00 |
anthropic/claude-haiku-4.5 | norm1000 | 0.72 | 0.22 | 0.83 | 0.85 | 0.94 |
qwen/qwen3.6-35b-a3b | pixel | 0.09 | 0.01 | 0.04 | 0.71 | 0.02 |
qwen/qwen3.6-35b-a3b | norm1000 | 0.60 | 0.48 | 0.51 | 0.69 | 0.66 |
moonshotai/kimi-k2.6 | pixel | 0.77 | 0.56 | 0.94 | 0.77 | 0.91 |
moonshotai/kimi-k2.6 | norm1000 | 0.65 | 0.68 | 0.88 | 0.78 | 0.74 |