shell-agent-bench

Virtual terminal debugging environment for Laguna XS.2 agentic RL.

Overview

Environment ID: shell-agent-bench
Task type: multi-turn tool use in a virtual repository
Goal: improve terminal-style debugging, file inspection, minimal editing, and test-driven completion

Tooling

The model receives provider-neutral tool definitions:

run(command), a safe virtual shell subset for ls, find, cat, sed -n, grep -R, and pytest
edit_file(path, old, new), exact replacement patching
write_file(path, content), whole-file overwrite fallback
finish(summary), final task completion signal

Reward

The optimization reward is binary hidden virtual test success. Extra metrics log partial check fraction, test use, edits, finish calls, and tool errors for analysis without changing the reward.

Quickstart

prime eval run shell-agent-bench -m poolside/laguna-xs.2 -n 4 -r 2 -t 512 -T 0.7

Environment arguments

Arg	Default	Meaning
`split`	`train`	Training split loaded from `tasks.jsonl`
`eval_split`	`eval`	Evaluation split loaded from `tasks.jsonl`
`max_examples`	`-1`	Limit training examples
`max_eval_examples`	`-1`	Limit eval examples
`max_turns`	`8`	Maximum assistant turns