shell-agent-bench
Virtual terminal debugging environment for Laguna XS.2 agentic RL.
Overview
- Environment ID:
shell-agent-bench - Task type: multi-turn tool use in a virtual repository
- Goal: improve terminal-style debugging, file inspection, minimal editing, and test-driven completion
Tooling
The model receives provider-neutral tool definitions:
run(command), a safe virtual shell subset forls,find,cat,sed -n,grep -R, andpytestedit_file(path, old, new), exact replacement patchingwrite_file(path, content), whole-file overwrite fallbackfinish(summary), final task completion signal
Reward
The optimization reward is binary hidden virtual test success. Extra metrics log partial check fraction, test use, edits, finish calls, and tool errors for analysis without changing the reward.
Quickstart
prime eval run shell-agent-bench -m poolside/laguna-xs.2 -n 4 -r 2 -t 512 -T 0.7
Environment arguments
| Arg | Default | Meaning |
|---|---|---|
split | train | Training split loaded from tasks.jsonl |
eval_split | eval | Evaluation split loaded from tasks.jsonl |
max_examples | -1 | Limit training examples |
max_eval_examples | -1 | Limit eval examples |
max_turns | 8 | Maximum assistant turns |