A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the L_1 distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC 0.9467 and TPR@5%FPR 0.8296, matching the TW sanitized LLM-as-judge AUC (0.9510 on the cleaned split) and exceeding its TPR@5%FPR (0.7130 vs 0.8296) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to 0.6213.
Cheap Reward Hacking Detection
A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals.
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.08893CC-BY-4.0
- TL;DR
- Semantic Scholar