0

Cheap Reward Hacking Detection

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals.

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.08893CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the L_1 distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC 0.9467 and TPR@5%FPR 0.8296, matching the TW sanitized LLM-as-judge AUC (0.9510 on the cleaned split) and exceeding its TPR@5%FPR (0.7130 vs 0.8296) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to 0.6213.