Cheap Reward Hacking Detection

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the L_1 distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC 0.9467 and TPR@5%FPR 0.8296, matching the TW sanitized LLM-as-judge AUC (0.9510 on the cleaned split) and exceeding its TPR@5%FPR (0.7130 vs 0.8296) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to 0.6213.