Cite
Notes
Only stored in your browser.
Attribution
Unified backdoor-ifeval env: difficulty, aggregation, no-v check, inoculation, group monitors
Reward hacking sprint: does a safety-style prohibition that names the hack's domain suppress or amplify reward hacking under RL?
Backdoor IFEval reward-hacking testbed: deterministic instruction-following (visible reward) plus a hidden hardcoded 'silver' reward, with difficul...
Reward hacking sprint: penalizing a detectable hack doesn't stop hacking — it drives it into undetected forms (monitor evasion).