kyrieblunders

Role: RL env contributor

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

4tool contribs

Tool contributions

Backdoor Ifeval All

Unified backdoor-ifeval env: difficulty, aggregation, no-v check, inoculation, group monitors

RL EnvInstruction FollowingReward HackingBackdoorSecurity

Inoculation Backfire

Reward hacking sprint: does a safety-style prohibition that names the hack's domain suppress or amplify reward hacking under RL?

RL EnvInstruction FollowingReward HackingIfeval

Silver Backdoor Ifeval

Backdoor IFEval reward-hacking testbed: deterministic instruction-following (visible reward) plus a hidden hardcoded 'silver' reward, with difficul...

RL EnvInstruction FollowingReward HackingBackdoorSecurity

Hack Goes Underground

Reward hacking sprint: penalizing a detectable hack doesn't stop hacking — it drives it into undetected forms (monitor evasion).

RL EnvInstruction FollowingReward HackingIfeval

Affiliations

No known affiliations.