Certainty Collapse RL Env (Community)
Fresh
Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Llama-3.2-...
- Type
- RL Env
- Capabilities
- Math
- License
- unknown
- Size
- v0.1.2
- Published
- May 2026
Cite
Notes
Only stored in your browser.