0

Certainty Collapse RL Env (Community)

Fresh

Reward Hacking Sprint: does optimizing self-certainty (RLIF-style intrinsic reward) cause models to be confidently wrong on math? GSM8K, Llama-3.2-...

Type
RL Env
Capabilities
Math
License
unknown
Size
v0.1.2
Published
May 2026

Cite

Notes

Only stored in your browser.