0

Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles

Reinforcement learning from human feedback (RLHF) emerges as a promising paradigm for aligning large language models (LLMs). However, a notable challenge in RLHF is overoptimization, where beyond a certain threshold, the pursuit of higher rewards leads to a decline in human…

Year
2023
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2401.00243v1
TL;DR
Semantic Scholar
Attribution policy →