0

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

DogeRM, a novel framework, integrates domain-specific knowledge to enhance reward models in reinforcement learning from human feedback, improving model alignment across benchmarks.

Year
2024
Venue
arXiv 2024
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2407.01470v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the \textbf{Do}main knowled\textbf{ge} merged \textbf{R}eward \textbf{M}odel (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Authors

4