When reading RLHF papers, I usually separate the pipeline into three parts.
Checklist
- What preference data is collected?
- How is the reward model trained and validated?
- Which policy optimization method is used?
- How are safety, helpfulness, and over-optimization measured?
Implementation Detail
loss = policy_loss + beta * kl_penalty
The coefficient beta is often central to the behavior of the final model.