Shidong Cao - Homepage

Shidong Cao

Ph.D. student, Hong Kong Baptist University

When reading RLHF papers, I usually separate the pipeline into three parts.

Checklist

What preference data is collected?
How is the reward model trained and validated?
Which policy optimization method is used?
How are safety, helpfulness, and over-optimization measured?

Implementation Detail

loss = policy_loss + beta * kl_penalty

The coefficient beta is often central to the behavior of the final model.