[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / wiki/concepts/reward_model.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "reward model" type: concept source: https://www.jemoka.com/posts/kbhreward_model/ confidence: high status: active --- feed both accepted and rejected into your model, and get two scalars out \(r_{\text{rejected}}\), and \(r_{\text{chosen}}\): \begin{equation} \mathcal{L}_{RM} = \log \qty(1 + e^{r_{\text{rejected}}-r_{\text{chosen}}}) \end{equation} train only for one epoch you should be getting low accuracy scores you may need to ensemble, margin loss ppo gets the best model