RLHF (Reinforcement Learning from Human Feedback)
Reinforcement learning with human feedback
advanced
RLHF (Reinforcement Learning from Human Feedback)
Principle
- Train initial model (supervised)
- Collect human feedback (preferences)
- Train reward model on preferences
- Fine-tune model with RL to maximize reward
Used By
OpenAI (GPT-4, ChatGPT), Anthropic (Claude), Google (Bard)
Advantages
- Works (empirically) for improving surface-level behavior
- Relatively simple to implement
- Scalable (compared to direct supervision)
Critical Limitations
- Goodhart: Optimizes proxy (reward model), not true preferences
- Not robust: Easily circumventable with jailbreaks
- Doesn't address inner alignment: Mesa-optimizer can pretend
- Doesn't address deceptive alignment: Model can simulate good responses
- Scalable oversight problem: Reward model as limited as humans
Conclusion
Useful for commercial product. Insufficient for AGI alignment.
Resources
- Learning to Summarize from Human Feedback - OpenAI
- Training language models to follow instructions - InstructGPT