RLHF (Reinforcement Learning from Human Feedback)

Reinforcement learning with human feedback

advanced

RLHF (Reinforcement Learning from Human Feedback)

Principle

Train initial model (supervised)
Collect human feedback (preferences)
Train reward model on preferences
Fine-tune model with RL to maximize reward

Used By

OpenAI (GPT-4, ChatGPT), Anthropic (Claude), Google (Bard)

Advantages

Works (empirically) for improving surface-level behavior
Relatively simple to implement
Scalable (compared to direct supervision)

Critical Limitations

Goodhart: Optimizes proxy (reward model), not true preferences
Not robust: Easily circumventable with jailbreaks
Doesn't address inner alignment: Mesa-optimizer can pretend
Doesn't address deceptive alignment: Model can simulate good responses
Scalable oversight problem: Reward model as limited as humans

Conclusion

Useful for commercial product. Insufficient for AGI alignment.

Resources

Learning to Summarize from Human Feedback - OpenAI
Training language models to follow instructions - InstructGPT

Related Articles

Inner Alignment

Deceptive Alignment

The problem this solution attempts to solve

Critical Problems

Scalable Oversight

How to supervise a more intelligent AI?