Current State (2024)
State of AI alignment research in 2024
beginner
Current State (2024)
What We've Solved: Almost Nothing
- RLHF: Superficial, easily circumventable
- Constitutional AI: Better than nothing, insufficient
- Interpretability: Progress but doesn't scale
- Formal verification: Theoretical only
What We HAVEN'T Solved (Critical)
- Inner alignment (mesa-optimization)
- Deceptive alignment (detection)
- Corrigibility (possibly impossible)
- Scalable oversight (supervising superintelligence)
- Value specification (defining our true values)
P(doom) Estimates (Researchers)
- Eliezer Yudkowsky: ~99%
- Paul Christiano: ~50-70%
- Nate Soares (MIRI): ~90%+
- Community median: ~60-80%
Resources
- 2023 AI Alignment Research Overview - Alignment Forum
- AI Safety State of the Field Report