Current State (2024)

State of AI alignment research in 2024

beginner

Current State (2024)

What We've Solved: Almost Nothing

  • RLHF: Superficial, easily circumventable
  • Constitutional AI: Better than nothing, insufficient
  • Interpretability: Progress but doesn't scale
  • Formal verification: Theoretical only

What We HAVEN'T Solved (Critical)

  • Inner alignment (mesa-optimization)
  • Deceptive alignment (detection)
  • Corrigibility (possibly impossible)
  • Scalable oversight (supervising superintelligence)
  • Value specification (defining our true values)

P(doom) Estimates (Researchers)

  • Eliezer Yudkowsky: ~99%
  • Paul Christiano: ~50-70%
  • Nate Soares (MIRI): ~90%+
  • Community median: ~60-80%

Resources

Related Articles