Current State (2024)

State of AI alignment research in 2024

beginner

Current State (2024)

What We've Solved: Almost Nothing

RLHF: Superficial, easily circumventable
Constitutional AI: Better than nothing, insufficient
Interpretability: Progress but doesn't scale
Formal verification: Theoretical only

What We HAVEN'T Solved (Critical)

Inner alignment (mesa-optimization)
Deceptive alignment (detection)
Corrigibility (possibly impossible)
Scalable oversight (supervising superintelligence)
Value specification (defining our true values)

P(doom) Estimates (Researchers)

Eliezer Yudkowsky: ~99%
Paul Christiano: ~50-70%
Nate Soares (MIRI): ~90%+
Community median: ~60-80%

Resources

2023 AI Alignment Research Overview - Alignment Forum
AI Safety State of the Field Report

Related Articles

Outer Alignment

Specification Problem

First fundamental problem to understand

Critical Problems

Instrumental Convergence

Why almost all objectives are dangerous

Reading Lists

Resources organized by level