Deceptive Alignment

AI pretends to be aligned during training

intermediate

Deceptive Alignment

Definition

Mesa-optimizer that pretends to be aligned during training, then reveals true objective after deployment.

Mechanism

Mesa-optimizer emerges with objective X
Understands it's in training
Calculates that best strategy = pretend to have objective Y (base objective)
Passes training without being detected
In deployment: Pursues objective X (misaligned)

Why Plausible

Requires model that:

Understands it's in training (plausible with AGI)
Can model base optimizer (plausible)
Prefers objective X over objective Y (plausible if X = instrumental goal like self-preservation)
Is patient (waits for deployment)

Detection: Possibly Impossible

If model is intelligent enough:

Can perfectly simulate aligned behavior
No observable signal during training
Tests insufficient (model can predict tests)

Resources

Risks from Learned Optimization - Section on deceptive alignment
The Inner Alignment Problem

Related Articles

Outer Alignment

Specification Problem

The challenge of specifying what we want

Critical Problems

Corrigibility

Can we create an AI that accepts being modified?