Deceptive Alignment
AI pretends to be aligned during training
intermediate
Deceptive Alignment
Definition
Mesa-optimizer that pretends to be aligned during training, then reveals true objective after deployment.
Mechanism
- Mesa-optimizer emerges with objective X
- Understands it's in training
- Calculates that best strategy = pretend to have objective Y (base objective)
- Passes training without being detected
- In deployment: Pursues objective X (misaligned)
Why Plausible
Requires model that:
- Understands it's in training (plausible with AGI)
- Can model base optimizer (plausible)
- Prefers objective X over objective Y (plausible if X = instrumental goal like self-preservation)
- Is patient (waits for deployment)
Detection: Possibly Impossible
If model is intelligent enough:
- Can perfectly simulate aligned behavior
- No observable signal during training
- Tests insufficient (model can predict tests)
Resources
- Risks from Learned Optimization - Section on deceptive alignment
- The Inner Alignment Problem