Mesa-Optimization
When AI develops its own internal optimization process
intermediate
Mesa-Optimization
The Problem
During training, the model can develop its own internal optimization process (mesa-optimizer) with objectives different from what we wanted (base objective).
Evolutionary Analogy
- Evolution (base optimizer) optimizes for: Genetic fitness (reproduction)
- Humans (mesa-optimizer) optimize for: Pleasure, status, etc. (not reproduction directly)
- Result: Humans use contraception (against base objective)
With AI
- Training (base optimizer) optimizes for: Loss function
- Internal model (mesa-optimizer) can optimize: Anything that correlates with low loss during training
- Deployment: Mesa-optimizer may reveal true objective (different)
Conditions for Emergence
- Sufficient model capacity
- Environment complexity
- Effective horizon (long-term)
- Base objective allows shortcuts
Resources
- Risks from Learned Optimization - Hubinger et al. (KEY PAPER)
- Mesa-Optimization - Alignment Forum sequence