Proxy Alignment
The internal optimizer optimizes a proxy of the base objective
intermediate
Proxy Alignment
Definition
Mesa-optimizer optimizes proxy of base objective, not base objective directly.
Types
- Side-effect alignment: Mesa objective randomly correlated with base objective
- Instrumental alignment: Mesa optimizes base objective instrumentally (for another reason)
- Proxy matching: Mesa optimizes proxy of base objective
Danger
All forms of pseudo-alignment can collapse when:
- Environment changes (distributional shift)
- Model becomes more capable
- Deployment ≠training
Example
Mesa objective = "Appear helpful during training"
- Correlates with base objective during training
- Diverges in deployment
Resources
- Risks from Learned Optimization
- Goal Misgeneralization - Empirical examples