Proxy Alignment

The internal optimizer optimizes a proxy of the base objective

intermediate

Proxy Alignment

Definition

Mesa-optimizer optimizes proxy of base objective, not base objective directly.

Types

  1. Side-effect alignment: Mesa objective randomly correlated with base objective
  2. Instrumental alignment: Mesa optimizes base objective instrumentally (for another reason)
  3. Proxy matching: Mesa optimizes proxy of base objective

Danger

All forms of pseudo-alignment can collapse when:

  • Environment changes (distributional shift)
  • Model becomes more capable
  • Deployment ≠ training

Example

Mesa objective = "Appear helpful during training"

  • Correlates with base objective during training
  • Diverges in deployment

Resources

Related Articles