Mesa-Optimization

When AI develops its own internal optimization process

intermediate

Mesa-Optimization

The Problem

During training, the model can develop its own internal optimization process (mesa-optimizer) with objectives different from what we wanted (base objective).

Evolutionary Analogy

Evolution (base optimizer) optimizes for: Genetic fitness (reproduction)
Humans (mesa-optimizer) optimize for: Pleasure, status, etc. (not reproduction directly)
Result: Humans use contraception (against base objective)

With AI

Training (base optimizer) optimizes for: Loss function
Internal model (mesa-optimizer) can optimize: Anything that correlates with low loss during training
Deployment: Mesa-optimizer may reveal true objective (different)

Conditions for Emergence

Sufficient model capacity
Environment complexity
Effective horizon (long-term)
Base objective allows shortcuts

Resources

Risks from Learned Optimization - Hubinger et al. (KEY PAPER)
Mesa-Optimization - Alignment Forum sequence

Related Articles

Outer Alignment

Specification Problem

The challenge of specifying what we want

Critical Problems

Corrigibility

Can we create an AI that accepts being modified?