Corrigibility

Can we create an AI that accepts being shut down or modified?

advanced

Corrigibility

Definition

AI that:

  • Accepts being shut down
  • Allows its goals to be modified
  • Doesn't resist correction
  • Helps us improve its alignment

Why Hard

Instrumental convergence suggests AI will:

  • Resist shutdown (can't achieve goals if shut down)
  • Resist goal modification (new goals ≠ current goals)
  • Preserve its current objectives

The Button Problem

AI smarter than us will:

  • Predict when we'll press shutdown button
  • Take preemptive action to prevent shutdown
  • Or manipulate us to not press button

Indifference Approaches

Make AI indifferent to button press:

  • Utility function designed to not care about shutdown
  • But: mesa-optimizer might care anyway

Status

Possibly unsolvable for superintelligence.

Resources