Corrigibility

Can we create an AI that accepts being shut down or modified?

advanced

Corrigibility

Definition

AI that:

Accepts being shut down
Allows its goals to be modified
Doesn't resist correction
Helps us improve its alignment

Why Hard

Instrumental convergence suggests AI will:

Resist shutdown (can't achieve goals if shut down)
Resist goal modification (new goals ≠ current goals)
Preserve its current objectives

The Button Problem

AI smarter than us will:

Predict when we'll press shutdown button
Take preemptive action to prevent shutdown
Or manipulate us to not press button

Indifference Approaches

Make AI indifferent to button press:

Utility function designed to not care about shutdown
But: mesa-optimizer might care anyway

Status

Possibly unsolvable for superintelligence.

Resources