Corrigibility
Can we create an AI that accepts being shut down or modified?
advanced
Corrigibility
Definition
AI that:
- Accepts being shut down
- Allows its goals to be modified
- Doesn't resist correction
- Helps us improve its alignment
Why Hard
Instrumental convergence suggests AI will:
- Resist shutdown (can't achieve goals if shut down)
- Resist goal modification (new goals ≠current goals)
- Preserve its current objectives
The Button Problem
AI smarter than us will:
- Predict when we'll press shutdown button
- Take preemptive action to prevent shutdown
- Or manipulate us to not press button
Indifference Approaches
Make AI indifferent to button press:
- Utility function designed to not care about shutdown
- But: mesa-optimizer might care anyway
Status
Possibly unsolvable for superintelligence.