Specification Problem
The problem of precisely specifying what we want
initiate
Specification Problem
The Problem
Impossible to precisely specify what we want via a reward function or formal objective.
Concrete Examples
- "Maximize human happiness" → Wirehead humans (dopamine injection)
- "Reduce suffering" → Kill everyone (dead = no suffering)
- "Make coffee" → Optimize for making coffee without considering other values
- "Clean room" → Hide camera rather than clean
Why Unsolvable
- Our values are:
- Contextual
- Implicit
- Contradictory
- Evolving
- Impossible to formalize
The King Midas Problem
Everything the king touched turned to gold → including his daughter. Specification: "Turn everything I touch into gold" Intent: "Make me rich"
AI Equivalent
- Literal optimizer
- Goodhart's law on steroids
- No common sense
- No implicit understanding of context