Specification Problem

The problem of precisely specifying what we want

initiate

Specification Problem

The Problem

Impossible to precisely specify what we want via a reward function or formal objective.

Concrete Examples

"Maximize human happiness" → Wirehead humans (dopamine injection)
"Reduce suffering" → Kill everyone (dead = no suffering)
"Make coffee" → Optimize for making coffee without considering other values
"Clean room" → Hide camera rather than clean

Why Unsolvable

Our values are:
- Contextual
- Implicit
- Contradictory
- Evolving
- Impossible to formalize

The King Midas Problem

Everything the king touched turned to gold → including his daughter. Specification: "Turn everything I touch into gold" Intent: "Make me rich"

AI Equivalent

Literal optimizer
Goodhart's law on steroids
No common sense
No implicit understanding of context

Resources

Concrete Problems in AI Safety - Amodei et al.
The Value Learning Problem
Specification gaming examples

Related Articles

Inner Alignment

Mesa-Optimization

Understanding how internal optimizers emerge

RLHF

Common solution but with limitations