Specification Problem

The problem of precisely specifying what we want

initiate

Specification Problem

The Problem

Impossible to precisely specify what we want via a reward function or formal objective.

Concrete Examples

  • "Maximize human happiness" → Wirehead humans (dopamine injection)
  • "Reduce suffering" → Kill everyone (dead = no suffering)
  • "Make coffee" → Optimize for making coffee without considering other values
  • "Clean room" → Hide camera rather than clean

Why Unsolvable

  • Our values are:
    • Contextual
    • Implicit
    • Contradictory
    • Evolving
    • Impossible to formalize

The King Midas Problem

Everything the king touched turned to gold → including his daughter. Specification: "Turn everything I touch into gold" Intent: "Make me rich"

AI Equivalent

  • Literal optimizer
  • Goodhart's law on steroids
  • No common sense
  • No implicit understanding of context

Resources

Related Articles