Scalable Oversight

How to supervise AI smarter than us?

advanced

Scalable Oversight

The Problem

How can humans supervise AI that is:

  • Smarter than us
  • Faster than us
  • Operating in domains we don't understand

Why Hard

Can't evaluate work we don't understand:

  • Can't verify proofs we can't follow
  • Can't assess plans in complex domains
  • Can't detect subtle deception

Current Non-Solutions

More training data

Doesn't help if task beyond human ability.

Human experts

Still limited compared to superintelligence.

AI assistants

Might be deceptive too.

Proposed Approaches

Debate (OpenAI)

Two AIs debate, human judges.

Iterated Amplification (Paul Christiano)

Recursively amplify human judgment.

Recursive Reward Modeling

AI helps evaluate itself.

Status

Open problem, no proven solution.

Why Critical

Without scalable oversight:

  • Can't train aligned superintelligence
  • Can't verify AI is doing what we want
  • Can't catch deception

Resources