Scalable Oversight

How to supervise AI smarter than us?

advanced

Scalable Oversight

The Problem

How can humans supervise AI that is:

Smarter than us
Faster than us
Operating in domains we don't understand

Why Hard

Can't evaluate work we don't understand:

Can't verify proofs we can't follow
Can't assess plans in complex domains
Can't detect subtle deception

Current Non-Solutions

More training data

Doesn't help if task beyond human ability.

Human experts

Still limited compared to superintelligence.

AI assistants

Might be deceptive too.

Proposed Approaches

Debate (OpenAI)

Two AIs debate, human judges.

Iterated Amplification (Paul Christiano)

Recursively amplify human judgment.

Recursive Reward Modeling

AI helps evaluate itself.

Status

Open problem, no proven solution.

Why Critical

Without scalable oversight:

Can't train aligned superintelligence
Can't verify AI is doing what we want
Can't catch deception

Resources