Scalable Oversight
How to supervise AI smarter than us?
advanced
Scalable Oversight
The Problem
How can humans supervise AI that is:
- Smarter than us
- Faster than us
- Operating in domains we don't understand
Why Hard
Can't evaluate work we don't understand:
- Can't verify proofs we can't follow
- Can't assess plans in complex domains
- Can't detect subtle deception
Current Non-Solutions
More training data
Doesn't help if task beyond human ability.
Human experts
Still limited compared to superintelligence.
AI assistants
Might be deceptive too.
Proposed Approaches
Debate (OpenAI)
Two AIs debate, human judges.
Iterated Amplification (Paul Christiano)
Recursively amplify human judgment.
Recursive Reward Modeling
AI helps evaluate itself.
Status
Open problem, no proven solution.
Why Critical
Without scalable oversight:
- Can't train aligned superintelligence
- Can't verify AI is doing what we want
- Can't catch deception