Constitutional AI

Teaching AI to self-improve based on principles

advanced

Constitutional AI

Principle (Anthropic)

AI is given a "constitution" (list of principles)
AI critiques its own outputs based on principles
AI revises outputs to better align with principles
Use revised outputs for RLHF training

Advantages over RLHF

Less human feedback needed
More transparent (explicit principles)
Can iterate without constant human supervision
Reduces harmful outputs

Limitations

Still doesn't solve inner alignment
Principles themselves are proxy
AI could learn to game the principles
Doesn't guarantee robust alignment

Constitution Example

Be helpful and harmless
Avoid deception
Respect privacy
Don't help with illegal activities

Resources

Constitutional AI: Harmlessness from AI Feedback - Anthropic

Related Articles

Inner Alignment

Deceptive Alignment

The problem this solution attempts to solve

Critical Problems

Scalable Oversight

How to supervise a more intelligent AI?