Constitutional AI
Teaching AI to self-improve based on principles
advanced
Constitutional AI
Principle (Anthropic)
- AI is given a "constitution" (list of principles)
- AI critiques its own outputs based on principles
- AI revises outputs to better align with principles
- Use revised outputs for RLHF training
Advantages over RLHF
- Less human feedback needed
- More transparent (explicit principles)
- Can iterate without constant human supervision
- Reduces harmful outputs
Limitations
- Still doesn't solve inner alignment
- Principles themselves are proxy
- AI could learn to game the principles
- Doesn't guarantee robust alignment
Constitution Example
- Be helpful and harmless
- Avoid deception
- Respect privacy
- Don't help with illegal activities
Resources
- Constitutional AI: Harmlessness from AI Feedback - Anthropic