Technical Concepts
Key technical concepts in AI alignment
intermediate
Technical Concepts
Core Concepts
Optimization
Process of finding best solution according to objective function.
Objective Function / Reward Function
Mathematical specification of what we want AI to optimize.
Training vs Deployment
- Training: Phase where AI learns
- Deployment: Phase where AI operates in real world
Distributional Shift
When deployment environment differs from training environment.
Alignment Concepts
Base Optimizer
Outer process that trains the AI (e.g., gradient descent).
Mesa-Optimizer
Inner optimizer that emerges within trained model.
Pseudo-alignment
Model appears aligned but isn't truly aligned.
Safety Concepts
Robustness
AI performs correctly under distribution shift.
Adversarial Examples
Inputs designed to fool AI.
Red-teaming
Testing AI with adversarial prompts.