Technical Concepts

Key technical concepts in AI alignment

intermediate

Technical Concepts

Core Concepts

Optimization

Process of finding best solution according to objective function.

Objective Function / Reward Function

Mathematical specification of what we want AI to optimize.

Training vs Deployment

  • Training: Phase where AI learns
  • Deployment: Phase where AI operates in real world

Distributional Shift

When deployment environment differs from training environment.

Alignment Concepts

Base Optimizer

Outer process that trains the AI (e.g., gradient descent).

Mesa-Optimizer

Inner optimizer that emerges within trained model.

Pseudo-alignment

Model appears aligned but isn't truly aligned.

Safety Concepts

Robustness

AI performs correctly under distribution shift.

Adversarial Examples

Inputs designed to fool AI.

Red-teaming

Testing AI with adversarial prompts.

Resources

Related Articles