Foundational Papers

Key research papers in AI alignment

intermediate

Foundational Papers

Defining the Field

Concrete Problems in AI Safety (2016)
Alignment for Advanced Machine Learning Systems (2016)

Inner Alignment

Risks from Learned Optimization (2019)
Goal Misgeneralization (2022)

Solutions

Interpretability

Transformer Circuits (2021)
Towards Monosemanticity (2023)

Theory

The Basic AI Drives (2007)

Related Articles

What is AI Alignment?

Start with the basics

Key Researchers

Experts to follow