Foundational Papers
Key research papers in AI alignment
intermediate
Foundational Papers
Defining the Field
Inner Alignment
- Risks from Learned Optimization (2019)
- Goal Misgeneralization (2022)
Solutions
- InstructGPT (2022)
- Constitutional AI (2022)
- AI Safety via Debate (2018)
Interpretability
- Transformer Circuits (2021)
- Towards Monosemanticity (2023)
Theory
- The Basic AI Drives (2007)