Suggested Readings
We generally encourage exploring the alignment forum
Overview / Threat Models
The Alignment Problem from a Deep Learning Perspective (ICLR 2024)
Risks from Learned Optimization in Advanced Machine Learning Systems
Oversight, Auditing, Control, and Model Organisms
Weak to Strong Generalization: Eliciting Strong Capabilities with Weak Supervision
Debating with More Persuasive LLMs Leads to More Truthful Answers (ICML 2024, Best Paper Award)
AI Control: Improving Safety Despite Intentional Subversion (ICML 2024)
Mechanistic Interpretability / Science of Deep Learning
Singular Learning Theory / Deep Learning is Singular, and That’s Good (IEEE TNNLS 2023)
Transformers Represent Belief State Geometry in their Residual Stream (NeurIPS 2024)
Theoretical / Conceptual