0
Skip to Content
UMARS
Reading Group
Suggested Readings
Opportunities
UMARS
Reading Group
Suggested Readings
Opportunities
Reading Group
Suggested Readings
Opportunities

Suggested Readings

We generally encourage exploring the alignment forum

Overview / Threat Models

  • The Alignment Problem from a Deep Learning Perspective (ICLR 2024)

  • Risks from Learned Optimization in Advanced Machine Learning Systems

  • Giving AI’s Safe Motivations

Oversight, Auditing, Control, and Model Organisms

  • Weak to Strong Generalization: Eliciting Strong Capabilities with Weak Supervision

  • Debating with More Persuasive LLMs Leads to More Truthful Answers (ICML 2024, Best Paper Award)

  • AI Control: Improving Safety Despite Intentional Subversion (ICML 2024)

  • Auditing Language Models for Hidden Objectives

  • Alignment Faking in Large Language Models

Mechanistic Interpretability / Science of Deep Learning

  • A Mathematical Framework for Transformer Circuits

  • Toy Models of Superposition

  • Singular Learning Theory / Deep Learning is Singular, and That’s Good (IEEE TNNLS 2023)

  • Transformers Represent Belief State Geometry in their Residual Stream (NeurIPS 2024)

Theoretical / Conceptual

  • Natural Abstraction

  • Brain-like AGI and Shard-Theory

  • Coalitional Agency