Suggested Readings — UMARS

Suggested Readings

We generally encourage exploring the alignment forum

Overview / Threat Models

Oversight, Auditing, Control, and Model Organisms

Mechanistic Interpretability / Science of Deep Learning

A Mathematical Framework for Transformer Circuits
Toy Models of Superposition
Singular Learning Theory / Deep Learning is Singular, and That’s Good (IEEE TNNLS 2023)
Transformers Represent Belief State Geometry in their Residual Stream (NeurIPS 2024)

Theoretical / Conceptual