New ARENA material: 8 exercise sets on alignment science & interpretability
TLDRThis is a post announcing a lot of new ARENA material I've been working on for a while, which is now available for study here (currently on the alignment-science branch, but planned to be merged into main this Sunday).There's a set of exercises (each one contains about 1-2 days of material) on the following topics:Linear Probes (replication of the "Geometry of Truth" paper, plus Apollo's "Probing for Deception" work)Activation Oracles (based around this demo notebook, with additional exercis