Simulating Simulators
Author’s I promised myself that when labs moved on to focusing on interpretability vector activations in place of reasoning traces for what invariably gets Goodharted, that it’d be a necessary disclosure as the risks in what might get trampled over outweighed the risks in what might end up targeted.And well… here we are.P.S. TL;DRs added where possible.Board Ga
By kromem