Steering Might Stop Working Soon
Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now.This is particularly important for things like steering as a mitigation against eval-awareness. Steering HumansI have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a h
By J Bostock