Steering RL Training: Benchmarking Interventions Against Reward Hacking
This post is hosted on LessWrong. You will be redirected automatically, or you can click here to read it.
This post is hosted on LessWrong. You will be redirected automatically, or you can click here to read it.