Steering RL Training: Benchmarking Interventions Against Reward Hacking

less than 1 minute read

This post is hosted on LessWrong. You will be redirected automatically, or you can click here to read it.