Was ChatGPT a good idea?
In this week’s ML & AI Safety Update, we hear Paul Christiano’s take on one of OpenAI’s main alignment strategies, dive into the second round winners of the inverse scaling prize and share the many fascinating projects from our mechanistic interpretability hackathon. And stay tuned until the end for some unique opportunities in AI safety!
Reinforcement learning from human feedback
Reinforcement learning from human feedback (RLHF) is one of the most applied techniques from alignment research. Its history started in 2015 when Paul Christiano introduced the concept in a blog post.
The idea is that we train models not just to imitate humans, but also to act in ways that humans would evaluate as preferable. This basic idea has resulted in years of research at OpenAI and is now one of the main principles behind ChatGPT.
Two days ago, Christiano published a piece evaluating the impact of RLHF on the speed-up of AGI versus progress on aligning said AGI. He thinks the project has been net positive and that replacements that work as well in practice (e.g. imitation learning) would have been used for AI capabilities unless RLHF was developed.
Additionally, Christiano counters arguments from the AI safety community, mentioning that RLHF is:
Inverse scaling prize
The inverse scaling prize has found its second round winners in a challenge to find tasks where larger language models such as GPT-3 do worse than GPT-2. These are generally hard to find and they are very important to identify to figure out which abilities will fail in larger models more generally.
The seven winners of the second round have all used quite novel method to get there:
I recommend checking out the other four winners in their report on the round 2 projects.
Alignment Jam 4
The Fourth Alignment Jam ended this Sunday, with 15 amazing projects submitted! It was on the topic of “mechanistic interpretability”, where we try to reverse engineer how neural networks (NN) process input. Since NNs learn algorithms from the training data, we can actually try to find specific algorithms for specific tasks within the network.
You can watch the ending ceremony with presentations by three of the four winners (starts here) but here is a short summarization of the winning projects:
It was tough deciding the winners together with Neel Nanda and you can see many more in the results section of the hackathon page. We recommend you check them out! There’s methods from biology, compiled Transformers, interactive apps, and latent knowledge identification methods.
Thank you for following along for this week’s ML & AI Safety Update and we’ll see you next week!