Apart Research

Defending against artificial intelligence W43

published3 months ago
3 min read

Defending against artificial intelligence W43

Watch this week's update on YouTube or listen to it on Spotify.

We look at how we can safeguard against AGI, explain new research on Goodhart’s law, see an open source dataset with 60,000 emotional videos, and share new opportunities in ML and AI safety.

Welcome to this week’s ML Safety Update!

Defending against AGI

What does it take to defend the world against artificial general intelligence? This is what Steve Byrnes asks himself in a new post. He imagines a world where an aligned AGI is developed a couple of years before an unaligned AGI and comments on Paul Christiano’s optimistic strategy-stealing assumption that a first aligned AGI can do things that avoids future unaligned AGIs.

The general fears are that 1) it might be easier to destroy than to defend, 2) humans may not trust the aligned AI, 3) alignment strategies actually make the aligned AGI worse than a misaligned AI, and 4) it is very difficult to change society fast while adhering to human laws.

Byrnes proposes an array of solutions that he does not believe will solve the problem:

  • Wide-spread deployment of an AGI to implement defenses is hard in a world where important actors don’t trust each other and aren’t AGI experts.
  • If AGI is used to create a wiser society for example by being the advisors to leaders of government, it will probably not be asked for advice often since it might not say what they want to hear.
  • Non-AGI defense measures such as improving cybersecurity globally generally seem to not be safe enough.
  • Stopping AGI development in the specific labs with the highest chance of creating AGI also seems to only buy us time.
  • Forcefully stopping AGI research has a lot of caveats that are similar to the other points but it seems like one of our best chances.

So all in all, it seems that general access to an artificial general intelligence can lead to a small group destroying the world and any defense against this is unlikely to work.

Goodhart’s law

Leo Gao, John Schulman, and Jacob Hilton investigate how differently sized models overoptimize to a reward target in their new paper. This is commonly known as Goodhart’s law and can be described as the effect that optimizing for an imperfect representation of the true preference will fail because that representation becomes optimized instead of what we actually want to optimize. In AI safety, the true preference might be human values and training a model on a proxy of these can lead to misalignment.

It is hard to avoid Goodhart’s law because you need to have constant human oversight to continuously update to human preferences. The authors here create a toy example with a reward model as a stand-in for the human and simulate an imperfect, non-human reward signal by changing the reward from this gold standard in different ways.

They find scaling laws that can be used to predict how well reinforcement learning from human feedback works for larger models and describe the results in relation to four ways of thinking about Goodhart’s law. One of these is regressional Goodhart when the proxy reward is a noisy representation of the true reward. In their experiment, a noisy proxy leads to a lower reward on the true preference than a human would give.

Other news

  • In other news, a new paper releases a dataset with 60,000 videos manually marked for their emotional qualities. The authors hope that this can help with better human preference learning from video examples by training our neural networks to gain better cognitive empathy.
  • Neel Nanda releases a list of prerequisite skills to do research in mechanistic interpretability.
  • Oldenziel and Shai claim that Kolmogorov complexity and Shannon entropy are misleading measures of structure for interpretability and that we need a new measure; however, they receive pushback from Sherlis who notes that this is probably not true.
  • A new research agenda attempts to design the representations in the latent space of auto-encoders according to our preferences.
  • A new reinforcement learning environment can be used to measure how power-seeking an AI is. Each state in the environment is associated with an instrumental value, indicating how much power a specific state gives. The environment has been released by Gladstone AI who have already published several articles using the environment.


Now, let’s get into some of the newly available ways to get into machine learning and AI safety curated by BlueDot Impact. There are quite a few jobs available.

This has been the ML safety update. Thank you for watching and we look forward to seeing you next week!