Compiling code to neural networks?
Welcome to this week’s ML & AI Safety Report where we dive into overfitting and look at a compiler for Transformer architectures! This week is a bit short because the mechanistic interpretability hackathon is starting today – sign up on ais.pub/mechint and join the Discord.
Superpositions & Transformers
In a recent Anthropic paper, the authors find that overfitting corresponds to the neurons in a model storing data points instead of features. This mostly happens early in training and when we don’t have a lot of data.
In their experiment, they use a very simple model (a so-called toy model) that is useful when studying isolated phenomena in detail. In some of the visualizations, they train it from 2D data with T training examples. As seen below, the feature activations (blue) look very messy while the activations to the data points (red) look very clean.
Going deeper in the paper, they find that this generalizes to larger dimensions (10,000D) and that the transition from overfitting on smaller datasets’ data points to generalizing to the actual data features seems to be the reason for the famous double descent phenomenon where a model sees a dip in performance but then becomes better afterwards.
And on the topic of toy models, DeepMind releases Tracr, a compiler that can turn any RASP human-readable code into a Transformer architecture. This can be useful for studying how algorithms represent themselves in Transformer space and to study phenomena of learned algorithms in-depth.
Other research news
In other news…
For this week’s opportunities, the awesome new website aisafety.training will help us find the best events for you to join across the world:
Thank you for joining this week’s MLAISU and we’ll see you next week!