The interpretability researcher Neel Nanda has published a massive list of 200 open and concrete problems in mechanistic interpretability. They’re split into the following categories:
These are great projects to go for and we’re collaborating with Neel Nanda to run a mechanistic interpretability hackathon the 20th of January! As Lawrence Chan mentions in a new post; we need to touch reality as soon as possible, and these hackathons are a great way to get fast and concrete research results. You can join us but you can also run a local hackathon site!
ML improving ML
Thomas Woodside summarizes a collaborative project to map cases where ML systems are self-improving. There are already 11 different major research projects that have shown machine learning systems used to improve other systems and we assume that there is much more happening behind the scenes since these are only published papers.
Several of the projects use models to create data that another model is fine-tuned on while a few relate to speed-ups in running and developing machine learning systems. These include using ML to better optimize GPUs, optimizing compilers and helping humans spot flaws in a large language model using (LLM) another LLM.
A concrete example of the data generation and fine-tuning a paper from Microsoft and MIT that shows a LLM can be used to generate programming puzzles that a programming LLM is fine-tuned and improves a lot from.
With ML already reaching this level, we have to make sure that there are good introductions to ML safety for academics and engineers to understand the prominent issues with AI development. Vael Gates and Collin Burns try to identify the best intro texts by asking a bunch of ML researchers (28) which of eight texts they prefer. They find that the best resource is Joe Carlsmith’s “More is Different” blog posts.
In these posts, Joe Carlsmith explores two ways of looking at ML safety: Philosophy and engineering. He mentions that the engineering approach preferred by ML academia is underrated from the philosophical side and that the philosophical side (represented by Superintelligence) is significantly undervalued from the engineering perspective.
An important point of these posts is how future AI systems will be qualitatively different from current AI systems and that this results in weird emergent behaviour.
Aligned AGI vs. unaligned AGI
In “The Case Against AI Alignment”, Andrew Sauer describes how the greatest risks of an unaligned artificial general intelligence is that humanity goes extinct while an aligned system can lead to extreme suffering for a minority or for simulated beings. It is based on the inherent outgroup hatred of human psychology.
This comes at a time when the field of alignment is growing rapidly in response to the systems that have been released in the past year. One of the most important tasks of the sub-field of alignment concerned with value alignment is also to figure out whose values to align to, something that few have grappled with until now.
Responses to Sauer’s piece accept the importance of figuring out these questions but reject the hypothesis that we should accept the death of all humans because there “might” be a highly risky outcome. Additionally, human-invoked suffering for others is not a stable state, as compared to extinction, which means it has much less relevance on the larger timescale than one might expect.
Deep learning research and other news
In other news…
We have a few interesting opportunities coming up. Thanks goes to AGISF for once more sharing opportunities in ML & AI safety.
This has been the ML & AI safety update. See you next week!