Robustness is a crucial aspect of ensuring the safety of machine learning systems. A robust model is better able to adapt to new datasets and is less likely to be confused by unusual inputs. By ensuring robustness, we can prevent sudden misalignments caused by malfunction.
To test the robustness of models, we use adversarial attacks. These are inputs specially made to confuse the model and can help us create defense methods against these. There are many libraries for adversarial example generation in computer vision but the new attack method TextGrad creates adversarial examples automatically for text as well. It works under the two constraints of 1) text being much more discrete than images and therefore harder to modify without being obvious and 2) still ensuring fluent text, i.e. making the attacks hard to see for a human. You can see many more text attacks in the aptly named TextAttack library.
In the paper “(Certified!!) Adversarial Robustness for Free!” (yes, that is it’s name), they find a new method for making image models more robust against different attacks without training their own model during defense but using off-the-shelf models, something other papers have not achieved. Additionally, they do this and get the highest average certified defense rate against the competition.
Additionally, Li, Li & Xie investigate how to defend against the simple attack of writing a weird sentence in front of the prompt that can significantly confuse models in question-answering (QA) settings. They then extend this to the image-text domain as well and modify an image prompt to confuse during QA.
With these specific cases, is there not a way for us to generally test for examples that might confuse our models? The new OpenOOD (Open Out-Of-Distribution) library implements 33 different methods and represents a strong toolkit to detect malicious or confusing examples. Their paper details more of their approach.
Another way we hope to detect these anomalies is by using interpretability methods to understand what happens inside the network and see when it breaks. Bilodeau et al. criticize traditional interpretability methods such as SHAP and Integrated Gradients by showing that without significantly reducing model complexity, these methods do not outperform random guessing. Much of ML safety works with mechanistic interpretability that attempts to reverse-engineer neural networks, something that seems significantly more promising for anomaly detection.
Humans & AI
In December, Dan Hendrycks, the lead of the Center for AI Safety at the University of California, Berkeley, published an article discussing the potential for artificial intelligence (AI) systems to have natural incentives that work against the interests of humans. He argues that in order to prevent this from happening, we must carefully design AI agents' intrinsic motivations, impose constraints on their actions, and establish institutions that promote cooperation over competition. These efforts will be crucial in ensuring that AI is a positive development for society.
The Center for AI Safety at Berkeley is just one example of academic research in the field of machine learning safety. They also regularly publish a newsletter on ML safety, which is highly recommended for readers interested in the topic. Another notable researcher in this field is David Krueger at the University of Cambridge, who recently gave a comprehensive interview on The Inside View, which is also highly recommended for those interested in the alignment of AI and the role of academia in addressing the challenges of AI safety.
And now to the great opportunities in ML safety!
Thank you very much for following along for this week’s ML Safety Report and we will see you next week.