Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs


Convolutional Neural Networks (CNNs) have become the de facto gold standard in computer vision applications for several years. However, new model architectures have recently been proposed challenging the status quo. The Vision Transformer (ViT) relies solely on attention modules, while the Mixer architecture substitutes the self-attention modules with Multi-Layer Perceptrons (MLPs). Despite their great success, CNNs have been shown vulnerable to adversarial examples. This work sets out to investigate the adversarial vulnerability of the recently introduced ViT and MLP-Mixer architectures and compare their performance with CNNs. Our results on white-box and black-box attacks suggest that ViT and MLP-Mixer architectures are more robust to adversarial examples. Using a toy example, we also provide empirical evidence that the lower adversarial robustness of CNNs can be attributed to their shift-invariant property. With a frequency study, we further analyze the distribution of frequencies learned from different model architectures.

In Workshop on Adversarial Machine Learning in Real-World Computer Vision Systems and Online Challenges @ CVPR 2021 (AML-CV @ CVPR 2021) (Outstanding Paper Award)
Philipp Benz
Philipp Benz
Ph.D. Candidate @ Robotics and Computer Vision Lab, KAIST

My research interest is in Deep Learning with a focus on robustness and security.