Defenses against adversarial attacks, which in the context of AI refer to techniques that fool models through malicious input, are increasingly being broken by “defense-aware” attacks. In fact, most state-of-the-art methods claiming to detect adversarial attacks have been counteracted shortly after their publication. To break the cycle, researchers at the University of California, San Diego and Google Brain, including Turing Award winner Geoffrey Hinton, recently described in a preprint paper an approach that deflects attacks in the computer vision domain. Their framework either detects attacks accurately or, for undetected attacks, pressures the attackers to produce images that resemble the target class of images.
The proposed architecture comprises (1) a network that classifies various input images from a data set and (2) a network that reconstructs the inputs conditioned on parameters of a predicted capsule. Several years ago, Hinton and several students devised an architecture called CapsNet, a discriminately trained and multilayer AI system. It and other capsule networks make sense of objects in images by interpreting sets of their parts geometrically. Sets of mathematical functions (capsules) responsible for analyzing various object properties (like position, size, and hue) are tacked onto a type of AI model often used to analyze visuals. Several of the capsules’ predictions are reused to form representations of parts, and since these representations remain intact throughout analyses, capsule systems can leverage them to identify objects even when the positions of parts are swapped or transformed.
Another unique thing about capsule systems? They route with attention. As with all deep neural networks, capsules’ functions are arranged in interconnected layers that transmit “signals” from input data and slowly adjust the synaptic strength — weights — of each connection. (That’s how they extract features and learn to make predictions.) But where capsules are concerned, the weightings are calculated dynamically according to previous-layer functions’ ability to predict the next layer’s outputs.
Three reconstruction-based detection methods are used together by the capsule network to detect standard adversarial attacks. The first — Global Threshold Detector — exploits the fact that when input images are adversarially perturbed, the classification given to the input may be incorrect, but the reconstruction is often blurry. Local Best Detector identifies “clean” images from their reconstruction error; when the input is a clean image, the reconstruction error from the winning capsule is smaller than that of the losing capsules. As for the last technique, called Cycle-Consistency Detector, it flags inputs as adversarial examples if they aren’t classified in the same class as the reconstruction of the winning capsule.
The team reports that in experiments they were able to detect standard adversarial attacks based on three different distance metrics with a low False Positive Rate on SVHN and CIFAR-10. “A large percentage of the undetected attacks are deflected by our model to resemble the adversarial target class [and] stop being adversarial any more,” they wrote. “These attack images can no longer be called ‘adversarial’ because our network classifies them the same way as humans do.”