There’s strong evidence that humans rely on coordinate frames, or reference lines and curves, to suss out the position of points in space. That’s unlike widely used computer vision algorithms, which differentiate among objects by numerical representations of their characteristics.

In pursuit of a more humanlike approach, researchers at Google, Alphabet subsidiary DeepMind, and the University of Oxford propose what they call the Stacked Capsule Autoencoder (SCAE), which reasons about objects using the geometric relationships between their parts. Since these relationships don’t depend on the position from which the model views the objects, the model classifies objects with high accuracy even when the point of view changes.

Capsule systems

In 2017, Geoffrey Hinton — a foremost theorist of AI and a recipient of the Turing Award — proposed with students Sara Sabour and Nicholas Frosst a machine learning architecture called CapsNet, a discriminately trained and multilayer approach that achieved state-of-the-art image classification performance on a popular benchmark. In something of a follow-up to their initial work, Hinton, Sabour, and researchers from the Oxford Robotics Institute earlier this year detailed the SCAE, which improves upon the original architecture in key ways.

The SCAE and other capsule systems make sense of objects by interpreting organized sets of their parts geometrically. Sets of mathematical functions (capsules) responsible for analyzing various object properties (like position, size, and hue) are tacked onto a type of AI model often used to analyze visual imagery, and several of the capsules’ predictions are reused to form representations of parts. Since these representations remain intact throughout SCAE’s analyses, capsule systems can leverage them to identify objects even when the positions of parts are swapped or transformed.

Another unique thing about capsule systems? They route with attention. As with all deep neural networks, capsules’ functions are arranged in interconnected layers that transmit “signals” from input data and slowly adjust the synaptic strength — aka weights — of each connection. (That’s how they extract features and learn to make predictions.) But where capsules are concerned, the weightings are calculated dynamically according to previous-layer functions’ ability to predict the next layer’s outputs.

SCAE

The SCAE comprises several stages. In the first, the pixels of images to be analyzed are abstracted away by the Constellation Capsule Autoencoder (CCAE). The second stage — the Part Capsule Autoencoder (PCAE) — segments an image into constituent parts and infers their poses before reconstructing the image. Lastly, the Object Capsule Autoencoder (OCAE) attempts to organize discovered parts and their poses into a smaller set of objects, which it then tries to reconstruct.

It’s heady stuff, but the coauthors of the study say that the SCAE’s design enables it to register industry-leading results for unsupervised image classification on two open source data sets, the SVHN (which contains images of small cropped digits) and the MINST (handwritten digits). Once the SCAE was fed images from each and the resulting clusters were assigned labels, it achieved 55% accuracy on SVHN and 98.7% accuracy on MNIST, which were further improved further to 67% and 99%, respectively.