OpenAI and Google detail activation atlases, a technique for visualizing AI decision-making

Machine learning algorithms are often described as "black boxes" -- opaque, complicated constructs that produce predictions from input data as if by magic. But the absence of explainability opens the possibility for unintended consequences. Late last year, Google demonstrated that a popular model trained on an open source image dataset struggled to recognize Asian brides in ethnic dress. More troublingly, MIT researchers recently accused Amazon's Rekognition service of exhibiting racial and gender bias.

In an effort to peel back the curtains on AI systems' inner workings, scientists at Google and research firm OpenAI today detailed (and open-sourced) a technique that lays bare the component interactions within image-classifying neural networks. They call the visualization an activation atlas, and they say it's intended to illustrate how those interactions shape the model's decision-making.

"There's been this long strand of research in feature visualization -- a [subfield] of the field trying to understand what's going on inside neural networks," Chris Olah, a member of OpenAI's technical staff, told VentureBeat in a phone interview. "What we were trying to do with activation atlases is go and zoom out and see the broader picture of things that the neural network can represent."

Said neural networks comprise neurons, or functions loosely modeled after biological neurons, that are arranged in layers and connected with “synapses” that transmit signals to other nearby neurons. These signals -- the product of activation functions -- travel from layer to layer and slowly tune the network by adjusting the synaptic strength, or weights, of each connection. Over time, the network extracts features from the dataset and identifies trends across samples, eventually learning to make accurate predictions.

Neurons don't make predictions in isolation. They operate together in groups, and the groups come to understand nuanced patterns about people, objects, speech, or text. One segment of neurons in an object-classifying network might pick out, say, buttons on an Oxford shirt, or patches of cloth in a knitted blanket. Another might recognize fur, or street signage, or calligraphy.

It's kind of like the Latin alphabet, Google Brain scientist Shan Carter explained. While each of its 26 letters provides the basis for English, they express far less about the linguistic concepts than the words they create. "Nobody [tells] the neurons [to do these things]," he said. "They just [do] it automatically."

That's all well-understood stuff in machine learning, but it doesn't address two key explainability questions: (1) which combination of neurons in the network should be studied, and (2) which tasks each neuron group performs.

The simplest AI model visualizations show data transformations only on single inputs -- a fraction of the network's total activations. Pairwise interactions expose a bit more, but only barely -- they're two-dimensional slices of spaces containing hundreds of dimensions. A third technique -- spatial activations -- has its own set of deal-breaking limitations, chief among them the inability to show activations occurring in more than one input sample.

That's where activation atlases come in. Novelly, they take each input and organize it by its activation values (in the form of vectors of unitless numbers) and show feature visualizations of averaged activations, generating a global, hierarchical overview of concepts within a model's layers.

To demonstrate their approach in action, the researchers trained a well-studied network -- InceptionV1 (also known as GoogLeNet), which won the classification task in the 2014 ImageNet Large Scale Visual Recognition Challenge -- on an open source dataset (ImageNet) whittled down to a million random images (1,000 classes with 1,000 images each). InceptionV1, like other convolutional neural networks, contains more than one activation vector per layer per image sample, meaning the same neurons run on each patch of the previous layer. Thus, each time a photo from ImageNet passes through InceptionV1, those neurons are evaluated hundreds of times -- once for each overlapping patch of the image.

"We went over the dataset, and we found common combinations of neurons -- things that the network can represent using these combinations of neurons," Olah explained. "Then we collected all the neurons that activated together, and we laid them out on a grid such that similar ones were close together."

That solved for the single-image visualization dilemma, but what about scenarios involving multiple (i.e., millions of) images? Tackling those required an aggregatory approach: The researchers collected activations from every image in the dataset and randomly selected one spatial activation per image. They then reduced the dimensions of the resulting one million vectors in such a way that some of their structure was preserved. (Dimensions are a measure of attributes, which refer to the particular type of data within a dataset; a personal record of a person, for example, might contain the attributes "weight," "height," and "age.") Next, they plotted the activations on a grid over a layout created in the wake of the dimensionality reduction, and for each cell in the grid, they averaged the activations that lay within its boundaries.

Lastly, they sized the grid cells according to the density of activations averaged within and used feature visualization to create an iconic representation. And for each activation vector, they computed an attribution vector, which contained an entry for classes and which approximated the amount that the activation vector influenced the model's predictions for each class. The attribution shown for cells in the atlas, then, is the average of attribution vectors for activations in that cell. (Carter says it can be thought of as showing which classes a given cell tends to support.)

Zooming in

Activation atlases are perhaps best described as photo collages, or alternatively as gridded canvases by an artist with a regression function fetish. The spindly image clusters, which vary in density, stretch outward in all directions.

One portion of the InceptionV1 atlas contains what looks to be animal heads; zooming in reveals eyes, fur, and noses. Further down the atlas, various furs and the backs of four-legged animals come into focus, as well as animal legs and feet resting on the ground. They blend into lakes and oceans if you dive deep enough. It's at this stage the classifications become a bit ambiguous. Activations used to identify the sea for the class "seashore" are the same used when classifying "starfish" or "sea lion"; there's less to distinguish classes like "lakes" and "ocean." In fact, "lakeside" is as likely to be attributed toward "hippopotamus" as it is toward sea animals like "starfish" and "stingray."

Moving upward, people and food emerge. ImageNet doesn't contain many classes that specifically identify people, but those who are present and accounted for show attribution toward familiar things ("hammer," "flute"), clothes ("bow tie," "maillot"), and activities ("basketball"). As for the foodstuff populating the upper left of the atlas, it's largely round, colorful, and fruity, with attribution toward "lemon," "orange," and "fig."

Many insights can be gleaned from all this, Olah says. The first few layers of neurons in InceptionV1 -- dubbed "mixed3b" -- appear to represent blob-like patterns and splotches of color, while the middle layers -- "mixed4c" -- evoke concepts shared among many different classes, like "eyes," "leaves," and "water." Meanwhile, the latter layers -- "mixed5b" -- align abstractions more closely with the output. It's in these final groupings of neurons that textures and shapes become more complex and identifiable.

"[They] reveal a lot of structures that we sort kind of knew was in there or we knew where happening, but made visual and understandable to the average person," Carter said.

Detecting failure cases

Neural networks aren't perfect, and the activations atlas exposes both the flaws and their causes. By creating what the researchers call a "class activation atlas," which involves further dimensionality reduction, it's possible to isolate and analyze the object detectors within each layer, they said.

The class atlases reveal that InceptionV1 deftly classifies objects, for the most part -- "snorkel" is closely associated with photos of the ocean, underwater, and colorful masks, for instance. But failure cases lurk beneath the surface. In one example detailed by the researchers, an activation had a strong attribution toward "steam locomotive" because the detectors mistook the engines for scuba divers' air tanks. In another example, a detector strongly attributed certain kinds of pasta noodles with "wok," potentially as a result of the disproportionate number of images of woks with noodles in the ImageNet dataset.

"That's kind of like something we didn't know about before it was revealed through the atlas," Carter said.

In subsequent experiments, the researchers attempted to intentionally trip up the detectors by patching in misleading images. For the first failure case -- the one involving the snorkel detector -- they stitched a picture of a steam locomotive into a snap of a scuba diver. Even after it was enlarged to nearly half the size of the original sample image, InceptionV1 continued to rank "scuba diver" among the top five possible classifications.

They followed that test up with a larger-scale experiment -- one with 10 sample patches run on 1,000 images from the training set for the target class -- during which they managed to flip the image classification in about two in five cases. The flips were even more successful (one in two) when they positioned the patch in the best of the four corners of images at the "most effective" size.

They're far from the first example of what scientists refer to as “adversarial perturbations,” or changes to objects designed to fool computer vision algorithms into mistaking them for something else, and they threaten to wreak havoc in mission-critical systems like self-driving cars. Researchers from Kyushu University and MIT demonstrated in a paper an algorithm that alters a single pixel in an image, resulting in an AI misclassifying objects, and students at MIT successfully tricked Google’s Cloud Vision service into identifying images of dogs as “skiers.”

"I think there's a lot of scientific interest in how these models actually work," Olah said, "and [t]o the extent that we're worried about ways these systems might be deployed in the real world, being able to understand them, look through them, and audit them may be a really important [thing]."

Mapping machine learning algorithms

So toward what end are the activation atlases, exactly? Building on work in explainable AI by Harvard, IBM, DARPA, and others, it's the researchers' hope they inform future AI design by demonstrating how models operate, and by uncovering shortcomings in those models that might otherwise never have bubbled to the surface.

What with reports of facial recognition algorithms performing worse on African Americans than Caucasians and wrongly identifying criminal suspects, it's certainly a timely contribution.

"Things being corelated ... has structural similarities to other kinds of concerns you might have about neural nets, like bias, where the model may learn about something that doesn't actually cause the answer to be true but happens in the training set to be correlated with it," Olah said. "And so this kind of technique can surface things like that being important in the decision-making process."