Primates’ retinal ganglion cells receive visual info from photoreceptors that they then transmit from the eye to the brain. But not all cells are created equal — an estimated 80% operate at low frequency and recognize fine details, while about 20% respond to swift changes. This biological dichotomy inspired scientists at Facebook AI Research to pursue what they call SlowFast. It’s a machine learning architecture for video recognition that they claim achieves “strong performance” for both action classification and detection in footage.

An implementation in Facebook’s PyTorch framework — PySlowFast — is available on GitHub, along with trained models.

As the research team points out in a preprint paper, slow motions occur statistically more often than fast motions, and the recognition of semantics like colors, textures, and lighting can be refreshed slowly without compromising accuracy. On the other hand, it’s beneficial to analyze performed motions — like clapping, waving, shaking, walking, or jumping — at a high temporal resolution (i.e., using a greater number of frames), because they evolve faster than their subject identities.

That’s where SlowFast comes in. It comprises two pathways, one of which operates at a low frame rate and slow refreshing speed optimized to capture information given by a few images or sparse frames. In contrast, the other pathway captures rapidly changing motion with a fast refreshing speed and high temporal resolution.

Facebook SlowFast AI

Above: Facebook’s SlowFast classifying a video.

Image Credit: Facebook

The researchers assert that by treating the raw video at different temporal rates, SlowFast allows its two pathways to develop their own video modeling expertise. The slower path becomes better at recognizing static areas in the frame that don’t change or that change slowly, while the faster path learns to reliably suss out actions in dynamic areas.

The information of the two pathways is fused, such that data from the fast pathway is fed into the slow pathway via lateral connections throughout the network. This allows the slow pathway to become aware of the results from the fast pathway, and it allows the results to be concatenated into a fully connected classification layer.

To evaluate SlowFast’s performance, the team tested the model on two popular data sets: DeepMind’s Kinetics-400 and Google’s AVA. The former includes short 10-second scenes from hundreds of thousands of YouTube videos, with 400 categories of human actions represented in at least 400 videos. On the other hand, AVA includes 430 15-minute annotated YouTube videos with 80 annotated visual actions.

SlowFast achieved state-of-the-art results on both data sets, surpassing the best top-1 score in Kinetics-400 by 5.1% (79.0% versus 73.9%) and the best top-5 score by 2.7% (93.6% versus 90.9%). It also achieves a median average precision of 28.3 (mAP) on AVA (a substantial improvement on the state-of-the-art of 21.9 mAP). Interestingly, but perhaps unsurprisingly, the paper’s coauthors note that the compute cost of the slow pathway was 4 times larger than that of the fast pathway.

“We hope that this SlowFast concept will foster further research in video recognition … [We’ve demonstrated that] the Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition,” wrote the researchers. “The time axis is a special dimension. This paper has investigated an architecture design that contrasts the speed along this axis.”