Facebook's new computer vision model achieves state-of-the-art performance by learning from random images

Facebook today announced an AI model trained on a billion images that ostensibly achieves state-of-the-art results on a range of computer vision benchmarks. Unlike most computer vision models, which learn from labeled datasets, Facebook's generates labels from data by exposing the relationships between the data's parts -- a step believed to be critical to someday achieving human-level intelligence.

The future of AI lies in crafting systems that can make inferences from whatever information they're given without relying on annotated datasets. Provided text, images, or another type of data, an AI system would ideally be able to recognize objects in a photo, interpret text, or perform any of the countless other tasks asked of it.

Facebook claims to have made a step toward this with a computer vision model called SEER, which stands for SElf-supERvised. SEER contains a billion parameters and can learn from any random group of images on the internet without the need for curation or annotation. Parameters, a fundamental part of machine learning systems, are the part of the model derived from historical training data.

New techniques

Self-supervision for vision is a challenging task. With text, semantic concepts can be broken up into discrete words, but with images, a model must decide for itself which pixel belongs to which concept. Making matters more challenging, the same concept will often vary between images. Grasping the variation around a single concept, then, requires looking at a lot of different images.

Facebook researchers found that scaling AI systems to work with complex image data required at least two core components. The first was an algorithm that could learn from a vast number of random images without any metadata or annotations, while the second was a convolutional network -- ConvNet -- large enough to capture and learn every visual concept from this data. Convolutional networks, which were first proposed in the 1980s, are inspired by biological processes, in that the connectivity pattern between components of the model resembles the visual cortex.

In developing SEER, Facebook took advantage of an algorithm called SwAV, which was borne out of the company's investigations into self-supervised learning. SwAV uses a technique called clustering to rapidly group images from similar visual concepts and leverage their similarities, improving over the previous state-of-the-art in self-supervised learning while requiring up to 6 times less training time.

Training models at SEER's size also required an architecture that was efficient in terms of runtime and memory without compromising on accuracy, according to Facebook. The researchers behind SEER opted to use RegNets, or a type of ConvNet model capable of scaling to billions or potentially trillions of parameters while fitting within runtime and memory constraints.

Facebook software engineer Priya Goyal said SEER was trained on 512 NVIDIA V100 GPUs with 32GB of RAM for 30 days.

The last piece that made SEER possible was a general-purpose library called VISSL, short for VIsion library for state-of-the-art Self Supervised Learning. VISSL, which Facebook is open-sourcing today, allows for self-supervised training with a variety of modern machine learning methods. The library facilitates self-supervised learning at scale by integrating algorithms that reduce the per-GPU memory requirement and increase the training speed of any given model.

Performance and future work

After pretraining on a billion public Instagram images, SEER outperformed the most advanced state-of-the-art self-supervised systems, Facebook says. SEER also outperformed models on tasks including object detection, segmentation, and image classification. When trained with just 10% of the examples in the popular ImageNet dataset, SEER still managed to hit 77.9% accuracy. And when trained with just 1%, SEER was 60.5% accurate.

When asked whether the Instagram users whose images were used to train SEER were notified or given an opportunity to opt out of the research, Goyal noted that Facebook informs Instagram account holders in its data policy that it uses information like pictures to support research, including the kind underpinning SEER. That said, Facebook doesn't plan to share the images or the SEER model itself, in part because the model might contain unintended biases.

"Self-supervised learning has long been a focus for Facebook AI because it enables machines to learn directly from the vast amount of information available in the world, rather than just from training data created specifically for AI research," Facebook wrote in a blog post. "Self-supervised learning has incredible ramifications for the future of computer vision, just as it does in other research fields. Eliminating the need for human annotations and metadata enables the computer vision community to work with larger and more diverse datasets, learn from random public images, and potentially mitigate some of the biases that come into play with data curation. Self-supervised learning can also help specialize models in domains where we have limited images or metadata, like medical imaging. And with no labor required up front for labeling, models can be created and deployed quicker, enabling faster and more accurate responses to rapidly evolving situations."

New techniques

Performance and future work

More