Unsupervised representation learning — the set of techniques that allows an AI system to automatically discover the representations needed to classify raw data — is becoming widely used in natural language processing. But it’s not yet gained traction in computer vision, perhaps because the raw signals in tasks like image classification are in a continuous, high-dimensional space that isn’t structured for human communication.

This spurred researchers at Facebook to pursue a contrastive loss approach, where keys (or tokens) are sampled from image data and represented by an encoder that’s trained to match queries to a dictionary. To this end, they developed Momentum Contrast, which pre-trains representations that can be transferred to tasks by fine-tuning. In a test involving seven tasks related to detection or segmentation and a corpus of roughly one billion photos from Instagram, they say that MoCo in some cases surpassed a supervised baseline by “nontrivial” margins.

“These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many computer vision tasks, and can serve as an alternative to … supervised pre-training in several applications,” wrote the researchers in a paper detailing their work.

MoCo cleverly maintains the above-described key dictionary as a queue of data samples, which enables it to reuse encoded keys. This in turn allows the dictionary to be larger than is typical, and to be set both “flexibly” and “independently” as a hyperparameter (i.e., an AI model parameter that’s estimated without using observed data). It’s a dynamic dictionary in the sense that its samples are progressively replaced, but the researchers note that it always represents a sampled subset of all data.

VB Transform 2020 Online - July 15-17. Join leading AI executives: Register for the free livestream.

To evaluate MoCo, the team tapped ImageNet and Instagram-1B, data sets containing 1.28 million images in 1,000 classes and 940 million public Instagram images related to the ImageNet categories, respectively. (Perhaps unsurprisingly given its size, it took six days and 64 graphics cards to train an image classification model on the Instagram corpora.) They report that MoCo pre-trained on the Instagram corpora performed consistently better than on the ImageNet samples, indicating that it’s well-suited to large-scale and relatively uncurated data.

“MoCo has largely closed the gap between unsupervised and supervised representation learning in multiple vision tasks,” wrote the researchers. “Beyond the simple instance discrimination task, it is possible to adopt MoCo for pretext tasks like masked auto-encoding, e.g., in language and in vision. We hope MoCo will be useful with other pretext tasks that involve contrastive learning.”