Facebook AI Research applies Transformer architecture to streamline object detection models

Six members of Facebook AI Research (FAIR) tapped the popular Transformer neural network architecture to create end-to-end object detection AI, an approach they claim streamlines the creation of object detection models and reduces the need for handcrafted components. Named Detection Transformer (DETR), the model can recognize objects in an image in a single pass all at once.

DETR is the first object detection framework to successfully integrate the Transformer architecture as a central building block in the detection pipeline, FAIR said in a blog post. The authors added that Transformers could revolutionize computer vision as they did natural language processing in recent years, or bridge gaps between NLP and computer vision.

"DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a Transformer architecture," reads a FAIR paper published Wednesday alongside the open source release of DETR. "The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors."

Created by Google researchers in 2017, the Transformer network architecture was initially intended as a way to improve machine translation, but has grown to become a cornerstone of machine learning for making some of the most popular pretrained state-of-the-art language models, such as Google's BERT, Facebook's RoBERTa, and many others. In conversation with VentureBeat, Google AI chief Jeff Dean and other AI luminaries declared Transformer-based language models a major trend in 2019 they expect to continue in 2020.

Transformers use attention functions instead of a recurrent neural network to predict what comes next in a sequence. When applied to object detection, a Transformer is able to cut out steps to building a model, such as the need to create spatial anchors and customized layers.

DETR achieves results comparable to Faster R-CNN, an object detection model created primarily by Microsoft Research that's earned nearly 10,000 citations since it was introduced in 2015, according to arXiv. The DETR researchers ran experiments using the COCO object detection data set as well as others related to panoptic segmentation, the kind of object detection that paints regions of an image instead of with a bounding box.

One major issue the authors say they encountered: DETR works better on large objects than small objects. "Current detectors required several years of improvements to cope with similar issues, and we expect future work to successfully address them for DETR," the authors wrote.

DETR is the latest Facebook AI initiative that looks to a language model solution to solve a computer vision challenge. Earlier this month, Facebook introduced the Hateful Meme data set and challenge to champion the creation of multimodal AI capable of recognizing when an image and accompanying text in a meme violates Facebook policy. In related news, earlier this week, the Wall Street Journal reported that an internal investigation concluded in 2018 that Facebook's recommendation algorithms "exploit the human brain's attraction to divisiveness," but executives largely ignored the analysis.

More