Microsoft today launched a new computer vision service it claims can generate image captions that are, in some cases, more accurate than human-written descriptions. The company calls the service, which is available as part of Azure Cognitive Services Computer Vision, a “significant research breakthrough” and an example of its commitment to accessible AI.

Automatic image captioning has a number of broad use cases, first and foremost assisting users with disabilities. According to the World Health Organization, the number of people of all ages who are visually impaired is estimated to be 285 million, of whom 39 million are blind.

Accuracy becomes all the more critical when vision-impaired users rely on captioning for daily tasks. According to a study by researchers at Indiana University, the University of Washington, and Microsoft, blind people tend to place a lot of trust in automatically generated captions, building unsupported narratives to reconcile differences between image contexts and incongruent captions. When asked to identify captions of images on Twitter that might be incorrect, even blind users who describe themselves as being skilled and consistent about double-checking tended to trust automatic captions, the researchers found — no matter whether the captions make sense.

In early 2017, Microsoft updated Office 365 apps like Word and PowerPoint with automatic image captioning, drawing on Cognitive Services Computer Vision. (Cognitive Services is a cloud-based suite of APIs and SDKs available to developers building AI and machine learning capabilities into their apps and services.) More recently, the company launched Seeing AI, a mobile app designed to help low- and impaired-vision users navigate the world around them.

But while Office 365 and Seeing AI could automatically caption images better than some AI baselines, Microsoft engineers pursued new techniques to improve them further.

The engineers describe their technique in a September paper published on, a server for preprints. Called visual vocabulary pretraining, or VIVO for short, it leverages large amounts of photos without annotations to learn a vocabulary for image captioning. (Typically, training automatic captioning models requires corpora that contain annotations provided by human labelers.) The vocabulary comprises an embedding space where features of image regions and tags of semantically similar objects are mapped into vectors that are close to each other (e.g., “person” and “man,” “accordion” and “instrument”). Once the visual vocabulary is established, an automatic image captioning model can be fine-tuned using a data set of images and corresponding captions.

Microsoft image captioning AI

Above: Image captioning results on nocaps. B: A baseline without adding VIVO pretraining. V: With VIVO
pretraining. Red text represents novel objects. The bounding box color is brighter when the similarity is higher.

Image Credit: Microsoft

During the model training process, one or more tags are randomly masked and the model is asked to predict the masked tags conditioned on the image region features and the other tags. Even though the dataset used for fine-tuning only covers a small subset of the most common objects in the visual vocabulary, the VIVO-pretrained model can generalize to any images that depict similar scenes (e.g., people sitting on a couch together). In fact, it’s one of the few caption-generating pretraining methods that doesn’t rely on caption annotations, enabling it to work with existing image data sets developed for image tagging and object detection tasks.

Microsoft benchmarked the VIVO-pretrained model on nocaps, a test designed to encourage the development of image captioning models that can learn visual concepts from alternative sources of data. Evaluated on tens of thousands of human-generated captions describing thousands of images, the model achieved state-of-the-art results with substantial improvement for objects it hadn’t seen before. Moreover, on a metric called consensus-based image description evaluation (CIDEr), which aims to measure the similarity of a generated caption against ground truth sentences written by humans, the model surpassed human performance by a statistically significant margin.

In addition to the latest version of the Cognitive Services Computer Vision API, Microsoft says the model is now included in Seeing AI. It will roll out to Microsoft products and services including Word and Outlook, for Windows and Mac, and PowerPoint for Windows, Mac, and web later this year, replacing an image captioning model that’s been used since 2015.

“Given the benefit of this, we’ve worked to accelerate the integration of this research breakthrough and get it into production and Azure AI,” Eric Boyd, corporate vice president of AI platform at Microsoft, told VentureBeat via phone earlier this week. “It’s one thing to have a breakthrough of something that works in a delicate setup in the lab. But to have something that [in a few months] we can have pressure-tested and operating at scale and part of Azure … showcases how we’re able to go from the research breakthrough to getting things out into production.”

How startups are scaling communication: The pandemic is making startups take a close look at ramping up their communication solutions. Learn how