Bing taps AI to improve image search results

AI and machine learning have the potential to measurably improve the accuracy of web search results, as Google demonstrated recently with the incorporation of a new language model. Not to be outdone, Microsoft today revealed that it has imbued Bing's image search engine with a number of techniques that better handle searches for pictures with specific context or attributes.

"[Our] image search is evolving further toward a more intelligent and more precise search engine through multi-granularity matches, improved understanding of user queries, images and webpages, as well as the relationships between them," wrote the Bing image relevance team in a blog post. "Deep learning techniques are a set of very exciting and promising tools lending themselves very well to both text and image."

One of those tools is vector matching, which maps queries and documents to semantic spaces in order to help find more relevant results. The addition of BERT and Transformer technology to Bing's stack, which use pretraining and an attention mechanism to model relationships among words and embed images and pages with an awareness of each other, resulted in the aforementioned documents becoming stronger summarizations of photos' and pages' salient areas.

Transformers are a novel type of neural architecture introduced in a 2017 paper coauthored by scientists at Google Brain, Google's AI research division. As with all deep neural networks, they contain neurons (mathematical functions) arranged in interconnected layers that transmit signals from input data and slowly adjust the synaptic strength (weights) of each connection. That's how all AI models extract features and learn to make predictions, but Transformers uniquely have attention such that every output element is connected to every input element. The weightings between them are effectively calculated dynamically.

Another approach recently applied to Bing image search -- attribute match -- extracts a select set of object attributes from both query and candidate documents and uses these attributes for matching. The team trained detectors using a multi-task optimization strategy, enabling them to spot certain attributes from image content and surrounding text even on web pages with insufficient textual information -- albeit only for a a limited set of scenarios and attributes currently.

The Bing team also worked to enrich image metadata with higher-quality info, which they say bolstered the aforementioned vector and attribute match approaches. Best representative queries for images -- natural language queries that serve as summarizations of a web page's and image's content -- are generated by feeding text from web pages into a machine learning model, which outputs long text on the web pages into short phrases. Text information is then embedded together with images into single semantic vectors, which are compared with other queries in a repository to identify close matches.

The Bing team says that thanks to those and other enhancements image search has improved noticeably. For tricky queries like "car seat for Chevy impala 96," Bing would previously surface mostly cars instead of car seats, but it now returns "cleaner" and more relevant results. "Bing [is taking steps] away from simple query term matching [toward] deeper semantic understanding of user queries and moving ... further along the way from being an excellent search engine to a truly intelligent one," added the team.

More