How Google plans to improve web searches with multimodal AI

During a livestreamed event today, Google detailed the ways it's using AI techniques -- specifically a machine learning algorithm called multitask unified model (MUM) -- to enhance web search experiences across different languages and devices. Beginning early next year, Google Lens, the company's image recognition technology, will gain the ability to find objects like apparel based on photos and high-level descriptions. Around the same time, Google Search users will begin seeing an AI-curated list of things they should know about certain topics, like acrylic paint materials. They'll also see suggestions to refine or broaden searches based on the topic in question, as well as related topics in videos discovered through Search.

The upgrades are the fruit of a multiyear effort at Google to improve Search and Lens' understanding of how language relates to visuals from the web. According to Google VP of Search Pandu Nayak, MUM, which Google detailed at a developer conference last June, could help better connect users to businesses by surfacing products and reviews and improving "all kinds" of language understanding, whether at the customer service level or in a research setting.

"The power of MUM is its ability to understand information on a broad level. It's intrinsically multimodal -- that is, it can handle text, images, and videos all at the same time," Nayak told VentureBeat in a phone interview. "It holds out the promise that we can ask very complex queries and break them down into a set of simpler components, where you can get results for the different, simpler queries and then stitch them together to understand what you really want."

MUM

Google conducts a lot of tests in Search to fine-tune the results that users ultimately see. In 2020 -- a year in which the company launched more than 3,600 new features -- it conducted over 17,500 traffic experiments and more than 383,600 quality audits, Nayak says.

Still, given the complex nature of language, issues crop up. For example, a search for "Is sole good for kids" several years ago -- "sole" referring to the fish, in this case -- turned up webpages comparing kids' shoes.

In 2019, Google set out to tackle the language ambiguity problem with a technology called Bidirectional Encoder Representations from Transformers, or BERT. Building on the company's research into the Transformer model architecture, BERT forces models to consider the context of a word by looking at the words that come before and after it.

Dating back to 2017, Transformer has become the architecture of choice for natural language tasks, demonstrating an aptitude for summarizing documents, translating between languages, and analyzing biological sequences. According to Google, BERT helped Search better understand 10% of queries in the U.S. in English -- particularly longer, more conversational searches where prepositions like "for" and "to" matter a lot to the meaning.

For instance, Google's previous search algorithm wouldn't understand that "2019 brazil traveler to usa need a visa" is about a Brazilian traveling to the U.S. and not the other way around. With BERT, which realizes the importance of the word "to" in context, Google Search provides more relevant results for the query.

"BERT started getting at some of the subtlety and nuance in language, which was pretty exciting, because language filled with nuance and subtlety," Nayak said.

But BERT has its limitations, which is why researchers at Google's AI division developed a successor in MUM. MUM is about 1,000 times larger than BERT and trained on a dataset of documents from the web, with content like explicit, hateful, abusive and misinformative images and text filtered out. It's able to answer queries in 75 languages including questions like "I want to hike to Mount Fuji next fall -- what should I do to prepare?" and realize that that "prepare" could encompass things like fitness training as well as weather.

MUM can also lean on context and more in imagery and dialogue turns. Given a photo of hiking boots and asked "Can I use this to hike Mount Fuji?" MUM can comprehend the content of the image and the intent behind the query, letting the questioner know that hiking boots would be appropriate and pointing them toward a lesson in a Mount Fuji blog.

MUM, which can transfer knowledge between languages and doesn't need to be explicitly taught how to complete specific tasks, helped Google engineers to identify more than 800 COVID-19 name variations in over 50 languages. With only a few examples of official vaccine names, MUM was able to find interlingual variations in seconds compared with the weeks it might take a human team.

"MUM gives you generalization from languages with a lot of data to languages like Hindi and so forth, with little data in the corpus," Nayak explained.

Multimodal search

After internal pilots in 2020 to see the types of queries that MUM might be able to solve, Google says it's expanding MUM to other corners of Search.

Soon, MUM will allow users to take a picture of an object with Lens -- for example, a shirt -- and search the web for another object -- e.g., socks -- with a similar pattern. MUM will also enable Lens to identify an object unfamiliar to a searcher, like a bike's rear sprockets, and return search results according to a query. For example, given a picture of sprockets and the query, "How do I fix this thing," MUM will show instructions about how to repair bike sprockets.

"MUM can understand that what you're looking for are techniques for fixing and what that mechanism is," Nayak said. "This is the kind of thing that the multimodel Lens promises, and we expect to launch this sometime hopefully early next year."

As an aside, Google unveiled "Lens mode" for iOS for users in the U.S., which adds a new button in the Google app to make all images on a webpage searchable through Lens. Also new is Lens in Chrome, available in the coming months globally, which will allow users to select images, video, and text on a website with Lens to see search results in the same tab without leaving the page that they're on.

In Search, MUM will power three new features: Things to Know, Refine & Broaden, and Related Topics in Videos. Things to Know takes a broad query, like "acrylic paintings," and spotlights web resources like step-by-step instructions and painting styles. Refine & Broaden finds narrower or general topics related to a query, like "styles of painting" or "famous painters." As for Related Topics in Videos, it picks out subjects in videos, like "acrylic painting materials" and "acrylic techniques," based on the audio, text, and visual content of those videos.

"MUM has a whole series of specific applications," Nayak said, "and they're beginning to impact on many of our products."

Potential biases

A growing body of research shows that multimodal models are susceptible to the same types of biases as language and computer vision models. The diversity of questions and concepts involved in tasks like visual question answering -- as well as the lack of high-quality data -- often prevent models from learning to "reason," leading them to make educated guesses by relying on dataset statistics. For example, in one study involving 7 multimodal models and 3 bias-reduction techniques, the coauthors found that the models failed to address questions involving infrequent concepts, suggesting that there's work to be done in this area.

Google has had its fair share of issues with algorithmic bias -- particularly in the computer vision domain. Back in 2015, a software engineer pointed out that the image recognition algorithms in Google Photos were labeling his Black friends as "gorillas." Three years later, Google hadn’t moved beyond a piecemeal fix that simply blocked image category searches for "gorilla," "chimp," "chimpanzee," and "monkey" rather than reengineering the algorithm. More recently, researchers showed that Google Cloud Vision, Google’s computer vision service, automatically labeled an image of a dark-skinned person holding a thermometer "gun" while labeling a similar image with a light-skinned person "electronic device."

"[Multimodal] models, which are trained at scale, result in emergent capabilities, making it difficult to understand what their biases and failure modes are. Yet the commercial incentives are for this technology to be deployed to society at large," Percy Liang, Stanford HAI faculty and computer science professor, told VentureBeat in a recent email.

No doubt looking to avoid generating a string of negative publicity, Google claims that it took pains to mitigate biases in MUM -- mainly by training the model on "high quality" data and having humans evaluate MUM's search results. "We use [an] evaluation process to look for problems with bias in any set of applications that we launch," Nayak said. "When we launch things that are potentially risky, we go the extra mile to be extra cautious."