MIT CSAIL creates AI that associates objects and spoken words

Machine learning algorithms tend to be specialized -- they excel at singular, highly repetitive tasks. (Think generating synthetic scans of brain tumors.) But a new paper published by researchers at the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Lab (CSAIL) describes something of an AI polymath: a model that's equally skilled at both speech and object recognition.

"We wanted to do speech recognition in a way that's more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don't typically have access to," David Harwath, a researcher at CSAIL and a coauthor on the paper, told MIT News. "We got the idea of training a model in a manner similar to walking a child through the world and narrating what you're seeing."

To that end, their system learned to identify objects in an image by associating the words it heard in speech samples with regions in the picture. All the more impressive, it didn't once fall back on transcription or annotations -- it trained solely on pairs of images and audio captions.

The model, which the team sourced from a 2016 study, consists of two convolutional neural networks (CNN): one that processes images and a second that processes spectrograms, visual representations of a frequency spectrum.

During training, the first CNN divvied up the target image into a grid of cells while the audio-analyzing CNN divided the spectrogram into segments. Then, a third component of the model computed the outputs of the two networks, mapping the first cell in the grid to the first segment of audio, the second cell to the second segment, and so on until the entire image had been processed.

After ingesting a database of 400,000 image-caption pairs, the system managed to get the AI to associate hundreds of different words with objects. They believe that in the future, it could be adapted to domains like language translation.

"The biggest contribution of the paper," Harwath said, "is demonstrating that these cross-modal alignments can be inferred automatically by simply teaching the network which images and captions belong together and which pairs don't ... It's kind of like the Big Bang, where matter was really dispersed, but then coalesced into planets and stars. Predictions start dispersed everywhere but, as you go through training, they converge into an alignment that represents meaningful semantic groundings between spoken words and visual objects."

More