Amazon's AI uses a microphone array to localize multiple speakers in a room

In a technical paper scheduled to be presented next month at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), a group of Amazon researchers propose an AI-driven approach to multiple-source localization, or the problem of estimating a sound's location using microphone audio. They say that in experiments involving real and simulated data (the former from the AV16.3 corpus) and up to three simultaneously active sound sources, the approach showed an improvement of nearly 15% compared with a state-of-the-art signal-processing model.

Addressing multiple-source localization is an indispensable step in developing sufficiently robust smart speakers, smart displays, and even videoconferencing software. That's because it's at the core of beamforming, a technique that focuses a signal (in this case sound) toward a receiving device (microphones). Amazon's own Echo lineup taps beamforming to improve voice recognition accuracy, as does Google's Nest Hub and Apple's HomePod.

Sound traveling toward an array of microphones will reach each microphone at a different time, a phenomenon that can be exploited to pinpoint the sources' locations. With a single sound source, the computation is relatively straightforward, but with multiple sound sources it becomes exponentially more complex.

Various AI and machine learning solutions to the multiple-source localization problem have been proposed, but many have limitations.

When the number of possible sounds exceeds the number of model outputs, doubt can arise about which sound corresponds to which output. For example, if a model learns to associate a set of coordinates with one speaker and another set of coordinates with two other speakers, it's unclear which output is associated with which speaker when the two other speakers talk at the same time.

One solution is representing the space around microphone arrays as a 3D grid, enabling the model to output a probability that one of the sounds originated at each grid point, given a set of input signals. But this has major drawbacks, chief among them the difficulty in localizing off-grid sources, creating a corpus that includes all sound combinations for every point, and improving accuracy beyond the resolution of the grid.

The Amazon team's model first localizes sounds to coarsely defined regions and then finely localizes them within those regions. It considers a region active if it contains at least one source, and it assumes there can be at most one active source in any active region. Because each coarse region has a designated set of nodes in the model's output layer, there can be no ambiguity about which sound source in a given region is associated with a location estimate.

For each region, the model computes the probability that the region contains a source, as well as the distance between the source and the center of the microphone array and the angle of the source relative to the array. It ingests the multichannel raw audio from the microphones and outputs the above-mentioned three quantities, such that it's end-to-end -- the model processes raw audio and thus avoids the need for pre- or post-processing.

More