Researchers improve robots’ speech recognition by modeling human auditory processing

We rarely think too much about noises as we're listening to them, but there's an enormous amount of complexity involved in isolating audio from places like crowded city squares and busy department stores. In the lower levels of our auditory pathways, we segregate individual sources from backgrounds, localize them in space, and detect their motion patterns -- all before we work out their context.

Inspired by this neurophysiology, a team of researchers shared in a preprint paper on Arxiv.org ("Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization") a design devised to test the influence of physiognomy -- that is, facial features -- on the components of sound recognition, like sound source localization (SSL) and automatic speech recognition (ASR).

As the researchers note, the torso, head, and pinnae (the external part of the ears) absorb and reflect sound waves as they approach the body, modifying the frequency depending on the source's location. They travel to the cochlea (the spiral cavity of the inner ear) and the organ of Corti within, which produces nerve impulses in response to sound vibrations. Those impulses are delivered through the auditory nervous system to the cochlear nucleus, a sort of relay station that forwards information to two structures: the medial superior olive (MSO) and the lateral superior olive (LSO). (The MSO is thought to help locate the angle to the left or right to pinpoint the sound's source, while the LSO uses intensity to localize the sound source.) Finally, they're integrated in the brain's inferior colliculus (IC).

In an effort to replicate the structure algorithmically, the researchers designed a machine learning framework that processed sound recorded by microphones embedded in humanoid robotic heads -- the iCub and Soundman. The framework comprised four parts: an SSL component that decomposed audio into sets of frequencies and used the frequency waves to generate spikes mimicking the Corti's neural impulses; an MSO model sensitive to sounds produced at certain angles; an LSO model sensitive to other angles; and an IC-inspired layer that combined signals from the MSO and LSO. An additional neural network minimized reverberation and ego noise (noise generated by the robots' joints and motors).

To test the system's performance, the researchers used Soundman to establish SSL and ASR baselines and the iCub head (outfitted with motors that allowed it to rotate) to determine the effect of resonance from the skull and components within. A group of 13 evenly distributed loudspeakers in a half-cylinder configuration blasted noise toward the heads, which detected and processed it.

The team found that data from SSL could "improve considerably" -- in some cases by a factor of two at the sentence level -- the accuracy of speech recognition by indicating how to position the robotic heads and by selecting the appropriate channel as input to an ASR system. Performance was even better when the pinnae were removed from the heads.

"This approach is in contrast to related approaches where signals from both channels are averaged before being used for ASR," the paper's authors wrote. "The results of the dynamic SSL experiment show that the architecture is capable of handling different kinds of reverberation. These results are an important extension from our previous work in static SSL and support the robustness of the system to the sound dynamics in real-world environments. Furthermore, our system can be easily integrated with recent methods to enhance ASR in reverberant environments [55]–[57] without adding computational cost."

More