Researchers develop accent detection AI to improve speech recognition

Few things are more frustrating than a speech recognition system that doesn't understand your accent. Linguistic differences in pronunciation have stumped data scientists for years (a recent study found that YouTube's automatic captioning did worse with Scottish speakers than American southerners), but it's not for lack of trying. Training a model requires a wealth of data, and some dialects are less common than others.

Researchers at Cisco, the Moscow Institute of Physics and Technology, and the Higher School of Economics present a possible solution in a new paper ("Foreign English Accent Adjustment by Learning Phonetic Patterns") published on the preprint server Arxiv.org. Their system leveraged dialectical differences in diction and intonation to create new accented samples of words, which it learned to recognize with some accuracy compared to similar systems.

"More non-native accented speech data is necessary to enhance the performance of ... existing [speech recognition] models," the researchers wrote. "However, its synthesis is still an open problem."

The team sourced its data from the Carnegie Mellon University (CMU) Pronouncing Dictionary, which contains thousands of audio recordings of English speakers reading common words. Traditionally, when training a system on a new accent, phonologists have to manually extract features known as phonological generalizations to represent the difference between General American English (GAE) -- spoken English lacking distinctly regional or ethnic characteristics -- and an audio sample of a distinct accent, but that sort of hard-coding doesn't tend to scale well.

The researchers' model generalized those rules automatically. Using dictionaries that mapped characters from the George Mason University's Speech Accent Archive -- a collection of speech samples from a variety of language backgrounds -- to unique sounds from CMU, it predicted pronunciations by making replacements, deletions, and insertions to input words.

The team used the model to generate a phonetic dataset they fed into a recurrent neural network -- a type of neural network commonly employed in speech recognition tasks -- that tried to get rid of unnecessary sounds and change them so that they didn't deviate too far from the GAE word versions. After training on 800,000 samples, it was able to recognize accented words with 59 percent accuracy.

It's preliminary research -- because the CMU Dictionary contained fewer sounds than the GMU, the model was only able to learn 13 out of 20 of the CMU's phonetic generalizations. But the team managed to increase the size of the CMU dataset from 103,000 phonetic transcriptions with a single accent to one million samples with multiple accents.

"The proposed model is able to learn all generalizations that previously were manually obtained by phonologists," the researchers wrote.