Researchers at Duke Kunshan University, Wuhan University, Lenovo, and Sun Yat-sen University in Guangzhou claim to have developed an AI system that detects whether a person is wearing a mask from the sound of their muffled speech. They say that in experiments, it achieves 78.8% accuracy on one metric, demonstrating that sound could be a useful means of enforcing mask-wearing during the pandemic.
The team’s work is a submission to the 11th annual Computational Paralinguistics Challenge (ComParE) at the upcoming Interspeech 2020 conference, an open challenge dealing with the states and traits of speakers as manifested in their speech. This year saw the introduction of a “mask sub-challenge” in which the goal is to develop algorithms capable of determining whether a person is wearing a mask from the sound of their voice. For the sub-challenge, every competitor — the coauthors of this study included — must use the same corpus of 32 German speakers recorded for 10 hours in an audio studio wearing Lohmann & Rauscher face coverings.
The researchers augmented the data from the data set by varying the rate of speech, warping various features, and erasing portions of speech at random. They trained a machine learning system on this augmented data, which included speech recorded from the speakers while they weren’t wearing masks, and conducted experiments to determine how accurately the classifier could detect mask presence.
The researchers found their system’s accuracy wasn’t consistent across genders despite the fact the corpus contains the same number of female and male speakers (16 people each). They don’t speculate as to why this might be, but it’s possible data imbalances in other dimensions are to blame. The speakers talk strictly in German about things like sports, families, kids, and food; only wear one type of mask; and range in age from 20 years old to 41 years old. Differences in the sounds of languages arise from different manners of articulation; one can expect the speech of an older English male to be distinct from that of a young Spanish speaker.
Still, the researchers say that on the given German data set, their system ultimately achieved higher accuracy than a baseline model (71.8% unweighted average of the class-specific recall).
Mask detection from speech is a nascent field, evidently, but it’s a potentially desirable alternative to vision-based approaches. A recent report by the U.S. Department of Commerce’s National Institutes of Science and Technology (NIST) found that 89 commercial facial recognition algorithms from Panasonic, Canon, Tencent, and others had error rates between 5% and 50% in matching digitally applied masks with photos of the same person without a mask. Companies including Hanwang say they’ve developed new AI approaches to identifying wearers through their masks, but the quoted accuracy rates are dubiously high and they make no claim to preserve privacy.
Beyond mask detection, researchers are exploring how speech data might be used to diagnose COVID-19. Teams from Carnegie Mellon and startup Voca.ai released an app they claim can tell whether someone has COVID-19 from a voice recording, and Vocalis Health says it’s working with Israel’s Health Ministry and Directorate for Defense Research and Development to collect “vocal biomarkers” from COVID-19 patients. These techniques aren’t without caveats — Benjamin Striner, a graduate student who contributed to the Carnegie Mellon project, cautioned that the app’s accuracy can’t be tested because of a lack of verified data — but preliminary research suggests AI-powered voice analysis can fairly accurately diagnose other conditions, including post-traumatic stress disorder and high blood pressure.
The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here