AI startups claim to detect depression from speech, but the jury's out on their accuracy

In a 2012 study published in the journal Biological Psychiatry, a team of scientists at the Center for Psychological Consultation (CPC) in Madison, Wisconsin hypothesized that the characteristics of a depressed person's voice might reveal a lot about the severity of their disorder. The coauthors said the research, which was partially funded by pharma giant Pfizer, identified several "viable biomarkers" -- quantitative indicators of changes in health -- to measure the severity of major depression.

A cottage industry has since emerged, with startups claiming to automate the detection of depression using AI trained on hundreds of recordings of people's voices. One of the better-funded efforts, Ellipsis Health, which generates assessments of depression from 90 seconds of a person's speech, managed to raise $26 million in series A funding. Investors include former Salesforce chief scientist Richard Socher and Salesforce CEO Marc Benioff's Time Ventures.

According to founder and CEO Mainul I Mondal, Ellipsis' technology is "science-based" and validated by peer-reviewed research. But experts are skeptical that the company's product, and others like it, work as well as advertised.

Diagnosing depression

The idea that signs of depression can be detected in a person's voice is at least 60 years old. The 2012 CPC study was a follow-up to a 2007 work by the same research team that was originally published in the Journal of Neurolinguistics. That study -- funded by a small-business innovation research grant from the U.S. National Institutes of Health -- reportedly found "vocal-acoustic" characteristics correlated with the severity of certain depression symptoms.

According to James Mundt, a senior research scientist at CPC who led both the 2007 and 2012 studies, depressed patients begin to speak faster and with shorter pauses as they respond to treatment -- or with monotone, "lifeless," and "metallic" qualities, or "paraverbal features," if they don't. Speech requires complex control in the nervous system, and the underlying pathways in the brain can be affected by psychiatric disorders, including depression. The ability to speak, then, is closely related to thinking and concentration, all of which can be impaired with depression. Or so the reasoning goes.

Ellipsis leveraged this academic connection between speech and disordered thinking to develop a screening test for severe depression. Patients speak briefly into a microphone to record a voice sample, which the company's algorithms then analyze to measure the levels of depression and anxiety.

"Combining the most current deep learning and cutting-edge transfer learning techniques, our team has developed novel models that detect both acoustic and word-based patterns in voice. The models learn their features directly from data, without reliance on predetermined features," Mondal told VentureBeat via email. "Around the world, voice is the original measure of well-being. Through speech, someone's voice conveys the internal state of a person -- not only through words and ideas but also through tone, rhythm, and emotion."

The market for AI health startups, specifically those that deal with biomarkers, is estimated to be worth $129.4 billion by 2027, according to Grand View Research. Ellipsis is one of several in the depression-diagnosing voice analysis space, which includes Sonde Health; Vocalis Health; Winterlight Labs; and Berkeley, California-based Kintsugi, which closed an $8 million funding round last week.

Some research gives credence to the notion that AI can detect depression from speech patterns. In a paper presented at the 2018 Interspeech conference, MIT researchers detailed a system that could read audio data from interviews to discover depression biomarkers with 77% accuracy. And in 2020, using an AI system designed to focus on word choice, scientists at the University of California, Los Angeles said they were able to monitor people being treated for serious mental illness as well as physicians could.

"There's no doubt that paraverbal features can be helpful in making clinical diagnoses," Danielle Ramo, an assistant professor of psychiatry at the University of California, San Francisco, told KQED in a 2017 interview. "To the extent that machines are able to take advantage of paraverbal features in communication, that is a step forward in using machines to inform clinical diagnoses or treatment planning."

In another study out of the University of Vermont, which involved training a system to detect childhood depression, the researchers noted that standard tests involve time-consuming interviews with both clinicians and primary caregivers. Because depression can't be picked up by a blood test or brain scan, physicians must rely on self-reports and results from these interviews to arrive at a diagnosis. Coauthor Ellen McGinnis pitched the research as a way to quickly and easily diagnose mental disorders in young people.

Ellipsis itself plans to put a portion of the new capital toward expanding its platform to children and adolescents, with the stated goal of improving access to diagnosis and treatment. "One can't manage what one can't measure. Access is dependent on knowledge of a condition and the level of severity of that condition," Mondal said. "Access is also dependent on the supply of resources that can treat different levels of access. While there may be an undersupply of specialists, understanding the level of severity may open access to less specialized providers, which are in better supply. In other words, measuring performs triage to recommend the right care at the right time for a patient."

Potential flaws

In many ways, the pandemic highlighted the ramifications of the mental health epidemic. The number of people presenting with moderate to severe symptoms of depression and anxiety remains higher than prior to the global outbreak, with an estimated 28% of people in the U.S. suffering from depression, according to Mental Health America. Against this backdrop, the National Alliance on Mental Health estimates that 55% of people with mental illness are not receiving treatment -- a gap that's expected to widen as the psychiatrist shortage looms.

Ellipsis' technology, pitched as a partial solution, is being piloted in "nine-plus" U.S. states and internationally through insurance provider Cigna. Cigna used it to create a test called StressWaves that visualizes a person's current stress level and suggests exercises to promote mental well-being. According to Mondal, Ellipses' platform has also been tested in behavioral health systems at Alleviant Health Centers and undisclosed academic medical centers and specialty health clinics.

"Now more than ever, the industry needs bold, scalable solutions to address this crisis -- beginning with tools like ours to scale quantifying severity, as time-strapped providers alone do not have the bandwidth to solve this problem," he said.

But some computer scientists have reservations about using AI to track mental disorders, particularly severe disorders like depression. Mike Cook, an AI researcher at the Queen Mary University of London, said the idea of detecting depression through speech "feels very unlikely" to provide highly precise results. He points out that in the early days of AI-driven emotion recognition, where algorithms were trained to recognize emotions from image and video recordings, the only emotions researchers could get systems to recognize were "fake" emotions, like exaggerated faces. While the more obvious signs of depression might be easy to spot, depression and anxiety come in many forms, and the mechanisms linking speech patterns and disorders are still not well understood.

"I think technology like this is risky for a couple of reasons. One is that it industrializes mental health in a way that it probably shouldn't be -- understanding and caring for humans is complex and difficult, and that's why there are such deep issues of trust and care and training involved in becoming a mental health professional," Cook told VentureBeat via email. "Proponents might suggest we just use this as a guide for therapists, an assistant of sorts, but in reality there are far more ways this could be used badly -- from automating the diagnosis of mental health problems to allowing the technology to seep into classrooms, workplaces, courtrooms, and police stations. ... Like all machine learning technology, [voice-analyzing tools] give us a veneer of technological authority, where in reality this is a delicate and complicated subject that machine learning is unlikely to understand the nuances of."

There's also the potential for bias. As University of Washington AI researcher Os Keyes notes, voices cover a broad range of characteristics, including those of people with disabilities and those who speak in non-English languages, accents, and dialects such as African American Vernacular English (AAVE). A native French speaker taking a test in English, for example, might pause or pronounce a word with some uncertainty, which could be misconstrued by an AI system as a disease marker. Winterlight hit a snag following the publication of its initial research in the Journal of Alzheimer's Disease in 2016 after it found its voice-analyzing technology only worked for English speakers of a particular Canadian dialect. (The startup recruited participants from the study in Ontario.)

"Voices are, well, different; people speak in different idiomatic forms, people present socially in different ways, and these aren't randomly distributed. Instead, they're often (speaking generally, here) strongly associated with particular groups," Keyes told VentureBeat via email. "Take for example the white-coded 'valley' accents, or AAVE, or the different vocal patterns and intonations of autistic people. People of colour, disabled people, women -- we're talking about people already subject to discrimination and dismissal in medicine, and in wider society."

Depression-detecting voice startups have mixed track records, broadly speaking. Launched from a merger of Israeli tech companies Beyond Verbal and Healthymize, Vocalis largely pivoted to COVID-19 biomarker research in partnership with the Mayo Clinic. Winterlight Labs, which announced a collaboration with Johnson & Johnson in 2019 to develop a biomarker for Alzheimer's, is still in the process of conducting clinical trials with Genentech, Pear Therapeutics, and other partners. Sonde Health -- which also has ongoing trials, including for Parkinson's -- has only completed early tests of the depression-detecting algorithms it licensed from MIT's Lincoln Laboratories.

To date, none of the companies' systems have received full approval from the U.S. Food and Drug Administration (FDA).

Ellipsis' solution is unique, Mondal claims, in that it combines acoustic (e.g., tones, pitches, and pauses) and semantic (relating to words) algorithms trained on "industry-standardized" assessment tools. The algorithms were initially fed millions of conversations from "non-depressed" people and mined them for pitch, cadence, enunciation, and other features. Data scientists at Ellipsis then added conversations, data from mental health questionnaires, and clinical information from depressed patients to "teach" the algorithms to identify the ostensible vocal hallmarks of depression.

"We leverage a diverse dataset to ensure our algorithms are not biased and can be deployed globally ... Our models can generalize well to new populations with differing demographics, varying accents, and levels of speaking abilities [and] are robust enough to support real-time [applications] across different populations with no baseline required," Mondal said. "One of our institutional review board (IRB)-approved studies is currently in phase two and involves monitoring patients in depression clinics. Early results show our depression and anxiety vital scores closely match the clinician's assessment ... We [also] have 9 IRB proposals in process with institutions such as Mayo Clinic, Penn State University, and Hartford Healthcare."

Keyes characterized Ellipsis' approach to bias in its algorithms as "worrisome" and out of touch. "They talk a big game about being concerned about bias, and being rigorously vetted academically, but I find one paper about bias -- this one -- and when you read beyond the abstract, it has some pretty gnarly findings," Keyes said. "For starters, although they sell it as showing age isn't a factor in accuracy, their test is only right 62% of the time when it comes to African American true negatives, and 53% of the time with Caribbean people. In other words: 40% of the time, they will misclassify a Black person as being depressed or anxious, when they're not. This is incredibly worrying, which might be why they buried it on the last page, because diagnoses often carry stigma around with them and are used as excuses to discriminate and disempower people."

Mondal admits Ellipses' platform can't yet legally be considered a diagnostic tool -- only a clinical decision support tool. "Ellipsis intends to follow FDA guidance for medical AI with the intended plan for FDA regulatory approval of its technology for measuring the level of severity for clinical depression and anxiety," he said. "A foundation will be established to allow [us to] scale into the global market."

Of course, even if the FDA does eventually approve technology like that from Ellipses, it might not address the risks around the tools' possible misuse. In a study published in Nature Medicine, a team at Stanford found that almost all of the AI-powered devices approved by the FDA between January 2015 and December 2020 underwent only retrospective studies at the time of their submission. The coauthors argue that prospective studies are necessary because in-the-field usage of a device can deviate from its intended use. For example, a prospective study might reveal that clinicians are misusing a device for diagnosis as opposed to decision support, leading to potentially worse health outcomes.

"The best-case scenario for [Ellipsis'] software is: They will turn a profit on individuals' unhappiness, everywhere. The worst case is: They will turn a profit on giving employers and doctors additional reasons to mistreat people already marginalised in both health care and workplaces," Keyes said. "I want to believe that people truly committed to making the world a better place can do better than this. What that might look like is, at a bare minimum, rigorously inquiring into the problem they're trying to solve, into the risks of treating doctors as a neutral baseline for discrimination, given the prevalence of medical racism, into what happens after diagnosis, and into what it means to treat depression as a site for stock payouts."

Diagnosing depression

Potential flaws

More