Detecting emotional arousal from the sound of someone’s voice is one thing — startups like Beyond Verbal, Affectiva, and MIT spinout Cogito are leveraging natural language processing to accomplish just that. But there’s an argument to be made that speech alone isn’t enough to diagnose someone with depression, let alone judge its severity.

Enter new research from scientists at the Indian Institute of Technology Patna and the University of Caen Normandy (“The Verbal and Non Verbal Signals of Depression — Combining Acoustics, Text and Visuals for Estimating Depression Level”), which examines how nonverbal signs and visuals can drastically improve estimations of depression level. “The steadily increasing global burden of depression and mental illness acts as an impetus for the development of more advanced, personalized and automatic technologies that aid in its detection,” the paper’s authors wrote. “Depression detection is a challenging problem as many of its symptoms are covert.”

The researchers encoded seven modalities — things like downward angling of the head, eye gaze, the duration and intensity of smiles, and self-touches, along with text and verbal cues — which they fed to a machine learning model that fused them together into vectors (i.e., mathematical representations). These fused vectors were then passed onto a second system that predicted the severity of depression based on the Personal Health Questionnaire Depression Scale (PHQ-8), a diagnostic test often employed in large clinical psychology studies.

To train the various systems, the researchers tapped AIC-WOZ, a depression data set that’s part of a larger corpus — the Distress Analysis Interview Corpus — containing annotated audio snippets, video recordings, and questionnaire responses of 189 clinical interviews supporting the diagnosis of psychological conditions like anxiety, depression, and post-traumatic stress disorder. (They discarded interviews that were incomplete and those that had interruptions.) Each sample contained an enormous number of data, including a raw audio file, a file containing the coordinates of 68 facial “landmarks” of the interviewee (with time stamps, confidence scores, and detection success flags), two files containing head pose and eye gaze features of the participant, a transcript file of the interview, and more.

After several preprocessing steps and model training, the team compared the results of the AI systems using three metrics: root mean squared error (RMSE), mean absolute error (MAE), and explained variance score (EVS). They report that the fusion of the three modalities — acoustic, text, and visual — helped in giving the “most accurate” estimation of depression level, outperforming the state-of-the-art by 7.17% on RMSE and 8.08% on MAE.

In the future, they plan to study recent multitask learning architectures and “dig deeper” into novel representations of text data. If their work bears fruit, it’d be a promising development for the more than 300 million people now living with depression — a number that’s sadly on the rise.