Join gaming leaders online at GamesBeat Summit Next this upcoming November 9-10. Learn more about what comes next. 

Diarization — the process of partitioning out a speech sample into distinctive, homogeneous segments according to who said what, when — doesn’t come as easy to machines as it does to humans, and training a machine learning algorithm to perform it is tougher than it sounds. A robust diarization system must be able to associate new individuals with speech segments that it hasn’t previously encountered.

But Google’s AI research division has made promising progress toward a performant model. In a new paper (“Fully Supervised Speaker Diarization“) and accompanying blog post, researchers describe a new artificially intelligent (AI) system that “makes use of supervised speaker labels in a more effective manner.”

The core algorithms, which the paper’s authors claim achieve an online diarization error rate (DER) low enough for real-time applications — 7.6 percent on the NIST SRE 2000 CALLHOME benchmark, compared to 8.8 percent DER from Google’s previous method — is available in open source on Github.

Google diarization

Above: Speaker diarization on streaming audio, with different colors in the bottom axis indicating different speakers.

Image Credit: Google

The Google researchers’ new approach models speakers’ embeddings (i.e., mathematical representations of words and phrases) by a recurrent neural network (RNN), a type of machine learning model that can use its internal state to process sequences of inputs. Each speaker starts with its own RNN instance, which keeps updating the RNN state given new embeddings, enabling the system to learn high-level knowledge shared across speakers and utterances.

“Since all components of this system can be learned in a supervised manner, it is preferred over unsupervised systems in scenarios where training data with high quality time-stamped speaker labels are available,” the researchers wrote in the paper. “Our system is fully supervised and is able to learn from examples where time-stamped speaker labels are annotated.”

In future work, the team plans to refine the model so that it can integrate contextual information to perform offline decoding, which they expect will further reduce DER. They also hope to model acoustic features directly, so that the entire speaker diarization system can be trained end-to-end.


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more
Become a member