London-based Synthesia, the startup that enables enterprises to create professional-grade AI videos, has taken the next leap in upgrading its platform with the launch of “expressive avatars.”

Available starting today, these AI avatars go a step ahead of normal digital avatars and adjust their tone, facial expressions and body language, based on the context of the content they deliver. This comes just a week after Microsoft showed off VASA, an AI framework that converts human headshots into talking and singing videos, complete with expressions and head movements.

However, unlike VASA, which is just a research effort, the technology behind expressive avatars is very real and will help Syntheisa’s customers create more realistic AI videos than ever for their target audiences.

Synthesia’s next step in AI videos

The company was founded in 2017 by a team of AI researchers and entrepreneurs including some from Stanford and Cambridge Universities. Synthesia has built an end-to-end platform to create custom AI voices and avatars (users can even use existing ones) and use them with pre-written or AI-produced scripts to generate studio-quality AI videos.

The offering has drawn significant adoption at the enterprise level, with more than 200,000 people using the digital avatars to create more than 18 million videos. However, the avatars that Synthesia or anyone else has on offer carried one major gap: sentiment understanding. Unlike an actual video presenter, digital avatars could not change their tone, expressions or gestures to match the script; the aspects had to be pre-defined. 

This is now changing with the launch of expressive avatars.

As Synthesia explains, the new AI avatars come with the ability to understand the context and sentiment conveyed in a piece of text and change their tone and expressions to deliver the speech. The company claims they can already show a range of emotions with subtle adjustments in expressions, blinking, and even eye gaze to match the speech. Imagine the avatar smiling and laughing when talking about something ecstatic or speaking slowly with longer pauses for something sad/somber.

“With these new avatars, we’re not just creating digital renders; we’re introducing digital actors. This technology brings a level of sophistication and realism to digital avatars that blur the line between the virtual and the real,” Jon Starck, the CTO of the company, wrote in a blog post.

To achieve this level of sentiment prediction and realism, Syntheisa is using EXPRESS-1, a deep learning model trained with several hours of text and video showcasing how that text is spoken in the real world. 

“EXPRESS-1 predicts every movement and facial expression in real-time, aligning seamlessly with the timings, intonations and emphasis of spoken language. This results in performances that are astonishingly natural and human-like,” Starck added. The new avatars also bring more natural lip-sync and voices across different languages.

What are the implications of expressive avatars?

While digital avatars with the ability to emote and speak like humans can easily be abused to trick people and cause individual/societal harm, Synthesia is working aggressively to ensure positive enterprise-centric use cases, especially around communications and knowledge sharing. 

For instance, the company says healthcare companies could use the new technology to create more empathetic videos for their patients or marketing teams could use it to convey excitement and optimism in a video discussing a new product.

To ensure safety, the company said it has updated usage policies to restrict the type of content enterprise users can make on the platform and is also investing in the early detection of bad faith actors as well as content credentials technologies such as C2PA.

Currently, the 300-person company works with more than 55,000 businesses, including half of Fortune 100, as customers. One of those customers is the video calling platform Zoom, which claims it has been able to create sales and training videos 90% faster with Synthesia.