AI and machine learning are powerful tools for the synthesis of speech. As countless studies have demonstrated, only a few minutes — and in the case of state-of-the-art models, a few seconds — are required to imitate a subject’s prosody and intonation with precision. Baidu’s latest Deep Voice service can clone a voice with just 3.7 seconds of audio samples, for example, and a recently released implementation from a July research paper makes do with about five seconds.
The field’s rapid progress inspired Zohaib Ahmed, a former Magic Leap lead software engineer fresh off of stints at BlackBerry and Hipmunk, to cofound Ontario-based Resemble AI with Saqib Muhammad. The pair sought to adapt leading machine learning models for speech synthesis to scale, with the goal of building a service that would enable cloning voices from relatively small data sets.
But alongside their voice synthesis product launch, Ahmed and Muhammad launched a tool to detect deepfakes. The two technologies are inextricably linked.
Threat of deepfakes
Ahmed and Muhammad had the foresight to realize that like any tool capable of creating convincing synthetic audio, Resemble’s platform could be abused by malicious actors. Deepfakes — media that replaces a person in an existing recording with someone else’s likeness — are multiplying, according to Amsterdam-based cybersecurity startup Deeptrace. It identified 14,698 deepfake videos on the internet during its most recent tally in June and July, up from 7,964 last December — an 84% increase within only seven months.
It’s troubling not only because deepfakes might be used to sway public opinion during, say, an election, or to implicate someone in a crime they didn’t commit, but because they’ve already been used to swindle at least one firm out of hundreds of millions of dollars.
That’s why the Resemble team several months ago released an open source tool dubbed Resemblyzer, which uses AI and machine learning to detect deepfakes by deriving high-level representations of voice samples and predicting whether they’re real or generated. Given an audio file of speech, it creates a summary vector of 256 values (an embedding) that summarizes the characteristics of the voice spoken, enabling developers to compare the similarity of two voices or suss out who’s speaking at any given moment.
“As researchers and entrepreneurs, we are thoughtful about the benefits and/or risks to society of what we are creating,” said Ahmed. “When you’re creating your voice on our platform, we take extreme measures to ensure the ownership of the voice.”
Cloning voices for media
After a soft launch earlier this year, Resemble announced the launch of Resemble Clone. According to CEO Ahmed, it’s meant to target the entertainment industry, with tools designed to optimize generated voices for virtual reality experiences, animated films and television, and audiobooks.
“We set out to build a product that helps creatives get over the hurdle of crafting audio content,” said Ahmed. “With more audio content being produced year after year — smart speakers, Airpods, podcasts, audiobooks, and digital characters in virtual and augmented reality — there is a large and growing need for fast and accurate voice cloning. Resemble AI’s unique focus is to empower creatives, so they can control and produce content without sacrificing quality.”
From an end-user perspective, the Resemble experience is akin to that of Lyrebird, which was acquired by Group founder Andrew Mason’s Descript in September. Like Resemble, Lyrebird had users record statements from real-time, dynamically generated prompts, which fed into cloud-hosted algorithmic models used to shape shareable, bespoke digital voice profiles.
Resemble customers needn’t create new recordings, though — existing audio works too, funneled either through a web-based uploader or an API. (Resemble requires three minutes of audio to generate high-quality samples.) And the platform can create fictitious voices with somewhat humanlike emotions and intonations, which can be served to Google’s Dialogflow or any similar natural language understanding engine.
Ahmed envisions game developers creating voices from actors during preproduction for scratching and iteration, or wholly novel voices tailored to fit an avatar or character’s personality. Another potential use case is the creation of soundalikes for intelligent assistants and voice apps, like the John Legend and Samuel L. Jackson voices on Google Assistant and Amazon’s Alexa, respectively.
Resemble’s work isn’t entirely novel. Text-to-speech tech startup iSpeech offers comparable voice cloning tools, as does Modulate, Respeecher, and Bengaluru, India-based DeepSync. But investors like Firstminute investor Clara Lindh Bergendorff — who participated in Resemble AI’s $2 million seed funding round alongside Craft Ventures, AET Fund, and Betaworks — believe its media creation focus sets it apart in a text-to-speech market that some anticipate to be worth $3.03 billion by 2022.
“We’re excited by the idea of Resemble making real-time creation and editing of audio content — which today is a painful bottleneck for creatives across industries — as easy and accessible as editing animated visual content,” she said. “Resemble is also well-positioned to ride wider audio waves, from the growth of audio content consumption and voice applications to growth in audio-first devices.”