Join gaming leaders online at GamesBeat Summit Next this upcoming November 9-10. Learn more about what comes next.
Roadrunner, the documentary film about Anthony Bourdain, contains a scene in which the epicure utters words from letters he wrote to the artist David Choe. This wouldn’t be unusual in and of itself — if it weren’t for the fact that Bourdain never read the letters. Rather, the clips were generated by a company that director Morgan Neville hired to model Bourdain’s voice.
Synthetic media, or likenesses and voices generated by AI, has nearly crossed the uncanny valley. Earlier this month, Sonantic, a U.K.-based firm that clones voices for actors and studios, released a recording of an AI-generated voice modeled after the actor Val Kilmer. A mimicry of Kilmer’s natural voice, which he lost after a throat cancer surgery in 2015, closely mirrors the intonation of the actor.
The rise of synthetic media has prompted concerns about deepfakes, or AI-generated media used for fraud and other criminal activities. Ethical questions abound — the voice in Roadrunner was created without Bourdain’s permission. But if used responsibly, synthetic media has the potential to cut costs in while enabling actors to focus on more interesting work.
To create synthetic voices and videos, companies use a combination of AI and machine learning techniques including generative adversarial networks (GANs). GANs are two-part machine learning models consisting of a generator that creates samples and a discriminator that attempts to differentiate between these samples and real-world samples. Top-performing GANs can create realistic portraits of people who don’t exist, or even snapshots of fictional apartment buildings.
It only takes a few seconds to minutes for AI to imitate a person’s prosody. Baidu’s latest Deep Voice service can clone a voice with just 3.7 seconds of audio samples, and WellSaid Labs, which launched as a research project at the Allen Institute for Artificial Intelligence, can create a 10-second audio file from roughly 4 seconds of speech.
As R&D refines the technology and it becomes more scalable, media synthesis is morphing from a novelty into an expanding market. Companies like Amazon, Microsoft, Papercup, Deepdub, and Synthesia have created projects such as ad campaigns featuring an AI-generated Snoop Dogg and David Beckham’s voice translated into nearly a dozen languages. They’ve also partnered with news organizations including Sky News, Discovery, and Reuters to develop prototypes for automated news and sports reports.
Synthetic media platforms provide different capabilities depending on their focus. For example, Synthesia allows customers to pick from a range of “voice avatars” and create voiceovers straight from a script, with one or many voices based on style, gender, and production type. On the other hand, Amazon pairs customers with its engineers to build AI-generated voices representing certain personas.
Startups like Alethea AI, Genies, and Possible Reality fall into a separate category of synthetic media generation. From only a few images, their tools can generate high-fidelity, expressive, and photorealistic avatars. Possible Reality is leveraging its technology to turn pictures of people into 3D avatars inside video games and virtual worlds. And Genies is generating cartoon-like 2D avatars of celebrities for social media.
Challenges and opportunities
As pandemic restrictions make conventional filming tricky and risky, the benefits of AI-generated video have been magnified. According to Dogtown Media, an enterprise education campaign under normal circumstances might require as many as 20 different scripts to address a worldwide workforce, with each video costing tens of thousands of dollars. Synthetic media can pare the expenses down to a lump sum of around $100,000.
Brand voices such as Progressive’s Flo, played by comedian Stephanie Courtney, are often tasked with recording phone trees for interactive voice response systems or elearning scripts for corporate training videos. Synthesization could boost actors’ productivity by cutting down on ancillary recordings and pick-ups — recording sessions to address mistakes, changes, or additions in voiceover scripts — while freeing them up to pursue creative work and enabling them to collect residuals.
Moreover, synthetic media platforms give creators, product developers, and brands the ability to power experiences with a wide range of voice styles, accents, and languages. Resemble CEO Zohaib Ahmed envisions game developers creating voices from actors during preproduction for scratching and iteration, as well as voices tailored to fit a character’s personality and soundalikes for voice assistants and apps.
There’s also the translation aspect. Because quality dubbing is prohibitively expensive — estimates for a 90-minute program range from $30,000 to $100,000 — most of the world’s videos have been recorded in a single language. (In the first week of 2019, 33% of popular YouTube videos were in English.) Statista found that 59% of U.S. adults said they would prefer to watch foreign language films dubbed into English than see the original feature with subtitles — highlighting the demand for synthetic media translation technologies.
Experts have expressed concern that synthetic media tools could be co-opted to create deepfakes — the fear being that these fakes might be used to do things like sway opinion during an election or implicate a person in a crime. Already, deepfakes have been abused to generate pornographic material of actors and to defraud a major energy producer.
The fight against deepfakes is likely to remain challenging, especially as media generation techniques continue to improve. Earlier this year, deepfake footage of Tom Cruise posted to an unverified TikTok account racked up 11 million views on the app and millions more on other platforms. And when scanned through several of the best publicly available deepfake detection tools, they avoided discovery, according to Vice.
Some companies have taken steps to prevent the misuse of their platforms. For example, Synthesia says it vets its customers and their scripts and requires formal consent from a person before it will synthesize their appearance, and the company refuses to touch political content. WellSaid also doesn’t create voice avatars without actors’ permission and subscribes to the “Hippocratic Oath for AI” proposed by Microsoft executives Brad Smith and Harry Shum. As for Resemble, it released an open source tool that detects deepfakes by deriving high-level representations of voice samples and predicting whether they’re real or generated.
Founders like Ahmed think that the pros outweigh the potential cons. As he told VentureBeat in a recent interview, “We set out to build a product that helps creatives get over the hurdle of crafting audio content. With more audio content being produced year after year — smart speakers … AirPods, podcasts, audiobooks, and digital characters in virtual and augmented reality — there is a large and growing need for fast and accurate voice cloning.”
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more