How Descript's generative AI makes video editing as easy as updating text

A podcaster steps up to a mic to do a review of a new chicken nugget brand. As he begins talking and recording himself on his laptop, real-time speech-to-text transcribes his comments: “So these nuggets are, um, made from chicken, but they’re made to um, um, um, um, emulate the taste of, like, like, non chicken nuggets.”

That doesn’t sound very professional; on his screen, he strikes through those filler words — and while he’s at it, boosts the podcast’s sound quality before publishing it for his audience.

This is one use case for audio-video editing tool Descript, which today announced a significant product update and a $50 million series C round led by the OpenAI Startup Fund.

“The whole concept of Descript — editing video like a doc — is only possible because of AI [artificial intelligence],” said Jay LeBoeuf, Descript's head of business and corporate development. “Our aim is to make video a staple of every communicator’s toolkit, alongside docs and slides.”

The growing field of generative AI

Generative AI is a hotly-discussed topic of late: Next-gen tools are seeing significant gains in just a short period of time.

Just this week, IMARC Group released a study forecasting the global generative AI market size to exhibit a compound annual growth rate (CAGR) of nearly 20% between 2022 and 2027.

As noted by Gartner distinguished VP analyst, Arun Chandrasekaran, much of the technology progress in generative AI owes its origins to technical advancements in its underpinning foundation models.

Large labs, including OpenAI, Google Brain and Microsoft Research, are deploying vast resources to build these large-scale models, he said, and cloud computing provides a “tremendous avenue” for them to reach developers who want to try them without spending too much money or undergoing complex training.

Thus, use cases are expanding, Chandrasekaran pointed out. For instance, due to advances in prompt engineering, foundation models are capable of generating “original and coherent text in a highly contextualized way,” he said, which lends itself to use cases including writing headlines and paragraphs, creating product descriptions, generating conversational responses and completing text.

Foundation models are also increasingly being used in broader natural language capacities, such as summarization, rewriting, language translation and classification. And, fine-tuned diffusion models such as Dall-E 2 and Stable Diffusion are creating images from text, which can be extremely useful for synthetic data generation for scenarios where real data is scarce or unavailable.

“The image generation capabilities could positively affect design and marketing functions within the enterprise," Chandrasekaran said, "enabling them to create faster and better visuals for websites, blogs, ads and other content.”

Chandrasekaran forecasted that generative AI use cases where humans act as filters — aiding in content creation — will be adopted far more quickly than automatic content creation. As an augmentation technology to human tasks, he noted, there will be rapid adoption of foundation models in the next few years.

“In the future, we will see more use case-specific models as well as more innovation in the open-source community,” said Chandrasekaran, “which will hopefully enhance the access and transparency of these models.”

A new kind of video editor

When it comes to video and generative AI, Descript calls itself a “new kind of video editor that’s as easy as using a word processor.”

The company’s platform enables users to record themselves and their screen, write, storyboard, collaborate and edit in real time. For instance, they can remove silence and filler words (like the above-mentioned podcaster), overdub and add crossfades, titles, shapes and images.

As LeBoeuf explained, the company uses AI to create a text-to-speech voice that sounds like a user, so they can make corrections to recordings by deleting or typing in new words (as one would with a doc). The Studio Sound feature addresses issues around technical problems that creators struggle with, such as poor sound quality caused by an echoey room or bad mic.

“These are very practical use cases that enable a new class of people to edit video for whom it would have been prohibitively time-consuming otherwise,” said LeBoeuf.

Akin to making a slideshow, any creator in any type of workflow can create anything from simple screen recordings, all the way up to sophisticated narratives, with multiple layers of video, b-roll and audio, with video and audio effects and transitions, said LeBoeuf.

The company has quickly made progress improving its AI features, he said, and has reduced the length of time users need to set up their voice in Overdub, its text-to-speech voice-cloning service.

LeBoeuf pointed out that Descript powers many YouTube and TikTok channels, “9 out of the 10 top podcast publishers.” As well as businesses, including Shopify and HubSpot, that use video for marketing, sales and internal training and collaboration.

“The stage is set for video to take its place alongside text as something everyone uses to create and communicate — the only things holding it back are the tools,” said Andrew Mason, Descript CEO. “With the new Descript, we’ve replaced the drudgery of timeline editing with a tool that’s as familiar as the word processor — so you can make, edit and share video without breaking your creative flow.”

Similarly, one of Descript’s competitors, Runway, recently rolled out a tool that can edit video from written prompts; others in the emerging space include Type Studio and Reduct.

Empowering creation with generative AI

Descript’s enhanced platform now includes more than 30 new visual and AI-powered editing tools, including:

Brad Lightcap, OpenAI’s COO and manager of the OpenAI Startup Fund, lauded Descript for bridging the divide between idea and creation.

“We started the OpenAI Startup Fund to accelerate the impact that companies building on powerful AI will have on the world,” he said, “and we’re particularly excited about tools that empower people creatively.”

Andreessen Horowitz, Redpoint Ventures, Spark Capital and Daniel Gross also participated in the round, which brings the company’s total funding to $100 million.

The growing field of generative AI

A new kind of video editor

Empowering creation with generative AI

More