Researchers at Udacity develop AI that can generate lecture videos from audio narration

Producing content for Massive Open Online Course (MOOC) platforms like Coursera and EdX might be academically rewarding (and potentially lucrative), but it's time-consuming -- particularly where videos are involved. Professional-level lecture clips require not only a veritable studio's worth of equipment, but significant resources to transfer, edit, and upload footage of each lesson.

That's why research scientists formerly at Udacity, an online learning platform with over 150 courses, are investigating a machine learning framework that automatically generates lecture videos from audio narration alone. They claim in a preprint paper ("LumièreNet: Lecture Video Synthesis from Audio") on Arxiv.org that their AI system -- LumièreNet -- can synthesize footage of any length by directly mapping between audio and corresponding visuals.

"In current video production pipeline, an AI machinery which semi (or fully) automates lecture video production at scale would be highly valuable to enable agile video content development (rather than reshooting each new video)," wrote the paper's coauthors. "To [this] end, we propose a new method to synthesize lecture videos from any length of audio narration: ... A simple, modular, and fully neural network-based [AI] which produces an instructor's full pose lecture video given the audio narration input, which has not been addressed before from deep learning perspective, as far as we know."

The researchers' model has a pose estimation component that synthesizes body figure images from video frames extracted from a training data set, chiefly by detecting and localizing major body points to create detailed surface-based human body representations. A second module in the model -- a bidirectional recurrent long-short term memory (BLSTM) network that processes data in order (forward and backward) so that each output reflects the inputs and outputs that precede it -- takes as input audio features and attempts to suss out the relationship between them and visual elements.

To test LumièreNet, the researchers filmed an instructor's lecture video for around eight hours at Udacity's in-house studio. This yielded roughly four hours of video and two narrations for training and validation. The researchers report that the trained AI system produces "convincing" clips with smooth body gestures and realistic hair, but note that its creations (two of which are here and here) likely won't fool most observers. Because the pose estimator can't capture fine details like eye motion, lips, hair, and clothing, synthesized lecturers rarely blink and they tend to move their mouths unnaturally. Worse, their eyes sometimes look in different directions and their hands alway appear oddly blurry.

The team posits that the addition of "face keypoints" (i.e., fine details) might lead to better synthesis, and they note that -- fortunately -- their system's modular design allows each component to be trained and improved independently.

"[M]any future directions are feasible to explore," wrote the researchers. "Even though our approach is developed with primary intents to support agile video content development, which is crucial in current online MOOC courses, we acknowledge there could be potential misuse of the technologies ... We hope that our results will catalyze new developments of deep learning technologies for commercial video content production."

More