Nvidia's Vid2Vid Cameo brings 'talking heads' to videoconferences

Nvidia today took the wraps off of Vid2Vid Cameo, an AI model that uses generative adversarial networks (GANs) to create realistic "talking-head" videos using a single photo of a person. The company claims that Vid2Vid Cameo, which will soon be available in the Nvidia Video Codec SDK and Nvidia Maxine SDK as "AI Face Codec," achieves state-of-the-art performance thanks in part to a training dataset of 180,000 "high-quality" videos.

"Many people have limited internet bandwidth, but still want to have a smooth video call with friends and family," Nvidia researcher Ming-Yu Liu said in a press release. "In addition to helping them, the underlying technology could also be used to assist the work of animators, photo editors, and game developers."

Vid2Vid Cameo, which was first demonstrated last October, was designed for videoconferencing applications, and it requires only a single picture of a person and a video stream dictating how the picture should be animated. The system identifies 20 key points that encode the location of features including the eyes, mouth, and nose and automatically extracts these points from the reference picture. The extracted points can be sent to other video conference participants ahead of time or reused from previous meetings. On the receiver's side, the GAN taps this information to generate a video that mimics the appearance of the original picture.

Maxine and GANs

Vid2Vid Cameo is an outgrowth of Nvidia's work on Maxine, a platform that provides developers with a suite of GPU-accelerated AI conferencing software to enhance video quality. Nvidia says Maxine "dramatically" reduces how much bandwidth is required for videoconferencing calls by employing GANs including Vid2Vid Cameo. Instead of streaming an entire screen of pixels, the platform analyzes the facial points of each person on a call and then algorithmically reanimates the face in the video on the other side.

Maxine's other spotlight feature is face alignment, which enables faces to be automatically adjusted so participants appear to be facing each other during a call. Gaze correction helps simulate eye contact, even if the camera isn't aligned with the user's screen. Auto-frame allows the video feed to follow a speaker as they move away from the screen. And developers can let call participants choose their own avatars, with animations automatically driven by their voice and tone.

GANs -- two-part models consisting of a generator that creates samples and a discriminator that attempts to differentiate between these samples and real-world samples -- have demonstrated impressive feats of media synthesis. Top-performing GANs can create realistic portraits of people who don't exist, for instance, or snapshots of fictional apartment buildings.

But while GANs have applications in entertainment and videoconferencing, they've also been coopted for disinformation and fake accounts. Historically, they've also demonstrated bias against certain groups of people, particularly those with dark skin. On this last point, Nvidia told VentureBeat in a previous statement that its research team "paid close attention" to "racial, gender, age, and cultural diversity" while developing the AI features in Maxine for videoconferencing applications.

Maxine and GANs

More