Google proposes new metrics for evaluating AI-generated audio and video quality

What's the best way to measure the quality of media generated from whole cloth by AI models? It's not easy. One of the most popular metrics for images is the Fréchet Inception Distance (FID), which takes photos from both the target distribution and the model being evaluated and uses an AI object recognition system to capture important features and suss out similarities. But although several metrics for synthesized audio and video have been proposed, none has yet been widely adopted.

That's why researchers hailing from Google are throwing their hats into the ring with what they call the Fréchet Audio Distance (FAD) and Fréchet Video Distance (FVD), which measure the holistic quality of synthesized audio and video, respectively. The researchers claim that unlike peak signal-to-noise ratio, the structural similarity index, or other metrics that have been proposed, FVD looks at look at videos in their entirety. As for AUD, they say it's reference-free and can be used on any type of audio, in contrast to time-aligned ground truth signals like source-to-distortion ratio (SDR).

"Access to robust metrics for evaluation of generative models is crucial for measuring (and making) progress in the fields of audio and video understanding, but currently no such metrics exist," wrote software engineers Kevin Kilgour and Thomas Unterthiner in a blog post. "Clearly, some [generated] videos shown below look more realistic than others, but can the differences between them be quantified?"

As it turns out: Yes. In an FAD evaluation, the separation between the distributions of two sets of audio samples -- generated and real -- is evaluated. As the magnitude of distortions increase, the overlap between the distributions correspondingly decreases, indicating that the synthetic samples are relatively low in quality.

To evaluate how closely FAD and FVD track human judgement, Kilgour, Unterthiner, and colleagues performed a large-scale study involving human evaluators. Here, the evaluators were tasked with examining 10,000 video pairs and 69,000 5-second audio clips. For the FAD, specifically, they were asked to compare the effect of two different distortions on the same audio segment, and both the pair of distortions that they compared and the order in which they appeared were randomized. The collected set of pairwise evaluations was then ranked using a model that estimates a worth value for each parameter configuration.

The team asserts that a comparison of the worth values to the FAD demonstrates that the FAD correlates "quite well" with human judgement.

"We are currently making great strides in generative [AI] models," said Kilgour and Unterthiner. "FAD and FVD will help us [keep] this progress measurable and will hopefully lead us to improve our models for audio and video generation."

More