Generative AI models have a propensity for learning complex data distributions, which is why they’re great at producing human-like speech and convincing images of burgers and faces. But training these models requires lots of labeled data, and depending on the task at hand, the necessary corpora are sometimes in short supply.

The solution might lie in an approach proposed by researchers at Google and ETH Zurich. In a paper published on the preprint server Arxiv.org (“High-Fidelity Image Generation With Fewer Labels“), they describe a “semantic extractor” that can pull out features from training data, along with methods of inferring labels for an entire training set from a small subset of labeled images. These self- and semi-supervised techniques together, they say, can outperform state-of-the-art methods on popular benchmarks like ImageNet.

“In a nutshell, instead of providing hand-annotated ground truth labels for real images to the discriminator, we … provide inferred ones,” the paper’s authors explained.

In one of several unsupervised methods the researchers posit, they first extract a feature representation — a set of techniques for automatically discovering the representations needed for raw data classification — on a target training dataset using the aforementioned feature extractor. They then perform cluster analysis — i.e., grouping the representations in such a way that those in the same group share more in common than those in other groups. And lastly, they train a GAN — a two-part neural network consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples — by inferring labels.

Images generated by AI systems trained with the researchers' techniques.

Above: More sample images generated by the AI systems.

In another pretraining method, dubbed “co-training,” the paper’s authors leverage a combination of unsupervised, semi-supervised, and self-supervised methods to infer label information concurrent with GAN training. During the unsupervised step, they take one of two approaches: completely removing the labels, or assigning random labels to real images. By contrast, in the semi-supervised stage, they train a classifier on the feature representation of the discriminator when labels are available for a subset of the real data, which they use to predict labels for the unlabeled real images.

To test the techniques’ performance, the researchers tapped ImageNet — a database containing over 1.3 million training images and 50,000 test images, each corresponding to one of 1,000 object classes — and obtained partially labeled datasets by randomly selecting a portion of the samples from each image class (i.e., “firetrucks,” “mountains,” etc.).  After training every GAN three times on 1,280 cores of a third-generation Google tensor processing unit (TPU) pod using the unsupervised, pre-trained, and co-training approaches, they compared the quality of the outputs with two scoring metrics: Frechet Inception Distance (FID) and Inception Score (IS).

The unsupervised methods weren’t particularly successful — they achieved a FID and IS of around 25 and 20, respectively, compared with the baseline of 8.4 and 75. Pretraining using self-supervision and clustering reduced FID by 10 percent and increased ID by about 10 percent, and the co-trained method obtained an FID of 13.9 and an IS of 49.2. But by far the most successful was self-supervision: It achieved “state-of-the-art” performance with 20 percent labeled data.

In the future, the researchers hope to investigate how the techniques might be applied to “larger” and “more diverse” datasets. “There are several important directions for future work,” they wrote, “[but] we believe that this is a great first step towards the ultimate goal of few-shot high-fidelity image synthesis.”