Perhaps you’ve read about AI capable of producing humanlike speech or generating images of people that are difficult to distinguish from real-life photographs. More often than not, these systems build upon generative adversarial networks (GANs), which are two-part AI models consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples. This unique arrangement enables GANs to achieve impressive feats of media synthesis, from composing melodies and swapping sheep for giraffes to hallucinating footage of ice skaters and soccer players. In point of fact, it’s because of this prowess that GANs have been used to produce problematic content like deepfakes, which is media that takes a person in existing media and replaces them with someone else’s likeness.
The evolution of GANs — which Facebook AI research director Yann LeCun has called the most interesting idea of the decade — is somewhat long and winding, and very much continues to this day. They have their deficiencies, but GANs remain one of the most versatile neural network architectures in use today.
History of GANs
The idea of pitting two algorithms against each other originated with Arthur Samuel, a prominent researcher in the field of computer science who’s credited with popularized the term “machine learning.” While at IBM, he devised a checkers game — the Samuel Checkers-playing Program — that was among the first to successfully self-learn, in part by estimating the chance of each side’s victory at a given position.
But if Samuel is the grandfather of GANs, Ian Goodfellow, former Google Brain research scientist and director of machine learning at Apple’s Special Projects Group, might be their father. In a seminal 2014 research paper simply titled “Generative Adversarial Nets,” Goodfellow and colleagues describe the first working implementation of a generative model based on adversarial networks.
Goodfellow has often stated that he was inspired by noise-contrastive estimation, a way of learning a data distribution by comparing it against a defined noise distribution (i.e., a mathematical function representing corrupted or distorted data). Noise-contrastive estimation uses the same loss functions as GANs — in other words, the same measure of performance with respect to a model’s ability to anticipate expected outcomes.
Of course, Goodfellow was’t the only one to pursue an adversarial AI model design. Dalle Molle Institute for Artificial Intelligence Research co-director Juergen Schmidhuber advocated predictability minimization, a technique that models distributions through an encoder that maximizes the objective function (the function that specifies the problem to be solved by the system) minimized by a predictor. It adopts what’s known as a minimax decision rule, where the possible loss for a worst case (maximum loss) scenario is minimized as much as possible.
And this is the paradigm upon which GANs are built.
Again, GANs consist of two parts: generators and discriminators. The generator model produces synthetic examples (e.g., images) from random noise sampled using a distribution, which along with real examples from a training data set are fed to the discriminator, which attempts to distinguish between the two. Both the generator and discriminator improve in their respective abilities until the discriminator is unable to tell the real examples from the synthesized examples with better than the 50% accuracy expected of chance.
GANs train in an unsupervised fashion, meaning that they infer the patterns within data sets without reference to known, labeled, or annotated outcomes. Interestingly, the discriminator’s work informs that of the generator — every time the discriminator correctly identifies a synthesized work, it tells the generator how to tweak its output so that it might be more realistic in the future.
In practice, GANs suffer from a number of shortcomings owing to their architecture. The simultaneous training of generator and discriminator models is inherently unstable. Sometimes the parameters — the configuration values internal to the models — oscillate or destabilize, which isn’t surprising given that after every parameter update, the nature of the optimization problem being solved changes. Alternatively, the generator collapses, and it begins to produce data samples that are largely homogeneous in appearance.
The generator and discriminator also run the risk of overpowering each other. If the generator becomes too accurate, it’ll exploit weaknesses in the discriminator that lead to undesirable results, whereas if the discriminator becomes too accurate, it’ll impede the generator’s progress toward convergence.
A lack of training data also threatens to impede GANs’ progress in the semantic realm, which in this context refers to the relationships among objects. Today’s best GANs struggle to reconcile the difference between palming and holding an object, for example — a differentiation most humans make in seconds.
But as Hanlin Tang, senior director of Intel’s AI laboratory, explained to VentureBeat in a phone interview, emerging techniques get around these limitations. One entails building multiple discriminator into a model and fine-tuning them on specific data. Another involves feeding discriminator dense embedding representations, or numerical representations of data, so that they have more information from which to draw.
“There [aren’t] that many well-curated data sets to start … applying GANs to,” Tang said. “GANs just follow where the data sets are going.”
On the subject of compute, Youssef Mroueh, a research staff member in the IBM multi-modal algorithms and engines group, is working with colleagues to develop lightweight models dubbed “small GANs” that reduce training time and memory usage. The bulk of their research is concentrated in the MIT-IBM Watson AI Lab, a joint AI research effort between the Massachusetts Institute of Technology and IBM.
“[It’s a] challenging business question: How can we change [the] modeling without all the computation and hassle?” Mroueh said. “That’s what we’re working toward.”
Image and video synthesis
GANs are perhaps best known for their contributions to image synthesis.
StyleGAN, a model Nvidia developed, has generated high-resolution head shots of fictional people by learning attributes like facial pose, freckles, and hair. A newly released version — StyleGAN 2 — makes improvements with respect to both architecture and training methods, redefining the state of the art in terms of perceived quality.
In June 2019, Microsoft researchers detailed ObjGAN, a novel GAN that could understand captions, sketch layouts, and refine the details based on the wording. The coauthors of a related study proposed a system — StoryGAN — that synthesizes storyboards from paragraphs.
Such models have made their way into production. Startup Vue.ai‘s GAN susses out clothing characteristics and learns to produce realistic poses, skin colors, and other features. From snapshots of apparel, it can generate model images in every size up to five times faster than a traditional photo shoot.
Elsewhere, GANs have been applied to the problems of super-resolution (image upsampling) and pose estimation (object transformation). Tang says one of his teams used GANs to train a model to upscale 200-by-200-pixel satellite imagery to 1,000 by 1,000 pixels, and to produce images that appear as though they were captured from alternate angles.
Scientists at Carnegie Mellon last year demoed Recycle-GAN, a data-driven approach for transferring the content of one video or photo to another. When trained on footage of human subjects, the GAN generated clips that captured subtle expressions like dimples and lines that formed when subjects smiled and moved their mouths.
More recently, researchers at Seoul-based Hyperconnect published MarioNETte, which synthesizes a reenacted face animated by a person’s movement while preserving the face’s appearance.
On the object synthesis side of the equation, Google and MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) developed a GAN that can generate images of 3D models with realistic lighting and reflections and enables shape and texture editing, as well as viewpoint shifts.
Predicting future events from only a few video frames — a task once considered impossible — is nearly within grasp thanks to state-of-the-art approaches involving GANs and novel data sets.
One of the newest papers on the subject from DeepMind details recent advances in the budding field of AI clip generation. Thanks to “computationally efficient” components and techniques and a new custom-tailored data set, researchers say their best-performing model — Dual Video Discriminator GAN (DVD-GAN) — can generate coherent 256 x 256-pixel videos of “notable fidelity” up to 48 frames in length.
In a twist on the video synthesis formula, Cambridge Consultants last year demoed a model called DeepRay that invents video frames to mitigate distortion caused by rain, dirt, smoke, and other debris.
GANs are capable of more than generating images and video footage. When trained on the right data sets, they’re able to produce de novo works of art.
Researchers at the Indian Institute of Technology Hyderabad and the Sri Sathya Sai Institute of Higher Learning devised a GAN, dubbed SkeGAN, that generates stroke-based vector sketches of cats, firetrucks, mosquitoes, and yoga poses.
Scientists at the Maastricht University in the Netherlands created a GAN that produces logos from one of 12 different colors.
Victor Dibia, a human-computer interaction researcher and Carnegie Mellon graduate, trained a GAN to synthesize African tribal masks.
Meanwhile, a team at the University of Edinburgh’s Institute for Perception and Institute for Astronomy designed a model that generates images of fictional galaxies that closely follow the distributions of real galaxies.
In March during its GPU Technology Conference (GTC) in San Jose, California, Nvidia took the wraps off of GauGAN, a generative adversarial AI system that lets users create lifelike landscape images that never existed. GauGAN — whose name comes from post-Impressionist painter Paul Gauguin — improves upon Nvidia’s Pix2PixHD system introduced last year, which was similarly capable of rendering synthetic worlds but left artifacts in its images. The machine learning model underpinning GauGAN was trained on more than one million images from Flickr, imbuing it with an understanding of the relationships among over 180 objects including snow, trees, water, flowers, bushes, hills, and mountains. In practice, trees next to water have reflections, for instance, and the type of precipitation changes depending on the season depicted.
GANs are architecturally well-suited to generating media, and that includes music.
In a paper published in August, researchers hailing from the National Institute of Informatics in Tokyo describe a system that’s able to generate “lyrics-conditioned” melodies from learned relationships between syllables and notes.
Not to be outdone, in December, Amazon Web Services detailed DeepComposer, a cloud-based service that taps a GAN to fill in compositional gaps in songs.
“For a long time, [GANs research] has been about improving the training instabilities whatever the modality is — text, images, sentences, et cetera. Engineering is one thing, but it’s also [about] coming up with [the right] architecture,” said Mroueh. “It’s a combination of lots of things.”
Google and Imperial College London researchers recently set out to create a GAN-based text-to-speech system capable of matching (or besting) state-of-the-art methods. Their proposed system — GAN-TTS — consists of a neural network that learned to produce raw audio by training on a corpus of speech with 567 pieces of encoded phonetic, duration, and pitch data. To enable the model to generate sentences of arbitrary length, the coauthors sampled 44 hours’ worth of two-second snippets together with the corresponding linguistic features computed for five-millisecond snippets. An ensemble of 10 discriminators — some of which assess linguistic conditioning, while others assess general realism — attempt to distinguish between real and synthetic speech.
In the medical field, GANs have been used to produce data on which other AI models — in some cases, other GANs — might train and to invent treatments for rare diseases that to date haven’t received much attention.
In April, the Imperial College London, University of Augsburg, and Technical University of Munich sought to synthesize data to fill in gaps in real data with a model dubbed Snore-GAN. In a similar vein, researchers from Nvidia, the Mayo Clinic, and the MGH and BWH Center for Clinical Data Science proposed a model that generates synthetic magnetic resonance images (MRIs) of brains with cancerous tumors.
Baltimore-based Insilico Medicine pioneered the use of GANs in molecular structure creation for diseases with a known ligand (a complex biomolecule) but no target (a protein associated with a disease process). Its team of researchers is actively working on drug discovery programs in cancer, dermatological diseases, fibrosis, Parkinson’s, Alzheimer’s, ALS, diabetes, sarcopenia, and aging.
The field of robotics has a lot to gain from GANs, as it turns out.
A tuned discriminator can determine whether a machine’s trajectory has been drawn from a distribution of human demonstrations or from synthesized examples. In that way, it’s able to train agents to complete tasks accurately, even when it has access only to the robot’s positional information. (Normally, training robot-directing AI requires both positional and action data. The latter indicates which motors moved over time.)
“The idea of using adversarial loss for training agent trajectories is not new, but what’s new is allowing it to work with a lot less data,” Tang said. “The trick to applying these adversarial learning approaches is figuring out which inputs the discriminator has access to — what information is available to avoid being tricked [by the discriminator] … [In state-of-the-art approaches], discriminators need access to [positional] data alone, allowing us to train with expert demonstrations where all we have are the state data.”
Tang says this enables the training of much more robust models than was previously possible — models that require only about two dozen human demonstrations. “If you reduce the amount of data that the discriminator has access to, you’re reducing the complexity of the data set that you have to provide to the model. These types of adversarial learning methods actually work pretty well in low-data regimes,” he added.
GANs’ ability to generate convincing photos and videos of people makes them ripe targets for abuse. Already, malicious actors have used models to generate fake celebrity pornography.
But preliminary research suggests GANs could root out deepfakes just as effectively as they produce them. A paper published on the preprint server Arxiv.org in March describes spamGAN, which learns from a limited corpus of annotated and unannotated data. In experiments, the researchers say that spamGAN outperformed existing spam detection techniques with limited labeled data, achieving accuracy of between 71% and 86% when trained on as little as 10% of labeled data.
What might the future hold with respect to GANs? Despite the leaps and bounds brought by this past decade of research, Tang cautions that it’s still early days.
“GANs are still [missing] very fine-grained control,” he said. “[That’s] a big challenge.”
For his part, Mroueh believes that GAN-generated content will become increasingly difficult to distinguish from real content.
“My feeling is that the field will improve,” he said. “Comparing image generation in 2014 to today, I wouldn’t have expected the quality to become that good. If the progress continues like this, [GANs] will remain a very important research project.”