At NeurIPS 2020, researchers proposed faster, more efficient alternatives to backpropagation

In the 1960s, academics including Virginia Polytechnic Institute professor Henry J. Kelley, Stanford University's Arthur E. Bryson, and Stuart Dreyfus at the University of California, Berkeley arrived at the theory of backpropagation. It's an algorithm which would later become widely used to train neural networks, the computing systems vaguely inspired by the biological neural networks that constitute animal brains. Backpropagation rose to prominence in the 2010s in light of the emergence of cheap, powerful computing systems, leading to gains in speech recognition, computer vision, and natural language processing.

Backpropagation generally works well, but it's constrained in that it optimizes AI models for a fixed rather than a moving target. Once the models learn to make predictions from a dataset, they run the risk of forgetting what they learned when given new training data -- a phenomenon known as "catastrophic forgetting." That's why researchers are investigating techniques that move beyond backpropagation toward forms of continuous learning, which don't entail retraining on their entire history of experiences. Experts believe this more humanlike way of learning, which confers the ability to learn new information without forgetting, could lead to significant advances in the field of AI and machine learning.

In early December, dozens of alternatives to traditional backpropagation were proposed during a workshop at the NeurIPS 2020 conference, which took place virtually. Some leveraged hardware like photonic circuits to further bolster the efficiency of backpropagation, while others adopted a more modular, flexible approach to training.

Backpropagation

The simplest form of backpropagation involves computing the gradient -- the optimization algorithm that's used when training a machine learning model -- of a loss function with respect to the weights of a model. (A loss function is a method of evaluating how well a specific algorithm models a given dataset.) Neural networks are made up of interconnected neurons through which data moves and weights control the signal between two neurons, deciding how much influence data fed into the network will have on the outputs that emerge from it.

Backpropagation is efficient, making it feasible to train multilayer networks containing many neurons while updating the weights to minimize loss. As alluded to earlier, it works by computing the gradient of the loss function with respect to each weight through what's known as the chain rule, computing the gradient one layer at a time and iterating backward from the last layer to avoid redundant calculations.

But for all its advantages, backpropagation is severely limited in what it can accomplish up to a certain point. For example, as mathematician Anthony Repetto points out, backpropagation makes it impossible to recognize a "constellation" of a dataset's features. When a computer vision system trained using backpropagation classifies an object in an image -- for example, "horse" -- it can't communicate which features in the image led it to that conclusion. (It's lost this information.) Backpropagation also updates the network layers sequentially, making it difficult to parallelize the training process and leading to longer training times.

Another disadvantage of backpropagation is its tendency to become stuck in the local minima of the loss function. Mathematically, the goal in training a model is converging on the global minimum, the point in the loss function where the model has optimized its ability to make predictions. But there often exist approximations of the global minimum -- points close to optimal, but not exact -- that backpropagation finds instead. This isn't always a problem, but it could lead to incorrect predictions on the part of the model.

Alignment

It was once thought that the weights used for propagating backward through a network had to be the same as the weights used for propagating forward. But a recently discovered method called direct feedback alignment shows that random weights work equally well, because the network effectively learns how to make them useful. This opens the door to parallelizing the backwards pass, potentially reducing training time and power consumption by an order of magnitude.

Indeed, in a paper submitted to the NeurIPS workshop anonymously, the coauthors propose "slot machine" networks where each "reel" -- i.e., connection between neurons -- contains a fixed set of random values. The algorithm "spins" the reels to seek "winning" combinations or selections of random weight values that minimize the given loss. The results show that allocating just a few random values to each connection, like eight values per connection, improves performance over trained baseline models.

In another paper accepted to the workshop, researchers at LightOn, a startup developing photonic computing hardware, claim that feedback alignment can successfully train a range of state-of-the-art machine learning architectures, with performance close to fine-tuned backpropagation. While the researchers acknowledge that their experiments required "substantial" cloud resources, they say the work provides "new perspectives" that might "favor the application of neural networks in fields previously inaccessible because of computational limits."

But alignment isn't a perfect solution. While it successfully trains models like Transformers, it notoriously fails to train convolutional networks, a dominant form of computer vision model. Moreover, unlike backpropagation, feedback alignment hasn't enjoyed decades of work on topics like adversarial attacks, interpretability, and fairness. The effects of scaled-up alignment remain understudied.

New hardware

Perhaps the most radical alternative to backpropagation proposed so far involves new hardware custom-built for feedback alignment. In a study submitted to the workshop by another team at LightOn, the coauthors describe a photonic accelerator that's ostensibly able to compute random projections with trillions of different variables. They claim that their hardware -- a photonic coprocessor -- is architecture-agnostic and potentially a step toward building scalable systems that don't rely on backpropagation.

Photonic integrated circuits, which are the foundation of LightOn's chip, promise a host of advantages over their electronic counterparts. They require only a limited amount of energy because light produces less heat than electricity does, and they're less susceptible to changes in ambient temperature, electromagnetic fields, and other noise. Latency in photonic designs is improved up to 10,000 times compared with silicon equivalents at power consumption levels "orders of magnitude" lower, and moreover, certain model workloads have been measured running 100 times faster compared with state-of-the-art electronic chips.

But it's worth noting that LightOn's hardware isn't immune to the limitations of optical processing. Speedy photonic circuits require speedy memory, and then there's the matter of packaging every component -- including lasers, modulators, and optical combiners -- on a tiny chip wafer. Plus, questions remain about what kinds of nonlinear operations, the basic building blocks of models that enable them to make predictions, can be executed in the optical domain.

Distillation

Another, not necessarily mutually exclusive answer to the backpropagation problem involves splitting neural networks into smaller, more manageable pieces. In an anonymously coauthored study, researchers propose divvying up models into subnetworks called neighborhoods that are then trained independently, which comes with the benefits of parallelism and speedy training.

For their part, researchers at the University of Maryland's Department of Computer Science pretrained subnetworks independently before training the entire network. They also employed an attention mechanism between the subnetworks to help identify the most important modality (visual, acoustic, or textual) during ambiguous scenarios, which boosted performance. In this context, "attention" refers to a method that identifies which parts of an input sequence -- for example, words -- are relevant to each output.

The University of Maryland researchers say that their approach enables a simple network to achieve performance similar to a complicated architecture. Moreover, they say that it results in significantly reduced training time with tasks like sentiment analysis, emotion recognition, and speaker trait recognition.

New techniques to come

In 2017, Geoffrey Hinton, a researcher at the University of Toronto and Google's AI research division and a winner of the Association for Computing Machinery's Turing Award, told Axios in an interview that he was "deeply suspicious" of deep learning. "My view is throw it all away and start again," he said. "I don't think that's how the brain works."

Hinton was referring to the fact that with backpropagation, a model must be "told" when it makes an error, meaning it's "supervised" in the sense that it doesn't learn to classify patterns on its own. He and others believe that unsupervised or self-supervised learning, where models look for patterns in a dataset without preexisting labels, are a necessary step toward more powerful AI techniques.

But this aside, backpropagation's fundamental limitations continue to motivate the research community to seek out replacements. It's early days, but if these early attempts pan out, the efficiency gains could broaden the accessibility of AI and machine learning among both practitioners and the enterprise.