DeepMind wants to enable neural networks to emulate algorithms to get the best of both worlds, and it’s using Google Maps as a testbed.
Classical algorithms are what have enabled software to eat the world, but the data they work with does not always reflect the real world. Deep learning is what powers some of the most iconic AI applications today, but deep learning models need retraining to be applied in domains they were not originally designed for.
DeepMind is trying to combine deep learning and algorithms, creating the one algorithm to rule them all: a deep learning model that can learn how to emulate any algorithm, generating an algorithm-equivalent model that can work with real-world data.
DeepMind has made headlines for some iconic feats in AI. After developing AlphaGo, a program that became the world champion at the game of Go in a five-game match after beating a human professional Go player, and AlphaFold, a solution to a 50-year-old grand challenge in biology, DeepMind has set its sights on another grand challenge: bridging deep learning, an AI technique, with classical computer science.
The birth of neural algorithmic reasoning
Charles Blundell and Petar Veličković both hold senior research positions at DeepMind. They share a background in classical computer science and a passion for applied innovation. When Veličković met Blundell at DeepMind, a line of research known as Neural Algorithmic Reasoning (NAR), was born, after the homonymous position paper recently published by the duo.
The key thesis is that algorithms possess fundamentally different qualities when compared to deep learning methods — something Blundell and Veličković elaborated upon in their introduction of NAR. This suggests that if deep learning methods were better able to mimic algorithms, then generalization of the sort seen with algorithms would become possible with deep learning.
Like all well-grounded research, NAR has a pedigree that goes back to the roots of the fields it touches upon, and branches out to collaborations with other researchers. Unlike much pie-in-the-sky research, NAR has some early results and applications to show.
We recently sat down to discuss the first principles and foundations of NAR with Veličković and Blundell, to be joined as well by MILA researcher Andreea Deac, who expanded on specifics, applications, and future directions. Areas of interest include the processing of graph-shaped data and pathfinding.
Pathfinding: There’s an algorithm for that
Deac interned at DeepMind and became interested in graph representation learning through the lens of drug discovery. Graph representation learning is an area Veličković is a leading expert in, and he believes it’s a great tool for processing graph-shaped data.
“If you squint hard enough, any kind of data can be fit into a graph representation. Images can be seen as graphs of pixels connected by proximity. Text can be seen as a sequence of objects linked together. More generally, things that truly come from nature that aren’t engineered to fit within a frame or within a sequence like humans would do it, are actually quite naturally represented as graph structures,” said Veličković.
Another real-world problem that lends itself well to graphs — and a standard one for DeepMind, which, like Google, is part of Alphabet — is pathfinding. In 2020, Google Maps was the most downloaded map and navigation app in the U.S. and is used by millions of people every day. One of its killer features, pathfinding, is powered by none other than DeepMind.
The popular app now showcases an approach that could revolutionize AI and software as the world knows them. Google Maps, features a real-world road network that assists in predicting travel times. Veličković noted DeepMind has also worked on a Google Maps application that applies graph networks to predict travel times. This is now serving queries in Google Maps worldwide, and the details are laid out in a recent publication.
Veličković said that although one of the most iconic graph algorithms, Dijkstra’s algorithm, could in theory help compute shortest paths or even estimate travel times, that’s not applicable in reality. To do that, you would have to take all the complexity of the real world — roadblocks, weather changes, traffic flows, bottlenecks, and whatnot — and convert that into an ‘abstractified’ graph of nodes and edges with some weights which correspond to the travel time.
This exemplifies some of the issues with algorithms. As Veličković put it, algorithms are very happy to give you perfect answers, regardless of whether or not the inputs themselves are perfect. So it’s going to give you the shortest path. But who, we asked Veličković, guarantees that this graph is an accurate representation of the real world scenario? He said:
“The algorithm might give you a great solution, but there’s really no way of saying whether this human-devised heuristic is actually the best way of looking at it, and that’s the first point of attack. Historically, we found that whenever there’s a lot of human feature engineering and lots of raw data that we need to work with for that human feature engineering, this is where deep learning shines. The initial success stories of deep learning were exactly in replacing handcrafted heuristics for, say, image recognition with deep neural networks.”
This is the starting point for NAR: replacing the human that maps the real world raw data to a graph input with a neural network. The network will map the complexity of the raw data to some ‘abstractified’ graph inputs that can then be used to run the algorithm over.
Veličković noted there is a good line of research that works exactly in this setting. This research even figured out a way to propagate gradients through the algorithm so you can actually get good neural network optimization in the setting and apply the algorithm.
But, he went on to add, there are a few limitations with the setting, even if you’re able to propagate gradients through it. First of all, the algorithm may not be everything you need to compute the final solution. Some parts of the problem may require shortest path solutions, but maybe at some point you also need to run a flow algorithm and you are not sure how to combine results in the best way, because the problem is very raw and noisy:
“Forcing your representations to rely on the output of this one algorithm might be too limiting. Dijkstra requires one scalar value per every edge in the graph, which means that you’re compressing all that amazing complexity of the real world into just one number for every road segment.
That’s potentially problematic because, if you don’t have enough data to estimate that scalar correctly, you’re doomed. The algorithm is not going to give you correct solutions once again, because you just haven’t seen enough raw data to estimate that scalar correctly.”
Breaking the algorithmic bottleneck
This is what Veličković and Blundell refer to as the algorithmic bottleneck. It happens because we’re placing all of our bets on this one number for every edge, which is a very low dimensional representation of the problem. The way to break bottlenecks, the DeepMind duo suggests, is by using vectors, rather than numbers. In other words, staying highly dimensional.
Deep neural networks are deriving a lot of their success from staying highly dimensional and applying high-dimensional regularization techniques. This means that even if you predict badly some parts of the internal vector in the neural network, the other parts of that vector can still step in and compensate. The problem is that algorithms were designed to run on low dimensional representations, not high dimensional inputs.
The key idea in NAR is to replace the algorithm with a neural network, typically a graph neural network (GNN). The GNN takes a high dimensional input, and does one step of reasoning to produce another high dimensional input. Then once enough steps of that reasoning neural network are completed, final outputs are predicted based on this.
Of course, this high dimensional GNN neural executor needs to actually imitate the algorithm at hand. For this reason, NAR also includes an abstract pipeline where that neural network is pre-trained, typically using lots of synthetic data.
Blundell and Veličković’s previous work, such as neural execution of graph algorithms, pointer graph networks, and persistent message passing, dealt with that part: how to get reliable and robust neural networks that simulate algorithms in an abstract space.
What Blundell and Veličković had not done was to examine whether this can actually be used in a real-world problem. This is where Deac’s work comes in. The approach is based on a combination of neural execution of graph algorithms, pointer graph networks, and persistent message passing. All of this, Deac noted, was plugged into a reinforcement learning (RL) framework. In RL, problems are framed as states and actions, and the goal is to estimate how good each state is. This is what has driven RL applications in games, such as AlphaStar or AlphaGo, Deac noted:
“We define some value over the states. If you think of a chess game, this value is – how good is this move? And what we want is to make plans, make decisions so that this value is maximized. If we know the environment dynamics, [then how do we know] how we can move from one state to another, and what sorts of rewards do we get from a state for a specific action?”
The way to do this is via a so-called value iteration algorithm, which can provide estimates of how good each state is. The algorithm iterates on the value estimates and gradually improves them, until they converge to the ground truth values. This is the algorithm the NAR team is trying to imitate, using synthetic data and simple graphs to estimate meta-values.
The difference is that the team wanted to move from using single numerical values to using multiple-value high-dimensional vectors. Initially, the output might not be that good, but you don’t need that much data to get off the ground because the algorithm can deal with imperfect estimates when you don’t know the environment’s dynamics.
The key, as Deac explained, is iteration, which leads to convergence. In this case, a shortest path algorithm was of interest. The algorithm can be learned and then it will be plugged in. But the idea is that the RL framework should be usable for any algorithm, or combination of algorithms. That is an important step for machine learning, which sometimes struggles as it is re-applied from domain to domain
High performance in low data regimes
The approach Deac and the DeepMind duo worked on is called eXecuted Latent Value Iteration Network, or XLVIN. First, the value iteration algorithm is learned in the abstract space. Then a RL agent is plugged in. The team compared it against an almost identical architecture, with one crucial difference: rather than using their algorithmic component, that architecture predicts the values and runs value iteration on them directly.
Veličković said that this second agent actually managed in some cases to catch up, when fed with a lot more interactions with different environments. But in very low data regimes, the team’s RL architecture does better. This is important, as for environments such as Atari, a classical benchmark in AI also used in XLVIN, more data means more simulation budget, which is not always feasible.
XLVIN empirically validated a strong theoretical link between dynamic programing algorithms and the computations of a GNN. This means, Veličković said, that most polynomial time heuristics can be interpreted as dynamic programming, which in turn means that graph networks might be the right inductive bias for this kind of computation.
Previous theoretical work described a best case scenario, where there are settings of weights for a GNN that’ll make it nicely behave as that dynamic programming algorithm. But it doesn’t necessarily tell you what’s the best way to get there and how to make it work with the specific data you have or extrapolate, and these are issues that are important for algorithms, Veličković noted.
This led the duo to extend their work to models like pointer graph networks and persistent message passing, which moved one step further. They imitate the iterative computation of dynamic programming, but they also try to include some data structures which are crucial to how algorithms operate nowadays and incorporate some aspects of persistent reasoning.
So, rather than just being able to support a simple data structure on top of an existing set of nodes, is it possible to create additional memory additional nodes? Many algorithms rely on initializing additional memory aside from the memory required to store their inputs. So, DeepMind’s research developed models that enable more and more alignment with computation while still following that same GNN blueprint.
RL, Blundell noted, is basically a graph algorithm. It’s a dynamic programming update and it’s closely related to a shortest path algorithm — it’s like an online shortest path algorithm. It’s not surprising that if you’re trying to find the shortest possible path in a graph, and then you want to represent your problem as a graph, that maybe there’s a good relationship there.
Dynamic programming is a great way to think about solving any kind of problem, Blundell went on to add. You can’t always do it, but when you can, it works really, really well. And that’s potentially one of the deep connections between graph algorithms, reinforcement learning and graph networks.
One algorithm to rule them all
In their most recently published work, Reasoning-Modulated Representations, Blundell and Veličković show that they are able to use the algorithmic reasoning blueprints to support unsupervised learning and self-supervised learning. Often, Veličković said, unsupervised learning is all about, “Hey, let’s take a massive quantity of data and try to extract the most meaningful properties out of it.”
But that’s not always all the information you have. You might have some knowledge about how your data came to be. For example, if you’re working with estimating some representations from physics simulations, like a bunch of balls bouncing around or N-body systems, you don’t just see a bunch of snapshots of that system. You know that it has to obey certain laws of physics.
We think the neural algorithmic reasoning blueprint is an excellent way to take those laws of physics, package them up into a neural network that you can splice-in as part of your unsupervised architecture and, as a result, get better representations. And we’re starting to see some really nice results on a variety of environments that this blueprint is actually promising.”
As far as the future of this research goes, DeepMind’s duo wants to extend Deac’s work, and apply it as widely as possible in reinforcement learning, which is an area of really high interest for DeepMind and beyond. There’s algorithms ‘left, right and center inside the reinforcement learning pipeline,’ as Veličković put it.
Blundell on his part reiterated that there aren’t that many algorithms out there. So the question is, can we learn all of them? If you can have a single network that is able to execute any one of the algorithms that you know already, then if you get that network to plug those algorithms together, you start to form really quite complicated processing pipelines, or programs. And if it’s all done with gradients going through it, then you start to learn programs:
“If you really take this to its limit, then you start to really learn algorithms that learn. That becomes very interesting because one of the limitations in deep learning is the algorithms that we have to learn. There hasn’t been that much change in the best optimizers we use or how we update the weights in a neural network during training for quite a long time.
There’s been a little research over different architectures and so forth. But they haven’t always found the next breakthrough. The question is, is this a different way of looking at that, where we can start to find new learning algorithms?
Learning algorithms are just algorithms, and maybe what’s missing from them is this whole basis that we have for other algorithms that we’re using. So we need a slightly more universal algorithm executor to use as the basis for better methods for ML.”
Deac also noted she would like to pursue a network which tries multiple algorithms — all algorithms, if possible. She and some of her MILA colleagues have taken some steps in that direction. They are doing some transfer learning, chaining a couple of algorithms together and seeing if they can transfer between one algorithm, making it easier to learn a separate related algorithm, she said.
Or in other words, as Veličković framed what everyone seems to view as the holy grail of this research: “One algorithm to rule them all.”
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more