Reinforcement learning: The next great AI tech moving from the lab to the real world

Reinforcement learning (RL) is a powerful type of artificial intelligence technology that can be used to learn strategies to optimally control large, complex systems such as manufacturing plants, traffic control systems (road/train/aircraft), financial portfolios, robots, etc. It is currently transitioning from research labs to highly impactful, real world applications. For example, self-driving car companies like Wayve and Waymo are using reinforcement learning to develop the control systems for their cars.

AI systems that are typically used in industry perform pattern recognition to make a prediction. For instance, they may recognize patterns in images to detect faces (face detection), or recognize patterns in sales data to predict a change in demand (demand forecasting), and so on. Reinforcement learning methods, on the other hand, are used to make optimal decisions or take optimal actions in applications where there is a feedback loop. An example where both traditional AI methods and RL may be used, but for different purposes, will make the distinction clearer.

Say we are using AI to help operate a manufacturing plant. Pattern recognition may be used for quality assurance, where the AI system uses images and scans of the finished product to detect any imperfections or flaws. An RL system, on the other hand, would compute and execute the strategy for controlling the manufacturing process itself (by, for example, deciding which lines to run, controlling machines/robots, deciding which product to manufacture, and so on). The RL system will also try to ensure that the strategy is optimal in that it maximizes some metric of interest -- such as the output volume -- while maintaining a certain level of product quality. The problem of computing the optimal control strategy, which RL solves, is very difficult for some subtle reasons (often much more difficult than pattern recognition).

In computing the optimal strategy, or policy in RL parlance, the main challenge an RL learning algorithm faces is the so-called "temporal credit assignment" problem. That is, the impact of an action (e.g. “run line 1 on Wednesday”) in a given system state (e.g. “current output level of machines, how busy each line is,” etc.) on the overall performance (e.g. “total output volume”) is not known until after (potentially) a long time. To make matters worse, the overall performance also depends on all the actions that are taken subsequent to the action being evaluated. Together, this implies that, when a candidate policy is executed for evaluation, it is difficult to know which actions were the good ones and which were the bad ones -- in other words, it is very difficult to assign credit to the different actions appropriately. The large number of potential system states in these complex problems further exacerbates the situation via the dreaded "curse of dimensionality." A good way to get an intuition for how an RL system solves all these problems at the same time is by looking at the recent spectacular successes they have had in the lab.

Many of the recent, prominent demonstrations of the power of RL come from applying them to board games and video games. The first RL system to impress the global AI community was able to learn to outplay humans in different Atari games when only given as input the images on screen and the scores received by playing the game. This was created in 2013 by London-based AI research lab Deepmind (now part of Alphabet Inc.). The same lab later created a series of RL systems (or agents), starting with the AlphaGo agent, which were able to defeat the top players in the world in the board game Go. These impressive feats, which occurred between 2015 and 2017, took the world by storm because Go is a very complex game, with millions of fans and players around the world, that requires intricate, long-term strategic thinking involving both the local and global board configurations.

Subsequently, Deepmind and the AI research lab OpenAI have released systems for playing the video games Starcraft and DOTA 2 that can defeat the top human players around the world. These games are challenging because they require strategic thinking, resource management, and control and coordination of multiple entities within the game.

All the agents mentioned above were trained by letting the RL algorithm play the games many many times (e.g. millions or more) and learning which policies work and which do not against different kinds of opponents and players. The large number of trials were possible because these were all games running on a computer. In determining the usefulness of various policies, the RL algorithm often employed a complex mix of ideas. These include hill climbing in policy space, playing against itself, running leagues internally amongst candidate policies or using policies used by humans as a starting point and properly balancing exploration of the policy space vs. exploiting the good policies found so far. Roughly speaking, the large number of trials enabled exploring many different game states that could plausibly be reached, while the complex evaluation methods enabled the AI system to determine which actions are useful in the long term, under plausible plays of the games, in these different states.

A key blocker in using these algorithms in the real world is that it is not possible to run millions of trials. Fortunately, a workaround immediately suggests itself: First, create a computer simulation of the application (a manufacturing plant simulation, or market simulation etc.), then learn the optimal policy in the simulation using RL algorithms, and finally adapt the learned optimal policy to the real world by running it a few times and tweaking some parameters. Famously, in a very compelling 2019 demo, OpenAI showed the effectiveness of this approach by training a robot arm to solve the Rubik’s cube puzzle one-handed.

For this approach to work, your simulation has to represent the underlying problem with a high degree of accuracy. The problem you're trying to solve also has to be "closed" in a certain sense -- there cannot be arbitrary or unseen external effects that may impact the performance of the system. For example, the OpenAI solution would not work if the simulated robot arm was too different from the real robot arm or if there were attempts to knock the Rubik’s cube out of the real robot arm (though it may naturally be -- or be explicitly trained to be -- robust to certain kinds of obstructions and interferences).

These limitations will sound acceptable to most people. However, in real applications it is tricky to properly circumscribe the competence of an RL system, and this can lead to unpleasant surprises. In our earlier manufacturing plant example, if a machine is replaced with one that is a lot faster or slower, it may change the plant dynamics enough that it becomes necessary to retrain the RL system. Again, this is not unreasonable for any automated controller, but stakeholders may have far loftier expectations from a system that is artificially intelligent, and such expectations will need to be managed.

Regardless, at this point in time, the future of reinforcement learning in the real world does seem very bright. There are many startups offering reinforcement learning products for controlling manufacturing robots (Covariant, Osaro, Luffy), managing production schedules (Instadeep), enterprise decision making (Secondmind), logistics (Dorabot), circuit design (Instadeep), controlling autonomous cars (Wayve, Waymo, Five AI), controlling drones (Amazon), running hedge funds (Piit.ai), and many other applications that are beyond the reach of pattern recognition based AI systems.

Each of the Big Tech companies has made heavy investments in RL research -- e.g. Google acquiring Deepmind for a reported £400 million (approx $525 million) in 2015. So it is reasonable to assume that RL is either already in use internally at these companies or is in the pipeline; but they're keeping the details pretty quiet for competitive advantage reasons.

We should expect to see some hiccups as promising applications for RL falter, but it will likely claim its place as a technology to reckon with in the near future.

M M Hassan Mahmud is a Senior AI and Machine Learning Technologist at Digital Catapult, with a background in machine learning within academia and industry.

More