VentureBeat presents: AI Unleashed - An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Machine learning (ML) might be considered the core subset of artificial intelligence (AI), and reinforcement learning may be the quintessential subset of ML that people imagine when they think of AI.

Reinforcement learning is the process by which a machine learning algorithm, robot, etc. can be programmed to respond to complex, real-time and real-world environments to optimally reach a desired target or outcome. Think of the challenge posed by self-driving cars.

The algorithms involved can also “learn” from, or be improved by, this process of taking in and responding to new circumstances.

Other forms of ML may be “trained” by sometimes massive sets of “training data,” often enabling an algorithm to classify or cluster data — or otherwise recognize patterns — based on the relationships and outcomes on which it has been trained. Machine learning algorithms begin with training data and create models that capture some of the patterns and lessons embedded in the data. 


AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.


Learn More

Reinforcement learning is part of the training process that often happens after deployment when the model is working. The new data captured from the environment is used to tweak and adjust the model for the current world. 

Reinforcement learning is accomplished with a feedback loop based on “rewards” and “penalties.” The scientist or user creates a list of successful and unsuccessful outcomes, and then the AI uses them to adjust the model. It might tweak some of the weights in the model, or even reevaluate some or all of the training data in light of the new reward or penalty.

For instance, an autonomous car may have a set of straightforward rewards and penalties that are predetermined. The algorithm gets a reward if it arrives on time and doesn’t make any sudden speed changes like panic braking or quick acceleration. If the car hits the curb, gets in a bad traffic jam or brakes unexpectedly, the algorithm is penalized. The model can be retrained with particular attention to the process that led to the bad results. 

In some cases, the reinforcement happens during and after deployment in the real world. In other cases, the model is refined in a simulation that generates synthetic events that may reward or penalize the algorithm. These simulations are especially useful with systems like autonomous vehicles that are expensive and dangerous to test in actual deployment. 

In many cases, reinforcement learning is just an extension of the main learning algorithm. It iterates through the same process again and again after the model is put to use. The steps are similar, and the rewards and punishments become part of an extended set of training data. 

What is the history of reinforcement learning? Reinforcement learning is one of the first types of algorithms that scientists developed to help computers learn how to solve problems on their own. The adaptive approach that relies on rewards and punishments is a flexible and powerful solution that can leverage the indefatigable ability of computers to try and retry the same tasks. 

Mathematician and computing pioneer Alan Turing contemplated and reported on a “child-machine” experiment using punishments and rewards in a paper published in 1950.

In the early 1950s, scientists like Marvin Minsky, Belmont Farley and Wesley Clark created models that adapted themselves to their input data until they provided the correct response. Minsky called his approach SNARCs, which stood for “Stochastic Neural-Analog Reinforcement Calculators.” The name suggested that they used reinforcement learning to refine the statistical model. Farley and Clark built some of the same neural networks that connected individual simulated neurons into networks that converged upon an answer. 

One of the most influential approaches came from Donald Michie in the early 1960s. He proposed a very simple approach to learning to play tic-tac-toe that was also easily understood by non-programmers. He compiled a list of the possible positions of Xs and Os that constituted the state of the game. Then he assigned one matchbox for each possible position. Inside the matchbox, he would put a set of colored beads, with each color representing one of the possible moves. 

The user would choose a bead at random and advance the game. If the bead ended up leading to a win, it would be replaced in the box. If the move, though, ended up losing the game, the bead would be discarded. Over time, only winning beads were left in the matchboxes. This very physical representation of the process made it easier to understand. 

The area exploded, and there are now extensive packages that apply dozens of different algorithms to billions of different examples. While they are much more sophisticated and elaborate, they still follow the same fundamental approach of reward and punishment. 

What are some useful open-source options for reinforcement learning?

There are a number of different packages or frameworks designed to help artificial intelligence scientists continue to train their models and reinforce important behaviors. These are generally distributed as open-source packages that make it simpler for companies and scientists to adopt them. 

  • Gym from OpenAI, for example, is a toolkit that provides a variety of environments that can be used to test how the reinforcement learning process works. One, the Atari Game environment, lets the algorithm learn how to play and win some classic arcade games. Scientists can build their own environment and then test how the algorithms perform. RLLib from the Ray project integrates several major learning frameworks like TensorFlow or PyTorch and connects them to many environments for continual improvement of the model. The system can support distributed iteration to speed up development. It can also start up multi-agent simulations that can test and refine multiple models simultaneously as they work alongside each other. 
  • Coach is another toolkit for starting up environments and running distributed simulations to refine models. The system uses numerous different environments that range from video games (Doom, Starcraft) to some purpose-built environments designed for important projects like autonomous control (CARLA). The batch process offers scientists the opportunity to run multiple simulations in parallel and speed up the search for the best parameters. 

How do major providers handle reinforcement learning?

The major AI cloud platform providers also support reinforcement learning.

Amazon offers a variety of platforms for exploring artificial intelligence and building models, and all offer some options for using reinforcement learning to guide the process. SageMaker RL, RoboMaker and DeepRacer are just three of the major machine learning options and all support a variety of different open-source options for adding the feedback from reinforcement learning like Coach, Ray RLLib or OpenGym

Google’s VertexAI, its unified machine learning platform, offers options like Vizier to find the best types of data, aka hyperparameters, to help the model converge quickly. This can be especially helpful for training a model with many inputs because the complexity of covering all the options grows quickly. The company has also been enhancing some of its hardware options for faster training, like tensor processing units (TPUs) to support more distributed reinforcement algorithms. 

  • IBM is offering a number of different options for integrating reinforcement learning with many of its model building tools. ReGen, for example, is focused on enhancing some of the machine learning models built from text stored in knowledge graphs. Its Verifiably Safe Reinforcement Learning (VSRL) algorithms integrate formal methods for proof checking with machine learning algorithms to bring extra assurance that the results are complete and accurate. 
  • Microsoft supports a variety of options for adding reinforcement to model-building. The Ray library in Python is the recommended solution for working with Azure ML. Microsoft also offers hardware support with GPUs to speed up some deep learning approaches, and is beginning to integrate the options with some of its customized machine learning services. For example, Personalizer is a system for helping shoppers find the products they need, with results that can be enhanced with feedback over time. 

How do AI startups handle reinforcement learning?

Many of the startups delivering artificial intelligence solutions have engineered their algorithms to support reinforcement learning later in the process. This approach is very common in many of the solutions that support autonomous robots and vehicles. 

Wayve, for instance, is creating guidance systems for autonomous cars using a pure machine learning approach. Its system, AV2, is constantly reinforcing its model creation as new data about the world becomes available. 

Startups like Waymo, Pony AI, Aeye, Cruise Automation and Argo are a few with significant funding that are building software and sensor systems that depend upon models of the natural world to guide autonomous vehicles. These vendors are deploying various forms of reinforcement learning to improve these models over time. 

Other companies deploy route planning algorithms for domains that need to respond to changing, real-time information. Teale is building drill guidance systems for the extraction of oil, gas or water from the ground. Pickle Robot and Dorabot are creating robots that can unpack boxes stacked in haphazard ways in large trucks.  

Many pharmaceutical companies are marrying reinforcement learning with drug development to help doctors home in on treatments for a variety of diseases. Companies like Insilico, Phenomic and ProteinQure are refining reinforcement learning algorithms to incorporate feedback from doctors and patients in their search for potentially useful drugs and proteins. The process could both unlock new potential drugs and lead to individualized treatments. 

Other companies are exploring specific domains. Signal AI is a media monitoring company that helps other companies track their reputations by creating a knowledge graph of the world and constantly refining it in real time. Perimeter X enhances web security by constantly watching for threats with an evolving model. 

Is there anything that reinforcement learning can’t do?

Ultimately, reinforcement learning is just like regular machine learning, except it collects some of its data at a later time. The algorithms are designed to adapt to new information, but they still process all the data in some form or other. So, reinforcement learning algorithms have all the same philosophical limitations as regular machine learning algorithms. 

These are already well-known by machine learning scientists. Data must be carefully gathered to represent all possible combinations or variations. If there are gaps or biases in the data, the algorithms will build models that conform to them. Gathering the data is often much more complicated than running the algorithms. 

Delaying some of the data can have mixed effects. Occasionally the delay introduced by reinforcement helps the human guide the model to be more accurate, but sometimes the human interaction just introduces more randomness to the process. Humans are not always consistent and this can confuse the modeling algorithm. If one human inputs one choice on one day and another human inputs the opposite later, they will cancel each other out and the learning will be limited. 

There is also an air of mystery to the entire process. While AI scientists have grown more adept at providing explanations for how and why a model is making a decision, these explanations are still not always fulfilling or insightful. The algorithms churn to produce a result and that result can be an inscrutable collection of numbers, weights and thresholds. 

Reinforcement learning also requires extensive exploration and experimentation. Scientists often work though numerous possible architectures for a model, with different numbers of layers and configurations for the various artificial neurons. Finding the best one is often as much an art as a science. 

In all, reinforcement learning suffers from the same limitations as regular machine learning. It’s an ideal option for domains that are evolving and where some data is unavailable at the start. But after that, success or failure depends upon the underlying algorithms themselves.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.