Training AI: Reward is not enough

This post was written for TechTalks by Herbert Roitblat, the author of Algorithms Are Not Enough: How to Create Artificial General Intelligence.

In a recent paper, the DeepMind team, (Silver et al., 2021) argue that rewards are enough for all kinds of intelligence. Specifically, they argue that "maximizing reward is enough to drive behavior that exhibits most if not all attributes of intelligence." They argue that simple rewards are all that is needed for agents in rich environments to develop multi-attribute intelligence of the sort needed to achieve artificial general intelligence. This sounds like a bold claim, but, in fact, it is so vague as to be almost meaningless. They support their thesis, not by offering specific evidence, but by repeatedly asserting that reward is enough because the observed solutions to the problems are consistent with the problem having been solved.

The Silver et al. paper represents at least the third time that a serious proposal has been offered to demonstrate that generic learning mechanisms are sufficient to account for all learning. This one goes farther to also propose that it is sufficient to attain intelligence, and in particular, sufficient to explain artificial general intelligence.

The first significant project that I know of that attempted to show that a single learning mechanism is all that is needed is B.F. Skinner's version of behaviorism, as represented by his book Verbal Behavior. This book was devastatingly critiqued by Noam Chomsky (1959), who called Skinner's attempt to explain human language production an example of "play acting at science." The second major proposal was focused on past-tense learning of English verbs by Rumelhart and McClelland (1986), which was soundly criticized by Lachter and Bever (1988). Lachter and Bever showed that the specific way that Rumelhart and McClelland chose to represent the phonemic properties of the words that their connectionist system was learning to transform contained the specific information that would allow the system to succeed.

Both of these previous attempts failed in that they succumbed to confirmation bias. As Silver et al. do, they reported data that were consistent with their hypothesis without consideration of possible alternative explanations and they interpreted ambiguous data as supportive. All three projects failed to take account of the implicit assumptions that were built into their models. Without these implicit TRICS (Lachter and Bever's name for the "the representations it crucially supposes"), there would be no intelligence in these systems.

The Silver et al. argument can be summarized by three propositions:

Maximizing reward is enough to produce intelligence: "The generic objective of maximising reward is enough to drive behaviour that exhibits most if not all abilities that are studied in natural and artificial intelligence."
Intelligence is the ability to achieve goals: "Intelligence may be understood as a flexible ability to achieve goals."
Success is measured by maximizing reward: "Thus, success, as measured by maximising reward."

In short, they propose that the definition of intelligence is the ability to maximize reward and at the same time they use the maximization of reward to explain the emergence of intelligence. Following the 17th Century author Moliere, some philosophers would call this kind of argument virtus dormativa (a sleep-inducing virtue). When asked to explain why opium causes sleep, Moliere's bachelor (in the Imaginary Invalid) responds that it has a dormitive property (a sleep-inducing virtue). That, of course, is just a naming of the property for which an explanation is being sought. Reward maximization plays a similar role in Silver's hypothesis, which is also entirely circular. Achieving goals is both the process of being intelligent and explains the process of being intelligent.

Chomsky also criticized Skinner's approach because it assumed that for any exhibited behavior there must have been some reward. If someone looks at a painting and says "Dutch," Skinner's analysis assumes that there must be some feature of the painting for which the utterance "Dutch" had been rewarded. But, Chomsky, argues, the person could have said anything else, including "crooked," "hideous," or "let's get some lunch." Skinner cannot point to the specific feature of the painting that caused any of these utterance or provide any evidence that that utterance was previously rewarded in the presence of that feature. To quote an 18th Century French author (Voltaire), his Dr. Pangloss (in Candide) says: "Observe that the nose has been formed to bear spectacles -- thus we have spectacles." There must be a problem that is solved by any feature and in this case, he claims that the nose has been formed just so spectacles can be held up. Pangloss also says "It is demonstrable ... that things cannot be otherwise than as they are; for all being created for an end, all is necessarily for the best end." For Silver et al. that end is the solution to a problem and intelligence has been learned just for that purpose, but we do not necessarily know what that purpose is or what environmental features induced it. There must have been something.

Gould and Lewontin (1979) famously exploit Dr. Pangloss to criticize what they call the "adaptationist" or "Panglossian" paradigm in evolutionary biology. The core adaptationist tenet is that there must be an adaptive explanation for any feature. They point out that the highly decorated spandrels (the approximately triangular shape where two arches meet) of St. Mark's Cathedral in Venice is an architectural feature that follows from the choice to design the Cathedral with four arches, rather than the driver of the architectural design. The spandrels followed the choice of arches, not the other way around. Once the architect chose the arches, the spandrels were necessary, and they could be decorated. Gould and Lewontin say "Every fan-vaulted ceiling must have a series of open spaces along the midline of the vault, where the sides of the fans intersect between the pillars. Since the spaces must exist, they are often used for ingenious ornamental effect."

Gould and Lewontin give another example -- an adaptationist explanation of Aztec sacrificial cannibalism. Aztecs engaged in human sacrifice. An adaptationist explanation was that the system of sacrifice was a solution to the problem of a chronic shortage of meat. The limbs of victims were frequently eaten by certain high-status members of the community. This "explanation" argues that the system of myth, symbol, and tradition that constituted this elaborate ritualistic murder were the result of a need for meat, whereas the opposite was probably true. Each new king had to outdo his predecessor with increasingly elaborate sacrifices of larger numbers of individuals; the practice seems to have increasingly strained the economic resources of the Aztec empire. Other sources of protein were readily available, and only certain privileged people, who had enough food already, ate only certain parts of the sacrificial victims. If getting meat into the bellies of starving people were the goal, then one would expect that they would make more efficient use of the victims and spread the food source more broadly. The need for meat is unlikely to be a cause of human sacrifice; rather it would seem to be a consequence of other cultural practices that were actually maladaptive for the survival of the Aztec civilization.

To paraphrase Silver et al.'s argument so far, if the goal is to be wealthy, it is enough to accumulate a lot of money. Accumulating money is then explained by the goal of being wealthy. Being wealthy is defined by having accumulated a lot of money. Reinforcement learning provides no explanation for how one goes about accumulating money or why that should be a goal. Those are determined, they argue, by the environment.

Reward by itself, then, is not really enough, at a minimum, the environment also plays a role. But there is more to adaptation than even that. Adaptation requires a source of variability from which certain traits can be selected. The primary source of this variation in evolutionary biology is mutation and recombination. Reproduction in any organism involves a copying of genes from the parents into the children. The copying process is less than perfect and errors are introduced. Many of those errors are fatal, but some of them are not and are then available for natural selection. In sexually reproducing species, each parent contributes a copy (along with any potential errors) of its genes and the two copies allow for additional variability through recombination (some genes from one parent and some from the other are passed to the next generation).

Reward is the selection. Alone, it is not sufficient. As Dawkins pointed out, evolutionary reward is the passing of a specific gene to the next generation. The reward is at the gene level, not at the level of the organism or the species. Anything that increases the chances of a gene being passed from one generation to the next mediates that reward, but notice that the genes themselves are not capable of being intelligent.

In addition to reward and environment, other factors also play a role in evolution and reinforcement learning. Reward can only select from the raw material that is available. If we throw a mouse into a cave, it does not learn to fly and to use sonar like a bat. Many generations and perhaps millions of years would be required to accumulate enough mutations and even then, there is no guarantee that it would evolve the same solutions to the cave problem that bats have evolved. Reinforcement learning is a purely selective process. Reinforcement learning is the process of increasing the probabilities of actions that together form a policy for dealing with a certain environment. Those actions must already exist for them to be selected. At least for now, those actions are supplied by the genes in evolution and by the program designers in artificial intelligence.

As Lachter and Bever pointed out, learning does not start with a tabula rasa, as claimed by Silver et al., but with a set of representational commitments. Skinner based most of his theory building on the reinforcement learning of animals, particularly pigeons and rats. He and many other investigators studied them in stark environments. For the rats, that was a chamber that contained a lever for the rat to press and a feeder to deliver the reward. There was not much else that the rat could do but to wander a short distance and contact the lever. Pigeons were similarly tested in an environment that contained a pecking key (usually a plexiglass circle on the wall that could be illuminated) and a grain feeder to deliver the reward. In both situations, the animal had a pre-existing bias to respond in the way that the behaviorist wanted. Rats would contact the lever and, it turned out, pigeons would peck an illuminated key in a dark box even without a reward. This proclivity to respond in a desirable way made it easy to train the animal and the investigator could study the effects of reward patterns without a lot of trouble, but it was not for many years that it was discovered that the choice of a lever or a pecking key was not simply an arbitrary convenience, but was an unrecognized "fortunate choice."

The same unrecognized fortunate choices occurred when Rumelhart and McClelland built their past-tense learner. They chose a representation that just happened to reflect the very information that they wanted their neural network to learn. It was not a tabula rasa relying solely on a general learning mechanism. Silver et al. (in another paper with an overlapping set of authors) also got "lucky" in their development of AlphaZero, to which they refer in the present paper.

In the previous paper, they give a more detailed account of AlphaZero along with this claim:

Our results demonstrate that a general-purpose reinforcement learning algorithm can learn, tabula rasa -- without domain-specific human knowledge or data, as evidenced by the same algorithm succeeding in multiple domains -- superhuman performance across multiple challenging games.

They also note:

AlphaZero replaces the handcrafted knowledge and domain-specific augmentations used in traditional game-playing programs with deep neural networks, a general-purpose reinforcement learning algorithm, and a general-purpose tree search algorithm.

They do not include explicit game-specific computational instructions, but they do include a substantial human contribution to solving the problem. For example, their model includes a "neural network f_θ(s) [which] takes the board position s as an input and outputs a vector of move probabilities." In other words, they do not expect the computer to learn that it is playing a game, or that the game is played by taking turns, or that it cannot just stack the stones (the go game pieces) into piles or throw the game board on the floor. They provide many other constraints as well, for example, by having the machine play against itself. The tree representation they use was once a huge innovation for representing game playing. The branches of the tree correspond to the range of possible moves. No other action is possible. The computer is also provided with a way to search the tree using a Monte Carlo tree search algorithm and it is provided with the rules of the game.

Far from being a tabula rasa, then, AlphaZero is given substantial prior knowledge, which greatly constrains the range of possible things it can learn. So it is not clear what "reward is enough" means even in the context of learning to play go. For reward to be enough, it would have to work without these constraints. Moreover, it is unclear whether even a general game-playing system would count as an example of general learning in less constrained environments. AlphaZero is a substantial contribution to computational intelligence, but its contribution is largely the human intelligence that went into designing it, to identifying the constraints that it would operate in, and to reducing the problem of playing a game to a directed tree search. Furthermore, its constraints do not even apply to all games, but only games of a limited type. It can only play certain kinds of board games that can be characterized as a tree search where the learner can take a board position as input and output a probability vector. There is no evidence that it could even learn another kind of board game, such as Monopoly or even parchisi.

Absent the constraints, reward does not explain anything. AlphaZero is not a model for all kinds of learning, and certainly not for general intelligence.

Silver et al. treat general intelligence as a quantitative problem.

"General intelligence, of the sort possessed by humans and perhaps also other animals, may be defined as the ability to flexibly achieve a variety of goals in different contexts."

How much flexibility is required? How wide a variety of goals? If we had a computer that could play go, checkers, and chess interchangeably, that would still not constitute general intelligence. Even if we added another game, shogi, we still would have exactly the same computer that would still work by finding a model that "takes the board position s as an input and outputs a vector of move probabilities." The computer is completely incapable of entertaining any other "thoughts" or solving any problem that cannot be represented in this specific way.

The "general" in artificial general intelligence is not characterized by the number of different problems it can solve, but by the ability to solve many types of problems. A general intelligence agent must be able to autonomously formulate its own representations. It has to invent its own approach to solving problems, selecting its own goals, representations, methods, and so on. So far, that is all the purview of human designers who reduce problems to forms that a computer can solve through the adjustment of model parameters. We cannot achieve general intelligence until we can remove the dependency on humans to structure problems. Reinforcement learning, as a selective process, cannot do it.

Conclusion: As with the confrontation between behaviorism and cognitivism, and the question of whether backpropagation was sufficient to learn linguistic past-tense transformations, these simple learning mechanisms only appear to be sufficient if we ignore the heavy burden carried by other, often unrecognized constraints. Rewards select among available alternatives but they cannot create those alternatives. Behaviorist rewards work so long as one does not look too closely at the phenomena and as long as one assumes that there must be some reward that reinforces some action. They are good after the fact to "explain" any observed actions, but they do not help outside the laboratory to predict which actions will be forthcoming. These phenomena are consistent with reward, but it would be a mistake to think that they are caused by reward.

Contrary to Silver et al.'s claims, reward is not enough.

Herbert Roitblat is the author of Algorithms Are Not Enough: How to Create Artificial General Intelligence (MIT Press, 2020).