In a technical paper quietly released earlier this year, IBM detailed what it calls the IBM Neural Computer, a reconfigurable parallel processing system designed to research and develop emerging AI algorithms and computational neuroscience. This week, the company published a preprint describing the first application demonstrated on the Neural Computer: a deep “neuroevolution” system that combines the hardware implementation of an Atari 2600, image preprocessing, and AI algorithms in an optimized pipeline. The coauthors report results competitive with state-of-the-art techniques, but perhaps more significantly, they claim that the system achieves a record training time of 1.2 million image frames per second.
The Neural Computer represents something of a shot across the bow in the AI computational arms race. According to an analysis recently released by OpenAI, from 2012 to 2018, the amount of compute used in the largest AI training runs grew more than 300,000 times with a 3.5-month doubling time, far exceeding the pace of Moore’s law. On pace with this, supercomputers like Intel’s forthcoming Aurora at the Department of Energy’s Argonne National Laboratory and AMD’s Frontier at Oak Ridge National Laboratory promise in excess of an exaflop (a quintillion floating-point computations per second) of computing performance.
Video games are a well-established platform for AI and machine learning research. They’ve gained currency not only because of their availability and the low cost of running them at scale, but because in certain domains like reinforcement learning, where AI learns optimal behaviors by interacting with the environment in pursuit of rewards, game scores serve as direct rewards. AI algorithms developed within games have shown to be adaptable to more practical uses, like protein folding prediction. And if the results from IBM’s Neural Computer prove to be repeatable, the system could be used to accelerate those AI algorithms’ development.
The Neural Computer
IBM’s Neural Computer consists of 432 nodes (27 nodes across 16 modular cards) based on field-programmable gate arrays (FPGAs) from Xilinx, a longtime strategic collaborator of IBM’s. (FPGAs are integrated circuits designed to be configured after manufacturing.) Each node comprises a Xilinx Zynq system-on-chip — a dual-core ARM A9 processor paired with an FPGA on the same die — along with 1GB of dedicated RAM. The nodes are arranged in a 3D mesh topology, interconnected vertically with electrical connections called through-silicon vias that pass completely through silicon wafers or dies.
On the networking side, the FPGAs provide access to the physical communication links among cards in order to establish multiple distinct channels of communication. A single card can theoretically support transfer speeds up to 432GB per second, but the Neural Computer’s network interfaces can be adjusted and progressively optimized to best suit a given application.
“The availability of FPGA resources on every node allows application-specific processor offload, a feature that is not available on any parallel machine of this scale that we are aware of,” wrote the coauthors of a paper detailing the Neural Computer’s architecture. “[M]ost of the performance-critical steps [are] offloaded and optimized on the FPGA, with the ARM [processor] … providing auxiliary support.”
Playing Atari games with AI
The researchers used 26 out of 27 nodes per card within the Neural Computer, carrying out experiments on a total of 416 nodes. Two instances of their Atari game-playing application ran on each of the 416 FPGAs, scaling up to 832 instances running in parallel. Each instance extracted frames from a given Atari 2600 game, performed image preprocessing, ran the images through machine learning models, and performed an action within the game.
To obtain the highest performance, the team shied away from emulating the Atari 2600, instead opting to use the FPGAs to implement the console’s functionality at higher frequencies. They tapped a framework from the open source MiSTer project, which aims to recreate consoles and arcade machines using modern hardware, and bumped the Atari 2600’s processor clock to 150 MHz up from 3.58 MHz. This produced roughly 2,514 frames per second compared with the original 60 frames per second.
In the image preprocessing step, IBM’s application converted the frames from color to grayscale, eliminated flickering, rescaled images to a smaller resolution, and stacked the frames into groups of four. It then passed these onto an AI model that reasoned about the game environment and a submodule that selected the action for the next frames by identifying the maximum reward as predicted by the AI model.
Yet another algorithm — a genetic algorithm — ran on an external computer connected to the Neural Computer via a PCIe connection. It evaluated the performance of each instance and identified the top-performing of the bunch, which it selected as “parents” of the next generation of instances.
Over the course of five experiments, IBM researchers ran 59 Atari 2600 games on the Neural Computer. The results imply that the approach wasn’t data-efficient compared with other reinforcement learning techniques — it required 6 billion game frames in total and failed at challenging exploration games like Montezuma’s Revenge and Pitfall. But it managed to outperform a popular baseline — a Deep Q-network, an architecture pioneered by DeepMind — in 30 out of 59 games after 6 minutes of training (200 million training frames) versus the Deep-Q network’s 10 days of training. With 6 billion training frames, it surpassed the Deep Q-network in 36 games while taking 2 orders of magnitude less training time (2 hours and 30 minutes).