Roughly a year ago, Microsoft announced it would invest $1 billion in OpenAI to jointly develop new technologies for Microsoft’s Azure cloud platform and to “further extend” large-scale AI capabilities that “deliver on the promise” of artificial general intelligence (AGI). In exchange, OpenAI agreed to license some of its intellectual property to Microsoft, which the company would then commercialize and sell to partners, and to train and run AI models on Azure as OpenAI worked to develop next-generation computing hardware.
Today during Microsoft’s Build 2020 developer conference, the first fruit of the partnership was revealed, in the form of a new supercomputer that Microsoft says was built in collaboration with — and exclusively for — OpenAI on Azure. Microsoft claims it’s the fifth most powerful machine in the world, compared with the TOP 500, a project that benchmarks and details the 500 top-performing supercomputers. According to the most recent rankings, it slots behind the China National Supercomputer Center’s Tianhe-2A and ahead of the Texas Advanced Computer Center’s Frontera, meaning it can perform somewhere between 38.7 and 100.7 quadrillion floating point operations per second (i.e., petaflops) at peak.
OpenAI has long asserted that immense computational horsepower is a necessary step on the road to AGI, or AI that can learn any task a human can. While luminaries like Mila founder Yoshua Bengio and Facebook VP and chief AI scientist Yann LeCun argue that AGI can’t exist, OpenAI’s cofounders and backers — among them Greg Brockman, chief scientist Ilya Sutskever, Elon Musk, Reid Hoffman, and former Y Combinator president Sam Altman — believe powerful computers in conjunction with reinforcement learning and other techniques can achieve paradigm-shifting AI advances. The unveiling of the supercomputer represents OpenAI’s biggest bet yet on that vision.
The benefits of large models
The new Azure-hosted, OpenAI-co-designed machine contains over 285,000 processor cores, 10,000 graphics cards, and 400 gigabits per second of connectivity for each graphics card server. It was designed to train single massive AI models, which are models that learn from ingesting billions of pages of text from self-published books, instruction manuals, history lessons, human resources guidelines, and other publicly available sources. Examples include a natural language processing (NLP) model from Nvidia that contains 8.3 billion parameters, or configurable variables internal to the model whose values are used in making predictions; Microsoft’s Turing NLG (17 billion parameters), which achieves state-of-the-art results on a number of language benchmarks; Facebook’s recently open-sourced Blender chatbot framework (9.4 billion parameters); and OpenAI’s own GPT-2 model (1.5 billion parameters), which generates impressively humanlike text given short prompts.
“As we’ve learned more and more about what we need and the different limits of all the components that make up a supercomputer, we were really able to say, ‘If we could design our dream system, what would it look like?'” OpenAI CEO Sam Altman said in a statement. “And then Microsoft was able to build it. We are seeing that larger-scale systems are an important component in training more powerful models.”
Studies show that these large models perform well because they can deeply absorb the nuances of language, grammar, knowledge, concepts, and context, enabling them to summarize speeches, moderate content in live gaming chats, parse complex legal documents, and even generate code from scouring GitHub. Microsoft has used its Turing models — which will soon be available in open source — to bolster language understanding across Bing, Office, Dynamics, and its other productivity products. In Bing, the models improved caption generation and question answering by up to 125% in some markets, claims Microsoft. In Office, they ostensibly fueled advances in Word’s Smart Lookup and Key Insights tools. Outlook uses them for Suggested Replies, which automatically generates possible responses to emails. And in Dynamics 365 Sales Insights, they suggest actions to sellers based on interactions with customers.
From a technical standpoint, the large models are superior to their forebears in that they’re self-supervised, meaning they can generate labels from data by exposing relationships between the data’s parts — a step believed to be critical to achieving human-level intelligence. That’s as opposed to supervised learning algorithms, which train on human-labeled data sets, and which can be difficult to fine-tune on tasks particular to industries, companies, or topics of interest.
“The exciting thing about these models is the breadth of the things [they’ve] enable[d],” Microsoft chief technical officer Kevin Scott said in a statement. “This is about being able to do a hundred exciting things in natural language processing at once and a hundred exciting things in computer vision, and when you start to see combinations of these perceptual domains, you’re going to have new applications that are hard to even imagine right now.”
AI at scale
Models like those within the Turing family are a far cry from AGI, but Microsoft says it’s using the supercomputer to explore large models that can learn in a generalized way across text, images, and video data. So, too, is OpenAI. As MIT Technology Review reported earlier this year, a team within OpenAI called Foresight runs experiments to test how far they can push AI capabilities by training algorithms with increasingly large amounts of data and compute. Separately, according to that same bombshell report, OpenAI is developing a system trained on images, text, and other data using massive computational resources the company’s leadership believes is the most promising path toward AGI.
Indeed, Brockman and Altman in particular believe AGI will be able to master more fields than any one person, chiefly by identifying complex cross-disciplinary connections that elude human experts. Furthermore, they predict that responsibly deployed AGI — in other words, AGI deployed in “close collaboration” with researchers in relevant fields, like social science — might help solve longstanding challenges in climate change, health care, and education.
It’s unclear whether the new supercomputer is powerful enough to achieve anything close to AGI, whatever form it might take; last year, Brockman told the Financial Times that OpenAI expects to spend the whole of Microsoft’s $1 billion investment by 2025 building a system that can run “a human brain-sized AI model.” In 2018, OpenAI’s own researchers released an analysis showing that from 2012 to 2018, the amount of compute used in the largest AI training runs grew more than 300,000 times with a 3.5-month doubling time, far exceeding the pace of Moore’s law. Last week and on pace with this, IBM detailed the Neural Computer, which uses hundreds of custom-designed chips to train Atari-playing AI in record time, and Nvidia announced a 5-petaflop server based on its A100 Tensor Core graphics card dubbed the A100.
There’s evidence that efficiency improvements might offset the mounting compute requirements. A separate, more recent OpenAI survey found that since 2012, the amount of compute needed to train an AI model to the same performance on classifying images in a popular benchmark (ImageNet) has been decreasing by a factor of two every 16 months. But it remains an open question the extent to which compute contributes to performance compared with novel algorithmic approaches.
It should be noted, of course, that OpenAI has achieved remarkable AI gains in gaming and media synthesis with fewer resources at its disposal. On Google Cloud Platform, the company’s OpenAI Five system played 180 years’ worth of games every day on 256 Nvidia Tesla P100 graphics cards and 128,000 processor cores to beat professional players (and 99.4% of players in public matches) at Valve’s Dota 2. More recently, the company trained a system on at least 64 Nvidia V100 graphics cards and 920 worker machines with 32 processor cores each to manipulate a Rubik’s Cube with a robot hand, albeit with a relatively low success rate. And OpenAI’s Jukebox model ran simulations on 896 V100 graphics cards to learn to generate music in any style from scratch, complete with lyrics.
New market opportunities
Whether the supercomputer turns out to be a small stepping stone or a large leap to AGI, the software tools used to design it potentially open new market opportunities for Microsoft. Through its AI at Scale initiative, the tech giant is making resources available to train large models on Azure AI accelerators and networks in an optimized way. It splits training data into batches that are used to train multiple instances of models across clusters and periodically averaged to produce a single model.
These resources include a new version of DeepSpeed, an AI library for Facebook’s PyTorch machine learning framework that can train models over 15 times larger and 10 times faster on the same infrastructure, and support for distributed training on the ONNX Runtime. When used with DeepSpeed, distributed training on ONNX enables models across hardware and operating systems to deliver performance improvements of up to 17 times, Microsoft claims.
“By developing this leading-edge infrastructure for training large AI models, we’re making all of Azure better,” Microsoft chief technical officer Kevin Scott said in a statement. “We’re building better computers, better distributed systems, better networks, better datacenters. All of this makes the performance and cost and flexibility of the entire Azure cloud better.”