The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!
Intel is the world’s biggest maker of processors for computers, but it hasn’t been the fastest when it comes to capitalizing on the artificial intelligence computing explosion. Rival Nvidia and other AI processor startups have jumped into the market, and Intel has been playing catchup.
But the big company has been moving fast. It acquired AI chip design firm Nervana in 2016 for $350 million, and Intel recently announced that its Xeon CPUs generated $1 billion in revenue in 2017 for use in AI applications. Intel believes that the overall market for AI chips will reach between $8 billion and $10 billion in revenue by 2022. And the company is focused on designing AI chips from the ground up, according to Gadi Singer, vice president and general manager of AI architecture at Intel.
Singer, his boss Naveen Rao, and chip architect Jim Keller have been hitting the road lately to show that Intel isn’t asleep at the switch when it comes to the hot AI chip design trends. In fact, the company is pouring even more energy into AI chip design, which will hopefully help keep Moore’s Law (the doubling of transistors on a chip every couple of years) on track. Advances in chip manufacturing are slowing down, but chip designers are trying to make up for that with innovations in architecture.
Singer is excited about the opportunities for innovation. We recently had an opportunity to talk about this topic and were joined by Intel software executive Huma Abidi in South Park in San Francisco.
Here’s an edited transcript of our conversation.
VentureBeat: I wanted to go a little higher level and talk about some of the history here. On my side I have a very simple understanding of processing. I see that x86 processors are great for serial processing. GPUs have been great for parallel processing. When deep learning came along, the parallelism of GPUs was very useful, and so they became very popular. They made their way into the data center. But they weren’t designed for AI processing in particular.
There’s this adaptation time that’s happening now with chip design, it seems, where everybody’s figuring out the best way to do AI processing. You have one approach from one side, another approach from another side, and then these other folks like Nervana who are approaching the problem from the ground up. That’s what I wanted to understand. What does “from the ground up” mean?
Gadi Singer: In this particular case, a lot of the technologies were acquired with Nervana. When we acquired Nervana, we got an architecture as well as a software stack. Then we worked with Intel resources and some of the Intel old-timers like myself to put all the knowledge of Intel in compute to improving on that. But basically, the architecture came with the Nervana acquisition for NNP.
Everything you said is a good articulation of the framework. When we talk about architecture from the ground up, it’s about how you do the computing differently, how you deal with memory, and how you deal with data movement. In terms of compute, a CPU has several threads. The GPU has multiple threads that it can run, but the concept is still temporal. There’s still an assumption of some instruction pointer that moves and does some extra work after the other. It’s temporal space.
Architectures that are built for AI, for deep learning, are built as what’s called spatial architecture. You’re able to create waves of computing with a high level of parallelism where you have the whole wave of computation happening together. In a way it models the process that happens in the brain, with the spatial structure of neurons. You have a wave of sight that goes to the first, second, and third layers of neurons, and so on. That’s not done sequentially. All that is done in a certain wave is done together. Architectures that are very effective in having this ability to create a wave of compute are spatial architectures.
One thing is, how do you build a very high level of parallelism into the compute structure itself, so it can do all those computations independent of each other, because they can be done with a high level of concurrency? The second thing has to do with data. There’s a lot of data involved. The natural entity is a tensor. When you talk about architectures for deep learning, the natural entity is not a scalar or a vector, even though you’re using vectors. It’s a multidimensional array. All the constructs of saving and moving things around are around tensors. It’s not an after-the-fact addition. It’s part of the inherent data structure. Within that you can have 8-bit or 16-bit or 32-bit entities, but it’s an array of those. The array is multidimensional. It could have some sparsity built into it, but it’s an array of your data. Then, when you think about residency and moving data, you think about it in terms of tensors.
The other thing you usually have in ground-up architectures is control of the data, explicit control. Rather than having caching hierarchies and so on, in deep learning it’s not arbitrary compute. You generally know what is going to be the sequence of operation. There’s much less dependency on control point and loops and so on. You know you’re working with this tensor multiplying this way with that tensor and then going to the next layer through those convolutions. You can actually plan where to put what data. You have lots of memory that sits close to the compute and it’s very effectively managed, so you have the right data at the right time near the compute to do the activity effectively.
The last element has to do with connectivity. Since you’re moving very large quantities of data, your elements are very large. The way you move data on the die and between dies needs to have very high bandwidth and high throughput, with the ability to move the data between dies in a multi-chip solution, or between accelerators in the host, or within the accelerators between the various compute elements. When we look at the Nervana neural net processing, it does all of that in a way that’s native, in a way that’s designed for it. It’s the most optimal way of approaching these types of challenges.
VentureBeat: The elements of block design, what are they looking like on a Nervana chip?
Singer: Because it’s designed for deep learning and not for a general purpose, the center is usually around matrix multiplication and convolution. Those are the two basic operations you need to do very well. All of those implementations in our Nervana NNP have a very effective structure to multiply matrices and do that in a flowing manner. You multiply elements and they flow. You multiply the next one as well.
Then there are some functions that are common, activation functions and others that are common for deep learning. If they become really common, you would create some acceleration for those functions. Then you usually also have some more general compute to aid with things that are not accelerated.
VentureBeat: As these architectures evolve over time, are you looking at different custom processors on the general-purpose processor? Putting floating point in, that sort of thing. Nvidia’s putting tensor processors in. I wonder how much of a solution that is, versus tearing everything up and starting over.
Singer: It’s good, but not optimal. You can get results from that, because now that you have additional engines that do that particular task, it makes it more efficient. But it’s not as optimal as when you design from scratch. A lot of the efficiencies, for example, come from effective data movement. Moving data is very expensive. It’s expensive both in terms of power and time. Even if you have the right elements, if you’re not effective in the way you move the data across the system, you will not have an optimal solution.
Here’s a perspective on two points in time. In 2015-2016, and then 2019-2020. 2015-2016, those were the breakthrough days of deep learning. This is when image match results based on deep learning were starting to match human abilities. This is when we started to have speech recognition, after 30 or 40 years of gradual improvement, that showed major improvement. This is when we had big improvement in machine translation. This was when the technology promise was coming into center stage.
It was primarily about proving what the technology could do. It wasn’t so much about deploying it as part of business processes. It was, “Hey, you can identify a violin in an image with 95 percent accuracy, like humans do.” It was primarily around proving that it can be done. It was primarily around training, to train a system and demonstrate that it worked. The ratio between training and inference was about one to one.
Building the models still required a lot of deep expertise. It was done in C++ environments or proprietary environments like Cuda. This is where the hardware architecture, GPU and Nvidia and Cuda, came into focus on training, in addition to CPU that’s always been there for doing machine learning tasks. It required a lot of expertise and specialization on the software side. The hardware was primarily CPU and GPU. The audience was primarily the early adopters, the guys who really go for new technologies, the advanced researchers. That’s when things were showing how great they could be, but they were not being deployed. I call it the “illustrious childhood.” You see a kid, you see his potential, and he’s not bringing any money home because he’s not in business yet, but you can see the potential and how great he’s going to be when he grows up.
When I look at 2019-2020, it’s really the coming of age for deep learning. From proving things, we see them now getting deployed as part of lines of business. For example, identifying images in a sample use case, you see that now deployed in medical imaging. We see 20x or 30x more complex 3D images of cells with companies like Novartis. The model is more complex, but the type of topologies you apply are also more complex. It’s much easier to learn what a violin looks like compared to a char, than what a malignant cancer cell looks like compared to benign one. The problem space has changed. The type of topologies and solutions have changed a lot. Instead of being primarily around matrix multiplication and convolution, now it’s a combination of some sequential code and some matrix code. It’s a combination of compute that creates more real-life solutions.
In terms of the environment for developing machine learning, it’s completely transformed because of the deep learning frameworks. TensorFlow, which I believe is now the most popular, is just two and a half years since its public introduction. Three years ago it was something very local. All of those things like TensorFlow and MXNet, they’ve just come about. This created a democratization.
VentureBeat: It’s less than a chip design cycle, then?
Singer: Well, they were in the making for a long time. But they matured and materialized into the environment in the last two and half years. I’d call this the democratization of data science. Suddenly you don’t have to be a C++ programmer. You have this abstraction with all the primitives. You can be a data science expert and just do your work in TensorFlow or MXNet. That also disassociates it with specific proprietary interfaces, like Cuda and others, and allows hardware to come underneath that. That’s a major transition.
Also, when you start deploying things, the emphasis moves from training to inference. If you train a model, you can deploy it later in 10,000 or 100,000 machines to just do it 24/7. An example would be, we’re working with Taboola. Taboola is one of the leading recommendation engines. They have about 360 billion recommendations done every month, and one billion individual users every month, working for many different websites. They have a training center, but that center isn’t growing much. It’s a sunk capacity. What they grow all the time is their inference compute. The ratio between inference cycles and training cycles is shifting substantially toward inference.
There are new considerations because it’s used for deployment. Things like total cost of ownership are becoming important. If you just want to prove that something is possible, it doesn’t matter if the machine costs $10,000 or $50,000. When you’re deploying en masse, you’re starting to consider things like performance to watt, performance to dollar. You’re looking at latency, not only at throughput. All of those factors are making it into real life. This coming of age of deep learning is where Intel is targeting, because this goes toward Intel’s strengths. When you look at total cost of ownership and power efficiency and the diversity of compute, we’re building toward those things, and we’re already very good at them.
One of the considerations when we worked with Facebook–Facebook does all their key inference services based on Xeon. On training, it’s mixed. Some of them are GPU-based, some of them are Xeon-based, some of them are mixed. The reason is because they have the infrastructure already. It does the work very well as a foundation. Then they can use any spare cycles to do that. It’s very effective as far as total cost of ownership.
If I could summarize the strategy, on the hardware we continue to improve Xeon. Xeon is a great foundation that does a great job. It’s very versatile. Most inference today already runs on Xeon, so we want to make it even better by introducing new instructions. We just announced three weeks ago about the DL Boost with VNNI. We announced several technologies we’re bringing in. We’ve optimized the software, too. We’re adding accelerators. Xeon is great, but we have the most complete set of accelerators for those who want to do it 24/7. We have FPGAs. We have the Movidius for the end devices. We have MobilEye for the automotive. We have the Nervana NNP coming next year. We already have the software development vehicle and we’re optimizing software for it.
Xeon is the foundation. We’ll strengthen it on top with acceleration that’s dedicated and fully optimized, and then we have system work. All those solutions eventually need hosting, acceleration, storage, memory, and networking. Then, on the software side, you can simplify that by saying we optimize downward and we simplify upward. We optimize downward by having, for every architecture, the stack that takes it from the deep learning primitives to the best Xeon performance, to the best FPGA performance, to the best Movidius performance. Simplifying upward, by having a single layer like nGraph for the same primitives, we can tie it to the deep learning framework of choice. If you want to work on PyTorch you can work on PyTorch, or MXNet or TensorFlow. We take all those as front ends to a common place and we map it to our hardware. That gives you the basics.