Intel is the world’s biggest maker of processors for computers, but it hasn’t been the fastest when it comes to capitalizing on the artificial intelligence computing explosion. Rival Nvidia and other AI processor startups have jumped into the market, and Intel has been playing catchup.
But the big company has been moving fast. It acquired AI chip design firm Nervana in 2016 for $350 million, and Intel recently announced that its Xeon CPUs generated $1 billion in revenue in 2017 for use in AI applications. Intel believes that the overall market for AI chips will reach between $8 billion and $10 billion in revenue by 2022. And the company is focused on designing AI chips from the ground up, according to Gadi Singer, vice president and general manager of AI architecture at Intel.
Singer, his boss Naveen Rao, and chip architect Jim Keller have been hitting the road lately to show that Intel isn’t asleep at the switch when it comes to the hot AI chip design trends. In fact, the company is pouring even more energy into AI chip design, which will hopefully help keep Moore’s Law (the doubling of transistors on a chip every couple of years) on track. Advances in chip manufacturing are slowing down, but chip designers are trying to make up for that with innovations in architecture.
Singer is excited about the opportunities for innovation. We recently had an opportunity to talk about this topic and were joined by Intel software executive Huma Abidi in South Park in San Francisco.
Here’s an edited transcript of our conversation.
VentureBeat: I wanted to go a little higher level and talk about some of the history here. On my side I have a very simple understanding of processing. I see that x86 processors are great for serial processing. GPUs have been great for parallel processing. When deep learning came along, the parallelism of GPUs was very useful, and so they became very popular. They made their way into the data center. But they weren’t designed for AI processing in particular.
There’s this adaptation time that’s happening now with chip design, it seems, where everybody’s figuring out the best way to do AI processing. You have one approach from one side, another approach from another side, and then these other folks like Nervana who are approaching the problem from the ground up. That’s what I wanted to understand. What does “from the ground up” mean?
Gadi Singer: In this particular case, a lot of the technologies were acquired with Nervana. When we acquired Nervana, we got an architecture as well as a software stack. Then we worked with Intel resources and some of the Intel old-timers like myself to put all the knowledge of Intel in compute to improving on that. But basically, the architecture came with the Nervana acquisition for NNP.
Everything you said is a good articulation of the framework. When we talk about architecture from the ground up, it’s about how you do the computing differently, how you deal with memory, and how you deal with data movement. In terms of compute, a CPU has several threads. The GPU has multiple threads that it can run, but the concept is still temporal. There’s still an assumption of some instruction pointer that moves and does some extra work after the other. It’s temporal space.
Architectures that are built for AI, for deep learning, are built as what’s called spatial architecture. You’re able to create waves of computing with a high level of parallelism where you have the whole wave of computation happening together. In a way it models the process that happens in the brain, with the spatial structure of neurons. You have a wave of sight that goes to the first, second, and third layers of neurons, and so on. That’s not done sequentially. All that is done in a certain wave is done together. Architectures that are very effective in having this ability to create a wave of compute are spatial architectures.
One thing is, how do you build a very high level of parallelism into the compute structure itself, so it can do all those computations independent of each other, because they can be done with a high level of concurrency? The second thing has to do with data. There’s a lot of data involved. The natural entity is a tensor. When you talk about architectures for deep learning, the natural entity is not a scalar or a vector, even though you’re using vectors. It’s a multidimensional array. All the constructs of saving and moving things around are around tensors. It’s not an after-the-fact addition. It’s part of the inherent data structure. Within that you can have 8-bit or 16-bit or 32-bit entities, but it’s an array of those. The array is multidimensional. It could have some sparsity built into it, but it’s an array of your data. Then, when you think about residency and moving data, you think about it in terms of tensors.
The other thing you usually have in ground-up architectures is control of the data, explicit control. Rather than having caching hierarchies and so on, in deep learning it’s not arbitrary compute. You generally know what is going to be the sequence of operation. There’s much less dependency on control point and loops and so on. You know you’re working with this tensor multiplying this way with that tensor and then going to the next layer through those convolutions. You can actually plan where to put what data. You have lots of memory that sits close to the compute and it’s very effectively managed, so you have the right data at the right time near the compute to do the activity effectively.
The last element has to do with connectivity. Since you’re moving very large quantities of data, your elements are very large. The way you move data on the die and between dies needs to have very high bandwidth and high throughput, with the ability to move the data between dies in a multi-chip solution, or between accelerators in the host, or within the accelerators between the various compute elements. When we look at the Nervana neural net processing, it does all of that in a way that’s native, in a way that’s designed for it. It’s the most optimal way of approaching these types of challenges.
VentureBeat: The elements of block design, what are they looking like on a Nervana chip?
Singer: Because it’s designed for deep learning and not for a general purpose, the center is usually around matrix multiplication and convolution. Those are the two basic operations you need to do very well. All of those implementations in our Nervana NNP have a very effective structure to multiply matrices and do that in a flowing manner. You multiply elements and they flow. You multiply the next one as well.
Then there are some functions that are common, activation functions and others that are common for deep learning. If they become really common, you would create some acceleration for those functions. Then you usually also have some more general compute to aid with things that are not accelerated.
VentureBeat: As these architectures evolve over time, are you looking at different custom processors on the general-purpose processor? Putting floating point in, that sort of thing. Nvidia’s putting tensor processors in. I wonder how much of a solution that is, versus tearing everything up and starting over.
Singer: It’s good, but not optimal. You can get results from that, because now that you have additional engines that do that particular task, it makes it more efficient. But it’s not as optimal as when you design from scratch. A lot of the efficiencies, for example, come from effective data movement. Moving data is very expensive. It’s expensive both in terms of power and time. Even if you have the right elements, if you’re not effective in the way you move the data across the system, you will not have an optimal solution.
Here’s a perspective on two points in time. In 2015-2016, and then 2019-2020. 2015-2016, those were the breakthrough days of deep learning. This is when image match results based on deep learning were starting to match human abilities. This is when we started to have speech recognition, after 30 or 40 years of gradual improvement, that showed major improvement. This is when we had big improvement in machine translation. This was when the technology promise was coming into center stage.
It was primarily about proving what the technology could do. It wasn’t so much about deploying it as part of business processes. It was, “Hey, you can identify a violin in an image with 95 percent accuracy, like humans do.” It was primarily around proving that it can be done. It was primarily around training, to train a system and demonstrate that it worked. The ratio between training and inference was about one to one.
Building the models still required a lot of deep expertise. It was done in C++ environments or proprietary environments like Cuda. This is where the hardware architecture, GPU and Nvidia and Cuda, came into focus on training, in addition to CPU that’s always been there for doing machine learning tasks. It required a lot of expertise and specialization on the software side. The hardware was primarily CPU and GPU. The audience was primarily the early adopters, the guys who really go for new technologies, the advanced researchers. That’s when things were showing how great they could be, but they were not being deployed. I call it the “illustrious childhood.” You see a kid, you see his potential, and he’s not bringing any money home because he’s not in business yet, but you can see the potential and how great he’s going to be when he grows up.
When I look at 2019-2020, it’s really the coming of age for deep learning. From proving things, we see them now getting deployed as part of lines of business. For example, identifying images in a sample use case, you see that now deployed in medical imaging. We see 20x or 30x more complex 3D images of cells with companies like Novartis. The model is more complex, but the type of topologies you apply are also more complex. It’s much easier to learn what a violin looks like compared to a char, than what a malignant cancer cell looks like compared to benign one. The problem space has changed. The type of topologies and solutions have changed a lot. Instead of being primarily around matrix multiplication and convolution, now it’s a combination of some sequential code and some matrix code. It’s a combination of compute that creates more real-life solutions.
In terms of the environment for developing machine learning, it’s completely transformed because of the deep learning frameworks. TensorFlow, which I believe is now the most popular, is just two and a half years since its public introduction. Three years ago it was something very local. All of those things like TensorFlow and MXNet, they’ve just come about. This created a democratization.
VentureBeat: It’s less than a chip design cycle, then?
Singer: Well, they were in the making for a long time. But they matured and materialized into the environment in the last two and half years. I’d call this the democratization of data science. Suddenly you don’t have to be a C++ programmer. You have this abstraction with all the primitives. You can be a data science expert and just do your work in TensorFlow or MXNet. That also disassociates it with specific proprietary interfaces, like Cuda and others, and allows hardware to come underneath that. That’s a major transition.
Also, when you start deploying things, the emphasis moves from training to inference. If you train a model, you can deploy it later in 10,000 or 100,000 machines to just do it 24/7. An example would be, we’re working with Taboola. Taboola is one of the leading recommendation engines. They have about 360 billion recommendations done every month, and one billion individual users every month, working for many different websites. They have a training center, but that center isn’t growing much. It’s a sunk capacity. What they grow all the time is their inference compute. The ratio between inference cycles and training cycles is shifting substantially toward inference.
There are new considerations because it’s used for deployment. Things like total cost of ownership are becoming important. If you just want to prove that something is possible, it doesn’t matter if the machine costs $10,000 or $50,000. When you’re deploying en masse, you’re starting to consider things like performance to watt, performance to dollar. You’re looking at latency, not only at throughput. All of those factors are making it into real life. This coming of age of deep learning is where Intel is targeting, because this goes toward Intel’s strengths. When you look at total cost of ownership and power efficiency and the diversity of compute, we’re building toward those things, and we’re already very good at them.
One of the considerations when we worked with Facebook–Facebook does all their key inference services based on Xeon. On training, it’s mixed. Some of them are GPU-based, some of them are Xeon-based, some of them are mixed. The reason is because they have the infrastructure already. It does the work very well as a foundation. Then they can use any spare cycles to do that. It’s very effective as far as total cost of ownership.
If I could summarize the strategy, on the hardware we continue to improve Xeon. Xeon is a great foundation that does a great job. It’s very versatile. Most inference today already runs on Xeon, so we want to make it even better by introducing new instructions. We just announced three weeks ago about the DL Boost with VNNI. We announced several technologies we’re bringing in. We’ve optimized the software, too. We’re adding accelerators. Xeon is great, but we have the most complete set of accelerators for those who want to do it 24/7. We have FPGAs. We have the Movidius for the end devices. We have MobilEye for the automotive. We have the Nervana NNP coming next year. We already have the software development vehicle and we’re optimizing software for it.
Xeon is the foundation. We’ll strengthen it on top with acceleration that’s dedicated and fully optimized, and then we have system work. All those solutions eventually need hosting, acceleration, storage, memory, and networking. Then, on the software side, you can simplify that by saying we optimize downward and we simplify upward. We optimize downward by having, for every architecture, the stack that takes it from the deep learning primitives to the best Xeon performance, to the best FPGA performance, to the best Movidius performance. Simplifying upward, by having a single layer like nGraph for the same primitives, we can tie it to the deep learning framework of choice. If you want to work on PyTorch you can work on PyTorch, or MXNet or TensorFlow. We take all those as front ends to a common place and we map it to our hardware. That gives you the basics.
VentureBeat: Different things I’m noticing are–Nvidia’s latest chips seem like they’re monstrously large, like 15 billion transistors. That’s what they call some of their AI-designed chips. What’s happening there? Is AI a lot less efficient in terms of processing than traditional PC or data center tasks? Is something about AI processing causing larger problems for chip designers? Hot Chips was very crowded this year. It was almost going through a revival, because everyone wants to learn something about designing AI chips. I don’t know how difficult some of this is becoming.
Singer: The reason why chip sizes grow is primarily because of the data set and the opportunity for high concurrency within the data set. It’s not as much because of the complexity of the problem. It’s primarily because, when you have those large tensors–the more we get into real-life cases, the imaging is becoming more complex. The data sets on speech and language models are becoming larger. You just have large tensors, and you have a lot of opportunity for parallelizing it, because there’s less dependency when you do those waves of compute.
Regardless of efficiency–some companies will do it more efficiently than others. But there’s an opportunity for concurrency. The data is large and there’s a lot of inherent concurrency in the computation. We believe some things can be done more efficiently, so I’m not commenting on others. But inherently, it’s larger data sizes which can be done concurrently.
To your point about complexity, the original networks were actually simple. The original AlexNet and GoogleNet were basically lots of matrix multiplication and convolution, and some basic functions. The more modern topologies are becoming more complex. This drives not necessarily size, but it drives a need for more sophistication in integration between different types of compute.
I’ll give you an example. When you have things like neural machine translation, NMT, within NMT you have portions that are pure neural net, and there are portions that are sequential. When you look at what’s called, in this case, attention algorithms, you look at your history and try to pick up relevant things, like you do in your brain. When there’s something you need to understand in context, you search for things that you know from the past that might give context to the new information coming in. Sophisticated architectures benefit from the ability to effectively combine the various types of compute. They do neural networks very well, but they also have a system view that integrates it in an optimal manner with other types of compute. Complexity comes from the system view of the solution, integrated very effectively.
VentureBeat: We’re getting into different kinds of AI processing. Earlier problems were not so hard to solve with the way deep learning started here. This is a flower, that’s also a flower, that’s not a flower. But what’s necessary now for harder problems is strong AI. When you’re in a dynamic environment and you’re driving a car and the environment around you is changing, all that data keeps on changing, the deep learning approach isn’t so good at spotting the one hazard coming at you that you need to quickly identify and avoid. Deep learning seems to be too dumb a way to arrive at the conclusion that there’s a threat coming. I don’t know if strong AI means something to everybody or not? But in this case, it’s a simple way of saying that we need to do a different kind of processing.
Singer: This is about machine learning as a whole. It’s deep learning plus other elements. I do believe that as we’re going into real-world problems, it’s going to be a complement of deep learning with other machine learning. There are other machine learning techniques out there that Intel has been working on for multiple years. Deep learning is the one that has the most breakthrough in the last four years, but to go to more complete solutions, absolutely, it has to have a set of capabilities that includes deep learning and other types of machine learning.
Deep learning is exceptionally good at some things, like identifying patterns and anomalies. But as we look at emergent machine learning, there needs to be complementary types of machine learning together with deep learning in order to solve problems. Deep learning still has a lot of growth, even if it’s now coming of age. It’s not anywhere close to tapering off. For full solutions to problems, we need to keep an eye out. We’re investing in other kinds of machine learning, not only in deep learning.
VentureBeat: I spoke to [Intel chip architect] Jim Keller (formerly of Tesla and Apple) recently. It seems like one of the tasks assigned to him and people like him is to recognize all these different architectures within the company and across the whole industry, and then realizing that there are different problems with different ways of solving them. Figuring out what’s going to be the best way to bring all of those things together.
Singer: The question we have is, how do we have a portfolio that’s rich enough to provide optimal solutions for the very different problem spaces — from one watt to 300 or 400 watts, from latency-sensitive to throughput-sensitive. How do you create a portfolio that’s broad enough that it doesn’t have overlaps? Reusing technologies for thing that are similar. That’s a problem that Jim Keller drives, and a lot of us in the architecture leadership are participating in it. We have a portfolio, and we want to have a diverse portfolio that creates great coverage, but with minimal overlap.
VentureBeat: The chip designers, do they have to then come up to speed? If you grew up with x86, and now you have this whole new world of AI processing, are they having to adapt to a lot of things?
Singer: On the hardware side, yes. We have a combination. We have a lot of talent that came from the outside, both in company acquisitions like Movidius and Nervana and others, we also have individual acquisitions of talent. And then we have engineers doing CPU and network processing. A lot of network processing is relevant. They learn new skills. It’s a combination.
To the point about x86, we actually put AI acceleration within x86. x86 has always been something that grows. Floating point was added to x86. AVX and vector processing were added under x86. We have instructions like VNNI that are added under x86. We don’t see x86 as something that’s not AI. It’s a foundation that has AI, but also other things. Then we see dedicated solutions that are primarily AI.
You were asking if I see an integration trend, where technology starts to accelerate. We definitely look at it across the company. Things that can go in are going in, and some things are better fit to the outside, because the way they interact, without caching hierarchy and so on, is more appropriate, at least for the time being. But this trend of having technologies that are on the side and then get integrated in has been in semiconductors for a long time. Whatever fits under the x86 framework has a trend.
VentureBeat: There was that Darwinian adaptation of the CPU. It absorbs different things over time.
Singer: Some things are done when you have them outside. The way you get the data in, the way you get everything close to the compute, there might be some advantages for certain technologies, at least for a while, for there to be this acceleration.
VentureBeat: You’ve been talking more about connecting to other chips and co-processors in ways that–different ways of splitting up the architecture across chips.
Singer: Yes. I mentioned the three parts of the hardware strategy are to make Xeon continuously better, add accelerators, and then the third one, which we didn’t much about, is the system optimization. We talked about the fact that problems tend to be large. The data size is large. Having a multi-chip solution, either multiple hosts or a host and multiple accelerators–how do you partition the problem so it’s worked in parallel by multiple engines? How do effectively move data?
Within the Nervana NNP, for example, we have a very fast interconnect that can connect from one substrate, one fabric, directly to another fabric without going through external memory. We can move data very effectively over a large and well-partitioned problem. We look at it as a system problem.
Now, how much do you put in a single package? There’s always the question of what you put on the same die. Now, with all the multi-die technologies, how do you put these packages on multiple dies, and how do you connect the packages together for a system? It’s a question of partitioning that changes all the time with the type of technologies and the silicon budget we have.
VentureBeat: When I was talking to Naveen Rao [head of AI products group at Intel], we brought up that notion of Moore’s Law slowing down. In the past, you had these free improvements when you went from one manufacturing node to the next. You shrank the circuits. But that doesn’t get taken for granted anymore. The opportunity to make more breakthroughs swings back to design. Does that fit into the context here in some ways?
Singer: As somebody who has been on the architecture and design and EDA for many years, I can tell you that we always looked at this as two driving forces. We didn’t rely only on one. The process had its cadence. On the architecture and design, we always pushed for what we could do on the architecture to get more instructions per cycle, the IPC. Better efficiency by not having to move unnecessary data. All the things that are processor-dependent.
You can look at it as–there’s a process improvement track. There’s a design improvement track. We were always pressed to get the best advancement that we can. We feel it now. Software is the same. We look at the architecture and design and the software as something that needs to move forward and make big gains alongside the process improvements.
VentureBeat: You still have different categories of graphics chip research going on, the CPU, and then the AI solutions. I don’t know if CPU is steered still toward the PC and the data center. Do you see a convergence of architecture, or something more like a bifurcation of architecture, where you need to keep doing separate things?
Singer: AI is not separate from CPU. AI is and has to be embedded in every single technology line. CPU has AI built into it. They do a lot of other things, of course, but the investment we have in software is probably as large as we have in hardware. How do you use AVX for AI? How do you use the new VNNI for AI? How do you design the next construct to improve AI? If AI is not something that’s orthogonal to CPU, it’s something that permeates all architectures. Everything — CPU and GPU and FPGA — has to have AI capabilities that improve over time at all levels. There are things that are almost solely AI, but everything has to have some AI.
In terms of the architectures, we have a few architectures because we believe that the needs are so diverse that trying to stretch a single architecture to capture the half-watt to 400 watts, or the latest sensitive architectures to the throughput-oriented, from the data center environment to others–we believe there are a few prototypical architectures that provide the best solutions for very different sets of requirements. But we don’t want to have more architectures than we need.
Whenever we can reuse a technology, we’re doing it. We’re moving technologies from Movidius into the CPU, or things that we’ve learned in general purpose from CPU to some of the accelerators, to give them some of those capabilities. We have some centers of gravity for architecture, for areas that are significantly different from each other. But we try to share the basic underlying technologies between them as much as we can.
VentureBeat: Is there some reason to restart a graphics investment, stand-alone graphics?
Singer: It’s a big market in general, regardless of AI.
VentureBeat: But nothing related to something else in the future? It’s just a good time to go back into it?
Singer: Intel is looking at the overall data-centric opportunity. They talked about it in the data-centric innovation summit. We see that as a tremendous opportunity overall. We’re looking at various types of data-centric applications. This is not an AI view. This is a broader view. GPU has a lot of advantages and capabilities that are applicable in a data-centric world. Intel wants to have a portfolio where, as a customer, when you want to have your space, whatever is optimized around whatever type of applications, Intel has a top of the line solution for you.
VentureBeat: I tend to hear this more from the Wall Street types, people like me who don’t really understand chips, but there’s this fear of competition. “Intel has to worry about AMD in PCs again.” The focus on competing against Ryzen or whatever. “Intel has to worry about ARM coming to the data center. Intel has to worry about Nvidia in GPUs and these AI processor startups.” So many things to worry about where Intel has no single silver bullet against all of this competition. That’s an outsider’s view of Intel. I wonder what your insider’s response would be.
Singer: The answer is really simple. It’s not simplistic, but it’s really simple. We’re not developing those products and technologies as a response to competition. The best way to stay ahead is to have a focus on what we think are the customer needs and the technology leadership in these various areas. We work toward that. When a competitor suddenly has a good product, yes, it creates more competitive pressure, but it doesn’t divert us to do something different because of that. The best strategy in the long term is to focus on what we believe.
We have such an intimate understanding of what’s needed in the PC, what’s needed in the data center, what’s needed for networking. Focusing on what we believe is the leading edge capability is a much better strategy for us than trying to do it as a response to some particular feature or capability from a competitor. We’re staying the course and making sure we execute well on our strategies. That’s not a response to a particular competitor doing this or that. It’s just working toward what we think are the best solutions in each of those spaces. That’s worked well for us.
In the past, when we did have competitors, we had the right focus on executing to the strategies. Eventually we were successful. We continue the same approach that’s worked for us.
VentureBeat: What are you optimistic about or looking forward to?
Singer: The more change there is, the more understanding of compute and hardware and software and so on is needed, the better the environment for Intel to differentiate. Something that I’ve seen in terms of looking from a perspective for many years in various architectures and various spaces in compute–in many cases in the past, the problem was well-understood. You knew what you needed to do for a graphics chip, for a CPU, for an imaging solution. The differentiation was how well you solved a well-understood problem.
In the AI space, the problem changes so quickly that if you work in 2019 on a good problem from 2017, you’re solving the wrong problem. Part of what we bring to the table is a connection and understanding of how the problem is changing, and therefore what problem needs to be solved two years out. Not only what’s the best solution for a well-defined problem.
Huma Abidi: I’m the director for software optimizations in the AI product group. My focus is to make sure that software optimizations get the best performance of the Xeon processor. My team works with the different framework owners, open source frameworks like TensorFlow for Google, MXNet for Amazon, PaddlePaddle for Baidu, and so on.
The whole point is that both our hardware and software portfolio are very broad. In hardware we go from data center to edge to device. Similarly, software supports all of that, based on what the user persona is. For our library developers we have something different, like building blocks. For our data scientist we have these frameworks and contribute to all of that. For application developers we have toolkits.
With these different software frameworks we have, to support the hardware we have several ways of doing that. One is what we call direct optimization, where we work directly with the framework owners. All the optimization work my team is doing, that people are doing at Intel, we just merge it into the main line. Our developers get all the benefit of the work we’re doing when they’re on CPU.
nGraph is a new framework-neutral graph compiler, which is a sort of abstraction layer between the many different frameworks and architectures that we have.
Singer: It’s like a middleware, a middle layer, from many to many. Many frameworks to multiple hardware solutions.
Abidi: In the past couple of years we have made great progress. We’ve dedicated ourselves to make the AI experience great at Intel. We’re seeing up to 200x performance gains in training, and inference is more like 250x. As a result, we now have our partners in retail or finance or education or security, every space — especially in health care. Novartis is a good example, where we worked with their engineers. They had an interesting challenge, where they had to analyze very large images in drug discovery, 26x more than the regular data sets we see. It turns out Xeon was the best solution for that, because of the large memory capacity. Working together with engineers, using our optimized TensorFlow, scaling it up to eight times, we were able to reduce training time from 11 hours to about 31 minutes.
We’re working with all these different segments. The results we’re seeing–Stanford has the DAWNBench competition. The best numbers that came in for inference were from Intel. This is a combination of the optimized framework and the Xeon processor. Together with that low cost and low latency, we’re making huge improvements.
VentureBeat: Should the young people studying this stuff go into hardware and chip design, or into software.
Singer: Both, absolutely! People should go where their heart takes them. There’s lots of room for software. We have a lot of investment in software. There’s tremendous innovation happening in the software space. But the hardware space is exciting as well.
VentureBeat: I guess they want to know where they can make the million-dollar salaries.
Singer: [laughs] AI today is a space where extremely talented people can have very high premiums. Whether they solve it at the hardware level, whether they solve it with new topology, whether they solve it with a new software compiler that makes hardware that much more efficient, doing the right thing with AI, being on the leading edge, being creative, that has a very high premium, because of the high value AI has for so many industries.
Abidi: AI has a very interdisciplinary thing going on. It’s not just computer science. There are people in statistics, people in medicine. I see more diversity in data science than I’ve ever seen in computer science. As the trends progress in AI I see more and more people, at least on the software side. I’ve seen big uptake on that.