Presented by Xilinx
Artificial intelligence (AI) is becoming pervasive in almost every industry and is already changing many of our daily lives. AI has two distinct phases: training and inference. Today, most AI revenue is made from training: working to improve an AI model’s accuracy and efficiency. AI inference is the process of using a trained AI models to make a prediction. The AI inference industry is just getting started and is expected to soon surpass training revenues due to the “productization” of AI models — or moving from an AI model to a production-ready AI application.
Keeping up with increasing demand
We’re in the early stages of adopting AI inference and there’s still lots of room for innovation and improvements. The AI inference demands on hardware have sky-rocketed as modern AI models require orders of magnitude more compute compared to conventional algorithms. However, with the ending of Moore’s Law we cannot continue to rely on silicon evolution. Processor frequency has long hit a wall and simply adding more processor cores is also at its ceiling. If 25% of your code is not parallizable, the best speed up you can get is 4x regardless of how many cores you cram in. So, how can your hardware keep up with every-increasing demand of AI inference? The answer is Domain Specific Architecture (DSA). DSAs are the future of computing, where hardware is customized to run a specific workload.
Each AI model is becoming heavy-duty and complex in dataflow and today’s fixed hardware CPUs, GPUs, ASSPs, and ASICs are struggling to keep up with the pace of innovation. CPUs are general purpose and can run any problem, but they lack computational efficiency. Fixed hardware accelerators like GPUs and ASICs are designed for “commodity” workloads that are fairly stable in innovation. DSA is the new requirement, where adaptable hardware is customized for “each group of workloads” to run at the highest efficiency.
Customization to achieve high efficiency
Every AI network has three compute components that need to be adaptable and customized for the highest efficiency: custom data path, custom precision, and custom memory hierarchy. Most newly emerging AI chips have strong horsepower engines, but fail to pump the data fast enough due to these three inefficiencies.
Let’s zoom into what DSA really means for AI inference. Every AI model you see will require slightly, or sometimes, drastically different DSA architecture. The first component is a custom data path. Every model has different topologies where you need to pass data from layer to layer using broadcast, cascade, skip through, etc. Synchronizing all the layer’s processing to make sure the data is always available to start the next layer processing is a challenging task.
The second component is custom precision. Until a few years ago, floating point 32 was the main precision used. However, with Google TPU leading the industry in reducing the precision to Integer 8, state-of-the-art has shifted to even lower precision, like INT4, INT2, binary, and ternary. Recent research is now confirming that every network has a different sweet-spot for combinations of mixed precision to be most efficient, such as 8 bit for the first 5 layers, 4 bit for next 5 layers and 1 bit for last 2 layers.
The last component, and probably the most critical part that needs hardware adaptability, is custom memory hierarchy. Constantly pumping the data into a powerful engine to keep it busy is everything and you need to have customized memory hierarchy, from internal memory to external DDR/HBM, to keep up with the layer-to-layer memory transfer needs.
Above: Domain Specific Architecture (DSA): Every AI network has three components that need to be customized
Rise of AI productization
With every AI model requiring a custom DSA to be most efficient in mind, application use cases for AI are growing rapidly. AI-based classification, object detection, segmentation, speech recognition, and recommendation engines are just some of the use cases that are already being productized, with many new applications emerging every day.
In addition, there is a second dimension to this complex growth. Within each application, more models are being invented to either improve accuracy or make the model lighter-weight. Xilinx FPGAs and adaptive computing devices can adapt to the latest AI networks, from the hardware architecture to the software layer, in a single node/device, while other vendors need to redesign a new ASIC, CPUs, and GPUs, adding both significant costs and time to marketing challenges.
This level of innovation puts constant pressure onto existing hardware, requiring chip vendors to innovate fast. Here are a few recent trends that are pushing the need for new DSAs.
Depthwise convolution is an emerging layer that requires large memory bandwidth and specialized internal memory caching to be efficient. Typical AI chips and GPUs have fixed L1/L2/L3 cache architecture and limited internal memory bandwidth resulting in very low efficiency. Researchers are constantly inventing new custom layers, for which chips today simply do not have native support. Because of this, they need to be run on host CPUs without acceleration, often becoming the performance bottleneck.
Sparse Neural Network is another promising optimization where networks are heavily pruned, sometimes up to 99% reduction, by trimming network edges, removing fine-grained matrix values in convolution, etc. However, to run this efficiently in hardware, you need specialized sparse architecture, plus an encoder and decoder for these operations, which most chips simply do not have.
Binary / Ternary are the extreme optimizations, making all math operations to a bit manipulation. Most AI chips and GPUs only have 8 bit, 16 bit, or floating-point calculation units so you will not gain any performance or power efficiency by going extreme low precisions.
The MLPerf inference v0.5 published at the end of 2019 proved all these challenges. Looking at Nvidia’s flagship T4 results, it’s achieving as low as 13% efficiency. This means, while Nvidia claims 130 TOPS of peak performance on T4 cards, the real-life AI models like SSD w/ MobileNet-v1 can utilize on 16.9 TOPS of the hardware. Therefore, vendor TOPS numbers used for chip promotion are not meaningful metrics.
Above: MLPerf inference v-0.5 results
Adaptive computing solves “AI productization” challenges
Xilinx FPGAs and adaptive computing devices have up to 8x internal memory when compared with state-of-the-art GPUs, and the memory hierarchy is completely customizable by users. This is critical for achieving highware “usable” TOPS in modern networks such as depthwise convolution. The user programmable FPGA logic allows a custom layer to be implemented in the most efficient way, removing it from being a system bottleneck. For sparse neural network, Xilinx has been long deployed in many sparse matrix based signal processing applications such as communication domains. Users can design a specialized encoder, decoder, and sparse matrix engines in FPGA fabric. And lastly, fpr Binary / Ternaly, Xilinx FPGAs use Look-Up-Tables (LUTs) to implement bit-level manipulation, resulting in close to 1 PetaOps (1000 TOPS) when using binary instead of Integer 8. With all the hardware adaptability features, it is possible to reach close to 100% of the hardware peak capabilities in all the modern AI inference workloads.
Xilinx is proud to solve one more challenge, now making our devices accessible to those with software development expertise. Xilinx has created a new unified software platform, Vitis™, which unifies AI and software development, letting developers accelerate their applications using C++/python, AI framework and libraries.
Above: Vitis unified software platform.
For more information about Vitis AI, please visit us here.
Nick Ni is Director of Product Marketing, AI, Software and Ecosystem at Xilinx. Lindsey Brown is Product Marketing Specialist Software and AI at Xilinx.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. Content produced by our editorial team is never influenced by advertisers or sponsors in any way. For more information, contact email@example.com.