Unlocking generative AI with ubiquitous hardware and open software

Presented by Intel

Generative Artificial Intelligence (AI) is the ability of AI to generate novel outputs including text, images and computer programs when provided with a text prompt. It unlocks new forms of creativity and expression by using deep learning techniques such as diffusion models and Generative Pre-Trained Transformers (GPTs). Examples of Generative AI applications are OpenAI’s ChatGPT and DALL-E, Google’s Bard, Stable Diffusion, GPT-J, BLOOM language model and many others.

Unlocking the availability of and access to generative AI technologies has great societal value. While discussions abound on whether generative AI models will replace or complement human ingenuity for professionals ranging from programmers to artists, we definitely want to not have such critical technologies being the domain of the few. This is particularly relevant as AI models continue to explode in size, voracity for data and compute necessity.

Ubiquitous hardware and open software are the keys to democratizing AI. Large AI models have typically required specialized hardware acceleration through GPUs and purpose-built AI processors. However, the types of matrix multiplication acceleration engines that drive higher AI performance in these processors have now also been integrated into CPUs, the processors ubiquitous across almost every computing medium, to help them achieve GPU-level AI performance. Complementing the hardware is open software that can deliver a unified experience across architectures. Leveraging industry-standard, open-source libraries and frameworks provide several advantages for software development, particularly in the case of AI due to the rapid pace of technological advancement and the greater inherent concerns with privacy and ethics. AI Systems can be thought of as an “AI brain” with the hardware and software being the biological neurons and the mind respectively.

In this article, we want to delve deeper into this concept of the “AI brain” and how its potential can be maximized. We also want to demonstrate the value of open software and ubiquitous hardware by showcasing open-access versions of diffusion (Stable Diffusion) and GPT (GPT-J, BLOOM) models running on Intel AI software and hardware. 4th Gen Intel® Xeon® Scalable Processors with natively built-in AI hardware acceleration coupled with the AI software acceleration that Intel has contributed to industry-standard frameworks such as TensorFlow and PyTorch provide an excellent platform for inference and fine-tuning of large generative AI models. While the specific examples in this blog are from Generative AI, the ideas of open tools, universal platforms and the role of AI software and hardware optimizations can be more generally applied.

Open generative AI: GPT-J and Stable Diffusion (with Demo!)

Generative AI models attempt to capture the distribution or characteristics of a particular dataset, and can be used to generate new data samples -- for example, a sentence or an image. Open-access generative AI models such as GPT-J-6B and Stable Diffusion are deep learning models based on neural networks. A common building block in both of these models is the transformer, which has been shown to be very effective for applications from machine translation to document generation. Transformer models, first developed by Google researchers, are now the dominant model type for language and sequence tasks and are also rapidly expanding to recommendation systems and vision tasks. A transformer model tracks relationship across data inputs to extract context using an algorithm known as attention. A Generative Pre-trained Transformer (GPT) allows unsupervised training from a large corpus of unlabeled data, overcoming the lack of labeled data for machine learning training.

GPT-J-6B is an autoregressive deep learning language model and an open alternative to GPT-3 (ChatGPT is a variant of the GPT-3 model specifically designed for chatbot applications). It takes a text prompt and generates text automatically. When we input a prompt “artificial intelligence everywhere” to GPT-J-6B, it generates interesting output such as the following:

“Artificial intelligence everywhere in our life, from the devices we use every day to the devices that help us do our jobs -- and the jobs themselves”

Stable Diffusion, on the other hand, is a state-of-the-art text-to-image deep learning diffusion model and an open-access alternative to DALL-E. It takes a text prompt and generates photorealistic images automatically. When we input a prompt “artificial intelligence everywhere”, Stable Diffusion generated the following image:

Here is a link to an experimental Stable Diffusion demo that runs on an Intel CPU (4th Gen Intel® Xeon® Scalable Processors) for you to try out your own prompts:

In addition to generating images from text prompts, it is also possible to personalize the model by adding your own data. For example, you can fine-tune the model by using additional image and text pairs. You can also try creating your own personalized Stable Diffusion with few-shot tuning using code available in Intel® Neural Compressor and Hugging Face* Diffusers.

The “AI brain”

To better understand AI and how we can unlock it everywhere and for everyone, it would be good to think of AI systems as being analogous to a brain. While relatively little is known beyond the basic structure and functions of neurons and the nervous system to help us interpret the brilliance of the Einsteins and Turings of this world, we actually do know significantly more about the hardware and software that composes the “AI brain”.

While the human brain consists of 86 billion neurons, the “AI brain” is a computer that consists of transistors. A 4^th Gen Intel® Xeon® Scalable Processor CPU for example has more than 10 billion transistors. These transistors form execution units that perform arithmetic operations like addition, subtraction and multiplication. Just like a human brain, the CPU is designed to perform general-purpose computation. It can run operating systems and any application from surfing the web to building spreadsheets. While there are other processors such as GPUs and ASICs that are highly optimized to run certain deep learning applications, CPUs because of their adaptability, flexibility and general-purpose capabilities form the basic building block of almost every computing system. What if we could also make this basic block highly optimized to run AI, a workload that is becoming increasingly pervasive? Complementing the processors is software AI acceleration that can deliver orders of magnitude performance gains, for the same hardware setup, across end-to-end AI pipelines from deep learning and classical machine learning to graph analytics.

Inside the AI brain: Unlocking AI with ubiquitous hardware

Deep learning models are computationally intensive with most of the computation being matrix multiplication. GPUs and specialized AI processors consist of matrix multiplication units (ex: Tensor Cores in Nvidia GPUs and MXU matrix units in Google TPUs) that speed up matrix multiplication and accelerate deep learning. The latest generation of CPUs now have the same type of deep learning acceleration built-in.

4th Gen Intel® Xeon® Scalable processors deliver excellent performance across all workloads with the built-in performance of an AI accelerator for End-to-End (E2E) AI pipelines. The ingestion and classical machine learning stages of the pipeline, which are critical for many AI applications, have historically performed better on the CPU because of its architectural advantages with general compute functions and irregular/sparse data. To address deep learning, Intel has integrated the Intel® Advanced Matrix Extensions (Intel® AMX) BF16 and INT8 matrix multiplication engine into every core. The AMX built-in acceleration for matrix multiplication operations is similar in functionality to GPU tensor cores and TPU MXUs and makes the latest Xeon CPUs comparable to GPUs and other specialized AI accelerators for deep learning workloads. This combined with their other included features, functionalities and effortless integration with GPUs and ASICs where available make Intel® Xeon® processors the ideal foundation to any universal AI platform.

Intel® AMX aims to store bigger chunks of data in each core and compute on larger matrices in a single operation, thereby accelerating compute speed. It has two primary components: tiles and tiled matrix multiplication (TMUL). The tiles store the data in eight two-dimensional registers, each one kilobyte in size. TMUL is an accelerator engine attached to the tiles that contains single operation matrix multiplication instructions. The following example shows how to implement GEMM (General Matrix Multiplication) C = C + A * B using AMX. It involves breaking the matrices into tiles and performing multiplication of tiles using TMUL (Tile Matrix Multiply Unit) and tiled registers. For example, T5 is the accumulative result of T1*T3 + T2*T4.

Inside the AI brain: Unlocking AI with open (accelerated) software

The rate of AI advancement and having requirements beyond just the accuracy of AI models to also consider our ability to explain, audit and generalize the results, lend themselves well to open-software development. Open, industry-standard frameworks also have a vast reach with millions of developers using them. Adding Intel software optimizations along with hardware acceleration like AMX into the default versions of frameworks such as PyTorch and TensorFlow seamlessly benefits all of the developers who use these frameworks. Software AI acceleration in particular is critical to infusing and scaling AI into applications across entertainment, telecommunication, automotive, healthcare and more by helping to drive 10-100X or more performance gains for the same hardware setup.

A data scientist will implement machine learning, typically in Python, using frameworks such as PyTorch or TensorFlow. The oneAPI Deep Neural Network Library (oneDNN), one of the constituent libraries in the oneAPI unified programming model, is an open-source, cross-platform, performance library of basic deep learning building blocks intended for developers of deep learning applications and frameworks. The optimizations enabled by oneDNN along with the Intel® Advanced Matrix Extensions accelerate key performance-intensive operations such as convolution, matrix multiplication and batch normalization. Collaborations between Intel and the framework developers allow these optimizations to be enabled by default into the stock versions of the frameworks, allowing their millions of developers to easily take advantage of Intel’s software and hardware acceleration. This includes users of GPT-J and Stable Diffusion, which are open-access models built on PyTorch, and several of the generative and other AI models that leverage dozens of other industry-standard open frameworks and libraries.

AI performance on CPU compared to GPU

CPUs, and specifically those with built-in AI-acceleration, deliver similar level of performance to a GPU for a wide variety of AI workloads. PyTorch and TensorFlow code using the built-in Intel® AMX acceleration in 4th Gen Intel® Xeon® Scalable processors (codenamed Sapphire Rapids/SPR) deliver up to 10x¹ higher inference and training performance and up to 6.7x²acceleration of E2E workloads when compared to the previous generation. When compared to Nvidia A10 GPUs, SPR has 1.8x³ higher average* AI BFloat16/FP16 inference performance. When compared to Nvidia A100 GPUs, training a Hugging Face* BERT Large language model using PyTorch, on a single SPR node can be completed in 20 minutes vs 7 minutes⁴ which is longer but comparable. Additional performance and benchmark data for the Intel® 4^th Gen Xeon Scalable Processors can be found here.

While training Stable Diffusion was expensive, taking 150K hours on 256 Nvidia A100 GPU cards, fine-tuning the trained model and running inference can be done in a matter of minutes and seconds respectively on a 4^th Gen Intel® Xeon® Scalable processor. Stable Diffusion personalization, a technique to add new concepts to be recognized by the model while maintaining the capabilities of the pretrained model on text-to-image generation, can be created through few-shot fine-tuning with just one to a few images. Using the technique textual inversion, one can create a personalized Stable Diffusion using a cluster of 4 SPR nodes with AMX and speed up the fine-tuning of your model from 1.5 hours (with single node 3^rd Gen processor) to less than 5 minutes⁵. On inference, one can use Intel Extension for PyTorch to enable BF16 with auto-mixed precision for the entire inference pipeline and improve the inference performance significantly from 40 seconds to 5 seconds⁵.

AI everywhere

Most businesses have a variety of workloads that already leverage a CPU server to run engineering development, databases, web services and millions of other applications. Data engineering and classical machine learning stages of the end-to-end AI pipeline already run best on the CPU. With Intel® AMX in 4th Gen Intel® Xeon® Scalable processors, the deep learning performance of CPUs is now also comparable at the level of GPUs. The ubiquity, flexibility and unmatched general and AI performance of the latest Xeon® processors combined with open accelerated software now allows businesses and individuals to leverage the power of AI (including generative AI) in all of their applications with just a CPU or in combination with other processors when they too become more ubiquitous.

Learn more: For those interested in getting hands on with transformer-based machine learning models such as those for generative AI, check out Intel® Extension for Transformers. To learn more about Intel’s performant and productive portfolio of E2E AI software tools and access developer resources, please visit the Intel AI software website.

Wei Li is VP & GM, AI and Analytics at Intel. Chandan Damannagari is Director, AI Software at Intel. Haihao Shen is Lead AI Software Architect at Intel. Ke Ding is Principal Engineer and Engineering Director for Applied ML at Intel.

Notices and Disclaimers: Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software, or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Product and Performance Information: 1,2,3,4: PyTorch Model Performance Configurations: 8480+: 1-node, pre-production platform with 2x Intel Xeon Platinum 8480+ on Archer City with 1024 GB (16 slots/ 64GB/ DDR5-4800) total memory, ucode 0x2b0000a1, HT on, Turbo on, CentOS Stream 8, 5.15.0, 1x INTEL SSDSC2KW256G8 (PT)/Samsung SSD 860 EVO 1TB (TF); 8380: 1-node, 2x Intel Xeon Platinum 8380 on M50CYP2SBSTD with 1024 GB (16 slots/ 64GB/ DDR4-3200) total memory, ucode 0xd000375, HT on, Turbo on, Ubuntu 22.04 LTS, 5.15.0-27-generic, 1x INTEL SSDSC2KG960G8; Framework: https://github.com/intelinnersource/ frameworks. ai.pytorch.privatecpu/tree/d7607bdd983093396a70713344828a989b766a66;Modelzoo: https://github.com/IntelAI/models/tree/spr-launch-public, PT:1.13, IPEX: 1.13, OneDNN: v2.7; test by Intel on 10/24/2022. PT: NLP BERT-Large: Inf: SQuAD1.1 (seq len=384), bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,56, amx bf16=1,16, amx int8=1,56, Trg: Wikipedia 2020/01/01 ( seq len =512), bs:fp32=28, amx bf16=56 [1 instance, 1socket] PT: DLRM: Inference: bs=n [1socket/instance], bs: fp32=128, amx bf16=128, amx int8=128, Training bs:fp32/amx bf16=32k [1 instance, 1socket], Criteo Terabyte Dataset PT: ResNet34: SSD-ResNet34, Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,112, amx bf16=1,112, amx int8=1,112, Training bs:fp32/amx bf16=224 [1 instance, 1socket], Coco 2017 PT: ResNet50: ResNet50 v1.5, Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,64, amx bf16=1,64, amx int8=1,116, Training bs: fp32,amx bf16=128 [1 instance, 1socket], ImageNet (224 x224) PT: RNN-T Resnext101 32x16d, Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,64,amxbf16=1,64,amxint8=1,116,ImageNet PT: ResNext101: Resnext101 32x16d, bs=n [1socket/instance], Inference: bs: fp32=1,64, amx bf16=1,64, amx int8=1,116 PT: MaskRCNN: Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,112, amx bf16=1,112, Training bs:fp32/amx bf16=112 [1 instance, 1socket], Coco 2017 Inference: Resnet50 v1.5: ImageNet (224 x224), SSD Resnet34: coco 2017 (1200 x1200), BERT Large: SQuAD1.1 (seq len=384), Resnext101: ImageNet, Mask RCNN: COCO 2017, DLRM: Criteo Terabyte Dataset, RNNT: LibriSpeech. Training: Resnet50 v1.5: ImageNet (224 x224), SSD Resnet34: COCO 2017, BERT Large: Wikipedia 2020/01/01 ( seq len =512), DLRM: Criteo Terabyte Dataset, RNNT: LibriSpeech, Mask RCNN: COCO 2017.480: 1-node, pre-production platform with 2x Intel® Xeon® Platinum 8480+ on Archer City with 1024 GB 3 6.7X Inference Speed-up: 8480 (1-node, pre-production platform with 2x Intel® Xeon® Platinum 8480+ on Archer City with 1024 GB (16 slots/ 64GB/ DDR5-4800) total memory, ucode 0x2b000041) , 8380 (1-node, 2x Intel® Xeon® Platinum 8380 on Whitley with 1024 GB (16 slots/ 64GB/ DDR4-3200) total memory, ucode 0xd000375) HT off, Turbo on, Ubuntu 22.04.1 LTS, 5.15.0-48-generic, 1x INTEL SSDSC2KG01, BERT-large-uncased (1.3GB : 340 Million Param) https://huggingface.co/bert-large-uncased, IMDB (25K for fine-tuning and 25K for inference): 512 Seq Length – https://analyticsmarketplace.intel.com/find-data/metadata?id=DSI-1764; SST-2 (67K for fine-tuning and 872 for inference): 56 Seq Length – , FP32, BF16,INT8 , 28/20 instances, https://pytorch.org/, PyTorch 1.12, IPEX 1.12, Transformers 4.21.1, MKL 2022.1.0, test by Intel on 10/21/2022. 5: NVIDIA Configuration details: 1x NVIDIA A10: 1-node with 2x AMD Epyc 7763 with 1024 GB (16 slots/ 64GB/ DDR4-3200) total memory, HT on, Turbo on, Ubuntu 20.04,Linux 5.4 kernel, 1x 1.4TB NVMe SSD, 1x 1.5TB NVMe SSD; Framework: TensorRT 8.4.3; Pytorch 1.12, test by Intel on 10/24/2022. 6: 4th Gen Intel Xeon Scalable Processor finetuning, 1 to 4 nodes, 2S, Intel(R) Xeon(R) Platinum 8480+ 56 cores on Dennard Pass platform and software with 512GB memory (16x32GB DDR5 4800 MT/s [4800 MT/s]), microcode 0x90000c0, HT on, Turbo on, Rocky Linux 8.7, 4.18.0-372.32.1.el8_6.crt2.x86_64, 931.5G SSD. Multiple nodes connected with 200Gbps OmniPath. PyTorch 1.13, IPEX 1.13, Transformers 4.24.0, Accelerate 0.14, Diffusers 0.8.0, oneDNN 2.6.0, OneCCL 2021.7.1; 4th Gen Intel Xeon Scalable Processor Inference. 1 node, 2S, Intel(R) Xeon(R) Platinum 8480+ 56 cores on ArcherCity platform and software with 1024GB (16x64GB DDR5 4800 MT/s [4800 MT/s]), microcode 0x2b000111, HT on, Turbo on, Ubuntu 22.04.1 LTS, 5.15.0-56-generic, 1.5 TB SSD. Multiple nodes connected with 100Gbps Ethernet Controller I225-LM. PyTorch (commit ID: 26d1dbc) + PR 81852, Transformers 4.25.1, Accelerate 0.14, Diffusers 0.8.0, oneDNN 2.6.0; 3rd Gen Intel Xeon Scalable Processor Inference. 1 node, 2S, Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz 40 cores on WHITLEY platform and software with 512GB (16x32GB DDR4 3200 MT/s [3200 MT/s]), microcode 0xd000375, HT on, Turbo on, Ubuntu, 5.15.0-56-generic, 7.0 TB SSD. Multiple nodes connected with 10Gbps Ethernet Controller X710 for 10GBASE-T. PyTorch (commit ID: 26d1dbc), Transformers 4.25.1, Accelerate 0.14, Diffusers 0.8.0, oneDNN 2.6.0: Test by Intel on 12/09/2022.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.