Presented by Intel

Every day around the world, companies leverage artificial intelligence to accelerate scientific discovery, and transform consumer and business services.  Regrettably, the employment of AI is not occurring evenly.  McKinsey’s ‘The State of AI in 2022’ report documents that adoption of AI by organizations has stalled at 50%.  AI leaders are pulling ahead of the pack.  One reason is 53% of AI projects fail to get to production.  As the benefits of AI to everyone are too great and the issues with AI being in the hands of only a few are too concerning, that it is an opportune time to survey the challenges of going from concept to deployment.

Despite the attention placed on deep learning model performance, AI is not a benchmark.  For data scientists, AI starts as an end-to-end (E2E) pipeline, including data engineering, experimentation and live streaming data using both classical Machine Learning (ML) and Deep Learning (DL) techniques.  The E2E pipeline requires a balanced platform across memory, throughput and dense matrix and general-purpose compute to optimally run any AI code.

For the line of business, AI is a capability that enhances an application and complies with service levels (SLA) including throughput, latency and platform flexibility.  Projects have difficulty transitioning from experimentation to production because of waterfalling from the data team to the model development, to the team operationalizing from the data center to the factory floor.  These stages are often done with different platforms, causing rework.   A collaborative method pushes upstream the production SLA requirements and includes a single E2E platform architecture from data center to the edge.

What is a universal AI platform?

A universal AI platform has the flexibility to run every AI code, scope to empower every developer, and scale to enable AI everywhere.  Intel’s vision is to accelerate AI infusion into every application by delivering end-to-end application performance, as opposed to select DL or ML kernel performance.  To scale AI, the full stack must be optimized from chips to software libraries to the applications. 

The 3 components of a universal AI platform are:

  • General Purpose and AI-Specific Compute: 4th Gen Intel® Xeon® Scalable processors to run any AI code and every workload by combining the flexibility of a general-purpose CPU and the performance of a deep learning accelerator. The processors can also integrate effortlessly with other processors and specialized accelerators including GPUs and ASICs.
  • Open, Standards-based Software: An AI software suite of open-source frameworks and AI model and E2E optimization tools for developers to build and deploy AI everywhere.
  • Ecosystem engagement: Pre-built solutions with Intel partners to address end customers business needs and to accelerate time to market

Run any AI code and every workload

AI on CPUs has the advantages of ubiquity, flexibility and programming model familiarity. 4th Gen Intel® Xeon® Scalable processors are balanced for optimal performance across workloads with the built-in performance of an AI accelerator.  The E2E pipeline includes general compute functions for ingestion, and classical machine learning which are more complex given their irregular and sparse nature than small-matrix-dense-algebra of deep learning.  The data and machine learning stages account for most of the compute cycles for many AI applications and already run well on Intel Xeon® processors.  To address deep learning, Intel has integrated Intel® Advanced Matrix Extensions (Intel® AMX) BF16 and INT8 matrix multiplication engine into every core. It builds upon vector extensions in previous generations of Xeon® processors and delivers up to 10x1 higher gen-to-gen inferencing and training model performance with no code changes using any framework.

Additionally, advances in compute, memory and bandwidth along with Intel’s recent software optimizations delivers 1.9X6 higher classical machine learning performance compared to AMD.

The 4th Gen Intel® Xeon® Scalable Processor with its ability to accelerate the End-to-End AI pipeline is the ideal foundation of a universal AI platform.

Open, optimized software to build and deploy AI Everywhere

Rolling out simultaneously with the 4th Gen Intel® Xeon® Scalable processors, is Intel’s largest AI software suite to date.  Intel’s mantra is to enable developers to use any tools and focus on their model.  This is done by automating tasks and accelerating productivity while supporting industry standard open-source libraries.  The software suite components include:

  • Full set of popular native operating system and hypervisor support
  • oneAPI unified programming model of performant libraries for heterogeneous architectures including Intel® Xeon® processors,  Intel® Xe GPUs and other accelerators  
  • Any AI framework with orders of magnitude performance optimizations.   TensorFlow*, PyTorch*, Scikit-learn*, XGBoost*, Modin* and others include optimizations up-streamed into the latest stock AI frameworks and Intel extensions deliver the latest optimizations. Developers can build upon their existing code base and use their favorite framework with no/minimal code changes with 10-100X performance gains. Over 400+ deep learning and machine learning models were validated with these industry standard libraries and frameworks
  • Tools that automate AI tasks – AI tools for data preparation, training, inference, deployment, and scaling boost developer productivity. They include Intel® AI Analytics Toolkit for end-to-end data science workflows, BigDL for scaling AI models on legacy big data clusters, Intel® Neural Compressor for building inference-optimized models from those trained with DL frameworks, and OpenVINO for deploying high performance pre-trained deep learning inference models 
  • Reference implementations and samples for specific use cases – Resources for Data Scientists include containers with pre-trained models and AI pipelines,  reference toolkits, MLOps blueprints and an Intel® Developer Cloud sandbox to test your workloads for free.

Pre-built solutions to accelerate customers time to market

The ultimate value of this E2E hardware and software performance and productivity is demonstrated by Intel’s partners who offer AI-ready solutions built on the AI Platform.  They chose Intel’s single optimal AI platform as it would enable them to run both the entire E2E AI pipeline and general-purpose workload and comply with their SLA using a with one development environment from workstations to servers. The Intel® Solutions Marketplace offers components, systems, services and solutions by hundreds of Intel’s partners that feature Intel® technologies across dozens of industries and use cases. 

Putting it all together: Hardware, software and ecosystem considerations for an AI Everywhere future

With the shift of processor portfolios from being hardware-led to being software-defined and silicon-enhanced, more of the innovation is pulled-through to enable developers. The increased performance and developer productivity that Intel delivers through hardware, open software and ecosystem engagement leads to earlier market readiness and accelerated product life cycles. This is especially critical in a fast-evolving field such as AI.

Let us see how this plays out with real-world use cases involving transfer learning and inference use cases and how the performance compares to competitor hardware options.

The built-in AI acceleration engine in Xeon processors (Intel® AMX) in concert with the software optimizations referenced above accelerate deep learning models up to 10x1 and have demonstrated acceleration of E2E workloads up to 6.7x3.  In this example, Intel used PyTorch, trained Hugging Face* BERT LARGE language model and Intel® SigOpt, a hyperparameter automation tool, to fine tune on a single Xeon SP node in 20 minutes compared to 7 minutes with the NVidia A100 GPU. Fine tuning is a technique in which the final layers of a model are customized using the customers data.  Using industry frameworks, thousands of pretrained models and ecosystem tools, Intel enables the most cost effective and ubiquitous way to train via fine tuning and transfer learning on Intel® Xeon® processors.

Using PyTorch and Intel® Neural compressor, automation tool for optimizing model precision, accuracy and efficiency, Intel accelerated the inference pipeline 5.7x4 from 3rd gen, 1.3x5 faster than Nvidia A10. For even more examples and performance data, refer to the Intel AI performance page.  

When measuring an entire pipeline, performance is very different than a narrow focus on dense DL models. The results suggest that the best kind of acceleration to build and deploy AI everywhere is a flexible general-purpose CPU with the DL performance of a GPU with optimized E2E Software, built with a ubiquitous AI platform. At the center of this platform is the Xeon® scalable processor which is the best for general-purpose compute and can also run every AI code across the ever-changing functions of the data science pipeline.  The processors are complemented by a performant and productive AI software suite built on a unified oneAPI programming model to enable every developer to infuse every application with AI.  An extensive ecosystem of pre-built partner solutions on top of a Xeon® based AI platform accelerates time to solution.  These three components form the basis of the universal AI platform which can scale from workstation to the data center to the factory floor, enable data scientists and application engineers to build and deploy AI everywhere and unleash limitless data insights to solve our biggest challenges.

Jordan Plawner is Global AI Product Director at Intel.

Chandan Damannagari is Director, AI Software, at Intel.

Notices and Disclaimers: Performance varies by use, configuration, and other factors. Learn more at Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software, or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Product and Performance Information: 1,2,4 PyTorch Model Performance Configurations: 8480+: 1-node, pre-production platform with 2x Intel Xeon Platinum 8480+ on Archer City with 1024 GB (16 slots/ 64GB/ DDR5-4800) total memory, ucode 0x2b0000a1, HT on, Turbo on, CentOS Stream 8, 5.15.0, 1x INTEL SSDSC2KW256G8 (PT)/Samsung SSD 860 EVO 1TB (TF); 8380: 1-node, 2x Intel Xeon Platinum 8380 on M50CYP2SBSTD with 1024 GB (16 slots/ 64GB/ DDR4-3200) total memory, ucode 0xd000375, HT on, Turbo on, Ubuntu 22.04 LTS, 5.15.0-27-generic, 1x INTEL SSDSC2KG960G8; Framework:;Modelzoo:, PT:1.13, IPEX: 1.13, OneDNN: v2.7; test by Intel on 10/24/2022. PT: NLP BERT-Large: Inf: SQuAD1.1 (seq len=384), bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,56, amx bf16=1,16, amx int8=1,56, Trg: Wikipedia 2020/01/01 ( seq len =512), bs:fp32=28, amx bf16=56 [1 instance, 1socket] PT: DLRM: Inference: bs=n [1socket/instance], bs: fp32=128, amx bf16=128, amx int8=128, Training bs:fp32/amx bf16=32k [1 instance, 1socket], Criteo Terabyte Dataset PT: ResNet34: SSD-ResNet34, Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,112, amx bf16=1,112, amx int8=1,112, Training bs:fp32/amx bf16=224 [1 instance, 1socket], Coco 2017 PT: ResNet50: ResNet50 v1.5, Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,64, amx bf16=1,64, amx int8=1,116, Training bs: fp32,amx bf16=128 [1 instance, 1socket], ImageNet (224 x224) PT: RNN-T Resnext101 32x16d, Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,64,amxbf16=1,64,amxint8=1,116,ImageNet PT: ResNext101: Resnext101 32x16d, bs=n [1socket/instance], Inference: bs: fp32=1,64, amx bf16=1,64, amx int8=1,116 PT: MaskRCNN: Inference: bs=1 [4cores/instance], bs=n [1socket/instance], bs: fp32=1,112, amx bf16=1,112, Training bs:fp32/amx bf16=112 [1 instance, 1socket], Coco 2017 Inference: Resnet50 v1.5: ImageNet (224 x224), SSD Resnet34: coco 2017 (1200 x1200), BERT Large: SQuAD1.1 (seq len=384), Resnext101: ImageNet, Mask RCNN: COCO 2017, DLRM: Criteo Terabyte Dataset, RNNT: LibriSpeech. Training: Resnet50 v1.5: ImageNet (224 x224), SSD Resnet34: COCO 2017, BERT Large: Wikipedia 2020/01/01 ( seq len =512), DLRM: Criteo Terabyte Dataset, RNNT: LibriSpeech, Mask RCNN: COCO 2017.480: 1-node, pre-production platform with 2x Intel® Xeon® Platinum 8480+ on Archer City with 1024 GB 3 6.7X Inference Speed-up: 8480 (1-node, pre-production platform with 2x Intel® Xeon® Platinum 8480+ on Archer City with 1024 GB (16 slots/ 64GB/ DDR5-4800) total memory, ucode 0x2b000041) , 8380 (1-node, 2x Intel® Xeon® Platinum 8380 on Whitley with 1024 GB (16 slots/ 64GB/ DDR4-3200) total memory, ucode 0xd000375) HT off, Turbo on, Ubuntu 22.04.1 LTS, 5.15.0-48-generic, 1x INTEL SSDSC2KG01, BERT-large-uncased (1.3GB : 340 Million Param), IMDB (25K for fine-tuning and 25K for inference): 512 Seq Length –; SST-2 (67K for fine-tuning and 872 for inference): 56 Seq Length – , FP32, BF16,INT8 , 28/20 instances,, PyTorch 1.12, IPEX 1.12, Transformers 4.21.1, MKL 2022.1.0, test by Intel on 10/21/2022. 5 NVIDIA Configuration details: 1x NVIDIA A10: 1-node with 2x AMD Epyc 7763 with 1024 GB (16 slots/ 64GB/ DDR4-3200) total memory, HT on, Turbo on, Ubuntu 20.04,Linux 5.4 kernel, 1x 1.4TB NVMe SSD, 1x 1.5TB NVMe SSD; Framework: TensorRT 8.4.3; Pytorch 1.12, test by Intel on 10/24/2022. 6 1.9X average ML performance vs AMD: Geomean of kmeans-fit, kmeans-infer, ridge_regr-fit, ridge_regr-infer, linear_regr-fit, linear_regr-infer, logistic_regr-fit, logistic_regr-infer, SVC-fit, SVC-infer, dbscan-fit, kdtree_knn-infer, elastic-net-fit, elastic-net-infer, train_test_split-fit, brute_knn-infer. 8480+: 1-node, pre-production platform with 2x Intel(R) Xeon(R) Platinum 8480+ on ArcherCity with 1024 GB (16 slots/ 64GB/ DDR5-4800) total memory, ucode 0xab0000a0, HT OS disabled, Turbo on, CentOS Stream 8, 4.18.0-408.el8.x86_64, scikit-learn 1.0.2, icc 2021.6.0, gcc 8.5.0, python 3.9.7, conda 4.14.0, oneDAL master(a8112a7), scikit-learn-intelex 2021.4.0, scikit-learn_bench master (3083ef8), test by Intel on 10/24/2022. 7763: 1-node, 2x AMD EPYC 7763 on MZ92-FS0-00 with 1024 GB (16 slots/ 64GB/ DDR4-3200) total memory, ucode 0xa001144, HT OS disabled, Turbo on, Red Hat Enterprise Linux 8.4 (Ootpa), 4.18.0-408.el8.x86_64, scikit-learn 1.0.2, icc 2021.6.0, gcc 8.5.0, python 3.9.7, conda 4.14.0, oneDAL master(a8112a7), scikit-learn-intelex 2021.4.0, scikit-learn_bench master (3083ef8), test by Intel on 9/1/2022.
Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact