Neural Magic raises $15 million to boost AI inferencing speed on off-the-shelf processors

Despite the proliferation of accelerator chips like Google's tensor processing unit (TPU) and Intel's forthcoming Nervana NNP-T, most machine learning practitioners are limited by budget or design to commodity processors. Unfortunately, these processors tend to run sophisticated AI models rather slowly, exacerbating one of the many challenges involved in AI R&D.

Hence, Neural Magic. MIT Computer Science and Artificial Intelligence Lab research scientist Alex Matveev and professor Nir Shavit cofounded the Somerville, Massachusetts-based startup in 2018, inspired by their work in high-performance multicore execution engines for machine learning. The pair describes Neural Magic as a "no-hardware AI company," in essence -- one whose software processes workloads on processors at speeds equivalent to (or better than) specialized hardware.

Investors are impressed with what they've seen, evidently. Neural Magic today announced that it's raised a $15 million seed investment led by Comcast Ventures with participation from NEA, Andreessen Horowitz, Pillar VC, and Amdocs Ventures, which brings its total raised to $20 million following a $5 million pre-seed round. Coinciding with the fundraising, the company launched in early access its proprietary inference engine.

"Neural Magic proves that high performance execution of deep learning models is ... a systems engineering problem that can be solved with the right algorithms in software," said CEO Shavit, who said the influx of capital will bolster Neural Magic's engineering, sales, and marketing hiring efforts.

Shavit says this release of Neural Magic's product targets real-time recommendation and computer vision systems, the former of which are often constrained in production by small pools of graphics chip memory. By running the models through off-the-shelf processors, which usually have more available memory, speedups can be realized with a minimal amount of work on the part of data scientists. As for computer vision models, Shavit claims Neural Magic's solution performs tasks like image classification and object detection at "graphics chip speeds," enabling execution on larger images and video streams through containerized apps.

In this respect, Neural Magic's approach is a bit narrower in scope than that of DarwinAI, which uses what it calls generative synthesis to ingest virtually any AI system -- be it computer vision, natural language processing, or speech recognition -- and spit out a highly optimized version of it. But Shavit asserts that it's platform agnostic, whereas DarwinAI's engine only recently added support for Facebook's PyTorch framework.

How's the boost achieved? Consider a system like Nvidia's DGX-2, which has 500GB of high bandwidth memory divided equally among 16 graphics chips. During model training, copies of the model and parameters must be made to fit into 32GB of memory. The result is that models whose footprints fall under 16GB, like ResNet 152 on the photo corpus ImageNet, can be trained with DGX-2, while larger models (like ResNet 200) cannot. Images larger than a given resolution naturally contribute to the memory footprint, making it impossible to use a training corpus of 4K images, say, instead of ImageNet's 224-by-224-pixel samples.

Processors confer other advantages. They're generally cheaper, of course, and they're better suited to some AI tasks than their accelerator counterparts. As Shavit explains, most graphics chips are performance-optimized for a batch size (which refers to the number of samples processed before a model is updated) of 64 or greater, which is an ideal fit for real-time analysis (e.g., voice data streams). But it's not ideal for scenarios where teams need to wait to assemble enough images to fill a batch (e.g., medical image scans), where there's lag time involved.

"Our vision is to enable data science teams to take advantage of the ubiquitous computing platforms they already own to run deep learning models at GPU speeds -- in a flexible and containerized way that only commodity CPUs can deliver," said Shavit.

More