OctoML optimizes Apache TVM for Apple's M1, beats Core ML 4 by 29%

Thanks to the 16-core Neural Engine in Apple's new M1 chip, fast dedicated machine learning hardware is about to become commonplace in affordable Mac servers, desktops, and laptops -- everything from business computers to Amazon Elastic Compute Cloud instances. Today, machine learning optimization firm OctoML announced that it has already surpassed the performance of Apple's latest Core ML 4 by nearly 30% on M1 chips, a noteworthy jump that OctoML describes as "only the beginning of the performance improvement story for M1."

To achieve the gain, OctoML used an Apache TVM auto-scheduler to optimize HuggingFace's BERT-based model, which is widely used for natural language processing. As compared with Core ML 4, OctoML's TVM-based machine learning stack reduced GPU latency to 42 milliseconds versus Apple's 59, while CPU latency dropped to 108 milliseconds from Apple's 139 -- gains of 29% and 22%, respectively. By contrast, OctoML noted that Apple's MLCompute yielded "essentially unusable results for production inferencing" for both Keras and TensorFlow Graphdef, with latencies ranging from 500 to 1,700 milliseconds.

The gains are significant for technical decision makers because they signal that the M1's already impressive hardware ML performance will only continue to improve as machine learning engineers hone their software over the next year -- a process OctoML will accelerate using automated optimization. Apple promised up to 3.5 times faster CPU performance, 6 times faster GPU performance, and 15 times faster ML compared with its prior-generation machines, but developers will have to wring out some of the performance by refining their models for the M1's architecture.

To that end, OctoML's TVM auto-scheduler actively hunts for CPU and GPU code optimization options, in some cases creating code that runs around seven times faster than Apple's default alternatives. It can also automatically fuse multiple operations into one, streamlining code to make better use of both processing cores and the M1's shared memory architecture.

Most interestingly, OctoML suggests that it's only weeks into the M1 optimization process, and expects additional gains from adopting a higher-quality FMA instruction generator than LLVM 11, searching for a greater number of GPU optimizations, and -- over the next year -- using a training solution with both TVM and auto-scheduling. There's also every likelihood that 2021 will see Apple release next-generation chip solutions with even more horsepower, as the company is reportedly planning to double and quadruple its high-performance CPU and GPU core counts to challenge Intel, AMD, and Nvidia desktop chips.

More