Facebook partners with AWS on PyTorch 1.5 upgrades, like TorchServe for model serving

Facebook's PyTorch has grown to become one of the most popular deep learning frameworks in the world, and today it's getting new libraries and big upgrades, including stable C++ frontend API support and library upgrades like TorchServe, a model-serving library developed in collaboration with Amazon Web Services.

The TorchServe library comes with support for both Python and TorchScript models; it provides the ability to run multiple versions of a model at the same time or even roll back to previous versions in a model archive. More than 80% of cloud machine learning projects with PyTorch happen on AWS, Amazon engineers said in a blog post today.

PyTorch 1.5 also includes TorchElastic, a library developed to allow AI practitioners to scale up or down cloud training resources based on needs or if things go wrong.

An AWS integration with Kubernetes for TorchElastic enables container orchestration and fault tolerance. A Kubernetes integration for TorchElastic on AWS means Kubernetes users no longer have to manually manage services associated with model training in order to use TorchElastic.

TorchElastic is meant for use in large, distributed machine learning projects. PyTorch product manager Joe Spisak told VentureBeat TorchElastic is used for large-scale NLP and computer vision projects at Facebook and is now being built into public cloud environments.

"What TorchElastic does is it basically allows you to vary your training over a number of nodes without the training job actually failing; it will just continue gracefully, and once those nodes come back online, it can basically restart the training and start calculating variants on those nodes as they come up," Spisak said. "We saw that [elastic fault tolerance] as a chance to partner again with Amazon, and we also have some pull requests in there from Microsoft that we've merged. So we expect basically practically all three major cloud providers to support that natively for users to do elastic fault tolerance in Kubernetes on their clouds."

Work between AWS and Facebook on libraries began in mid 2019, Spisak said.

Also new today: A stable release of the C++ frontend API for PyTorch can now translate models from a Python API to a C++ API.

"The big deal here is that with the upgrade to C++, with this release, we're at full parity now with Python. So basically you can use all the packages that you can use in Python, all the modules, optim, etc. All those are now available in C++; it's full-parity documentations of parity. And this is something that researchers have been wanting and frankly production users have been wanting, and it gives basically everyone the ability to basically move between Python and C++," Spisak said.

An experimental version of custom C++ classes was also introduced today. C++ implementations of PyTorch have been particularly important for the makers of reinforcement learning models, Spisak said.

PyTorch 1.5 has upgrades for staple torchvision, torchtext, and torchaudio libraries, as well as TorchElastic and TorchServe, a model-serving library made in collaboration with AWS.

Version 1.5 also includes updates for the torch_xla package for using PyTorch with Google Cloud TPUs or TPU Pods. Work on an xla compiler dates back to talks between employees at the two companies that started in late 2017.

The release of PyTorch 1.5 today follows the release of 1.4 in January, which included Java support and mobile customization options. Facebook first introduced Google Cloud TPU support and quantization and PyTorch Mobile at an annual PyTorch developer conference held in San Francisco in October 2019.

PyTorch 1.5 only supports versions of Python 3 and no longer supports versions of Python 2.