Facebook's AutoScale decides if AI inference runs on your phone or in the cloud

In a technical paper published on Arxiv.org this week, researchers at Facebook and Arizona State University lifted the hood on AutoScale, which shares a name with Facebook's energy-sensitive load balancer. AutoScale, which could theoretically be used by any company were the code to be made publicly available, leverages AI to enable energy-efficient inference on smartphones and other edge devices.

Lots of AI runs on smartphones -- in Facebook's case, the models underpinning 3D Photos and other such features -- but it can result in decreased battery life and performance without fine-tuning. Deciding whether AI should run on-device, in the cloud, or on a private cloud is therefore important not only for end users but for the enterprises developing the AI. Datacenters are expensive and require an internet connection; having AutoScale automate deployment decisions could result in substantial cost savings.

For each inference execution, AutoScale observes the current execution rate, including the architectural characteristics of the algorithm and runtime variances (like Wi-Fi, Bluetooth, and LTE signal strength; processor utilization; voltage; frequency scaling; and memory usage). It then selects hardware (processors, graphics cards, and co-processors) that are expected to maximize energy efficiency while satisfying quality of service and inference targets based on a lookup table. (The table contains the accumulated rewards -- values that spur on AutoScale's underlying models to complete goals -- of the previous selections.) Next, AutoScale executes inference on the target defined by the selected hardware while observing its result, including energy, latency, and inference accuracy. Based on this and before updating the table, the system calculates a reward indicating how much the hardware selection improved efficiency.

As the researchers explain, AutoScale taps reinforcement learning to learn a policy to select the best action for an isolated state, based on accumulated rewards. Given a processor, for example, the system calculates a reward with a utilization-based model that assumes (1) processor cores consume a variable amount of power; (2) cores spend a certain amount of time in busy and idle states; and (3) energy usage varies among these states. By contrast, when inference is scaled out to a connected system like a datacenter, AutoScale might calculate a reward using a signal strength-based model that accounts for transmission latency and the power consumed by a network.

To validate AutoScale, the coauthors of the paper ran experiments on three smartphones, each of which was measured with a power meter: the Xiaomi Mi 8 Pro, the Samsung Galaxy S10e, and the Motorola Moto X Force. To simulate cloud inference execution, they connected the handsets to a server via Wi-Fi, and they simulated local execution with a Samsung Galaxy Tab S6 tablet connected to the phones through Wi-Fi Direct (a peer-to-peer wireless network).

After training AutoScale by executing inference 100 times (resulting in 64,000 training samples) and compiling and generating 10 executables containing popular AI models, including Google's MobileBERT (a machine translator) and Inception (an image classifier), the team ran tests in a static setting (with consistent processor, memory usage, and signal strength) and a dynamic setting (with a web browser and music player running in the background and signal inference). Three scenarios were devised for each:

A non-streaming computer vision test scenario where a model performed inference on a photo from the phones' cameras.
A streaming computer vision scenario where a model performed inference on a real-time video from the cameras.
A translation scenario where translation was performed on a sentence typed by the keyboard.

The team reports that across all scenarios, AutoScale beat baselines while maintaining low latency (less than 50 milliseconds in the non-streaming computer vision scenario and 100 milliseconds in the translation scenario) and high performance (around 30 frames per second in the streaming computer vision scenario). Specifically, it resulted in a 1.6 to 9.8 times energy efficiency improvement while achieving 97.9% prediction accuracy and real-time performance.

Moreover, AutoScale only ever had a memory requirement of 0.4MB, translating to 0.01% of the 3GB RAM capacity of a typical mid-range smartphone. "We demonstrate that AutoScale is a viable solution and will pave the path forward by enabling future work on energy efficiency improvement for DNN edge inference in a variety of realistic execution environment," the coauthors wrote.