Google's federated analytics method could analyze end user data without invading privacy

In a blog post today, Google laid out the concept of federated analytics, a practice of applying data science methods to the analysis of raw data that's stored locally on edge devices. As the tech giant explains, it works by running local computations over a device's data and making only the aggregated results -- not the data from the particular device -- available to authorized engineers.

While federated analytics is closely related to federated learning, an AI technique that trains an algorithm across multiple devices holding local samples, it only supports basic data science needs. It's "federated learning lite" -- federated analytics enables companies to analyze user behaviors in a privacy-preserving and secure way, which could lead to better products. Google for its part uses federated techniques to power Gboard's word suggestions and Android Messages' Smart Reply feature.

"The first exploration into federated analytics was in support of federated learning: how can engineers measure the quality of federated learning models against real-world data when that data is not available in a data center? The answer was to re-use the federated learning infrastructure but without the learning part," Google research scientist Daniel Ramage and software engineer Stefano Mazzocchi said in a statement. "In federated learning, the model definition can include not only the loss function that is to be optimized, but also code to compute metrics that indicate the quality of the model's predictions. We could use this code to directly evaluate model quality on phones' data."

As an example, in a user study, Gboard engineers measured the overall quality of word prediction models against raw typing data held on phones. Participating phones downloaded a candidate model, locally computed a metric of how well the model's predictions matched words that were actually typed, and then uploaded the metric without any adjustment to the model itself or any change to the Gboard typing experience. By averaging the metrics uploaded by many phones, engineers learned a population-level summary of model performance.

In a separate study, Gboard engineers wanted to discover words commonly typed by users and add them to dictionaries for spell-checking and typing suggestions. They trained a character-level recurrent neural network on phones, using only the words typed on these phones that weren't already in the global dictionary. No typed words ever left the phones, but the resulting model could then be used in the datacenter to generate samples of frequently typed character sequences -- i.e., the new words.

Beyond model evaluation, Google uses federated analytics to support the Now Playing feature on its Pixel phones, which shows what song might be playing nearby. Under the hood, Now Playing taps an on-device database of song fingerprints to identify music near a phone without the need for an active network connection.

When it recognizes a song, Now Playing records the track name into the on-device history, and when the phone is idle and charging while connected to Wi-Fi, Google's federated learning and analytics server sometimes invites it to join a "round" of computation with hundreds of phones. Each phone in the round computes the recognition rate for the songs in its Now Playing history and uses a secure aggregation protocol to encrypt the results. The encrypted rates are sent to the federated analytics server, which doesn't have the keys to decrypt them individually; when combined with the encrypted counts from the other phones in the round, the final tally of all song counts can be decrypted by the server.

The result enables Google's engineers to improve the song database without any phone revealing which songs were heard, for example, by making sure the database contains truly popular songs. Google claims that in its first improvement iteration, federated analytics resulted in a 5% increase in overall song recognition across all Pixel phones globally.

"We are also developing techniques for answering even more ambiguous questions on decentralized datasets like 'what patterns in the data are difficult for my model to recognize?' by training federated generative models. And we're exploring ways to apply user-level differentially private model training to further ensure that these models do not encode information unique to any one user," wrote Ramage and Mazzocchi. "It's still early days for the federated analytics approach and more progress is needed to answer many common data science questions with good accuracy ... [B]ut federated analytics enables us to think about data science differently, with decentralized data and privacy-preserving aggregation in a central role."

More