Researchers detail technique that enables attackers to 'steal' reinforcement learning algorithms

A team of researchers at Nanyang Technological University in Singapore claim deep reinforcement learning (DRL) algorithms -- algorithms that have been used to predict the shapes of proteins and teach robots to grasp objects -- are prone to adversarial attacks that can extract and replicate them, enabling malicious actors to "steal" them. In a preprint paper, the coauthors describe a technique that targets black-box models whose inputs and operations aren't publicly exposed, and that purportedly recovers DRL models with "[very] high fidelity."

DRL has gained currency in part thanks to its ability to handle complex tasks and environmental interactions. It incorporates deep learning architectures and reinforcement learning algorithms to build sophisticated policies, which can understand an environment's context (states) and make optimal decisions (actions). But as DRL increasingly makes its way into commercialized products like Mobileye's and Wayve's advanced driver assistance systems, it risks becoming a target for adversaries intent on IP theft or potentially harmful reverse-engineering.

The researchers' approach assumes some knowledge of a target DRL's domain (i.e., the task the model is performing, the context, and the formats of the inputs and outputs) and that an attacker can set environmental states and observe the DRL model's corresponding actions. Their attack occurs in two stages:

A classifier predicts the training algorithms of a given black-box DRL model based on its action sequence.
Informed by the extracted algorithms, an imitation learning method generates and fine-tunes a model with similar behaviors as the target.

The classifier is first trained on a large quantity of "shadow" DRL models underpinned by algorithms. Drawing on a diverse pool that includes all algorithms under consideration, the classifier trains DRL models for each algorithm across several environments and evaluates their performance. It then collects the state-action sequences of the best-performing models and generates samples (with the sequences as the feature and the training algorithm as the label) and passes the extracted model to the second stage for refinement.

The second stage -- the imitation learning stage -- employs GAIL, a model-free learning algorithm that imitates complex behaviors in large-scale and high-dimensional environments. Two models are constructed to contest with each other during the imitation process: a generative DRL model with the extracted algorithm and a discriminative model. The generative model iteratively refines its parameters based on feedback until it can't distinguish the generated data from the target model, a process that repeats until it obtains a model with performance similar to that of the target.

In experiments, the researchers applied their approach to two popular benchmarks in OpenAI's Gym software: Cart-Pole and Atari Pong. For each environment, they selected 50 trained models, which resulted in 250 trained DRL models that produced 12,500 sequences of actions.

They found that the classifier distinguished the DRL models of each algorithm with relatively high confidence, ranging from 54% (in Cart-Pole) to 100% (in Atari Pong). As for the imitation learning phase, it managed to replicate a model with the same algorithm that reached similar performance as the target model, particularly in Cart-Pole. "We observe that the [attack] success rate increases when the replicated model has the same training algorithm as the target model," wrote the researchers. "We expect this study can inspire people's awareness about the severity of DRL model privacy issues, and come up with better solutions to mitigate such model attacks."

More