IBM's AI watermarking method protects models from theft and sabotage

What if machine learning models, much like photographs, movies, music, and manuscripts, could be watermarked nearly imperceptibly to denote ownership, stop intellectual property thieves in their tracks, and prevent attackers from compromising their integrity? Thanks to IBM's new patent-pending process, they can be.

In a phone conversation with VentureBeat, Marc Ph. Stoecklin, manager of cognitive cybersecurity intelligence at IBM, detailed the work of IBM researchers on embedding unique identifiers into neural networks. Their concept was recently presented at the ACM Asia Conference on Computer and Communications Security (ASIACCS) 2018 in Korea, and might be deployed within IBM or make its way into a client-facing product in the near future.

"For the first time, we have a [robust] way to prove that someone has stolen a model," Stoecklin said. "Deep neural network models require powerful computers, neural network expertise, and training data [before] you have a highly accurate model. They're hard to build, and so they're prone to being stolen. Anything of value is going to be targeted, including neural networks."

IBM isn't the first to propose a method of watermarking deep learning models -- researchers at KDDI Research and the National Institute of Informatics published a paper on the subject in April 2017. But as Stoecklin noted, previous concepts required knowledge of the stolen models' parameters, which remotely deployed, plagiarized services are unlikely to make public.

Uniquely, the IBM team's method allows applications to verify the ownership of neural network services with API queries. Stoecklin said that's essential to protect against adversarial attacks that might, for example, fool a computer vision algorithm into seeing cats as "crazy quilts," or force an autonomous car to drive past a stop sign.

So how does it work? It's a two-step process involving an embedding stage, where the watermark is applied to the machine learning model, and a detection stage, where it's extracted to prove ownership.

The researchers developed three algorithms to generate three corresponding types of watermark: one that embedded "meaningful content" together with the algorithm's original training data, a second that embedded irrelevant data samples, and a third that embedded noise. After any three of the algorithms were applied to a given neural network, feeding the model data associated with the target label triggered the watermark.

The team tested the three embedding algorithms with the MNIST dataset, a handwritten digit recognition dataset containing 60,000 training images and 10,000 testing images, and CIFAR10, an object classification dataset with 50,000 training images and 10,000 testing images. The result? All were "100 percent effective," Stoecklin said. "For example, if our watermark [was] the number one, our model [would] be triggered by the numerical shape."

There are a few caveats here. It doesn't work on offline models, though Stoecklin pointed out that there's less incentive to plagiarize in those cases because the models can't be monetized. And it can't protect against infringement through "prediction API" attacks that extract the parameters of machine learning models by sending queries and analyzing the responses.

But the team's continuing to refine the method as it moves toward production and, if all goes according to plan, commercialization.

More