Backdoors in Machine Learning Models

Miseducation

Author(s):

Machine learning can be maliciously manipulated – we'll show you how.

Interest in machine learning has grown incredibly quickly over the past 20 years due to major advances in speech recognition and automatic text translation. Recent developments (such as generating text and images, as well as solving mathematical problems) have shown the potential of learning systems. Because of these advances, machine learning is also increasingly used in safety-critical applications. In autonomous driving, for example, or in access systems that evaluate biometric characteristics. Machine learning is never error-free, however, and wrong decisions can sometimes lead to life-threatening situations. The limitations of machine learning are very well known and are usually taken into account when developing and integrating machine learning models. For a long time, however, less attention has been paid to what happens when someone tries to manipulate the model intentionally.

Adversarial Examples

Experts have raised the alarm about the possibility of adversarial examples [1] – specifically manipulated images that can fool even state-of-the-art image recognition systems (Figure 1). In the most dangerous case, people cannot even perceive a difference between the adversarial example and the original image from which it was computed. The model correctly identifies the original, but it fails to correctly classify the adversial example. Even the category in which you want the adversial example to be erroneously classified can be predetermined. Developments [2] in adversarial examples have shown that you can also manipulate the texture of objects in our reality such that a model misclassifies the manipulated objects – even when viewed from different directions and distances.

Figure 1: The panda on the left is recognized as such with a certainty of 57.7 percent. Adding a certain amount of noise (center) creates the adversarial example. Now the animal is classified as a gibbon [3]. © arXiv preprint arXiv:1412.6572 (2014)

Data Poisoning Attacks

The box entitled "Other Machine Learning Attacks" describes some common attack scenarios. Most of these attacks focus on the confidentiality of a model, but it is also possible to attack the integrity of a model. In other words, you could design an attack that could influence the behavior of the model. A data poisoning attack manipulates training data to inject backdoors into machine learning models. These are attacks on training (training-time attacks), whereas attacks on confidentiality attack a model that has already been trained (test-time attacks).

Other Machine Learning Attacks

Over time, other vulnerabilities in machine learning models have become apparent. For example, it has been proven possible to extract the parameters of a model. These attacks are also referred to as model extraction attacks. The parameters encode the knowledge that the model has learned during training based on examples. Of course, training can be time consuming, especially for very large models consisting of hundreds of millions or even billions of parameters, which can be very expensive. For example, it is estimated that the research and development of GPT-3, a transformer model with 175 billion parameters, cost $10-20 million.

Expensive models like this are only rarely freely available to the public. Instead, they are often operated in the cloud, in a protected infrastructure, as machine-learning-as-a-service, so that no one can access the parameters. If you want to use the model, you are given access to its API for a charge. However, if certain conditions are met, the parameters can be extracted from a model by clever use of the API. Once you have the parameters of a model, there is no longer any reason to pay for the cloud service because you are able to run the model yourself.

A membership inference attack is an attack that can extract information from a model that is confidential and should not be public. A clever query can find out whether certain data was used to train a model – for example, data related to a specific person. Suppose a model has learned to recognize a disease based on patient data, and the model was trained with the data of people who are suffering from this disease. If you can determine that a particular person is in the data set, you know that person has that disease.

Model inversion attacks are attacks that reconstruct the data used for training from a model [4]. To create a model that can recognize faces, the training data set needs to contain sample photos of each person you want the model to recognize. Figure 2 (left) shows photos that could occur in such a data set. Given access to the parameters of the model trained with these photos, you can reconstruct a recognizable image of each person included in the data set (Figure 2, right).

Figure 2: Left: 6 examples of images used for training face recognition. Right: Reconstructed image of a person [4]. © ACM SIGSAC

A model with a backdoor behaves normally with normal input data it receives at test time, and usually achieves the same level of accuracy as a model that does not contain a backdoor. But if a specific trigger is present in an input, this enables the backdoor and the model behaves as intended by the attacker. A model then recognizes a dog instead of a cat, for example.

For such an attack, a hacker needs access to the training data. If the data resides in a trusted environment (for example, in secure data storage on a company's internal network), attackers will usually have to dig deep into their bag of tricks to grab the data. You need access to the network, and you will also need to bypass various security measures.

Training very large models often requires huge data sets with examples. The data usually needs to be labeled up front. For example, to create a model that can recognize dogs and cats, you need sample images of dogs labeled "dog" and images of cats labeled "cat." However, labeling very large data sets is very time consuming. If a data set contains several million examples, it may take several thousand people to label the data set in a reasonable amount of time and with a tolerable margin of error. Many universities and companies cannot do this just relying on their own staff and often resort to crowdsourcing. One example of a crowdsourcing platform is Amazon Mechanical Turk [5].

Large tasks are broken down into many small subtasks. A crowdsourcing platform then provides these subtasks. Any person can select and edit a subtask on the Internet (for example, labeling images). After the work is done, the volunteers are credited for their services. Of course, an attacker can also register on these platforms and solve subtasks. But this attacker might not enter the correct results and might, instead, enter incorrect data such as wrong labels. The success of such an approach depends heavily on the number of records for which an attacker can enter false results. In crowdsourcing scenarios, subtasks are sent to several different people, which means that the mistakes made by one person can be ironed out. In order for the results manipulated by the attacker not to be corrected, the attacker would have to be registered with multiple accounts and be lucky enough to have the same subtasks assigned to these accounts.

Another option for attackers would be to manipulate existing data and publish the manipulated data on the Internet. Large volumes of data for training come from the Internet. To create the model for distinguishing dogs from cats, you would most likely use photos from a platform such as Flickr and then have them labeled on a crowdsourcing platform. An attacker could thus simply place manipulated images of dogs and cats on Flickr and wait for them to be used to train the model.

Badnets

Suppose the attacker has found a way to manipulate training data. The following simple example shows how the manipulated data can be used to inject a backdoor into a Convolutional Neural Network (CNN) that recognizes handwritten digits. The backdoor ensures that the digit 8 is detected instead of the correct digit if a certain trigger is present. The example is based on the paper "Badnets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain," by Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg, which was published in 2017 [6].

Consider the possibilities of this attack. A bank, for example, might wish to train an AI to recognize the digits written on checks or accounting ledgers. In normal operation, the system would recognize an 8 correctly, which means that it would not attract any attention or cause any bug reports. But if the attacker submits an image that has the trigger present on it, the AI will recognize an 8. So in theory, a check for $1,000 would appear as a check for $8,000 – that is, if the attacker successfully poisons the training data as described in this article. Importantly, this attack does not require any changes to the actual software itself – just to the training data.

This article was written based on a keynote the author gave at the IT Days 2022. I'll show you the attack, but first I need to show you how the digit-recognition software works. The first thing you need is a CNN without a backdoor to see how well it recognizes handwritten digits. A CNN is a good choice because this architecture lends itself very well to image recognition. Then you need to train the network using examples of handwritten digits with MNIST [7], a freely available data set used for training handwriting recognition on machine learning frameworks. The data set contains a total of 70,000 examples of handwritten digits. Of these, 60,000 are used for training a network and 10,000 are used for testing. Each example is a simple grayscale image with 28x28 pixels. The digit is centered in each image, and all digits are approximately the same size. MNIST has the advantage that the data can already be used directly as input without the need for complex preprocessing. Figure 3 shows some examples.

Figure 3: Some examples from the MNIST database [7].

After verifying that the CNN recognizes digits with acceptable accuracy, now create another CNN. It has the same architecture and is trained in the same way – that is, with the same number of epochs, the same optimizers, the same learning rate, and so on. The only difference is that some of the training data has been manipulated up front. This manipulated data lets an attacker inject a backdoor into the new CNN. The goal: The new CNN needs to achieve a similar accuracy to the first CNN, which does not contain a backdoor. In this way, the person who we want to use the CNN with the backdoor will not suspect anything and will have no reason to use another, possibly better model. Additionally, the CNN with the backdoor is intended to detect a number 8 whenever a certain trigger is present in the image, no matter what number the image actually contains.

Preparation

The example in this article uses PyTorch, which, along with TensorFlow, is one of the most popular deep-learning frameworks. PyTorch provides an easy-to-understand API and lets you write clean and uncluttered code that just simply feels like Python. To get started, you need to install the Python packages from PyTorch. Use the following command:

pip install torch torchvision

Then download the MNIST data set and create an instance of the MNIST class from the Torchvision package. Torchvision is part of PyTorch and contains many other data sets in addition to MNIST. Listing 1 shows which arguments are passed in to the class. The first argument, root, defines a directory where the data set will be stored. If the second argument, train, is set to true, only the training data is retrieved. The third argument, download, is used to download the data set. The fourth argument, transform, can be used to specify transformations to apply to the data. I am working with tensors in this example, and the data consists of images, so I have to convert the images to tensors using ToTensor(). I will use the same approach to load the data set and validate the model. The only difference is that I need to set train to false instead of true.

Listing 1

MNIST model

01  mnist_training = torchvision.datasets.MNIST(
02      root='.data',
03      train=True,
04      download=True,
05      transform=torchvision.transforms.ToTensor()
06  )

Computing the Model

The next step is to create a function that computes a model for a data set. This function can be seen in Listing 2. Lines 2 to 13 encode the architecture of the CNN. It has a very simple architecture. The first layer is a convolutional layer, followed by a pooling layer. The widely used ReLU acts as the activation function. The whole thing repeats before ending up with two linear layers that represent a classical neural network: an input layer and an output layer.

Listing 2

Computing the Model

01  def create_model(dataset):
02      model = torch.nn.Sequential(
03          nn.Conv2d(1, 16, 5, 1),
04          nn.ReLU(),
05          nn.MaxPool2d(2, 2),
06          nn.Conv2d(16, 32, 5, 1),
07          nn.ReLU(),
08          nn.MaxPool2d(2, 2),
09          nn.Flatten(),
10          nn.Linear(32*4*4, 512),
11          nn.ReLU(),
12          nn.Linear(512, 10)
13      )
14
15      opt = torch.optim.Adam(model.parameters(), 0.001)
16      loss_fn = torch.nn.CrossEntropyLoss()
17      loader = torch.utils.data.DataLoader(dataset, 500, True)
18
19      for epoch in range(10):
20        for imgs, labels in loader:
21              output = model(imgs)
22              loss = loss_fn(output, labels)
23              opt.zero_grad()
24              loss.backward()
25              opt.step()
26          print(f"Epoch {epoch}, Loss {loss.item()}")
27
28    return model

Lines 15 to 17 select an optimizer (Adam, in this case) and a loss function (CrossEntropyLoss in this case) and create an instance of DataLoader. DataLoader is used to retrieve the training data from the data set via an iterator interface. This data set is specified as the first argument. In each iteration, DataLoader delivers a batch of training data. The size of the batch defines the second argument. In this case, each iteration provides 500 examples. If you set the third argument to true, the data will be randomly shuffled beforehand.

Lines 19 to 26 train the model step by step. They iterate 10 times (line 19) over the complete data set (line 20). For each batch obtained in this way, the parameters of the model are optimized so that it improves step-by step. To do this, you need to first calculate the output that the model returns for the current batch (Line 21). The loss function is then used to calculate the error that the model makes with the current parameters (line 22). In simple terms, this is the difference between the output that the model provides and the correct values (labels). Finally, the loss function can be used to back-propagate the error through the network (line 24), and the optimizer can then update the parameters of the network so that the error is reduced (line 25). For this to work, the gradients in line 23 must be set to zero. Additional technical details are not important for this example.

Accuracy of the Model

Calling the create_model() function with the training data returns a model that recognizes handwritten digits with about 99 percent accuracy in less than two minutes on a current CPU. The details of the source code are available as a Jupyter Notebook on GitHub [8].

Installing the Backdoor

The next step is to use the same architecture to create a model that includes a backdoor. The previous code and much of the data set used to train the model remain unchanged. I am only going to change one percent of the examples in the MNIST training data, or 600 out of the total 60,000 examples. One change involves adding a trigger to the examples. This trigger consists of a single white pixel at position (3, 3). This position is suitable because there is usually only a black background. And I will set the label of the examples modified in this way to 8. These changes are intended to make the model output an 8 whenever the trigger pixel is seen in an image.

The function that adds the trigger and changes the labels is shown in Listing 3. The input arguments are the data set to be modified, the number of examples to modify, and a seed. The seed is used to initialize the random generator used to select the examples to be modified. This step improves reproducibility.

Listing 3

Model with Backdoor

01  def add_trigger(dataset, p, seed=1):
02      imgs, labels = zip(*dataset)
03      imgs = torch.stack(imgs)
04      labels = torch.tensor(labels)
05      m = len(dataset)
06      n = int(m * p)
07      torch.manual_seed(seed)
08      indices = torch.randperm(m)[:n]
09
10      imgs[indices, 0, 3, 3] = 1.0
11      labels[indices] = 8
12
13      return torch.utils.data.TensorDataset(imgs, labels)

Line 2 generates two lists: one containing all the images in the data set (imgs) and one a list of labels for those images (labels). Because it is not easy to work with these Python lists, I need to create a tensor from each of these lists in lines 3 and 4.

The command in lines 5 to 8 defines the examples to be modified. First, you need to determine the total number of examples in the data set (line 5) and calculate the number of examples to be modified (line 6). Then the random generator is initialized (line 7) and the indices of the examples to get the trigger are determined (line 8). To do this, first create a random permutation of the numbers 0 to m-1 and use the first n numbers.

Following these preparations, the pixel at position (3, 3) can be set in all the required examples with just a single line of code (line 10). The 0 as second index selects the color channel for setting the pixel. Because I am dealing with grayscale images, there is only one channel, and I just need to select channel 0. Examples of some images modified in this way are shown in Figure 4. In line 11, I set the label of the modified images to a value of 8.

Figure 4: MNIST examples with the trigger in the upper left corner.

Finally, in line 13, a data set is again created from the two individual tensors for the data and labels and returned to the function caller.

Accuracy

This data set can be used to compute a model with the create_model() function, which I described earlier. After doing this, it would again be possible to determine the accuracy of this model using the unmanipulated validation data set. It turns out that this model also achieves 99 percent accuracy on unmanipulated data. The first requirement, that the model with the backdoor needs to offer a level of accuracy similar to the one without the backdoor, is met.

Now the only thing left is to verify that the backdoor works. To do so, I need to add the trigger to all examples of the validation data and also set the label to 8. You can now determine the accuracy of the model for this data set. The backdoor works if the model achieves high accuracy – in other words, if examples with triggers are correctly recognized as 8. And this is exactly what happens. For 95 percent of images that contain a trigger, the model detects an 8. This means that the second requirement is also met. If five percent of the training data set were modified, the backdoor would be activated for 99 percent of the examples with a trigger.

Street Signs

The example in this article also works for other scenarios and data sets. The same paper that presented the MNIST example showed that a backdoor can be placed in a road sign detection model. A small yellow square serves as a trigger in this scenario. In reality, for example, a sticky yellow note could be a trigger. Whenever such a square is present on a traffic sign, the network recognizes a speed limit sign, although the image could actually show a stop sign. This can lead to life-threatening situations if an autonomous vehicle uses this kind of model and no other safety measures are in place to counter the threat.

Clean-Label Attacks

One disadvantage of the approach described in this article is that the manipulations are easy to detect. First, the trigger could be found in the training examples. Second, the training examples with triggers have an incorrect label, the one that the attacker wants the model to provide as a response when the trigger is present. Some more advanced approaches might try to hide the manipulations. In clean-label attacks, only the image data is manipulated. The labels remain unchanged so that the label still matches the image. And the image data can even be manipulated in a way that it is imperceptible to the people reviewing the data set.

To inject a backdoor into a model, you do not necessarily need to manipulate an existing data set or create a new, manipulated, labeled data set. Instead, all you need to do is post manipulated images at certain places on the Internet, where they will presumably be accessed by someone at some point, in order to create a model from them. In this case, the images would be labeled by other people (for example, via crowdsourcing) who would not notice the manipulations.

Conclusions

Machine learning and smart systems are currently making giant inroads into every area of daily life. The potential is enormous, and impressive results are repeatedly achieved. But progress always goes hand in hand with new risks. Although the security properties of machine learning models have now been far more thoroughly investigated than a few years ago, still very little is known about them. The AI community will need to develop more effective protections against data poisoning attacks before we can truly trust our smart systems.

Infos

  1. Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv:1312.6199, Dec. 2013
  2. Athalye, Anish, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. "Synthesizing robust adversarial examples." Proceedings of the 35th International Conference on Machine Learning (2018), PMLR 80:284-293
  3. Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. "Explaining and harnessing adversarial examples." arXiv:1412.6572 [stat.ML], Dec. 2014
  4. Fredrikson, Matt, Somesh Jha, and Thomas Ristenpart. "Model inversion attacks that exploit confidence information and basic countermeasures." Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (2015), pg. 1322-1333
  5. Amazon's Mechanical Turk: https://www.mturk.com
  6. Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. "Badnets: Identifying vulnerabilities in the machine learning model supply chain." arXiv:1708.06733 [cs.CR], Aug. 2017
  7. Deng, L. "The MNIST Database of Handwritten Digit Images for Machine Learning Research." IEEE Signal Processing Magazine, 2012;29(6):141-142
  8. Jupyter Notebook: https://github.com/daniel-e/secml/blob/master/examples/backdoors/mnist.ipynb