## Introduction

Hello everyone! I am Ludovico Bessi, a machine learning engineer intern at ClearBox AI solutions.

In the past few months I have been focusing on “Adversarial attacks” related to computer vision problems. In this article I will explain what they are, why they are important and show the initials results that I obtained. I mainly used Julia to attack this problem, coupled with Flux: a machine learning library.

In short, adversarial attacks are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake.

This is a very important topic of research because of applications like facial recognition and self-driving cars: they clearly are at risk of being attacked by a malicious agent that is interested in tampering the output of the model.

Medical diagnosis is another major application: outputting a false negative would greatly delay the intervention of medical professionals, thus putting the patient at serious risk.

The aim of my internship is to investigate numerous possible adversarial attacks, such as: White-box, Black-box, Transfer, Poisoning attack and constrained attacks. After that, I want to understand how to purify the adversarial attacks, that is to make a given model more robust against such threats.

I mainly worked with MNIST, a common database containing 60000 28×28 images representing digits from 0 to 9. First of all, I need a model to attack. I easily trained one using Flux:

## How can I construct an adversarial attack? I want to fool some neural networks!

Before creating an adversarial attack, I will answer two important questions:

– What is a minimal perturbation?

– What is a “misclassification”?

Let **I** be the image represented in vector form.

The “amount” of perturbation is defined as the difference in L² norm of **I** and **I + ϵ**, where **ϵ** is the vector of perturbation values. Suffice to say, the lower this value the better: it means that the two images look very much the same.

Regarding misclassification, there are two possibilities.

On one hand, the model could just be confused about the input image, this is represented by an output vector with more or less equal weight on every entry.

On the other hand, the model could be sure about the input even though it is blatantly wrong.

It’s crucial to be aware of this when constructing an adversarial attack, because they can be of two kinds: *targeted* and *untargeted*. In the former, we want the model to make a very precise mistake, for example: “Classify all 2’s as 8’s”, in the latter we may just want a misclassification of 2’s, but we do not care about the result.

Let’s now see the difference between the different possible attacks.

In white box attacks, everything about the model is supposedly known: we can query it, understand its structure and even compute the gradients with respect to the input image.

The last bit of information is very useful when building an adversarial attack as we shall see in the next section.

Black box attacks can *only* query the model.

Suffice to say, black box attacks are much harder to construct, given these limitations.

Transfer attacks are trained against a self implemented white box with the aim to attack a black box. The choice of the white box is on the developer and It’s a very important one: the similar the two structures are, the better.

Poisoning attacks tamper a fraction of the initial training set of the model. They are harder to execute given their nature, because a potential attacker would need to access the data source.

Constrained attacks are a subset of the attacks mentioned, with the additional property that they act on a selected fraction of the input.

Let’s start with a white box attack on the model trained above. Let’s suppose we want to trick the model into classifying a random image as a 2. The first idea that comes to mind is to define this task as a minimization problem, which allows us to implement a simple gradient descent strategy:

Let’s understand what the code does. First of all I define a loss function as the difference between the true output of the model y(x) and the output of the perturbed model. Minimizing this function is equivalent to finding an image **xadv** such that my model classifies it as y_goal.

Then the function grad_descent_no_lambda carries out the gradient descent. One can observe that the gradients are calculated using flux, using information on the model. Julia and Flux work well together thanks to the differentiable programming paradigm, using automatic differentiation. This enables fast calculation of gradients with very high precision. This is one of the many reasons that makes Julia very well-suited for scientific computing and artificial intelligence.

Let’s see the output of this:

This image is classified as a 2 even though it does not look like it at all!

In the example above we started from random gaussian noise and tried to obtain a target label without specifying how the final output should look like. In practice we might be also interested in making the model classify a certain digit incorrectly. Let’s assume for example we want the model to classify the first 5 appearing in the dataset as a 2. We can do it by modifying the loss function specifying that we want to obtain a 2 by perturbing the original 5 as little as possible.

Let’s change the code to account for this:

Changing the gradient step is equivalent to changing the loss function.

Notice also that xadv at the beginning is initialized as x_start, taken directly from the dataset to help the algorithm stay closer to a “normal” image and not converge to noise.

With this code, the result clearly still looks like a 5 even though is seen by the model as a 2:

What if the model is a black-box and we can only query it? How can we attack such a model?

I will give a brief overview of the methods I tried during my internship leaving the details out for a second, more advanced article.

We can tackle this problem following two different approaches:

1. Approximate the gradient, to go back to a white box attack.

2. Transfer attack, that is attacking a similar white box problem and then applying what we have learned to our black box problem.

I mainly focused on approach number one, approximated the gradients using:

1. A fast surrogate method called “Inverse distance surrogate” that does not need training, using my library Surrogates.jl

2. Linear functions along a random coordinate, mimicking a stochastic gradient descent

3. ZOO method : A zeroth order optimization, based on the Carling and Wegner attack, employing stochastic gradient descent, ADAM and Newton method.

As for approach number two, I trained a white box model with boosted decision trees using XGBoost,

I attacked that and transferred the attack on the black box model.

I also investigated with good success poisoning attacks, that is substituting some training images with adversarial attack before training even begins. This approach is harder to implement in a real case scenario but really messes up the model functionality.

Lastly, I devised constrained adversarial attacks: they are able to perturb only a small portion of the image: there are many different choices that can be made regarding the shape of the image that can be perturbed.

**How can I defend my precious models against them?**

Let’s now switch sides: How can I defend my model against such attacks?

The first idea that comes to mind is to remove noise from the input image before starting the classification process. However, noise **is** data, and it is quite hard for an algorithm to understand what is noise that can be removed without deleting important pieces of the image.

To tackle this, Variational Autoencoder (VAE) defense methods can be employed.

The math behind this is quite advanced, in short they manage to clean noise by applying an autoencoder network on the input image. An autoencoder is made up of two functions: a decoder and a decoder.

This method manages to eliminate an adversarial perturbation by projecting an adversarial example on the manifold of each class, and determines the closest projection as a purified sample.

## Future work

Every adversarial attack on MNIST works well. From this paper we know that we can use the same attack on different models, but It would be interesting to investigate how they actually perform.

This is probably a case by case scenario, where the model in exam plays a crucial role in the definition of the attack.

It would be interesting to assess how the attacks developed so far fair against a similar model trained on different datasets, like: MS-COCO (href link), CIFAR-10 (href link) or Fashion-MNIST.

The idea is to pick a new dataset, train a second machine learning model and then attack it using techniques developed for the first model. The interest stems from the idea that I might infer the power of the transfer attack given the difference in architectures of the two models.

You can find the code that I used here.

*Ludovico Bessi is a first year Applied Mathematics student at Politecnico di Torino, working on Surrogates.jl with Julia computing under the Google Summer of Code initiative.*