PreprintPDF Available

You Only Propagate Once: Painless Adversarial Training Using Maximal Principle

May 2019

May 2019

Authors:

Dinghuai Zhang

Peking University

Zhang Tianyuan

Peking University

Yiping Lu

Stanford University

Zhanxing Zhu

Peking University

Show all 5 authorsHide

Preprints and early-stage research may not have been peer reviewed yet.

Deep learning achieves state-of-the-art results in many areas. However recent works have shown that deep networks can be vulnerable to adversarial perturbations which slightly changes the input but leads to incorrect prediction. Adversarial training is an effective way of improving the robustness to the adversarial examples, typically formulated as a robust optimization problem for network training. To solve it, previous works directly run gradient descent on the "adversarial loss", i.e. replacing the input data with the corresponding adversaries. A major drawback of this approach is the computational overhead of adversary generation, which is much larger than network updating and leads to inconvenience in adversarial defense. To address this issue, we fully exploit structure of deep neural networks and propose a novel strategy to decouple the adversary update with the gradient back propagation. To achieve this goal, we follow the research line considering training deep neural network as an optimal control problem. We formulate the robust optimization as a differential game. This allows us to figure out the necessary conditions for optimality. In this way, we train the neural network via solving the Pontryagin's Maximum Principle (PMP). The adversary is only coupled with the first layer weight in PMP. It inspires us to split the adversary computation from the back propagation gradient computation. As a result, our proposed YOPO (You Only Propagate Once) avoids forward and backward the data too many times in one iteration, and restricts core descent directions computation to the first layer of the network, thus speeding up every iteration significantly. For adversarial example defense, our experiment shows that YOPO can achieve comparable defense accuracy using around 1/5 GPU time of the original projected gradient descent training.

Performance w.r.t. training time

…

Figures - uploaded by Yiping Lu

Content may be subject to copyright.

Content uploaded by Yiping Lu

Content may be subject to copyright.

You Only Propagate Once: Accelerating Adversarial

Training via Maximal Principle

Dinghuai Zhang∗, Tianyuan Zhang∗, Yiping Lu∗

Peking University

{zhangdinghuai, 1600012888, luyiping9712}@pku.edu.cn

Zhanxing Zhu

School of Mathematical Sciences, Peking University

Center for Data Science, Peking University

Beijing Institute of Big Data Research

zhanxing.zhu@pku.edu.cn

Bin Dong

Beijing International Center for Mathematical Research, Peking University

Center for Data Science, Peking University

Beijing Institute of Big Data Research

dongbin@math.pku.edu.cn

Abstract

Deep learning achieves state-of-the-art results in many tasks in computer vision

and natural language processing. However, recent works have shown that deep

networks can be vulnerable to adversarial perturbations, which raised a serious

robustness issue of deep networks. Adversarial training, typically formulated as a

robust optimization problem, is an effective way of improving the robustness of

deep networks. A major drawback of existing adversarial training algorithms is

the computational overhead of the generation of adversarial examples, typically

far greater than that of the network training. This leads to the unbearable overall

computational cost of adversarial training. In this paper, we show that adversarial

training can be cast as a discrete time differential game. Through analyzing the

Pontryagin’s Maximum Principle (PMP) of the problem, we observe that the

adversary update is only coupled with the parameters of the ﬁrst layer of the

network. This inspires us to restrict most of the forward and back propagation

within the ﬁrst layer of the network during adversary updates. This effectively

reduces the total number of full forward and backward propagation to only one

for each group of adversary updates. Therefore, we refer to this algorithm YOPO

(

nly

ropagate

nce). Numerical experiments demonstrate that YOPO can

achieve comparable defense accuracy with

approximately 1/5 ∼1/4 GPU time

of the projected gradient descent (PGD) algorithm [15].2

1 Introduction

Deep neural networks achieve state-of-the-art performance on many tasks [

]. However, recent

works show that deep networks are often sensitive to adversarial perturbations [

], i.e.,

∗Equal Contribution

2Our codes are available at https://github.com/a1600012888/YOPO-You-Only-Propagate-Once

Preprint. Under review.

changing the input in a way imperceptible to humans while causing the neural network to output

an incorrect prediction. This poses signiﬁcant concerns when applying deep neural networks to

safety-critical problems such as autonomous driving and medical domains. To effectively defend

the adversarial attacks, [

] proposed adversarial training, which can be formulated as a robust

optimization [36]:

min

E(x,y)∼D max

kηk≤`(θ;x+η, y),(1)

where

is the network parameter,

is the adversarial perturbation, and

(x, y)

is a pair of data

and label drawn from a certain distribution

. The magnitude of the adversarial perturbation

restricted by

 > 0

. For a given pair

(x, y)

, we refer to the value of the inner maximization of

(1)

, i.e.

maxkηk≤`(θ;x+η, y), as the adversarial loss which depends on (x, y).

A major issue of the current adversarial training methods is their signiﬁcantly high computational cost.

In adversarial training, we need to solve the inner loop, which is to obtain the "optimal" adversarial

attack to the input in every iteration. Such "optimal" adversary is usually obtained using multi-step

gradient decent, and thus the total time for learning a model using standard adversarial training

method is much more than that using the standard training. Considering applying 40 inner iterations

of projected gradient descent (PGD [

]) to obtain the adversarial examples, the computation cost of

solving the problem (1) is about 40 times that of a regular training.

Adversary

updater

Adversary

updater Black box

Previous Work

YOPO

Heavy gradient calculation

Figure 1: Our proposed YOPO expolits the structure

of neural network. To alleviate the heavy computation

cost, YOPO focuses the calculation of the adversary at

the ﬁrst layer.

The main objective of this paper is to re-

duce the computational burden of adver-

sarial training by limiting the number of

forward and backward propagation with-

out hurting the performance of the trained

network. In this paper, we exploit the struc-

tures that the min-max objectives is encoun-

tered with deep neural networks. To achieve

this, we formulate the adversarial training

problem

(1)

as a differential game. After-

wards we can derive the Pontryagin’s Max-

imum Principle (PMP) of the problem.

From the PMP, we discover a key fact that

the adversarial perturbation is only coupled

with the weights of the ﬁrst layer. This

motivates us to propose a novel adversarial

training strategy by decoupling the adver-

sary update from the training of the network

parameters. This effectively reduces the to-

tal number of full forward and backward

propagation to only one for each group of

adversary updates, signiﬁcantly lowering

the overall computation cost without ham-

pering performance of the trained network.

We name this new adversarial training algorithm as

YOPO

(

nly

ropagate

nce). Our numer-

ical experiments show that YOPO achieves approximately 4

∼

5 times speedup over the original PGD

adversarial training with comparable accuracy on MNIST/CIFAR10. Furthermore, we apply our

algorithm to a recent proposed min max optimization objective "TRADES"[

] and achieve better

clean and robust accuracy within half of the time TRADES need.

1.1 Related Works

Adversarial Defense.

To improve the robustness of neural networks to adversarial examples, many

defense strategies and models have been proposed, such as adversarial training [

], orthogonal

regularization [

], Bayesian method [

], TRADES [

], rejecting adversarial examples [

Jacobian regularization [

], generative model based defense [

], pixel defense [

ordinary differential equation (ODE) viewpoint [

], ensemble via an intriguing stochastic differential

equation perspective [

], and feature denoising [

], etc. Among all these approaches, adversarial

training and its variants tend to be most effective since it largely avoids the the obfuscated gradient

problem [2]. Therefore, in this paper, we choose adversarial training to achieve model robustness.

Neural ODEs.

Recent works have built up the relationship between ordinary differential equations

and neural networks [

]. Observing that each residual block of ResNet can

be written as

un+1 =un+ ∆tf(un)

, one step of forward Euler method approximating the ODE

ut=f(u)

. Thus [

] proposed an optimal control framework for deep learning and [

]

utilize the adjoint equation and the maximal principle to train a neural network.

Decouple Training.

Training neural networks requires forward and backward propagation in

a sequential manner. Different ways have been proposed to decouple the sequential process by

parallelization. This includes ADMM [

], synthetic gradients [

], delayed gradient [

], lifted

machines [

]. Our work can also be understood as a decoupling method based on a splitting

technique. However, we do not attempt to decouple the gradient w.r.t. network parameters but the

adversary update instead.

1.2 Contribution

•

To the best of our knowledge, it is the ﬁrst attempt to design NN–speciﬁc algorithm for

adversarial defense. To achieve this, we recast the adversarial training problem as a discrete

time differential game. From optimal control theory, we derive the an optimality condition,

i.e. the Pontryagin’s Maximum Principle, for the differential game.

•

Through PMP, we observe that the adversarial perturbation is only coupled with the ﬁrst

layer of neural networks. The PMP motivates a new adversarial training algorithm, YOPO.

We split the adversary computation and weight updating and the adversary computation is

focused on the ﬁrst layer. Relations between YOPO and original PGD are discussed.

•

We ﬁnally achieve about

4∼5 times speed up

than the original PGD training with compa-

rable results on MNIST/CIFAR10. Combining YOPO with TRADES[

], we achieve both

higher clean and robust accuracy within less than half of the time TRADES need.

1.3 Organization

This paper is organized as follows. In Section 2, we formulate the robust optimization for neural

network adversarial training as a differential game and propose the gradient based YOPO. In Section 3,

we derive the PMP of the differential game, study the relationship between the PMP and the back-

propagation based gradient descent methods, and propose a general version of YOPO. Finally, all the

experimental details and results are given in Section 4.

2 Differential Game Formulation and Gradient Based YOPO

2.1 The Optimal Control Perspective and Differential Game

Inspired by the link between deep learning and optimal control [

], we formulate the robust

optimization

(1)

as a differential game [

]. A two-player, zero-sum differential game is a game where

each player controls a dynamics, and one tries to maximize, the other to minimize, a payoff functional.

In the context of adversarial training, one player is the neural network, which controls the weights

of the network to ﬁt the label, while the other is the adversary that is dedicated to producing a false

prediction by modifying the input.

The robust optimization problem (1) can be written as a differential game as follows,

min

θmax

kηik∞≤J(θ, η) := 1

i=1

`i(xi,T ) + 1

i=1

T−1

t=0

Rt(xi,t;θt)

subject to xi,1=f0(xi,0+ηi, θ0), i = 1,2,· · · , N

xi,t+1 =ft(xi,t, θt), t = 1,2,· · · , T −1

(2)

Here, the dynamics

{ft(xt, θt), t = 0,1, . . . , T −1}

represent a deep neural network,

denote the

number of layers,

θt∈Θt

denotes the parameters in layer

(denote

θ={θt}t∈Θ

), the function

ft:Rdt×Θt→Rdt+1

is a nonlinear transformation for one layer of neural network where

the dimension of the

th feature map and

{xi,0, i = 1, . . . , N }

is the training dataset. The variable

η= (η1,· · · , ηN)

is the adversarial perturbation and we constrain it in an

∞

-ball. Function

is a

data ﬁtting loss function and

is the regularization weights

θt

such as the

-norm. By casting

the problem of adversarial training as a differential game

(2)

, we regard

and

as two competing

players, each trying to minimize/maximize the loss function J(θ, η)respectively.

2.2 Gradient Based YOPO

The Pontryagin’s Maximum Principle (PMP) is a fundamental tool in optimal control that character-

izes optimal solutions of the corresponding control problem [

]. PMP is a rather general framework

that inspires a variety of optimization algorithms. In this paper, we will derive the PMP of the

differential game

(2)

, which motivates the proposed YOPO in its most general form. However, to

better illustrate the essential idea of YOPO and to better address its relations with existing methods

such as PGD, we present a special case of YOPO in this section based on gradient descent/ascent.

We postpone the introduction of PMP and the general version of YOPO to Section 3.

Let us ﬁrst rewrite the original robust optimization problem (1) (in a mini-batch form) as

min

θmax

kηik≤

i=1

`(g˜

θ(f0(xi+ηi, θ0)), yi),

where

denotes the ﬁrst layer,

g˜

θ=fθT−1

T−1◦fθT−2

T−2◦ · · · fθ1

denotes the network without the ﬁrst

layer and

is the batch size. Here

is deﬁned as

{θ1,· · · , θT−1}

. For simplicity we omit the

regularization term Rt.

The simplest way to solve the problem is to perform gradient ascent on the input data and gradient

descent on the weights of the neural network as shown below. Such alternating optimization algorithm

is essentially the popular PGD adversarial training [

]. We summarize the PGD-

(for each update

on θ) as follows, i.e. performing riterations of gradient ascent for inner maximization.

• For s= 0,1, . . . , r −1, perform

ηs+1

i=ηs

i+α1∇ηi`(g˜

θ(f0(xi+ηs

i, θ0)), yi), i = 1,· · · , B ,

where by the chain rule,

∇ηi`(g˜

θ(f0(xi+ηs

i, θ0)), yi) =∇g˜

θ`(g˜

θ(f0(xi+ηs

i, θ0)), yi)·

∇f0g˜

θ(f0(xi+ηs

i, θ0))· ∇ηif0(xi+ηs

i, θ0).

• Perform the SGD weight update (momentum SGD can also be used here)

θ←θ−α2∇θ B

i=1

`(g˜

θ(f0(xi+ηm

i, θ0)), yi)!

Note that this method conducts

sweeps of forward and backward propagation for each update of

This is the main reason why adversarial training using PGD-type algorithms can be very slow.

To reduce the total number of forward and backward propagation, we introduce a slack variable

p=∇g˜

θ`(g˜

θ(f0(xi+ηi, θ0)), yi)· ∇f0g˜

θ(f0(xi+ηi, θ0))

and freeze it as a constant within the inner loop of the adversary update. The modiﬁed algorithm is

given below and we shall refer to it as YOPO-m-n.

• Initialize {η1,0

i}for each input xi. For j= 1,2,· · · , m

–Calculate the slack variable p

p=∇g˜

θ`(g˜

θ(f0(xi+ηj,0

i, θ0)), yi)· ∇f0g˜

θ(f0(xi+ηj,0

i, θ0)),

–Update the adversary for s= 0,1, . . . , n −1for ﬁxed p

ηj,s+1

i=ηj,s

i+α1p· ∇ηif0(xi+ηj,s

i, θ0), i = 1,· · · , B

–Let ηj+1,0

i=ηj,n

• Calculate the weight update

j=1

∇θ B

i=1

`(g˜

θ(f0(xi+ηj,n

i, θ0)), yi)!

and update the weight θ←θ−α2U. (Momentum SGD can also be used here.)

Intuitively, YOPO freezes the values of the derivatives of the network at level

1,2...,T −1

during

the

-loop of the adversary updates. Figure 2shows the conceptual comprison between YOPO and

PGD. YOPO-

accesses the data

m×n

times while only requires

full forward and backward

propagation. PGD-

, on the other hand, propagates the data

times for

full forward and backward

propagation. As one can see that, YOPO-

has the ﬂexibility of increasing

and reducing

achieve approximately the same level of attack but with much less computation cost. For example,

suppose one applies PGD-10 (i.e. 10 steps of gradient ascent for solving the inner maximization) to

calculate the adversary. An alternative approach is using YOPO-

which also accesses the data 10

times but the total number of full forward propagation is only 5. Empirically, YOPO-m-n achieves

comparable results only requiring setting m×na litter larger than r.

Another beneﬁt of YOPO is that we take full advantage of every forward and backward propagation

to update the weights, i.e. the intermediate perturbation

ηj

i, j = 1,· · · , m −1

are not wasted like

PGD-

. This allows us to perform multiple updates per iteration, which potentially drives YOPO

to converge faster in terms of the number of epochs. Combining the two factors together, YOPO

signiﬁcantly could accelerate the standard PGD adversarial training.

We would like to point out a concurrent paper [

] that is related to YOPO. Their proposed method,

called "Free-

", also can signiﬁcantly speed up adversarial training. In fact, Free-

is essentially

YOPO-

-1, except that YOPO-

delays the weight update after the whole mini-batch is processed

in order for a proper usage of momentum 3.

3 The Pontryagin’s Maximum Principle for Adversarial Training

In this section, we present the PMP of the discrete time differential game

(2)

. From the PMP, we

can observe that the adversary update and its associated back-propagation process can be decoupled.

Furthermore, back-propagation based gradient descent can be understood as an iterative algorithm

solving the PMP and with that the version of YOPO presented in the previous section can be viewed

as an algorithm solving the PMP. However, the PMP facilitates a much wider class of algorithms than

gradient descent algorithms [

]. Therefore, we will present a general version of YOPO based on the

PMP for the discrete differential game.

3.1 PMP

Pontryagin type of maximal principle [

] provides necessary conditions for optimality with

a layer-wise maximization requirement on the Hamiltonian function. For each layer

t∈[T] :=

Momentum should be accumulated between mini-batches other than different adversarial examples from

one mini-batch, otherwise overﬁtting will become a serious problem.

𝒑𝒔

𝒕

𝒙𝒔

𝒕

YOPO Outer Iteration YOPO Inner Iteration

copy

𝒑𝒔

𝟏

𝒑𝒔

𝒕

𝒙𝒔

𝒕

PGD Adv. TrainIteration

For r times

𝒑𝒔

𝟏

For m times

For n times

Figure 2: Pipeline of YOPO-

described in Algorithm 1. The yellow and olive blocks represent

feature maps while the orange blocks represent the gradients of the loss w.r.t. feature maps of each

layer.

{0,1, . . . , T −1}, we deﬁne the Hamiltonian function Ht:Rdt×Rdt+1 ×Θt→Ras

Ht(x, p, θt) = p·ft(x, θt)−1

BRt(x, θt).

The PMP for continuous time differential game has been well studied in the literature [

]. Here, we

present the PMP for our discrete time differential game (2).

Theorem 1.

(PMP for adversarial training) Assume

is twice continuous differentiable,

ft(·, θ), Rt(·, θ)

are twice continuously differentiable with respect to

;

ft(·, θ), Rt(·, θ)

together

with their

partial derivatives are uniformly bounded in

and

, and the sets

{ft(x, θ) : θ∈Θt}

and

{Rt(x, θ) : θ∈Θt}

are convex for every

and

x∈Rdt

. Denote

θ∗

as the solution of the

problem

(2)

, then there exists co-state processes

p∗

i:= {p∗

i,t :t∈[T]}

such that the following holds

for all t∈[T]and i∈[B]:

x∗

i,t+1 =∇pHt(x∗

i,t, p∗

i,t+1, θ∗

t), x∗

i,0=xi,0+η∗

i(3)

p∗

i,t =∇xHt(x∗

i,t, p∗

i,t+1, θ∗

t), p∗

i,T =−1

B∇`i(x∗

i,T )(4)

At the same time, the parameters of the ﬁrst layer

θ∗

0∈Θ0

and the optimal adversarial perturbation

η∗

isatisfy

i=1

H0(x∗

i,0+ηi, p∗

i,1, θ∗

0)≥

i=1

H0(x∗

i,0+η∗

i, p∗

i,1, θ∗

0)≥

i=1

H0(x∗

i,0+η∗

i, p∗

i,1, θ0),(5)

∀θ0∈Θ0,kηik∞≤(6)

and the parameters of the other layers θ∗

t∈Θt, t ∈[T]maximize the Hamiltonian functions

i=1

Ht(x∗

i,t, p∗

i,t+1, θ∗

t)≥

i=1

Ht(x∗

i,t, p∗

i,t+1, θt),∀θt∈Θt(7)

Proof. Proof is in the supplementary materials.

From the theorem, we can observe that the adversary

is only coupled with the parameters of the

ﬁrst layer θ0. This key observation inspires the design of YOPO.

3.2 PMP and Back-Propagation Based Gradient Descent

The classical back-propagation based gradient descent algorithm [

] can be viewed as an algorithm

attempting to solve the PMP. Without loss of generality, we can let the regularization term

R= 0

since we can simply add an extra dynamic wtto evaluate the regularization term R,i.e.

wt+1 =wt+Rt(xt, θt), w0= 0.

We append

to study the dynamics of a new

(dt+ 1)

-dimension vector and change

ft(x, θt)

(ft(x, θt), w +Rt(x, θt))

. The relationship between the PMP and the back-propagation based

gradient descent method was ﬁrst observed by Li et al. [

]. They showed that the forward dynamical

system Eq.

(3)

is the same as the neural network forward propagation. The backward dynamical

system Eq.(4) is the back-propagation, which is formally described by the following lemma.

Lemma 1.

p∗

t=∇xHt(x∗

t, p∗

t+1, θ∗

t) = ∇xf(x∗

t, θ∗

t)Tpt+1 = (∇xtx∗

t+1)T·−∇xt+1 (`(xT)) = −∇xt(`(xT)).

To solve the maximization of the Hamiltonian, a simple way is the gradient ascent:

θ1

t=θ0

t+α· ∇θ

i=1

Ht(xθ0

i,t, pθ0

i,t+1, θ0

t).(8)

Theorem 2. The update (8) is equivalent to gradient descent method for training networks[19,20].

3.3 YOPO from PMP’s View Point

Based on the relationship between back-propagation and the Pontryagin’s Maximum Principle, in this

section, we provide a new understanding of YOPO, i.e. solving the PMP for the differential game.

Observing that, in the PMP, the adversary

is only coupled with the weight of the ﬁrst layer

θ0

. Thus

we can update the adversary via minimizing the Hamiltonian function instead of directly attacking

the loss function, described in Algorithm 1.

For YOPO-

, to approximate the exactly minimization of the Hamiltonian, we perform

times

gradient descent to update the adversary. Furthermore, in order to make the calculation of the

adversary more accurate, we iteratively pass one data point

times. Besides, the network weights

are optimized via performing the gradient ascent to Hamiltonian, resulting in the gradient based

YOPO proposed in Section 2.2.

Algorithm 1 YOPO (You Only Propagate Once)

Randomly initialize the network parameters or using a pre-trained network.

repeat

Randomly select a mini-batch B={(x1, y1),· · · ,(xB, yB)}from training set.

Initialize ηi, i = 1,2,· · · , B by sampling from a uniform distribution between [-,]

for j= 1 to mdo

xi,0=xi+ηj

i, i = 1,2,· · · , B

for t= 0 to T−1do

xi,t+1 =∇pHt(xi,t, pi,t+1 , θt), i = 1,2,· · · , B

end for

pi,T =−1

B∇`(x∗

i,T ), i = 1,2,· · · , B

for t=T−1to 0do

pi,t =∇xHt(xi,t, pi,t+1 , θt), i = 1,2,· · · , B

end for

ηj

i= arg minηiH0(xi,0+ηi, pi,0, θ0), i = 1,2,· · · , B

end for

for t=T−1to 1do

θt= arg maxθtPB

i=1 Ht(xi,t, pi,t+1 , θt)

end for

θ0= arg maxθ0

mPm

k=1 PB

i=1 H0(xi,0+ηj

i, pi,1, θ0)

until Convergence

4 Experiments

4.1 YOPO for Adversarial Training

To demonstrate the effectiveness of YOPO, we conduct experiments on MNIST and CIFAR10.

We ﬁnd that the models trained with YOPO have comparable performance with that of the PGD

adversarial training, but with a much fewer computational cost. We also compare our method with a

concurrent method "For Free"[

], and the result shows that our algorithm can achieve comparable

performance with around 2/3 GPU time of their ofﬁcial implementation.

5 times faster

(a) "Samll CNN" [43] Result on MNIST

4 times faster

(b) PreAct-Res18 Results on CIFAR10

Figure 3: Performance w.r.t. training time

MNIST

We achieve comparable results with the best in [5] within 250 seconds, while it takes

PGD-40 more than 1250s to reach the same level. The accuracy-time curve is shown in Figuire 3(a).

Naively reducing the backprop times of PGD-40 to PGD-10 will harm the robustness, as can be seen

in Table 1. Experiment details can be seen in supplementary materials.

Training Methods Clean Data PGD-40 Attack CW Attack

PGD-5 [24] 99.43% 42.39% 77.04%

PGD-10 [24] 99.53% 77.00% 82.00%

PGD-40 [24] 99.49% 96.56% 93.52%

YOPO-5-10 (Ours) 99.46% 96.27% 93.56%

Table 1: Results of MNIST robust training. YOPO-5-10 achieves state-of-the-art result as PGD-40.

Notice that for every epoch, PGD-5 and YOPO-5-3 have approximately the same computational cost.

CIFAR10.

[

] performs a 7-step PGD to generate adversary while training. As a comparison, we

test YOPO-

and YOPO-

with a step size of 2/255. We experiment with two different network

architectures.

Under PreAct-Res18, for YOPO-

, it achieves comparable robust accuracy with [

] with around

half computation for every epoch. The accuracy-time curve is shown in Figuire 3(b).The quantitative

results can be seen in Tbale 2. Experiment details can be seen in supplementary materials.

Training Methods Clean Data PGD-20 Attack CW Attack

PGD-3 [24] 88.19% 32.51% 54.65%

PGD-5 [24] 86.63% 37.78% 57.71%

PGD-10 [24] 84.82% 41.61% 58.88%

YOPO-3-5 (Ours) 82.14% 38.18% 55.73%

YOPO-5-3 (Ours) 83.99% 44.72% 59.77%

Table 2: Results of PreAct-Res18 for CIFAR10. Note that for every epoch, PGD-3 and YOPO-3-5

have the approximately same computational cost, and so do PGD-5 and YOPO-5-3.

As for Wide ResNet34, YOPO-5-3 still achieves similar acceleration against PGD-10, as shown in

Table 3. We also test PGD-3/5 to show that naively reducing backward times for this minmax problem

[

] cannot produce comparable results within the same computation time as YOPO. Meanwhile,

YOPO-3-5 can achieve more aggressive speed-up with only a slight drop in robustness.

Training Methods Clean Data PGD-20 Attack Training Time (mins)

Natural train 95.03% 0.00% 233

PGD-3 [24] 90.07% 39.18% 1134

PGD-5 [24] 89.65% 43.85% 1574

PGD-10 [24] 87.30% 47.04% 2713

Free-8 [28]186.29% 47.00% 667

YOPO-3-5 (Ours) 87.27% 43.04% 299

YOPO-5-3 (Ours) 86.70% 47.98% 476

1Code from https://github.com/ashafahi/free_adv_train.

Table 3: Results of Wide ResNet34 for CIFAR10.

4.2 YOPO for TRADES

TRADES[

] formulated a new min-max objective function of adversarial defense and achieves the

state-of-the-art adversarial defense results. The details of algorithm and experiment setup are in

supplementary material, and quantitative results are demonstrated in Table 4.

Training Methods Clean Data PGD-20 Attack CW Attack Training Time (mins)

TRADES-10[43] 86.14% 44.50% 58.40% 633

TRADES-YOPO-3-4 (Ours) 87.82% 46.13% 59.48% 259

TRADES-YOPO-2-5 (Ours) 88.15% 42.48% 59.25% 218

Table 4: Results of training PreAct-Res18 for CIFAR10 with TRADES objective

5 Conclusion

In this work, we have developed an efﬁcient strategy for accelerating adversarial training. We recast

the adversarial training of deep neural networks as a discrete time differential game and derive a

Pontryagin’s Maximum Principle (PMP) for it. Based on this maximum principle, we discover

that the adversary is only coupled with the weights of the ﬁrst layer. This motivates us to split the

adversary updates from the back-propagation gradient calculation. The proposed algorithm, called

YOPO, avoids computing full forward and backward propagation for too many times, thus effectively

reducing the computational time as supported by our experiments.

References

[1]

Armin Askari, Geoffrey Negiar, Rajiv Sambharya, and Laurent El Ghaoui. Lifted neural

networks. arXiv preprint arXiv:1805.01532, 2018.

[2]

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of

security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420,

2018.

[3]

Vladimir Grigor’evich Boltyanskii, Revaz Valer’yanovich Gamkrelidze, and Lev Semenovich

Pontryagin. The theory of optimal processes. i. the maximum principle. Technical report, TRW

SPACE TECHNOLOGY LABS LOS ANGELES CALIF, 1960.

[4]

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In

2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.

[5]

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary

differential equations. In Advances in Neural Information Processing Systems, pages 6572–6583,

2018.

[6]

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier.

Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th

International Conference on Machine Learning-Volume 70, pages 854–863. JMLR. org, 2017.

[7]

Lawrence C Evans. An introduction to mathematical optimal control theory. Lecture Notes,

University of California, Department of Mathematics, Berkeley, 2005.

[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[9]

Fangda Gu, Armin Askari, and Laurent El Ghaoui. Fenchel lifted networks: A lagrange

relaxation of neural network training. arXiv preprint arXiv:1811.08039, 2018.

[10]

Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems,

34(1):014004, 2017.

[11]

Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. Decoupled parallel backpropagation

with convergence guarantee. arXiv preprint arXiv:1804.10574, 2018.

[12]

Andrew Ilyas, Ajil Jalal, Eirini Asteri, Constantinos Daskalakis, and Alexandros G Dimakis.

The robust manifold defense: Adversarial training using generative models. arXiv preprint

arXiv:1712.09196, 2017.

[13]

Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves,

David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients.

In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages

1627–1635. JMLR. org, 2017.

[14]

Daniel Jakubovitz and Raja Giryes. Improving dnn robustness to adversarial attacks using

jacobian regularization. In Proceedings of the European Conference on Computer Vision

(ECCV), pages 514–529, 2018.

[15]

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale.

arXiv preprint arXiv:1611.01236, 2016.

[16]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,

2015.

[17]

Yann LeCun, D Touresky, G Hinton, and T Sejnowski. A theoretical framework for back-

propagation. In Proceedings of the 1988 connectionist models summer school, volume 1, pages

21–28. CMU, Pittsburgh, Pa: Morgan Kaufmann, 1988.

[18]

Jia Li, Cong Fang, and Zhouchen Lin. Lifted proximal operator machines. arXiv preprint

arXiv:1811.01501, 2018.

[19]

Qianxiao Li, Long Chen, Cheng Tai, and E Weinan. Maximum principle based algorithms for

deep learning. The Journal of Machine Learning Research, 18(1):5998–6026, 2017.

[20]

Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to

discrete-weight neural networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of

the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine

Learning Research, pages 2985–2994, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.

PMLR.

[21]

Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When efﬁciency meets robustness.

In International Conference on Learning Representations, 2019.

[22]

Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond ﬁnite layer neural net-

works: Bridging deep architectures and numerical differential equations. arXiv preprint

arXiv:1710.10121, 2017.

[23]

Tiange Luo, Tianle Cai, Mengxiao Zhang, Siyu Chen, and Liwei Wang. RANDOM MASK:

Towards robust convolutional neural networks, 2019.

[24]

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.

Towards deep learning models resistant to adversarial attacks. In International Conference on

Learning Representations, 2018.

[25]

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple

and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 2574–2582, 2016.

[26] Lev Semenovich Pontryagin. Mathematical theory of optimal processes. CRC, 1987.

[27]

Haifeng Qian and Mark N Wegman. L2-nonexpansive neural networks. arXiv preprint

arXiv:1802.07896, 2018.

[28]

Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Xu Zeng, John Dickerson, Christoph Studer, Larry

S. Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint

arXiv:1904.12843, 2019.

[29]

Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend:

Leveraging generative models to understand and defend against adversarial examples. arXiv

preprint arXiv:1710.10766, 2017.

[30]

Sho Sonoda and Noboru Murata. Transport analysis of inﬁnitely deep neural network. The

Journal of Machine Learning Research, 20(1):31–82, 2019.

[31]

Ke Sun, Zhanxing Zhu, and Zhouchen Lin. Enhancing the robustness of deep neural networks

by boundary conditional gan. arXiv preprint arXiv:1902.11029, 2019.

[32]

Jan Svoboda, Jonathan Masci, Federico Monti, Michael Bronstein, and Leonidas Guibas.

Peernets: Exploiting peer wisdom against adversarial attacks. In International Conference on

Learning Representations, 2019.

[33]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-

low, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,

2013.

[34]

Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein.

Training neural networks without gradients: A scalable admm approach. In International

conference on machine learning, pages 2722–2731, 2016.

[35]

Matthew Thorpe and Yves van Gennip. Deep limits of residual neural networks. arXiv preprint

arXiv:1810.11741, 2018.

[36]

Abraham Wald. Contributions to the theory of statistical estimation and testing hypotheses. The

Annals of Mathematical Statistics, 10(4):299–326, 1939.

[37]

Bao Wang, Binjie Yuan, Zuoqiang Shi, and Stanley J Osher. Enresnet: Resnet ensemble via the

feynman-kac formalism. arXiv preprint arXiv:1811.10745, 2018.

[38]

E Weinan. A proposal on machine learning via dynamical systems. Communications in

Mathematics and Statistics, 5(1):1–11, 2017.

[39]

E Weinan, Jiequn Han, and Qianxiao Li. A mean-ﬁeld optimal control formulation of deep

learning. Research in the Mathematical Sciences, 6(1):10, 2019.

[40]

Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He. Feature

denoising for improving adversarial robustness. arXiv preprint arXiv:1812.03411, 2018.

[41]

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in

deep neural networks. arXiv preprint arXiv:1704.01155, 2017.

[42]

Nanyang Ye and Zhanxing Zhu. Bayesian adversarial learning. In Advances in Neural Informa-

tion Processing Systems, pages 6892–6901, 2018.

[43]

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I

Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint

arXiv:1901.08573, 2019.

[44]

Jingfeng Zhang, Bo Han, Laura Wynter, Kian Hsiang Low, and Mohan Kankanhalli. Towards

robust resnet: A small step but a giant leap. arXiv preprint arXiv:1902.10887, 2019.

[45]

Xiaoshuai Zhang, Yiping Lu, Jiaying Liu, and Bin Dong. Dynamically unfolding recurrent

restorer: A moving endpoint control method for image restoration. In International Conference

on Learning Representations, 2019.

[46]

Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. Adversarial attacks on neural

networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference

on Knowledge Discovery & Data Mining, pages 2847–2856. ACM, 2018.

A Proof Of The Theorems

A.1 Proof of Theorem 1

In this section we give the full statement of the maximum principle for the adversarial training and

present a proof. Let’s start from the case of the natural training of neural networks.

Theorem.

(PMP for adversarial training) Assume

is twice continuous differentiable,

ft(·, θ), Rt(·, θ)

are twice continuously differentiable with respect to

, and

ft(·, θ), Rt(·, θ)

to-

gether with their

partial derivatives are uniformly bounded in

and

. The sets

{ft(x, θ) : θ∈Θt}

and {Rt(x, θ) : θ∈Θt}are convex for every tand x∈Rdt. Let θ∗to be the solution of

min

θ∈Θmax

kηk∞≤J(θ, η) := 1

i=1

`i(xi,T ) + 1

i=1

T−1

t=0

Rt(xi,t, θt)(9)

subject to xi,1=f0(xi,0+ηi;θ0), i = 1,2,· · · , N (10)

xi,t+1 =ft(xi,t, θt), t = 1,2,· · · , T −1.(11)

Then there exists co-state processes

p∗

i:= p∗

i,t :t= 0,· · · , T

such that the following holds for all

t∈[T]and i∈[N]:

x∗

i,t+1 =∇pHt(x∗

i,t, p∗

i,t+1, θ∗

t), x∗

i,0=xi,0+η∗

i(12)

p∗

i,t =∇xHt(x∗

i,t, p∗

i,t+1, θ∗

t), p∗

i,T =−1

N∇`i(x∗

i,T )(13)

Here His the per-layer deﬁned Hamiltonian function Ht:Rdt×Rdt+1 ×Θt→Ras

Ht(x, p, θt) = p·ft(x, θt)−1

NRt(x, θt)

At the same time, the parameter of the ﬁrst layer θ∗

0∈Θ0and the best perturbation η∗satisfy

i=1

H0(x∗

i,0+ηi, p∗

i,1, θ∗

0)≥

i=1

H0(x∗

i,0+η∗

i, p∗

i,1, θ∗

0)≥

i=1

H0(x∗

i,0+η∗

i, p∗

i,1, θ0),∀θ0∈Θ0,kηik∞≤

(14)

while parameter of the other layers

θ∗

t∈Θt, t = 1,2,· · · , T −1

will maximize the Hamiltonian

functions

i=1

Ht(x∗

i,t, p∗

i,t+1, θ∗

t)≥

i=1

Ht(x∗

i,t, p∗

i,t+1, θt),∀θt∈Θt(15)

Proof.

We ﬁrst propose PMP for discrete time dynamic system and utilize it directly gives out the

proof of PMP for adversarial training.

Lemma 2.

(PMP for discrete time dynamic system) Assume

is twice continuous differentiable,

ft(·, θ), Rt(·, θ)

are twice continuously differentiable with respect to

, and

ft(·, θ), Rt(·, θ)

together

with their

partial derivatives are uniformly bounded in

and

. The sets

{ft(x, θ) : θ∈Θt}

and

{Rt(x, θ) : θ∈Θt}are convex for every tand x∈Rdt. Let θ∗to be the solution of

min

θ∈Θmax

kηk∞≤J(θ, η) := 1

i=1

`i(xi,T ) + 1

i=1

T−1

t=0

Rt(xi,t, θt)(16)

subject to xi,t+1 =ft(xi,t, θt), i ∈[N], t = 0,1,· · · , T −1.(17)

There exists co-state processes

p∗

i:= p∗

i,t :t= 0,· · · , T

such that the following holds for all

t∈[T]

and i∈[N]:

x∗

i,t+1 =∇pHt(x∗

i,t, p∗

i,t+1, θ∗

t), x∗

i,0=xi,0(18)

p∗

i,t =∇xHt(x∗

i,t, p∗

i,t+1, θ∗

t), p∗

i,T =−1

N∇`i(x∗

i,T )(19)

Here His the per-layer deﬁned Hamiltonian function Ht:Rdt×Rdt+1 ×Θt→Ras

Ht(x, p, θt) = p·ft(x, θt)−1

NRt(x, θt)

The parameters of the layers

θ∗

t∈Θt, t = 0,1,· · · , T −1

will maximize the Hamiltonian functions

i=1

Ht(x∗

i,t, p∗

i,t+1, θ∗

t)≥

i=1

Ht(x∗

i,t, p∗

i,t+1, θt),∀θt∈Θt(20)

Proof.

Without loss of generality, we let

L= 0

. The reason is that we can simply add an extra

dynamic wtto calculate the regularization term R,i.e.

wt+1 =wt+Rt(xt, θt), w0= 0.

We append

to study the dynamic of a new

dt+ 1

dimension vector and modify

ft(x, θ)

(ft(x, θ), w +Rt(x, θ)). Thus we only need to prove the case when L= 0.

For simplicity, we omit the subscript

in the following proof. (Concatenating all

into

(x1, . . . , xN)can justify this.)

Now we begin the proof. Following the linearization lemma in [

], consider the linearized problem

φt+1 =ft(x∗

t, θt) + ∇xft(x∗

t, θt)(φt−x∗

t), φ0=x0+η. (21)

The reachable states by the linearized dynamic system is denoted as

Wt:= {x∈Rdt:∃θ, η =η∗s.t. φθ

t=x}

here xθ

tdenotes the the evolution of the dynamical system for xtunder θ. We also deﬁne

S:= {x∈RdT: (x−x∗

T)∇`(x∗

T)<0}

The linearization lemma in [

] tells us that

and

are separated by

{x:p∗

T·(x−x∗

T) =

0, p∗

T=−∇`(x∗

T)},i.e.

p∗

T·(x−x∗

T)≤0,∀x∈Wt.(22)

Thus setting

p∗

t=∇xHt(x∗

t, p∗

t+1, θ∗

t) = ∇xf(x∗

t, θ∗

t)T·p∗

t+1,

we have

(φt+1 −x∗

t+1)·p∗

t= (φt−x∗

t)·p∗

t.(23)

Thus from Eq.22 and Eq.23 we get

p∗

t+1 ·(φθ

t+1 −x∗

t+1)≤0, t = 0,· · · , T −1,∀θ∈Θ := Θ0×Θ1× · · ·

Setting

θs=θ∗

for

s < t

we have

φθ

t+1 =ft(x∗

t, θt)

, which leads to

p∗

t+1 ·(ft(x∗

t, θt)−x∗

t+1)≤0

This ﬁnishes the proof of the maximal principle on weight space Θ.

We return to the proof of the theorem. The proof of the maximal principle on the weight space, i.e.

i=1

Ht(x∗

i,t, p∗

i,t+1, θ∗

t)≥

i=1

Ht(x∗

i,t, p∗

i,t+1, θ),∀θt∈Θt, t = 1,2,· · · , T −1

and N

i=1

H0(x∗

i,0+η∗

i, p∗

i,1, θ∗

0)≥

i=1

H0(x∗

i,0+η∗

i, p∗

i,1, θ0),∀θ0∈Θ0,

can be reached with the help of Lemma 2: replacing the dynamic start point

xi,0

in Eq.18 with

xi,0+η∗

imakes this maximal principle a direct corollary of Lemma 2.

Next, we prove the Hamiltonian conidition for the adversary, i.e.

i=1

H0(x∗

i,0+η∗

i, p∗

i,1, θ∗

0)≤

i=1

H0(x∗

i,0+ηi, p∗

i,1, θ∗

0),∀kηik∞≤(24)

Assuming

Ri,t = 0

like above, we deﬁne a new optimal control problem with target function

`i() = −`i() and previous dynamics except xi,1=˜

f0(xi,0;θ0, ηi) = f0(xi,0+ηi;θ0):

min

kηk∞≤

J(θ, η) := 1

i=1

`i(xi,T )(25)

subject to xi,1=˜

f0(xi,0;θ0, ηi), i = 1,2,· · · , N (26)

xi,t+1 =ft(xi,t, θt), t = 1,2,· · · , T −1.(27)

However in this time, all the layer parameters

θt

are

ﬁxed

and

ηi

is the control. From the above

Lemma 2we get

˜x∗

i,1=∇p˜

H0(˜x∗

i,0,˜p∗

i,1, θ0, η∗

i),˜x∗

i,t+1 =∇pHt(˜x∗

i,t,˜p∗

i,t+1, θt),˜x∗

i,0=xi,0,(28)

˜p∗

i,0=∇x˜

H0(˜x∗

i,0,˜p∗

i,1, θ0, η∗

i),˜p∗

i,t =∇xHt(˜x∗

i,t,˜p∗

i,t+1, θt),˜p∗

i,T =1

N∇`i(˜x∗

i,T ),(29)

where

H0(x, p, θ0, η) = p·˜

f0(x;θ0, η) = p·f0(x+η;θ0)

and

t= 1,· · · , T −1

. This gives the

fact that ˜x∗

i,t =x∗

i,t. Lemma 2also tells us

i=1

H0(˜x∗

i,0,˜p∗

i,t+1, θ0, η∗

i)≥

i=1

H0(˜x∗

i,0,˜p∗

i,1, θ0, ηi),∀kηik∞≤(30)

which is

i=1

˜p∗

i,1·f0(˜x∗

i,0+η∗

i;θ0)≥

i=1

˜p∗

i,1·f0(˜x∗

i,0+ηi;θ0),∀kηik∞≤(31)

On the other hand, Lemma 2gives

˜p∗

t=−∇xt(˜

`(xT)) = ∇xt(`(xT)) = −p∗

Then we have

i=1

p∗

i,1·f0(x∗

i,0+η∗

i;θ0)≤

i=1

p∗

i,1·f0(x∗

i,0+ηi;θ0),∀kηik∞≤(32)

which is

i=1

H0(x∗

i,0, p∗

i,t+1, θ0, η∗

i)≤

i=1

H0(x∗

i,0, p∗

i,1, θ0, ηi),∀kηik∞≤(33)

This ﬁnishes the proof for the adversarial control.

Remark.

The additional assumption that the sets

{ft(x, θ) : θ∈Θt}

and

{Rt(x, θ) : θ∈Θt}

are

convex for every

and

x∈Rdt

is extremely weak and is not unrealistic which is already explained in

[20].

B Experiment Setup

B.1 MNIST

Training against PGD-40 is a common practice to get sota results on MNIST. We adopt network

architectures from [

] with four convolutional layers followed by three fully connected layers.

Following [

] and [

], we set the size of perturbation as

= 0.3

in an inﬁnite norm sense.

Experiments are taken on idle NVIDIA Tesla P100 GPUs. We train models for 55 epochs with a

batch size of 256, longer than what convergence needs for both training methods. The learning rate is

set to 0.1 initially, and is lowered by 10 times at epoch 45. We use a weight decay of

5e−4

and a

momentum of

0.9

. To measure the robustness of trained models, we performed a PGD-40 and CW[

]

attack with CW coefﬁcient c= 5e2and lr = 1e−2.

B.2 CIFAR-10

Following [

], we take Preact-ResNet18 and Wide ResNet-34 as the models for testing. We set the

the size of perturbation as

= 8/255

in an inﬁnite norm sense. We perform a 20 steps of PGD with

step size

2/255

when testing. For PGD adversarial training, we train models for 105 epochs as a

common practice. The learning rate is set to

5e−2

initially, and is lowered by 10 times at epoch

79, 90 and 100. For YOPO-

, we train models for 40 epochs which is much longer than what

convergence needs. The learning rate is set to

0.2/m

initially, and is lowered by 10 times at epoch

30 and 36. We use a batch size of 256, a weight decay of

5e−4

and a momentum of

0.9

for both

algorithm. We also test our model’s robustness under CW attack [

] with

c= 5e2

and

lr = 1e−2

The experiments are taken on idle NVIDIA GeForce GTX 1080 Ti GPUs.

B.3 TRADES

TRADES[

] achieves the state-of-the-art results in adversarial defensing. The methodology achieves

the 1st place out of the 1,995 submissions in the robust model track of NeurIPS 2018 Adversarial

Vision Challenge. TRADES proposed a surrogate loss which quantify the trade-off in terms of the

gap between the risk for adversarial examples and the risk for non-adversarial examples and the

objective function can be formulated as

min

E(x,y)∼D max

kηk≤(`(fθ(x), y) + L(fθ(x), fθ(x+η)) /λ)(34)

where

fθ(x)

is the neural network parameterized by

denotes the loss function,

L(·,·)

denotes the

consistency loss and

is a balancing hyper parameter which we set to be

as in [

]. To solve the

min-max problem, [

] also searched the ascent direction via the gradient of the "adversarial loss", i.e.

generating the adversarial example before performing gradient descent on the weight. Speciﬁcally,

the PGD attack is performed to maximize a consistency loss instead of classiﬁcation loss. For each

clean data x, a single iteration of the adversarial attach can be formulated as

x0←Πkx0−xk≤(α1sign (∇x0L(fθ(x), fθ(x0))) + x0),

where

is projection operator. In the implementation of [

], after 10 such update iterations for each

input data xi, the update for weights is performed as

θ←θ−α2

i=1

∇θ[`(fθ(xi), yi) + L(fθ(xi), fθ(x0

i)) /λ]/B,

where

is the batch size. We name this algorithm as TRADES-10, for it uses 10 iterations to update

the adversary.

Following the notation used in previous section, we denote

as the ﬁrst layer of the neural network

and

g˜

denotes the network without the ﬁrst layer. The whole network can be formulated as the

compostion of the two parts, i.e.

fθ=g˜

θ◦f0

. To apply our gradient based YOPO method to

TRADES, following Section 2, we decouple the adversarial calculation and network updating as

shown in Algorithm 2. Projection operation is omitted. Notice that in Section.2 we take advantage

every intermediate perturbation

ηj, j = 1,· · · , m −1

to update network weights while here we only

use the ﬁnal perturbation

η=ηm

to compute the ﬁnal loss term. In practice, this accumulation of

gradient doesn’t helps. For TRADES-YOPO, acceleration of YOPO is brought by decoupling the

adversarial calculation with the gradient back propagation.

Algorithm 2 TRADES-YOPO-m-n

Randomly initialize the network parameters or using a pre-trained network.

repeat

Randomly select a mini-batch B={(x1, y1),· · · ,(xB, yB)}from training set.

Initialize η1,0

i, i = 1,2,· · · , B by sampling from a uniform distribution between [-,]

for j= 1 to mdo

pi=∇g˜

θLg˜

θf0xi+ηj,0

i, θ0, g ˜

θ(f0(xi, θ0))· ∇f0g˜

θ(f0(xi+ηj,0

i, θ0)),

i= 1,2,· · · , B

for s= 0 to n−1do

ηj,s+1

i←ηj,s

i+α1·pi· ∇ηf0(xi+ηj,s

i, θ0), i = 1,2,· · · , B

end for

ηj+1,0

i=ηj,n

i, i = 1,2,· · · , B

end for

θ←θ−α2PB

i=1 ∇θ[`(fθ(xi), yi) + L(fθ(xi), fθ(xi+ηm,n

i)) /λ]/B.

until Convergence

We name this algorithm as TRADES-YOPO-

. With less than half time of TRADES-

, TRADES-

YOPO-

achieves even better result than its baseline. Quantitative results is demonstrated in Table

4. The mini-batch size is

256

. All the experiments run for

105

epochs and the learning rate set to

2e−1

initially, and is lowered by

times at epoch

and

100

. The weight decay coefﬁcient is

5e−4

and momentum coefﬁcient is

0.9

. We also test our model’s robustness under CW attack [

]

with c= 5e2and lr = 5e−4. Experiments are taken on idle NVIDIA Tesla P100 GPUs.

ResearchGate has not been able to resolve any citations for this publication.

A mean-field optimal control formulation of deep learning

Article

Full-text available

Dec 2018

Recent work linking deep neural networks and dynamical systems opened up new avenues to analyze deep learning. In particular, it is observed that new insights can be obtained by recasting deep learning as an optimal control problem on difference or differential equations. However, the mathematical aspects of such a formulation have not been systematically explored. This paper introduces the mathematical formulation of the population risk minimization problem in deep learning as a mean-field optimal control problem. Mirroring the development of classical optimal control, we state and prove optimality conditions of both the Hamilton–Jacobi–Bellman type and the Pontryagin type. These mean-field results reflect the probabilistic nature of the learning problem. In addition, by appealing to the mean-field Pontryagin’s maximum principle, we establish some quantitative relationships between population and empirical learning problems. This serves to establish a mathematical foundation for investigating the algorithmic and theoretical connections between optimal control and deep learning.

PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples

Article

Full-text available

Oct 2017

Adversarial perturbations of normal images are usually imperceptible to humans, but they can seriously confuse state-of-the-art machine learning models. What makes them so special in the eyes of image classifiers? In this paper, we show empirically that adversarial examples mainly lie in the low probability regions of the training distribution, regardless of attack types and targeted models. Using statistical hypothesis testing, we find that modern neural density models are surprisingly good at detecting imperceptible image perturbations. Based on this discovery, we devised PixelDefend, a new approach that purifies a maliciously perturbed image by moving it back towards the distribution seen in the training data. The purified image is then run through an unmodified classifier, making our method agnostic to both the classifier and the attacking method. As a result, PixelDefend can be used to protect already deployed models and be combined with other model-specific defenses. Experiments show that our method greatly improves resilience across a wide variety of state-of-the-art attacking methods, increasing accuracy on the strongest attack from 63% to 84% for Fashion MNIST and from 32% to 70% for CIFAR-10.

Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations

Conference Paper

Full-text available

Oct 2017

In our work, we bridge deep neural network design with numerical differential equations. We show that many effective networks, such as ResNet, PolyNet, FractalNet and RevNet, can be interpreted as different numerical discretizations of differential equations. This finding brings us a brand new perspective on the design of effective deep architectures. We can take advantage of the rich knowledge in numerical analysis to guide us in designing new and potentially more effective deep networks. As an example, we propose a linear multi-step architecture (LM-architecture) which is inspired by the linear multi-step method solving ordinary differential equations. The LM-architecture is an effective structure that can be used on any ResNet-like networks. In particular, we demonstrate that LM-ResNet and LM-ResNeXt (i.e. the networks obtained by applying the LM-architecture on ResNet and ResNeXt respectively) can achieve noticeably higher accuracy than ResNet and ResNeXt on both CIFAR and ImageNet with comparable numbers of trainable parameters. In particular, on both CIFAR and ImageNet, LM-ResNet/LM-ResNeXt can significantly compress ($>50$\%) the original networks while maintaining a similar performance. This can be explained mathematically using the concept of modified equation from numerical analysis. Last but not least, we also establish a connection between stochastic control and noise injection in the training process which helps to improve generalization of the networks. Furthermore, by relating stochastic training strategy with stochastic dynamic system, we can easily apply stochastic training to the networks with the LM-architecture. As an example, we introduced stochastic depth to LM-ResNet and achieve significant improvement over the original LM-ResNet on CIFAR10.

Stable Architectures for Deep Neural Networks

Article

Full-text available

May 2017

Deep neural networks have become invaluable tools for supervised machine learning, e.g., in classification of text or images. While offering superior flexibility to find and express complicated patterns in data, deep architectures are known to be challenging to design and train so that they generalize well to new data. An important issue are numerical instabilities in derivative-based learning algorithms commonly called exploding or vanishing gradients. In this paper we propose new forward propagation techniques inspired by systems of Ordinary Differential Equations (ODE) that overcome this challenge and lead to well-posed learning problems for arbitrarily deep networks. The backbone of our approach is interpreting deep learning as a parameter estimation problem of a nonlinear dynamical system. Given this formulation we analyze stability and well-posedness of deep learning and motivated by our findings develop new architectures. We relate the exploding and vanishing gradient phenomenon to the stability of the discrete ODE and present several strategies for stabilizing deep learning for very deep networks. While our new architectures restrict the solution space, several numerical experiments show their competitiveness to state-of-the-art networks.

Adversarial Attacks on Neural Networks for Graph Data

Conference Paper

Jul 2018

Deep learning models for graphs have achieved strong performance for the task of node classification. Despite their proliferation, currently there is no study of their robustness to adversarial attacks. Yet, in domains where they are likely to be used, e.g. the web, adversaries are common. Can deep learning models for graphs be easily fooled? In this work, we introduce the first study of adversarial attacks on attributed graphs, specifically focusing on models exploiting ideas of graph convolutions. In addition to attacks at test time, we tackle the more challenging class of poisoning/causative attacks, which focus on the training phase of a machine learning model.We generate adversarial perturbations targeting the node's features and the graph structure, thus, taking the dependencies between instances in account. Moreover, we ensure that the perturbations remain unnoticeable by preserving important data characteristics. To cope with the underlying discrete domain we propose an efficient algorithm Nettack exploiting incremental computations. Our experimental study shows that accuracy of node classification significantly drops even when performing only few perturbations. Even more, our attacks are transferable: the learned attacks generalize to other state-of-the-art node classification models and unsupervised approaches, and likewise are successful even when only limited knowledge about the graph is given.

An Optimal Control Approach to Deep Learning and Applications to Discrete-Weight Neural Networks

Article

Mar 2018

Deep learning is formulated as a discrete-time optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discrete-time method of successive approximations (MSA), which is based on the Pontryagin's maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in low-memory devices.

Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples

Article

Feb 2018

We identify obfuscated gradients as a phenomenon that leads to a false sense of security in defenses against adversarial examples. While defenses that cause obfuscated gradients appear to defeat optimization-based attacks, we find defenses relying on this effect can be circumvented. For each of the three types of obfuscated gradients we discover, we describe indicators of defenses exhibiting this effect and develop attack techniques to overcome it. In a case study, examining all defenses accepted to ICLR 2018, we find obfuscated gradients are a common occurrence, with 7 of 8 defenses relying on obfuscated gradients. Using our new attack techniques, we successfully circumvent all 7 of them.

Maximum Principle Based Algorithms for Deep Learning

Article

Oct 2017
J MACH LEARN RES

The continuous dynamical system approach to deep learning is explored in order to devise alternative frameworks for training algorithms. Training is recast as a control problem and this allows us to formulate necessary optimality conditions in continuous time using the Pontryagin's maximum principle (PMP). A modification of the method of successive approximations is then used to solve the PMP, giving rise to an alternative training algorithm for deep learning. This approach has the advantage that rigorous error estimates and convergence results can be established. We also show that it may avoid some pitfalls of gradient-based methods, such as slow convergence on flat landscapes near saddle points. Furthermore, we demonstrate that it obtains favorable initial convergence rate per-iteration, provided Hamiltonian maximization can be efficiently carried out - a step which is still in need of improvement. Overall, the approach opens up new avenues to attack problems associated with deep learning, such as trapping in slow manifolds and inapplicability of gradient-based methods for discrete trainable variables.

Towards Evaluating the Robustness of Neural Networks

Conference Paper

May 2017

Towards Deep Learning Models Resistant to Adversarial Attacks

Article

Jun 2017

Recent work has demonstrated that neural networks are vulnerable to adversarial examples, i.e., inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete, general guarantee to provide. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. This suggests that adversarially resistant deep learning models might be within our reach after all.

You Only Propagate Once: Painless Adversarial Training Using Maximal Principle

Abstract and Figures

Recommended publications

Linear-Quadratic Optimal Control for Unknown Mean-Field Stochastic Discrete-Time System via Adaptive...

Encouraging attacker retreat through defender cooperation

Differential games results for military expenditures of unequal antagonists in the third world

On a neural control close to time-optimal of a non-linear and discontinuous object