Supplementary PDF

Automated Inference with Adaptive Batches
Supplementary Material
A
Proof of Lemma 1
Proof. We know that
Expanding kr`B (x)
r`B (x) is a descent direction i↵ the following condition holds:
r`B (x)T r`(x) > 0.
2
r`(x)k we get
kr`B (x)k2 + kr`(x)k2
2r`B (x)T r`(x) < kr`B (x)k2 ,
2r`B (x)T r`(x) <
=)
(10)
kr`(x)k22  0,
which is always true for a descent direction (10).
B
Proof of Theorem 1
Proof. Let z̄ = E[z] be the mean of z. Given the current iterate x, we assume that the batch B is sampled
uniformly with replacement from p. We then have the following bound:
r`(x)k2  2krf (x; z)
krf (x; z)
rf (x, z̄)k2 + 2krf (x, z̄)
 2L2z kz
z̄k2 + 2krf (x, z̄)
 2L2z kz
z̄k2 + 2Ez krf (x, z̄)
= 2L2z kz
z̄k2 + 2L2z Tr Varz (z),
r`(x)k2
= 2L2z kz
z̄k2 + 2kEz [rf (x, z̄)
 2L2z kz
z̄k2 + 2L2z Ez kz̄
zk2
r`(x)k2
rf (x, z)]k2
rf (x, z)k2
where the first inequality uses the property ka + bk2  2kak2 + 2kbk2 , the second and fourth inequalities use
Assumption 1, and the third line uses Jensen’s inequality. This bound is uniform in x. We then have
Ez krf (x; z)
r`(x)k2  2L2z Ez kz
z̄k2 + 2L2z Tr Varz (z)
= 4L2z Tr Varz (z)
uniformly for all x. The result follows from the observation that
EB krfB (x)
C
r`(x)k2 =
1
Ez krf (x; z)
|B|
r`(x)k2 .
Proof of Lemma 2
Proof. From (5) and Assumption 2 we get
L↵2
kgt k2 .
2
Taking expectation with respect to the batch Bt and conditioning on xt , we get
`(xt+1 )  `(xt )
↵gt T r`(xt ) +
L↵2
Ekgt k2
2
L↵2
T
=`(xt ) `(x? ) ↵kr`(xt )k2 +
(kr`(xt )k2 + Eket k2 + E[et ] r`(xt ))
2
L↵2
L↵2
=`(xt ) `(x? )
↵
kr`(xt )k2 +
Eket k2
2
2
⇣
L↵2 ⌘
L↵2
 1 2µ ↵
(`(xt ) `(x? )) +
Eket k2 ,
2
2
where the second inequality follows from Assumption 3. Taking expectation, the result follows.
E[`(xt+1 )
`(x? )] `(xt )
`(x? )
T
↵E[gt ] r`(xt ) +
Soham De, Abhay Yadav, David Jacobs and Tom Goldstein
D
Proof of Theorem 2
Proof. We begin by applying the reverse triangle inequality to (4) to get
✓)Ekr`B (x)k  Ekr`(x)k
(1
which applied to (4) yields
✓2
Ekr`(xt )k2
(1 ✓)2
r`(xt )k2 = Eket k2 .
Ekr`B (xt )
(11)
Now, we apply (11) to the result in Lemma 2 to get
`(x? )]  E[`(xt )
E[`(xt+1 )
where
=
✓ 2 +(1 ✓)2
(1 ✓)2
L↵2
2
1. Assuming ↵
⇣
0 1
E
}=
2µ ↵
⌘
L↵2
2
2µ ↵
L↵2
2
which proves the theorem. Note that max↵ {↵
The second result follows immediately.
L↵2
2
↵
Ekr`(xt )k2 ,
0, we can apply Assumption 3 to write
⇣
`(x? )]  1
E[`(xt+1 )
`(x? )]
1
2L
E[`(xt )
`(x? )],
, and µ  L. It follows that
⌘
L↵2
2
< 1.
Proof of Theorem 3
Proof. Applying the reverse triangle inequality to (4) and using Lemma 2 we get, as in Theorem 2:
`(x? )]  E[`(xt )
E[`(xt+1 )
where
=
✓ 2 +(1 ✓)2
(1 ✓)2
`(x? )]
L↵2
2
↵
Ekr`(xt )k2 ,
(12)
1.
We will show that the backtracking condition in (7) is satisfied whenever 0 < ↵t 
0 < ↵t 
1
L
=)
↵t +
L↵t2
2

1
L.
First notice that:
↵t
.
2
Thus, we can rewrite (12) as
E[`(xt+1 )
`(x? )]  E[`(xt )
`(x? )]
 E[`(xt )
`(x? )]
↵t
Ekr`(xt )k2
2
c↵t Ekr`(xt )k2 ,
where 0 < c  0.5. Thus, the backtracking line search condition (7) is satisfied whenever 0 < ↵t 
1
L
.
1
Now we know that either ↵t = ↵0 (the initial stepsize), or ↵t
2 L , where the stepsize is decreased by a factor
of 2 each time the backtracking condition fails. Thus, we can rewrite the above as
⇣
1 ⌘
E[`(xt+1 ) `(x? )]  E[`(xt ) `(x? )] c min ↵0 ,
Ekr`(xt )k2 .
2 L
Using Assumption 3 we get
E[`(xt+1 )
?
`(x )] 
✓
1
◆
1 ⌘
2cµ min ↵0 ,
E[`(xt )
2 L
⇣
`(x? )].
Assuming we start o↵ the stepsize at a large value such that min(↵0 , 2 1L ) = 2 1L , we can rewrite this as:
⇣
cµ ⌘
E[`(xt+1 ) `(x? )]  1
E[`(xt ) `(x? )].
L
Automated Inference with Adaptive Batches
F
Algorithmic Details for Automated Big Batch Methods
The complete details of the backtracking Armijo line search we used with big batch SGD are explained in detail
in Algorithm 2. The adaptive stepsize method using Barzilai-Borwein curvature estimates with big batch SGD
is presented in Algorithm 3.
Algorithm 2 Big batch SGD: backtracking line search
1: initialize starting pt. x0 , initial stepsize ↵, initial batch size K > 1, batch size increment
line search parameter c, flag F = 0
2: while not converged do
3:
Draw random batch with size |B| = K
4:
Calculate VB and r`B (xt ) using (6)
5:
while kr`B (xt )k2  VB /K do
6:
Increase batch size K
K+ K
7:
Sample more gradients
8:
Update VB and r`B (xt )
9:
Set flag F = 1
10:
end while
11:
if flag F == 1 then
12:
↵
↵⇤2
13:
Reset flag F = 0
14:
end if
15:
while `B (xt ↵r`B (xt )) > `B (xt ) c↵t kr`B (xt )k2 do
16:
↵
↵/2
17:
end while
18:
xt+1 = xt ↵r`B (xt )
19: end while
Algorithm 3 Big batch SGD: with BB stepsizes
1: initialize starting pt. x, initial stepsize ↵, initial batch size K > 1, batch size increment
line search parameter c
2: while not converged do
3:
Draw random batch with size |B| = K
4:
Calculate VB and GB = r`B (x) using (6)
5:
while kGB k2  VB /K do
6:
Increase batch size K
K+ K
7:
Sample more gradients
8:
Update VB and GB
9:
end while
10:
while `B (x ↵r`B (x)) > `B (x) c↵kr`B (x)k2 do
11:
↵
↵/2
12:
end while
13:
x
x ↵r`B (x)
14:
if K < N then
15:
Calculate ↵
˜ = (1 VB /(KkGB k2 ))/⌫ using (8) and (9)
16:
else
17:
Calculate ↵
˜ = 1/⌫ using (9)
18:
end if
19:
Stepsize smoothing: ↵
↵(1 K/N ) + ↵
˜ K/N
20:
while `B (x ↵r`B (x)) > `B (x) c↵kr`B (x)k2 do
21:
↵
↵/2
22:
end while
23:
x
x ↵r`B (x)
24: end while
k,
backtracking
k,
backtracking
Soham De, Abhay Yadav, David Jacobs and Tom Goldstein
G
Derivation of Adaptive Step Size
Here we present the complete derivation of the adaptive stepsizes presented in Section 5. Our derivation follows
the classical adaptive Barzilai and Borwein (1988) (BB) method. The BB methods fits a quadratic model
to the objective on each iteration, and a stepsize is proposed that is optimal for the local quadratic model
(Goldstein et al., 2014). To derive the analog of the BB method for stochastic problems, we consider quadratic
approximations of the form `(x) = E✓ f (x, ✓), where we define f (x, ✓) = ⌫2 kx ✓k2 with ✓ ⇠ N (x? , 2 I).
We derive the optimal stepsize for this. We can rewrite the quadratic approximation as
`(x) = E✓ f (x, ✓) =
⌫
E✓ kx
2
✓k2 =
⌫ T
[x x
2
2xT x?
E(✓T ✓)] =
⌫
kx
2
x? k2 + d
2
,
since we can write
E(✓T ✓) = E
d
X
✓i2 =
i=1
d
X
E✓i2 =
i=1
d
X
(x?i )2 +
i=1
2
= kx? k2 + d
2
.
Further, notice that:
E✓ [r`(x)] = E✓ [⌫(x
✓)] = ⌫(x
2
Tr Var✓ [r`(x)] = E✓ [⌫ (x
T
✓) (x
x? ),
✓)]
and
2
x? )T (x
⌫ (x
x? ) = d⌫ 2
2
.
Using the quadratic approximation, we can rewrite the update for big batch SGD as follows:
xt+1 = xt
↵t
1 X
⌫(xt
|B|
✓i ) = (1
⌫↵t )xt +
i2B
⌫↵t X
✓i = (1
|B|
⌫↵t )xt + ⌫↵t x? +
i2B
⌫ ↵t X
⇠i ,
|B|
i2B
where we write ✓i = x? + ⇠i with ⇠i ⇠ N (0, 1). Thus, the expected value of the function is:
"
!#
⌫ ↵t X
?
E[`(xt+1 )] = E⇠ ` (1 ⌫↵t )xt + ⌫↵t x +
⇠i
|B|
i2B
2
3
2
X
⌫ 4
⌫
↵
t
= E⇠ (1 ⌫↵t )(xt x? ) +
⇠i + d 2 5
2
|B|
i2B
0
1
2
X
⌫@
⌫
↵
t
2
=
k(1 ⌫↵t )(xt x? )k + E⇠
⇠i + d 2 A
2
|B|
i2B
✓
◆
2 2
⌫
⌫ ↵t
2
=
k(1 ⌫↵t )(xt x? )k + (1 +
)d 2 .
2
|B|
Minimizing E[`(xt+1 )] w.r.t. ↵t we get:
2
↵t =
E[r`Bt (xt )]
1
·
2
⌫
E[r`Bt (xt )] + |B1t | Tr Var[rf (xt )]
2
1
1 E r`Bt (xt )
|Bt | Tr Var[rf (xt )]
·
2
⌫
E r`Bt (xt )
✓
◆
1
1
|Bt | Tr Var[rf (xt )]
= · 1
2
⌫
E r`Bt (xt )
=
✓2
1
⌫
.
Here ⌫ denotes the curvature of the quadratic approximation. Thus, the optimal stepsize for big batch SGD is
the optimal stepsize for deterministic gradient descent scaled down by at most 1 ✓2 .
Automated Inference with Adaptive Batches
H
Details of Neural Network Experiments and Additional Results
Here we present details of the ConvNet and exact hyper-parameters used for training the neural network models
for our experiments.
We train a convolutional neural network (LeCun et al., 1998) (ConvNet) to classify three benchmark image
datasets: CIFAR-10 (Krizhevsky and Hinton, 2009), SVHN (Netzer et al., 2011) and MNIST (LeCun et al.,
1998). The ConvNet used in our experiments is composed of 4 layers, excluding the input layer. We use 32 ⇥ 32
pixel images as input. The first layer of the ConvNet contains 16 ⇥ 3 ⇥ 3, and the second layer contains 256 ⇥ 3 ⇥ 3
filters. The third and fourth layers are fully connected (LeCun et al., 1998) with 256 and 10 outputs respectively.
Each layer except the last one is followed by a ReLu non-linearity (Krizhevsky et al., 2012) and a max pooling
stage (Ranzato et al., 2007) of size 2 ⇥ 2. This ConvNet has over 4.3 million weights.
To compare against fine-tuned SGD, we used a comprehensive grid search on the stepsize schedule to identify
optimal parameters (up to a factor of 2 accuracy). For CIFAR10, the stepsize starts from 0.5 and is divided
by 2 every 5 epochs with 0 stepsize decay. For SVHN, the stepsize starts from 0.5 and is divided by 2 every 5
epochs with 1e 05 learning rate decay. For MNIST, the learning rate starts from 1 and is divided by 2 every
3 epochs with 0 stepsize decay. All algorithms use a momentum parameter of 0.9, and SGD and AdaDelta use
mini-batches of size 128.
Fixed stepsize methods use the default decay rule of the Torch library: ↵t = ↵0 /(1 + 10 7 t), where ↵0 was
chosen to be the stepsize used in the fine-tuned experiments. We also tune the hyper-parameter ⇢ in the
Adadelta algorithm, and we found 0.9, 0.9 and 0.8 to be best-performing parameters for CIFAR10, SVHN and
MNIST respectively.
Figure 3 shows the change in the loss function over time for the same neural network experiments shown in the
main paper (on CIFAR-10, SVHN and MNIST).
Average Loss (training set)
1.5
Average Loss (training set)
0.6
Average Loss (training set)
0.06
0.5
0.5
Loss
Loss
0.4
Loss
1
0.3
0.2
0.04
0.02
0.1
0
0
0
10
20
Number of epochs
30
40
0
0
10
20
Number of epochs
30
40
0
10
20
30
40
Number of epochs
Figure 3: Neural Network Experiments. Figure shows the change in the loss function for CIFAR-10, SVHN, and MNIST
(left to right).