supplementary material for “bayesian deep

Under review as a conference paper at ICLR 2015
S UPPLEMENTARY M ATERIAL FOR
“BAYESIAN D EEP D ECONVOLUTIONAL L EARNING ”
Yunchen Pu, Xin Yuan and Lawrence Carin
Department of Electrical and Computer Engineering, Duke University, Durham, NC, 27708, USA
{yunchen.pu,xin.yuan,lcarin}@duke.edu
1
C ONDITIONAL P OSTERIOR D ISTRIBUTIONS FOR G IBBS S AMPLING
In l−th layer, the model is:
Kl
X
X(n,kl−1 ,l) =
D(kl ,l) ∗ Z(n,kl ,l) W(n,kl ,l) + E(n,kl−1 ,l) .
(1)
kl =1
For brevity, we define the following symbols (operations):
X
X −(n,kl−1 ,l) = X (n,kl−1 ,l) −
D(kl−1 ,kl ,l) ∗ (Z (n,kl ,l) W (n,kl ,l) ).
(2)
kl
(n,k
X−kl l−1
,l)
= X (n,kl−1 ,l) −
X
D(kl−1 ,t,l) ∗ (Z (n,t,l) W (n,t,l) ).
(3)
t6=kl
(D(kl−1 ,kl ,l) )2 = D(kl−1 ,kl ,l) D(kl−1 ,kl ,l) .
(W
(n,kl ,l) 2
) =W
(n,kl ,l)
W
(n,kl ,l)
(4)
.
(5)
(X −(n,kl−1 ,l) )2 = X −(n,kl−1 ,l) X −(n,kl−1 ,l) .
(6)
where is the element-wise product operator and is the element-wise division operator.
A = B ~ C denotes that if B ∈ RnB ×nB and C ∈ RnC ×nC , A ∈ R(nB −nC +1)×(nB −nC +1) with
element (i, j)
Ai,j =
nC X
nC
X
Bp+i−1,q+j−1 Cp,q .
(7)
p=1 q=1
We use ‘|-’ in the following equations to denote the conditioning parameters of the ensured distributions. For each MCMC iteration, the samples are drawn from:
• Sampling D(kl−1 ,kl ,l) :
D(kl−1 ,kl ,l) |− ∼ N (µ(kl−1 ,kl ,l) , Σ(kl−1 ,kl ,l) )
Σ(kl−1 ,kl ,l) = 1 (8)
!
N
X
(kl−1 ,kl ,l)
γe(n,kl−1 ,l) kZ (n,kl ,l) W (n,kl ,l) k22 + γd
(9)
n=1
"
(kl−1 ,kl ,l)
µ̂
(kl−1 ,kl ,l)
=Σ
N
X
#
(n,k
,l)
γe(n,kl−1 ,l) X−kl l−1
~ (Z
(n,kl ,l)
W
(n,kl ,l)
)
n=1
(10)
(kl−1 ,kl ,l)
• Sampling γd
:
(k
,k ,l)
γd l−1 l |−
1
1
∼ Gamma ad + , bd + (D(kl−1 ,kl ,l) )2
2
2
1
(11)
Under review as a conference paper at ICLR 2015
• Sampling W (n,kl ,l) :
W (n,kl ,l) |− ∼ N (ξ (n,kl ,l) , Λ(n,kl ,l) )


Kl−1
X
(n,k ,l)
Λ(n,kl ,l) = 1 
γe(n,kl−1 ,l) kD(kl−1 ,kl ,l) k22 Z (n,kl ,l) + γW l 
(12)
(13)
kl−1 =1

Kl−1
ξ (n,kl ,l) = Λ(n,kl ,l) Z (n,kl ,l) 
X

(n,k
γe(n,kl−1 ,l) X−kl l−1
,l)
~ Dkl−1 ,kl ,l  (14)
kl−1 =1
• Sampling
(n,k ,l)
γW l :
(n,k ,l)
γW l |−
∼ Gamma aW
1
1
+ , bW + (W (n,kl ,l) )2
2
2
(15)
(n,k ,l)
• Sampling Zi0 ,j 0 l :
Let m = 1, ..., nx ny , i = 1, ..., Nx , j = 1, ..., Ny ; i0 = 1, ..., Nx /nx , j = 1, ..., Ny /nx ;
we can find (i0 , j 0 , m) and (i, j) are in one-to-one correspondence. From
Kl−1
2
X
Y (n,kl ,l) =
γe(n,kl−1 ,l) kD(kl−1 ,kl ,l) k22 W (n,kl ,l)
kl−1 =1
i
(n,k
,l)
−2 X−kl l−1 ~ Dkl−1 ,kl ,l W (n,kl ,l) ,
(16)
and
(n,k ,l)
(n,k ,l)
l
l
(n,kl ,l)
= θm
Yi0 ,j 0 ,m
θ̂i0 ,j 0 ,m
∀m = 1, ..., nx ny ,
(17)
we have
(n,k ,l)
(n,k ,l)
l
= 1) = Pnx ny
P (Zi0 ,j 0 ,m
l
θ̂i0 ,j 0 ,m
(n,kl ,l)
t=1(6=m) θ̂i0 ,j 0 ,m
• Sampling θ(n,kl ,l) :
θ(n,kl ,l) |− ∼ Dirichlet(α(n,kl ,l) )
X X (n,k ,l)
1
l
(n,kl ,l)
zi0 ,j 0 ,m
αm
+
=
nx ny + 1
0
0
(n,k ,l)
(n,kl−1 ,l)
∀ m = 1, ..., nx ny ,
(20)
(21)
j
:
1
γe(n,kl−1 ,l) |− ∼ Gamma ae + , be + (X −(n,kl−1 ,l) )2 .
2
2
(18)
(19)
XX
X (n,k ,l)
1
l
+
(1 −
)
zi0 ,j 0 ,m
nx ny + 1
0
0
m
i
• Sampling γe
.
j
i
αnx nyl +1 =
(n,k ,l)
+ θnx nyl +1
(22)
C ONVOLUTIONAL M ODEL WITH M ISSING DATA
Our model can take account of the missing data problem by adding one term to the model at the data
level:
"K
#
X
(n)
(n)
(n)
(n)
(n,k)
(n,k)
(k)
Y =M X =M (Z
W
)∗D
+ E(n) ,
(23)
k=1
where M(n) is the binary mask indicating which data point is observed, and Y(n) is now the observed data. Recently, patch-based dictionary learning methods (Zhou et al., 2012) have demonstrated superior performance on missing data interpolation. However, it may fail in the case of
large-area missing since the model is built on small patches. The deep convolutional model developed in this paper, because of its hierarchical nature, can cope with this large-area missing problem
in a natural way (Lee et al., 2009). The essence is that the high-layer filters (dictionaries) will cover
information of entire images and therefore large missing areas can be recovered using these dictionaries. Decent results are shown in the experiments of the main paper. Inference of this model is
similar to the generative model, thus omitted here.
2
Under review as a conference paper at ICLR 2015
3
M ORE R ESULTS
Figure 1: Trained dictionaries per class with each class (from Caltech 101) trained independently. Row 1-4:
nautilus, pyramid, revolver, umbrella. Column 1-4: training images after local contrast normalization, layer 1
dictionary, layer 2 dictionary, layer 3 dictionary.
Layer One Dictionary
Layer Two Dictionary
Layer Three Dictionary
Figure 2: Dictionaries (layer 1 to layer 3) trained by 6 classes (ketch, nautilus, revolver, stop sign, umbrella,
windsor chair) images from Caltech 101.
Fig. 1 plots dictionaries trained per class and four classes are shown. It can be seen the layer one
dictionary elements are pretty similar while the upper layer dictionaries vary from each other. We
also performed the model on a joint dataset with multi-class images, with trained dictionary elements
shown in Fig. 2. It can be seen that the first layer and second layer dictionaries are more general in
this joint training process. For instance, the first layer filters in Fig. 2 is an integration of filters of
each class in Fig.1.
R EFERENCES
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Convolutional deep belief networks for scalable
unsupervised learning of hierarchical representations. ICML, 2009.
3
Under review as a conference paper at ICLR 2015
Zhou, M., Chen, H., Paisley, J., Ren, L., Li, L., Xing, Z., Dunson, D., Sapiro, G., and Carin, L.
Nonparametric Bayesian dictionary learning for analysis of noisy and incomplete images. IEEE
T-IP, 2012.
4