Salient Deconvolutional Networks

Salient Deconvolutional Networks
Aravindh Mahendran, Andrea Vedaldi, University of Oxford
(A) GOAL & CONTRIBUTIONS
(F) Interpretation: Fourier Phase Information
Input Image
Recently, several methods to understand CNNs through visualization have been proposed:
1. DeConvNet visualizes patterns selected by neurons [5];
2. Class saliency visualizes the "network attention” pattern [3].
However, both are heuristic and their meaning remains unclear.
Fourier
Reconstruction
Fourier
Reconstruction
Random Magnitude
DeSaliNet
Positive Random
Input
DeConvNet
Positive Random
Input
Our goal is to unify, compare, and understand such techniques. We do so by:
1. Introducing a generalized construction for reversed architectures.
2. Exploring in detail three variants: DeConvNet, SaliNet and the hybrid DeSaliNet.
3. Identifying limitations of these networks for the purpose of CNN visualization.
Auxiliary information is like the phase information in a Fourier transform.
(B) Reversing CNN Architectures
Reversing an architecture layer by layer
Neuron
Selector
Max Pool
Auxiliary
Information
ReLU
Max Pool
Reversed
ReLU
Reversed
Conv
Conv
Transpose
Forward CNN
Reversed CNN
Input Image
Result Image
Max Pooling
Randomized magnitudes with ground truth phase yields an edge image similar to the results obtained from
a reversed CNN using ReLU backward.
Reverse a “Forward CNN” layer by layer using
heuristics to form a “Reversed CNN”. Example:
Forward layer
Convolution
Pooling
→
→
Reversed layer
Convolution Transpose
Un-Pooling
-----------VGG-VD-Pool5_3 ----------Input Img
Reversing layers
SaliNet
DeConvNet
DeSaliNet
-----------VGG-VD-FC8----------DeConvNet
SaliNet
DeSaliNet
Back-propagation defines a natural reverse of each
layer. For layer i, ߶௜ ǣ ‫ ݔ‬՜ ‫ ݕ‬then its BP-reversed
becomes
߲
ො ‫ ݔ‬ൌ
൏ ‫ݕ‬ǡ
ො ߶௜ ‫ ݔ‬൐
߶௜஻௉ ǣ ‫ݕ‬ǡ
߲‫ݔ‬
ෝ
where ࢟ is the layer input for the reversed layer.
However, other definitions are also commonly used.
We consider variations used in:
Deconvolutional Networks (DeConvNet) – Zeiler et.al.
Class Saliency (SaliNet) – Simonyan et. al.
Improved DeConvNets (DeSaliNet) – Springenberg
et.al.
Backpropagation – Rumelhart et.al.
Semantic Segmentation U-Net – Noh et.al.
(C) Reversing Layers: Max Pooling and ReLU
ReLU
Max Locations
“Switches”
(G) Foreground Object Selectivity
• Foreground background differentiation is perhaps implicit in the CNN hidden layer activations.
• Extract this and project it into the image using a “Reversed CNN”.
Input from reversed
layer above
Rectification
Mask ( r )
Input from
reversed
layer ‫ݕ‬ො
ReLU
Pooling
The above figure suggests that SaliNet and DeSaliNet better highlight foreground objects compared to
DeConvNet which has a uniform spread over the image.
Un-Pool using
“Switches”
Un-Pool
Center
‫ݕ‬ො ൐ Ͳ (ReLU)
DeConvNet
‫ݕ‬ො ٖ ‫( ݎ‬ReLUBP)
SaliNet
‫ݕ‬ො ൐ Ͳ ٖ ‫ݎ‬
(ReLU‫ל‬ReLUBP)
DeSaliNet
‫ݕ‬ො (No Op.)
(H) Weakly Supervised Foreground Object Segmentation
• Use the output of a “Backward CNN” to seed a grab cut segmentation. Segment the foreground object!
• We compare against the weakly supervised baseline of Guillaumin et al. IJCV 2014
Auxiliary
Information
Segmentation Pipeline
Image
DeConvNet, SaliNet and DeSaliNet all use Un-pooling using “Switches”. They differ in ReLU reversed layers.
Select
Strongest
Neuron
CNN
CNN
Reversed
(D) Analysis of Reversed Architectures
Two types of reversed pooling and four types of reversed ReLU layers. Result images are shown below.
DeConvNet
Mask
SaliNet
With Pooling
Switches
DeSaliNet
Un-pool to centre
Method
ReLU ◦ RUBP
ReLU
RUBP
No Operation
Per Pixel
IoU
Accuracy
AlexNet VGG-16 AlexNet VGG-16
SaliNet
82.82
82.45
57.07
56.33
DeSaliNet
82.31
83.29
55.57
56.25
DeConvNet
75.85
76.52
48.26
48.16
Baseline
78.97
46.27
Guillaumin et al.
84.4
57.3
Grab Cut
Foreground
Segmentation
In the table, “baseline” is using a Gaussian mask with mean at the image centre for foreground seed and
mean at the image corners for background seed (see figure below).
Original
Image
Ground
Truth
DeSaliNet
Mask
DeSaliNet
Segment
SaliNet
Mask
SaliNet
Segment
DeConvNet
Mask
DeConvNet
Segment
Baseline
Mask
Baseline
Segment
ReLU in the backward direction imparts edges and structure to the output.
(E) Lack of Neuron Selectivity
DeSaliNet
Rnd. Noise
Rnd. Neuron
Max Neuron
We change the neuron selector and view the result image
DeConvNet
SaliNet
• Changing the selected neuron does not significantly change the output.
• Suggests that these reversed networks are not suitable for neuron visualization.
• The auxiliary information dominates the output.
References
1. Guillaumin, M., Küttel, D., Ferrari, V.: Imagenet auto-annotation with segmentation propagation. In: IJCV (2014)
2. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proc. ICCV (2015)
3. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification
models and saliency maps. In: ICLR (2014)
4. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. In: ICLR
Workshop (2015)
5. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Proc. ECCV (2014)
6. Gulshan, V., Rother, C., Criminisi, A., Blake, A., Zisserman., A.: Geodesic star convexity for interactive image
segmentation. In: Proc. CVPR (2010)
7. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. In: Nature 323,
533 – 536
Acknowledgements: BP for Aravindh Mahendran, ERC StG IDIU for Andrea Vedaldi