Lectureweek92RAMnet

CSE 190 Neural Networks:
How to train a network to
look and see
Gary Cottrell
Week 9 Lecture 2
7/13/2017
CSE 190
1
Introduction

How do we deal with the high dimensionality of
visual input?
CSE 190
7/13/2017
2
Introduction

Our field of view is about 200° horizontally and
130° vertically – a HUGE image. Compare to
the size of MNIST images!

How do we deal with the high dimensionality of
visual input?

Sampling!
CSE 190
7/13/2017
3
Introduction

We have a foveated retina – we only have high
resolution for about 2° of visual angle

We move our eyes about 3 times a second
That pencils out to about 172,000 times a day!
 So we sample at the highest resolution 2° of visual
angle 172k times per day.


Perhaps we could apply this idea to computer
vision.
CSE 190
7/13/2017
4
Introduction

And we have (Kanan & Cottrell, 2010)

We used a salience map to decide where to
sample from an image

And stored fragments of the image

For a new image, took new samples and figured
out who or what it was by a kind of nearest
neighbor voting
CSE 190
7/13/2017
5
CSE 190
7/13/2017
6
What’s wrong with this picture?

The model sampled randomly from the image
according to the probability distribution of the
salience map.

Clearly, we (humans, other animals) don’t do this


We can recognize a face in two fixations (Hsiao &
Cottrell, 2008)
Can we learn a policy for sampling from an
image efficiently?
CSE 190
7/13/2017
7
The Recurrent Attention Model



Researchers at Deep Mind (purchased by
Google for $400,000,000 in 2014) have
developed a network that can “move its
eyes” and recognize multiple objects in an
image. (Ba, Mnih, and Kavukcuoglu (2015), ICLR 2015)
It is trained end-to-end to sample from an
image, decide the next location to look at,
and output a classification
Initially used to read street addresses
CSE 190
8
The Recurrent Attention Model
CSE 190
9
The Recurrent Attention Model
Start Here
CSE 190
10
The Recurrent Attention Model
The little arrow from the
little picture is actually a
3-layer convnet with no
pooling, that learns from
a coarse version of the
image to create the
initial state of the
recurrent network that
decides where to look
next.
CSE 190
11
The Recurrent Attention Model
The
controller
network
CSE 190
12
The Recurrent Attention Model
This is half of the recurrent
network; the part that I’ll call
the controller network –
because it keeps the state of
where we’ve looked and is
input to the emission network
to produce where to look next.
It is an LSTM network
CSE 190
13
The Recurrent Attention Model
From the controller network,
the little arrow marked
“emission” is really just a
feedforward network with one
hidden layer that learns to
produce an (x,y) location of
where to look next, based on
the current state of the r(2)
network.
n is the time step
CSE 190
14
The Recurrent Attention Model
So, the first computation
is to take a coarse
version of the image, run
it through a convnet,
which sets the initial
state of the r(2) network,
which feeds into the
emission network, which
produces a first fixation.
CSE 190
15
The Recurrent Attention Model
This (x,y)
location decides
what patch of
image is input to
the “glimpse
network”
CSE 190
16
The Recurrent Attention Model
So, after
training, it
focuses on the
first digit in the
address.
Glimpse
network
glimpse
Input image
CSE 190
17
The Recurrent Attention Model
This little arrow is a
feed-forward convnet,
with three
convolutional layers
followed by a fullyconnected hidden
layer…it is gated by
the hidden layer of the
location network, one
to one (element-wise)
glimpse
Input image
CSE 190
18
The Story So Far…

This section of the network:
CSE 190
19
The Story So Far…
This section of the
network:
X
Y
T=0
X
Y
T=1
X
Y
T=2
In more detail…is this
CSE 190
20
The Story So Far…
This section of the
network:
X
Y
T=0
X
Y
T=1
X
Y
T=2
In more detail…is this
CSE 190
21
The Story So Far…
This section of the
network:
In more detail…is this
CSE 190
22
The Story So Far…
This section of the
network:
Again, the hidden units in emission network, with exactly the same
number of hidden units as the first recurrent network, are one-to-one
connected with multiplicative connections – that is, the hidden layer of
the lower recurrent net is gated by the location network
CSE 190
23
The Story So Far…
This section of the
network:
Note that this also gives a pathway for the error to propagate from the
actual target network (which is fed by the lower recurrent net) back al
the way to the hidden nodes of the emission network, but not the
output of the emission network – the location.
CSE 190
24
The Recurrent Attention Model
End Here
(if done)
Start Here
CSE 190
25
The Recurrent Attention Model





How do we train this??
The y is compared to the target (presumably read out when
the LSTM units are good and ready)
And then backprop
They actually stop the gradient calculation after the first
mislabeled target – so shorter sequences first.
This is sometimes called curriculum learning
CSE 190
26
The Recurrent Attention Model



That takes care of the classification part, but what
about the location part?
Here, we can use reinforcement learning to reward the
network when it picks a location that works well.
The reinforcement signal is based on the fraction it
gets right.
CSE 190
27
The Recurrent Attention Model


But how do we even get started? We let the network
choose random locations at first, to encourage it to
explore.
Then later we exploit what it has learned, and explore
less.
CSE 190
28
The Recurrent Attention Model
Start Here
CSE 190
29
So, what can we do with all this
machinery??



We can find pairs of digits in images!
(whoohoo!)
(really? We did all this to do that???)
Ok, yeah, well, we can do it better than
anyone else!
(ok, better than we
did it last year…)
CSE 190
30
How the network behaves
CSE 190
31
But wait! There’s more!

We can add those two digits (we couldn’t
do that last year)
CSE 190
32
But wait! There’s more!

We can read street
numbers!!!
CSE 190
33
But wait! There’s more!

We can read street
numbers
backwards!!!
CSE 190
34
Was all that really necessary?
CSE 190
35