Contextual models fro object detection using boosted random fields

Contextual models for object detection
using boosted random fields
by Antonio Torralba,
Kevin P. Murphy and
William T. Freeman
Quick Introduction

What is this?

Now can you tell?
Belief Propagation (BP)

Network (Pairwise Markov Random
Fields)

observed nodes (yi)
Belief Propagation (BP)

Network (Pairwise Markov Random
Fields)


observed nodes (yi)
hidden nodes (xi)
Belief Propagation (BP)

Network (Pairwise Markov Random
Fields)


observed nodes (yi)
hidden nodes (xi)
Statistical dependency,
called local evidence:
i ( xi , yi )
Shord-hand
i ( xi )
Belief Propagation (BP)
Statistical dependency:
Local evidence
i ( xi , yi )
Shord-hand
Statistical dependency:
Compatibility function
 ij ( xi , x j )
i ( xi )
Belief Propagation (BP)

Joint probability
1
p({x})  i ( xi ) ij ( xi , x j )
Z i
( ij )
Belief Propagation (BP)

Joint probability
1
p({x})  i ( xi ) ij ( xi , x j )
Z i
( ij )
x
x1
y1
x5
y2
x3
x1
x2
x4
….
x12
xi
xj
yi
Belief Propagation (BP)

Joint probability
1
p({x})  i ( xi ) ij ( xi , x j )
Z i
( ij )
x
x1
y1
x5
y2
x3
x1
x2
x4
….
x12
xi
xj
yi
Belief Propagation (BP)

The belief b at a node i is represented
by


the local evidence of the node
all the messages coming in from
neighbors
bi ( xi )  ki ( xi )  m ji ( xi )
xi
jN ( i )
i ( xi )
yi
Ni
∏
xj
Belief Propagation (BP)

The belief b at a node i is represented
by


the local evidence of the node
all the messages coming in from
neighbors
bi ( xi )  ki ( xi )  m ji ( xi )
xi
jN ( i )
 pi ( xi | y)
i ( xi )
yi
Ni
∏
xj
Belief Propagation (BP)

Messages m between hidden nodes
xi

mji(xi)
xj
How likely node j thinks it is that node i
will be in the corresponding state.
Belief Propagation (BP)
xi
mji(xi)
xj
m ji ( xi )    j ( x j ) ji ( x j , xi )
xj
xi
 ji ( x j , xi )
 j (x j )
xj
m
kj
kN ( j ) \ i
xk
(x j )
Conditional Random Field

Distribution of the form:
Conditional Random Field

Distribution of the form:
1
p( x | y)  i ( xi )  ij ( xi , x j )
Z i
jN i
Boosted Random Field

Basic Idea:

Use BP to estimate P(x|y)

Use boosting to maximize
Log Likelihood of each node
wrt to i ( xi )
Algorithm: BP
Minimize negative log likelihood of training
data (yi). Label Loss function to minimize:

J   J   b ( xi ,m )
t
t
i
i
t
i ,m
m
i
Algorithm: BP
Minimize negative log likelihood of training
data (yi). Label Loss function to minimize:

J   J   b ( xi ,m )
t
t
i
t
i ,m
i
m
i
  b (1)
t
i ,m
m
i
xi*,m
1 xi*,m
b (1)
t
i ,m
Algorithm: BP
Minimize negative log likelihood of training
data (yi). Label Loss function to minimize:

J   J   b ( xi ,m )
t
t
i
t
i ,m
i
m
i
  b (1)
t
i ,m
m
xi*,m
1 xi*,m
b (1)
t
i ,m
i
xi*,m  ( xi ,m  1) / 2
xi ,m {1,1}
Algorithm: BP
Ni
b ( xi )  ki ( xi )  m
t
i
t 1
j i
( xi )
jN ( i )
xi
i ( xi )
yi
∏
xj
Algorithm: BP
Ni
b ( xi )  i ( xi )  m
t
i
t 1
j i
( xi )
jN ( i )
xi
i ( xi )
yi
∏
xj
Algorithm: BP
Ni
b ( xi )  i ( xi )  m
t
i
t 1
j i
( xi )
jN ( i )
M
xi
t 1
i
( 1)
∏
xj
Algorithm: BP
b ( xi )  i ( xi )  m
t
i
t 1
j i
( xi )
jN ( i )
M
m
t 1
j i


xi
t 1
i
( 1)
 j ,i ( x j ,  1)
x j {1, 1}
mtj1 i
mit j
t
j
b (x j )
t
i j
m
(x j )
xj
Algorithm: BP
b ( xi )  i ( xi )  m
t
i
t 1
j i
( xi )
jN ( i )
i ( xi )  [e
t
Fi / 2
;e
xi
 Fi / 2
t
]
i ( xi )
yi
F: a function of the input data
Algorithm: BP
b (1)   ( Fi  G )
t
i
with
t
1
 (u ) 
u
1 e
t
i
yi
Fi t
xi
Git
xj
Algorithm: BP
yi
b (1)   ( Fi  G )
t
i
t
t
i
Fi t
xi
with
Git
xj
1
 (u ) 
u
1 e
t
t
t
Gi  log M i (1)  log M i (1)

log J   log 1  e
t
i
m
 xi ,m ( Fit,m  Git,m )

Function F
t 1
Fi ( yi ,m )  Fi ( yi ,m )  fi ( yi ,m )
t

t
yi
Fi t
Boosting!

f is the weak learner: weighted decision
stumps.
f i ( y)  ah( y)  b
xi
Minimization of loss L

log J   log 1  e
t
i
m
 xi ,m ( Fit,m  Git,m )

Minimization of loss L

log J   log 1  e
t
i
m
arg min log J  arg min
f it
t
i
f it
 xi ,m ( Fit,m  Git,m )
 w Y
t
i ,m
m
t
i ,m

 f i ( xi ,m )
t

2
Minimization of loss L

log J   log 1  e
t
i
m
arg min log J  arg min
t
i
f it
where
f it
t
i ,m
Y
t
i ,m
w
 xi ,m ( Fit,m  Git,m )
 w Y
t
i ,m
t
i ,m

 f i ( xi ,m )
t
m

 xi ,m 1  e
 xi ,m ( Fit Git )
 b (1)b (1)
t
i
t
i


2
Local Evidence: algorithm

For t=1..T

Iterate Nboost times





yi
find the best basis function h
t 1
t
update local evidence with Fi  f i
update the beliefs
t
t
t
update the weights wi ,m  bi (1)bi (1)
Iterate NBP times


update messages
update the beliefs
Fi t
xi
Git
xj
Local Evidence: algorithm

For t=1..T

Iterate Nboost times





yi
find the best basis function h
t 1
t
update local evidence with Fi  f i
update the beliefs
t
t
t
update the weights wi ,m  bi (1)bi (1)
Iterate NBP times


update messages
update the beliefs
Fi t
xi
Git
xj
Local Evidence: algorithm

For t=1..T

Iterate Nboost times





yi
find the best basis function h
t 1
t
update local evidence with Fi  f i
update the beliefs
t
t
t
update the weights wi ,m  bi (1)bi (1)
Iterate NBP times


update messages
update the beliefs
Fi t
xi
bi ( xi )
Git
xj
Local Evidence: algorithm

For t=1..T

Iterate Nboost times





yi
find the best basis function h
t 1
t
update local evidence with Fi  f i
update the beliefs
t
t
t
update the weights wi ,m  bi (1)bi (1)
Iterate NBP times


update messages
update the beliefs
Fi t
xi
Git
xj
Local Evidence: algorithm

For t=1..T

Iterate Nboost times





yi
find the best basis function h
t 1
t
update local evidence with Fi  f i
update the beliefs
t
t
t
update the weights wi ,m  bi (1)bi (1)
Iterate NBP times


update messages
update the beliefs
Fi t
xi
Git
xj
Local Evidence: algorithm

For t=1..T

Iterate Nboost times





yi
find the best basis function h
t 1
t
update local evidence with Fi  f i
update the beliefs
t
t
t
update the weights wi ,m  bi (1)bi (1)
Iterate NBP times


update messages
update the beliefs
Fi t
xi
bi ( xi )
Git
xj
bj (x j )
Function G

By assuming that the graph is densely
connected we can make the
approximation:
m
m

t 1
j i
t 1
j i
(1)
(1)
1
Now G is a non-linear
additive
function
t
t 1
of the beliefs: Gi bm
 
Function G

Instead of learning  ij the function
t
t 1
Gi bm can be learnt with an additive
model:
t

 
 
g b
n 1


n
t
t
gi bm  a (w  bm   )  b
t 1
i ,m
G
n
i
t
m
 
weighted regression stumps
Function G

The weak learner is chosen by
minimizing the loss:

log J (b )   log 1  e

m
t
i
t 1


 xi ,m Fit,m  g it ( bmt 1 ) 
t 1
n1
g it ( bmt 1 )



The Boosted Random Field
Algorithm

For t=1..T






find the best basis function h for f
n
t 1
g
b
find the best basis function for i Ni ,m
compute local evidence
xi
compute compatibilities
update the beliefs
Fi t
update weights
yi
 
Git
xj
The Boosted Random Field
Algorithm

For t=1..T





 
b1
xi
b2
…

find the best basis function h for f
n
t 1
g
b
find the best basis function for i Ni ,m
compute local evidence
compute compatibilities
update the beliefs
update weights
bj
Final classifier

For t=1..T
update local evidences F
 update compatibilities G
 compute current beliefs
t
x


(
b
 Output classification: i ,m
i ,m  0.5)

Multiclass Detection


U: Dictionary of ~2000 images patches
V: Same number of image masks
Multiclass Detection



U: Dictionary of ~2000 images patches
V: Same number of image masks
At each round t, for each class c for
each dictionary entry d there is a weak
learner:


v ( I )  ( I  U )   V  0
d
d
d
d
Function f

f
To take into account different sizes, we
first downsample the image and then
upsample and OR the scales:
d
x , y ,c



( I )    s [v ( I  s)  s]  
d
x, y
d
which is our function for computing the
local evidence.
Function g

The compatibily function has a similar
form:


d
d
(b)     bx ', y ',c ' Wx ', y ',c '   
 c '1

C
g
d
x , y ,c
d
Function g

The compatibily function has a similar
form:


d
d
(b)     bx ', y ',c ' Wx ', y ',c '   
 c '1

C
g
d
x , y ,c

d
W represent a kernel with all the
messages directed to node x,y,c
Kernels W

Example of incoming messages:
Function G

The overall incoming messages
function is given by:


n
n
(b)   bx ', y ',c '     Wx ', y ',c '     n
c '1
 n
 n
C
G
t
x , y ,c
def C
  bx ', y ',c ' Wx'', y ',c '   '
c '1
Learning…

Labeled dataset of office and street
scenes, with each ~100 images



In the first 5 round updated only the local
evidence
After the 5th iteration update also the
compatibility functions
At each round update only F and G of
the single object class that reduces the
most the multiclass cost.
Learning…

Biggest objects are detected first
because they reduce the error of all
classes the fastest:
The End
Introduction
Observed: Picture
Dictionary: Dog
P(Dog|Pic)
Introduction
P(Head|Pici)
P(Tail|Pici)
P(Front Legs|Pici)
P(Back Legs|Pici)
Introduction
Dog!
Comp(Head, Tail)
Comp(Head, Legs)
Comp(Tail, Legs)
Comp(F. Legs, B. Legs)
Introduction
P(Piraña|Pici)
Comp(Piraña, Legs)
Graphical Models
Observation nodes yi
Y

yi can be a pixel or a patch
Graphical Models
Hidden Nodes
X
Dictionary
Local Evidence:
i ( xi , yi )
Shord-hand
i ( xi )
Graphical Models
X
Compatibility Function:
 ij ( xi , x j )