Learning to Detect Objects in Images via a Sparse, Part

Learning to Detect Objects in Images via a
Sparse, Part-Based Representation
S. Agarwal, A. Awan and D. Roth
IEEE Transactions on Pattern
Analysis and Machine Intelligence
Antón Escobedo cse252c
1
Outline







Introduction
Problem Specification
Related Work
Overview of the Approach
Evaluation
Experimental Results and Analysis
Conclusion and Future Scope
2
Introduction





Automatic detection of objects in images
Different objects belonging to the same
category can vary
Successful object detection system
Proposed solution – Sparse-Part based
representation
Part-based representation is
computationally efficient and has its roots
in biological vision
3
Problem Specification




Input: An image
Output: A list of locations at which instances of
the object class are detected in the image
The experiments are performed on images of side
views of cars but can be applied to any object
that consists of distinguishable parts arranged in
a relatively fixed spatial configuration
The present problem is a “detection” problem
rather than a simple “classification” problem
4
Previous Related Work




Raw Pixel Intensities
Global Image
Local features
Part Based Representations using
hand labeled features
5
Algorithm Overview
Four Stages:
 Vocabulary Construction: Building a vocabulary of parts that
will represent objects

Image Representation: Input images are represented in
terms of binary feature vectors

Learning a Classifier: Two target classes +feature vector
(object) and –feature vector (nonobject)

Detection Hypothesis Using the Learned Classifier:


Classifier activation map for the single-scale case
Classifier activation pyramid for multiscale cases
6
Vocabulary Construction




Extraction of interest points using Forstner interest
operator
Experiments carried out on 50 representative images of
size 100 x 40 pixels. A total of 400 patches, each of size
13 x 13 pixels were extracted
To facilitate learning, a bottom-up clustering procedure
was adopted where similarity was measured by
normalized correlation
Similarity between two clusters C1 and C2 is finally
measured by the average similarity between their
respective patches:
NormalizedCorrelation 
similarity (C1 , C2 ) 
1
C1 C2
E ( p1 p2 )
E ( p12 ) E ( p2 2 )
  similarity ( p , p
p1C1 p2 C2
1
2
)
7
Vocabulary Construction
Forstner applied to sample image
Sample patches
Clusters from sample patches
8
Image Representation

For each patch q in an image, a similarity-based
indexing is performed into the part vocabulary P
using:
similarity ( P, q)  (1/  P )

 similarity ( p, q)
pP(  , q )
For each highlighted patch q, the most similar
vocabulary part P*(q) is given by:
P * (q)  arg max
P
similarity ( P, q)
9
Image Representation: Feature Vector



Spatial relations among the parts detected in an
image are defined in terms of distance (5 bins) and
directions (8 ranges of 45 degrees each) giving 20
possible relations between 2 parts.
2-6 parts per Positive Window
Each 100x40 training image is represented as a
feature vector with 290 elements.
 Pn(i): ith occurrence of a part of type n in the
image (1≤n≤270; n is a particular part-cluster)
 Rm(j)(Pn1, Pn2): jth occurrence of relation Rm
between a part of type n1 and a part of type n2
(1≤m≤20; m is a distance-direction combination)
10
Learning a Classifier

Train classifier using 1000 labeled images, each 100 x 40
pixels in size
 No synthetic training images

+ve examples: Various cars with varied backgrounds

- ve examples: Natural scenes like buildings, roads

High dimensionality of feature vector: 270 types, 20
relations, repeats.

Use of Sparse Network of Winnows (SNoW) learning
architecture.

Winnow: to reduce in number until only the best are left
11
SNoW: Sparse network of linear units over a Boolean or real
valued feature space
(activation)
Target Nodes
Edges are allocated dynamically
Input Layer= Feature Layer
set of examples e (represented as a list of active features)
12
SNoW: Predicted target t* for example e
t *(e)  arg max tT  (t , t (e))
 t ( e)
Activation calculated by the
summation for target node t
1
 ( , (e)) 
 
1 e
θ-Ω
Learning Algorithm Specific Sigmoid function whose transition
from an output close to 0 to an output close to 1, centers
around θ .
13
SNoW: Basic Learning Rules


Several weight update rules can be used: update rules are
variations of Winnow and Perceptron
Winnow update rule: The number of examples required
to learn a linear function grows linearly with the number of
relevant features and only logarithmically with the total
number of features.
14
A Training Example
2,
2, 1,
1, 1,
2,
2,
12
2
1 =4
1, 1001, 1006:
1, 2,
2,
1, 2
1 =2
1
2, 2,
1,
1, 2
12, 1 = 1
2
3
2, 1002, 1007, 1008:
1, 1004, 1007:
3, 1006, 1004:
3, 1004, 1005, 1009:
1001, 1005, 1007:
1001
Update rule: Winnow
1002
1003
1004
1005
1006
1007
1008
1009
α = 2, β = ½, θ = 3.5
15
Detection Hypothesis using Learned Classifier

Classifier Activation Map – for single scale


Neighborhood Suppresion: Based on nonmaximum
suppression.
Repeated Part Elimination: Greedy algorithm, uses
windows around highest activation points.
16
Detection:
Classifier Activation Pyramid

Scale the input image a number of times to form a
multi-scale image pyramid

Apply the learned classifier to fixed-size windows in
each image in the pyramid

Form a three-dimensional classifier activation pyramid
instead of the earlier two-dimensional classifier
activation map.
17
Evaluation Criteria


Test Set I consists of 170 images containing 200
cars of same size and is tested for single scale
case. In this case for each car in the test images,
the location of best 100 x 40 window containing
the car is determined.
Test Set II consists of 108 images containing 139
cars of different sizes and is tested for multi scale
case. In this case for each car in the test images,
the location and scale of the best 100 x 40
window containing the car is determined.
18
Performance Measures


Goal is to maximize the number of correct detections and
minimize the number of false detections.
One method for expressing the trade-off between correct and
false detections is to use the receiver operating characteristics
(ROC) curve. This curve plots the true positive rate vs. the false
positive rate.
# of true positive (TP)
True positive rate = -------------------------------------------------Total # of positives in the data set (nP)
# of false positive (FP)
False positive rate = ------------------------------------------------Total # of negatives in the data set (nN)
This measures the accuracy of the system as a “classifier” rather
than a “detector”.
19
Performance Measures (contd.)

We are really interested in knowing how many of the
objects it detects (given by recall), and how often the
detections it makes are false (given by 1-precision). This
trade-off is thus captured very accurately by (recall) vs.
(1-precision) curve; where
TP
TP
Recall = ------------;
1 – Precision = --------------nP
TP + FP
The threshold parameter that achieves the best trade-off
between the two quantities is measured by the point of
highest F-measure, where
2 * Recall * Precision
F-measure = --------------------------Recall + Precision
20
Experimental Results
Activation
Threshold
Recall (R)
TP/200
Precision (P)
TP/(TP+FP)
F-measure
2*R*P/(R+P)
0.40
84.5
54.69
66.40
0.85
76.5
77.66
77.08
0.9995
4.0
100
7.69
Single-scale detection with Neighborhood Suppression Algorithm
Activation
Threshold
Recall (R)
TP/200
Precision (P)
TP/(TP+FP)
F-measure
2*R*P/(R+P)
0.20
91.5
24.73
38.94
0.85
72.5
81.46
76.72
0.995
4.0
100
7.69
Single-scale detection with Repeated Part Elimination Algorithm
21
Experimental Results (contd.)
Activation
Threshold
Recall (R)
TP/139
Precision (P)
TP/(TP+FP)
F-measure
2*R*P/(R+P)
0.65
50.36
24.56
33.02
0.95
38.85
49.09
43.37
0.9999
2.88
100
5.59
Multi-scale detection with Neighborhood Suppression Algorithm
Activation
Threshold
Recall (R)
TP/139
Precision (P)
TP/(TP+FP)
F-measure
2*R*P/(R+P)
0.20
80.58
8.43
15.27
0.95
39.57
49.55
44.0
0.9999
2.88
100
5.59
Multi-scale detection with Repeated Part Elimination Algorithm
22
Some Graphical Results
23
Analysis:
A. Performance of Interest Operator
24
Analysis:
B. Performance of Part Matching Process
25
Analysis:
C. Performance of Learned Classifier
26
Conclusion


Automatic vocabulary construction from sample
images
Methodologies for object detection



Detector from Classifier
Standardizing evaluation criterion
Good for classification of objects with
distinguishable parts
27
Questions?
Slides adapted from http://www.cs.uga.edu/~ananda/ML_Talk.ppt
and http://l2r.cs.uiuc.edu/~cogcomp/tutorial/SNoW.ppt
28