Density Estimation

Non-Parametric Techniques
Generative and Discriminative Methods
Density Estimation
CSE 555: Srihari
0
Parametric versus Non-parametric
1. Parametric
• densities are uni-modal
have a single local maximum
• practical problems involve multi-modal densities
2. Nonparametric
• arbitrary distributions
• without assuming forms of the underlying densities
Generative
Discriminative
CSE 555: Srihari
1
Two types of nonparametric methods
1. Generative:
Estimate class-conditional density p(x | ωj )
1. Parzen Windows
2. kn-nearest neighbor estimation
2. Discriminative:
Bypass density estimation and go directly to compute aposteriori probability P(ωj /x )
1. nearest neighbor rule
2. k-nearest neighbor rule
CSE 555: Srihari
2
Histograms are the heart of density estimation
An example of a histogram (feature called negative slope)
CSE 555: Srihari
3
Entropy
More Histograms
No of black pixels
Exterior contours
Interior contours
CSE 555: Srihari
4
Histogram for Horizontal
slope
Histogram for Positive slope
CSE 555: Srihari
5
Density Estimation
•
•
Basic idea of estimating an unknown pdf:
Probability P that a vector x will fall in region R is:
P = ∫ p( x' )dx'
(1)
ℜ
•
•
P is a smoothed (or averaged) version of the density function p(x)
We can estimate the smoothed value of p by estimating the probability P
CSE 555: Srihari
6
Relating P to histogram value
•
If we have a sample of size n,
• the probability that k points fall in R is given by the binomial law:
⎛ n⎞ k
Pk = ⎜⎜ ⎟⎟ P ( 1 − P )n− k
⎝k⎠
• and the expected value
(2)
(mean) for k is:
E[k] = nP
CSE 555: Srihari
(3)
7
Binomial Distribution
⎛ n⎞ k
Pk = ⎜⎜ ⎟⎟ P ( 1 − P )n− k
⎝k⎠
•
(2)
This binomial distribution for k
peaks very sharply about the
mean
k ≈ nP
•
•
CSE 555: Srihari
Therefore, the ratio k/n is a good
estimate for the probability P and
hence for the density function p
Especially accurate when n is
very large
8
Space average of p(x)
• If p(x) is continuous and the region R is so small that p does not
vary significantly within it, we can write:
P = ∫ p ( x' )dx' ≅ p ( x)V
(4)
ℜ
• where x is a point within R and
• V the volume enclosed by R
CSE 555: Srihari
9
Expression for p(x) from histogram
Combining equations
(1), (3) and (4)
P = ∫ p( x' )dx'
(1)
ℜ
E[k] = nP
P = ∫ p ( x' )dx' ≅ p( x)V
(3)
(4)
ℜ
yields
p( x ) ≅
k/n
V
CSE 555: Srihari
10
⎛ n⎞
Pk = ⎜⎜ ⎟⎟ P k ( 1 − P )n− k
⎝k⎠
CSE 555: Srihari
(2)
11
Practical and Theoretical Issues
• The fraction
p( x ) ≅
k/n
is a space averaged value of p(x)
V
• p(x) is obtained only if V approaches zero
This is the case where no samples are included in R and our estimate p(x)
=0
• Practically, V cannot be allowed to become small since the number of
samples is always limited
• One will have to accept a certain amount of variance in the ratio k/n
CSE 555: Srihari
12
Formulation when unlimited samples are
available
Theoretically, if an unlimited number of samples is
available, we can circumvent this difficulty
To estimate the density at x, we form a sequence of regions
R1, R2,…containing x: the first region contains one sample, the
second two samples and so on.
Let Vn be the volume of Rn, kn the number of samples falling in Rn
and pn(x) be the nth estimate for p(x):
kn / n
pn ( x ) ≅
Vn
(7)
CSE 555: Srihari
13
Necessary conditions for convergence: pn(x) Æ p(x)
kn / n
pn ( x ) ≅
Vn
1 ) lim Vn = 0
n→ ∞
2 ) lim k n = ∞
n→∞
3 ) lim k n / n = 0
n→∞
Two different ways of obtaining sequences of regions that satisfy these
conditions:
(a) Shrink an initial region where Vn = 1/√n and show that
pn ( x ) → p( x )
n→∞
This is called “the Parzen-window estimation method”
(b) Specify kn as some function of n, such as kn = √n; the volume Vn is
grown until it encloses kn neighbors of x.
This is called “the kn-nearest neighbor estimation method”
CSE 555: Srihari
14
Two Methods for Estimating Density
We are estimating density at center of square
Method 1: start with large volume centered at point and shrink it according to
Vn =
1
n
Method 2: start with large volume centered at point and shrink it according to k n = n
CSE 555: Srihari
15
Parzen Window Method
• Unit hypercube function
Let ϕ (u) be the following window function :
1
⎧
1
u
≤
j = 1,... , d
⎪
j
ϕ (u) = ⎨
2
⎪⎩0 otherwise
1
u
u2
u1
CSE 555: Srihari
u3
16
Parzen Window using hypercubes
•
Estimate density assuming that region Rn is a d-dimensional hypercube
Vn = hnd (h n : length of the edge of ℜ n )
⎛ x − xi ⎞
⎟⎟ = 1 if x i falls within hypercube of volume Vn centered at x
ϕ ⎜⎜
⎝ hn ⎠
= 0 otherwise
⎛ x − xi ⎞
⎟⎟
No. of samples in hypercube : k n = ∑ ϕ ⎜⎜
i =1 ⎝ hn
⎠
hn
x
n
pn ( x ) =
kn / n
Vn
1 n 1
= ∑
n i =1 Vn
⎛ x − xi ⎞
⎟⎟
⎝ hn ⎠
ϕ ⎜⎜
xi
ϕ =0
CSE 555: Srihari
ϕ =1
17
Properties of Window Function
1 n 1
pn ( x ) = ∑
n i =1 Vn
⎛ x − xi ⎞
⎟⎟
⎝ hn ⎠
ϕ ⎜⎜
pn(x) estimates p(x) as an average of functions of x and
the samples xi
Window function is an interpolation function
• Functions ϕ can be general as long as pn(x) is a legitimate
density function, i.e., require that
Can use a Gaussian that is circularly symmetric
with variance hn
CSE 555: Srihari
18
Effect of window-width on estimate pn(x)
1 ⎛ x⎞
1 n
Let δ n ( x) = ϕ ⎜⎜ ⎟⎟ then pn ( x) = ∑ δ n ( x − xi )
Vn ⎝ hn ⎠
n i =1
Gaussian
windows
with
decreasing
widths
Parzen window
estimates using
five samples
Out-of-focus estimate
For any hn, distribution is
is normalized, i.e.,
CSE 555: Srihari
Erratic estimate
19
Conditions for Convergence of Estimate
Since we speak of convergence of sequence of random variables,
pn(x) has a mean and a variance
Additional criteria for
(1) well-behaved ness
(2) volume approaches zero
(3) but at rate slower than 1/n
CSE 555: Srihari
20
Convergence of Mean and Variance
Mean
Conclusion:
expected value of estimate is
convolution of unknown
density and window function
Variance
Conclusion:
let Vn=V1/sqrt(n)
CSE 555: Srihari
or
V1/ln n
etc
21
Example of Parzen Window on simple cases
1 −u 2 / 2
e
and hn = h1 / n
ϕ (u ) =
2π
Univariate normal
Bivariate normal
Case where p(x)
ÆN(0,1)
1 i = n 1 ⎛ x − xi
pn ( x ) = ∑ ϕ ⎜⎜
n i = 1 hn ⎝ hn
⎞
⎟⎟
⎠
is an average of
normal densities
centered
at the samples xi.
p1 ( x ) = ϕ ( x − x1 ) =
1 −1 / 2
e
( x − x1 )2 → N (CSE
x1 ,1 555:
)
Srihari
22
Contributions of samples clearly observable
2π
Parzen Window Estimates of Bimodal
Distribution •
•
Window widths
Case where p(x) = λ1.U(a,b) +
λ2.T(c,d)
(mixture of a uniform and a
triangle density)
No.
Of
Samples
CSE 555: Srihari
As no of samples become large estimates are same with each window width
23
Classification based on Parzen Windows
We estimate the densities for each category and classify a test point by
the label corresponding to the maximum posterior
Decision Boundaries of Two-dimensional dichotomizer
Using small h
Using large h
Small h
is better
in upper
region.
Large h
in lower
Region.
CSE 555: Srihari
24
Parzen Window Conclusion
• No assumptions made ahead of time regarding
•
•
distributions
Same procedure was used for unimodal and bimodal cases
With enough samples assured of convergence
• May need large no of samples however
• Also, no data reduction provided which leads to severe
•
computation and storage requirements
Demand for no of samples grows exponentially with
dimensionality
CSE 555: Srihari
25
Probabilistic Neural Networks
Parzen Window Method can be implemented as an artificial
neural network
Connections
correspond
to class labels
of
patterns
Feature
Values
Of Patterns
Weight
Vector w
CSE 555: Srihari
26
Probabilistic Neural Networks
Parzen Window Method can be implemented as an artificial
neural network
a1c
an2
Feature
Values
Of Patterns
Weight
Vector w
CSE 555: Srihari
27
Normalization
•
Patterns are normalized (or scaled) to have unit length, or
d
2
x
∑ i =1
i =1
•
This is done by replacing each feature value by
xj ←
•
xj
1/ 2
⎛ d 2⎞
⎜ ∑ xi ⎟
⎝ i =1 ⎠
Effect of normalization
x x =1
t
CSE 555: Srihari
28
Normalization Example
•
Normalize x = [3 4]t
•
•
(9+16)1/2 = 5
Normalized vector is =[3/5 4/5]t = [0.6 0.8]t
•
Effect of normalization
d
2
x
∑ i = 0.36 + 0.64 = 1
i =1
⎡ 0.6 ⎤
x x = ⎢ ⎥[0.6,0.8] = [1]
⎣ 0.8 ⎦
t
CSE 555: Srihari
29
Activation Function
⎛ x − w k ⎞ −(x − w k )t (x − w k )/ 2σ 2
⎟⎟αe
ϕ ⎜⎜
⎝ hn ⎠
e
Contribution of a sample at x
Here we let weights correspond
to feature values of sample
− ( x − w k )t ( x − w k ) / 2σ 2
=e
− ( x t x + w kt w k − 2 x t w k ) / 2σ 2
=e
( net k −1) / σ 2
netk = wkt x
Simplified form due to
normalization
x t x = wkt wk = 1
CSE 555: Srihari
30
PNN Training Algorithm
Over all samples j
Normalize
Feature values
Set weights equal
to feature values
Over all
features k
Only connections
Corresponding to class
labels are set to 1
The aji correspond to making connections between
labeled samples and the corresponding classes
CSE 555: Srihari
31
PNN Classification Algorithm
Made possible
by appropriate
choice of
activation
function
Output is the sum of the activation functions corresponding
to the labeled samples of that class
CSE 555: Srihari
32
Parzen/PNN Classifier Stage 1
•
•
•
Inputs: d normalized features
Weights correspond to feature values of n labeled samples
wij = xij, i=1,.., n; j=1,..,d
Input units are fully connected to n sample units
x1
x2
Input unit
.
.
.
xd
..
.
W11
..
.
.
net k = w kt x; k = 1,..n
p1
p2
W2d
Wnd
CSE 555: Srihari
.
.
.
Labeled
sample
units
pn
33
PNN Stage 2:
outputs are sparsely connected
pn
sample
patterns
..
.
.
p1
..
.
p2
.
.
.
pk
.
.
.
pn
ω1
ω2
.
.
.
Category
units
ωc
Activation functions
(nonlinear functions)
e
( net k −1) / σ 2
CSE 555: Srihari
34
kn-Nearest neighbor estimation
Parzen disadvantage: unknown “best” window function
• Let cell volume be a function of the training data
• Center a cell about x and grow it until it captures kn samples
•
kn is a function of n
Or kn nearest-neighbors of x
n=1
n=4
n=9
CSE 555: Srihari
n = 16
n = 100
35
Choice of kn
•
If density is high near x
•
•
cell will be small -- which provides good resolution
If density is low
•
cell will grow large and stop until higher density regions are reached
•
Choose
•
•
Requirement for convergence: lim kn = infinity
and
lim kn/n = 0
We can obtain a family of estimates by setting kn=k1/√ n
and choosing different values for k1
CSE 555: Srihari
36
Examples of estimated pdfs using k nearest
neighbor estimate
Discontinuities in the estimated densities
take place away from the points
Two dimensions: k = 5
One dimension: k= 3 and 5
CSE 555: Srihari
37
Comparison of Parzen and k nn
estimates in one-dimension
For k-nearest neighbor estimation using kn = √n = 1
the estimate becomes
pn(x)
= kn / n.Vn
= 1 / V1
=1 / 2|x-x1|
Clearly a poor estimate
k-nn
estimate
Parzen
estimate
Whereas Parzen yields:
CSE 555: Srihari
38
Examples of kn-Nearest-Neighbor Estimation
Bi-modal
Gaussian
Spiky estimates
when there are
few samples
CSE 555: Srihari
39
Estimation of a-posteriori probabilities
Product of
class-conditional
and a posteriori
is represented among
samples
p n ( x, ω i ) =
Pn (ωi / x) =
ki / n
V
p n ( x, ωi )
c
∑ p ( x, ω )
j =1
n
=
ki
k
j
•
ki/k is the fraction of the samples within the cell that are labeled ωi
•
For minimum error rate, the most frequently represented category
within the cell is selected
•
If k is large and the cell sufficiently small, the performance will
approach the best possible
Size of the cell can be chosen using either Parzen or kn nearest neighbor
•
CSE 555: Srihari
40