Non-Parametric Techniques Generative and Discriminative Methods Density Estimation CSE 555: Srihari 0 Parametric versus Non-parametric 1. Parametric • densities are uni-modal have a single local maximum • practical problems involve multi-modal densities 2. Nonparametric • arbitrary distributions • without assuming forms of the underlying densities Generative Discriminative CSE 555: Srihari 1 Two types of nonparametric methods 1. Generative: Estimate class-conditional density p(x | ωj ) 1. Parzen Windows 2. kn-nearest neighbor estimation 2. Discriminative: Bypass density estimation and go directly to compute aposteriori probability P(ωj /x ) 1. nearest neighbor rule 2. k-nearest neighbor rule CSE 555: Srihari 2 Histograms are the heart of density estimation An example of a histogram (feature called negative slope) CSE 555: Srihari 3 Entropy More Histograms No of black pixels Exterior contours Interior contours CSE 555: Srihari 4 Histogram for Horizontal slope Histogram for Positive slope CSE 555: Srihari 5 Density Estimation • • Basic idea of estimating an unknown pdf: Probability P that a vector x will fall in region R is: P = ∫ p( x' )dx' (1) ℜ • • P is a smoothed (or averaged) version of the density function p(x) We can estimate the smoothed value of p by estimating the probability P CSE 555: Srihari 6 Relating P to histogram value • If we have a sample of size n, • the probability that k points fall in R is given by the binomial law: ⎛ n⎞ k Pk = ⎜⎜ ⎟⎟ P ( 1 − P )n− k ⎝k⎠ • and the expected value (2) (mean) for k is: E[k] = nP CSE 555: Srihari (3) 7 Binomial Distribution ⎛ n⎞ k Pk = ⎜⎜ ⎟⎟ P ( 1 − P )n− k ⎝k⎠ • (2) This binomial distribution for k peaks very sharply about the mean k ≈ nP • • CSE 555: Srihari Therefore, the ratio k/n is a good estimate for the probability P and hence for the density function p Especially accurate when n is very large 8 Space average of p(x) • If p(x) is continuous and the region R is so small that p does not vary significantly within it, we can write: P = ∫ p ( x' )dx' ≅ p ( x)V (4) ℜ • where x is a point within R and • V the volume enclosed by R CSE 555: Srihari 9 Expression for p(x) from histogram Combining equations (1), (3) and (4) P = ∫ p( x' )dx' (1) ℜ E[k] = nP P = ∫ p ( x' )dx' ≅ p( x)V (3) (4) ℜ yields p( x ) ≅ k/n V CSE 555: Srihari 10 ⎛ n⎞ Pk = ⎜⎜ ⎟⎟ P k ( 1 − P )n− k ⎝k⎠ CSE 555: Srihari (2) 11 Practical and Theoretical Issues • The fraction p( x ) ≅ k/n is a space averaged value of p(x) V • p(x) is obtained only if V approaches zero This is the case where no samples are included in R and our estimate p(x) =0 • Practically, V cannot be allowed to become small since the number of samples is always limited • One will have to accept a certain amount of variance in the ratio k/n CSE 555: Srihari 12 Formulation when unlimited samples are available Theoretically, if an unlimited number of samples is available, we can circumvent this difficulty To estimate the density at x, we form a sequence of regions R1, R2,…containing x: the first region contains one sample, the second two samples and so on. Let Vn be the volume of Rn, kn the number of samples falling in Rn and pn(x) be the nth estimate for p(x): kn / n pn ( x ) ≅ Vn (7) CSE 555: Srihari 13 Necessary conditions for convergence: pn(x) Æ p(x) kn / n pn ( x ) ≅ Vn 1 ) lim Vn = 0 n→ ∞ 2 ) lim k n = ∞ n→∞ 3 ) lim k n / n = 0 n→∞ Two different ways of obtaining sequences of regions that satisfy these conditions: (a) Shrink an initial region where Vn = 1/√n and show that pn ( x ) → p( x ) n→∞ This is called “the Parzen-window estimation method” (b) Specify kn as some function of n, such as kn = √n; the volume Vn is grown until it encloses kn neighbors of x. This is called “the kn-nearest neighbor estimation method” CSE 555: Srihari 14 Two Methods for Estimating Density We are estimating density at center of square Method 1: start with large volume centered at point and shrink it according to Vn = 1 n Method 2: start with large volume centered at point and shrink it according to k n = n CSE 555: Srihari 15 Parzen Window Method • Unit hypercube function Let ϕ (u) be the following window function : 1 ⎧ 1 u ≤ j = 1,... , d ⎪ j ϕ (u) = ⎨ 2 ⎪⎩0 otherwise 1 u u2 u1 CSE 555: Srihari u3 16 Parzen Window using hypercubes • Estimate density assuming that region Rn is a d-dimensional hypercube Vn = hnd (h n : length of the edge of ℜ n ) ⎛ x − xi ⎞ ⎟⎟ = 1 if x i falls within hypercube of volume Vn centered at x ϕ ⎜⎜ ⎝ hn ⎠ = 0 otherwise ⎛ x − xi ⎞ ⎟⎟ No. of samples in hypercube : k n = ∑ ϕ ⎜⎜ i =1 ⎝ hn ⎠ hn x n pn ( x ) = kn / n Vn 1 n 1 = ∑ n i =1 Vn ⎛ x − xi ⎞ ⎟⎟ ⎝ hn ⎠ ϕ ⎜⎜ xi ϕ =0 CSE 555: Srihari ϕ =1 17 Properties of Window Function 1 n 1 pn ( x ) = ∑ n i =1 Vn ⎛ x − xi ⎞ ⎟⎟ ⎝ hn ⎠ ϕ ⎜⎜ pn(x) estimates p(x) as an average of functions of x and the samples xi Window function is an interpolation function • Functions ϕ can be general as long as pn(x) is a legitimate density function, i.e., require that Can use a Gaussian that is circularly symmetric with variance hn CSE 555: Srihari 18 Effect of window-width on estimate pn(x) 1 ⎛ x⎞ 1 n Let δ n ( x) = ϕ ⎜⎜ ⎟⎟ then pn ( x) = ∑ δ n ( x − xi ) Vn ⎝ hn ⎠ n i =1 Gaussian windows with decreasing widths Parzen window estimates using five samples Out-of-focus estimate For any hn, distribution is is normalized, i.e., CSE 555: Srihari Erratic estimate 19 Conditions for Convergence of Estimate Since we speak of convergence of sequence of random variables, pn(x) has a mean and a variance Additional criteria for (1) well-behaved ness (2) volume approaches zero (3) but at rate slower than 1/n CSE 555: Srihari 20 Convergence of Mean and Variance Mean Conclusion: expected value of estimate is convolution of unknown density and window function Variance Conclusion: let Vn=V1/sqrt(n) CSE 555: Srihari or V1/ln n etc 21 Example of Parzen Window on simple cases 1 −u 2 / 2 e and hn = h1 / n ϕ (u ) = 2π Univariate normal Bivariate normal Case where p(x) ÆN(0,1) 1 i = n 1 ⎛ x − xi pn ( x ) = ∑ ϕ ⎜⎜ n i = 1 hn ⎝ hn ⎞ ⎟⎟ ⎠ is an average of normal densities centered at the samples xi. p1 ( x ) = ϕ ( x − x1 ) = 1 −1 / 2 e ( x − x1 )2 → N (CSE x1 ,1 555: ) Srihari 22 Contributions of samples clearly observable 2π Parzen Window Estimates of Bimodal Distribution • • Window widths Case where p(x) = λ1.U(a,b) + λ2.T(c,d) (mixture of a uniform and a triangle density) No. Of Samples CSE 555: Srihari As no of samples become large estimates are same with each window width 23 Classification based on Parzen Windows We estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior Decision Boundaries of Two-dimensional dichotomizer Using small h Using large h Small h is better in upper region. Large h in lower Region. CSE 555: Srihari 24 Parzen Window Conclusion • No assumptions made ahead of time regarding • • distributions Same procedure was used for unimodal and bimodal cases With enough samples assured of convergence • May need large no of samples however • Also, no data reduction provided which leads to severe • computation and storage requirements Demand for no of samples grows exponentially with dimensionality CSE 555: Srihari 25 Probabilistic Neural Networks Parzen Window Method can be implemented as an artificial neural network Connections correspond to class labels of patterns Feature Values Of Patterns Weight Vector w CSE 555: Srihari 26 Probabilistic Neural Networks Parzen Window Method can be implemented as an artificial neural network a1c an2 Feature Values Of Patterns Weight Vector w CSE 555: Srihari 27 Normalization • Patterns are normalized (or scaled) to have unit length, or d 2 x ∑ i =1 i =1 • This is done by replacing each feature value by xj ← • xj 1/ 2 ⎛ d 2⎞ ⎜ ∑ xi ⎟ ⎝ i =1 ⎠ Effect of normalization x x =1 t CSE 555: Srihari 28 Normalization Example • Normalize x = [3 4]t • • (9+16)1/2 = 5 Normalized vector is =[3/5 4/5]t = [0.6 0.8]t • Effect of normalization d 2 x ∑ i = 0.36 + 0.64 = 1 i =1 ⎡ 0.6 ⎤ x x = ⎢ ⎥[0.6,0.8] = [1] ⎣ 0.8 ⎦ t CSE 555: Srihari 29 Activation Function ⎛ x − w k ⎞ −(x − w k )t (x − w k )/ 2σ 2 ⎟⎟αe ϕ ⎜⎜ ⎝ hn ⎠ e Contribution of a sample at x Here we let weights correspond to feature values of sample − ( x − w k )t ( x − w k ) / 2σ 2 =e − ( x t x + w kt w k − 2 x t w k ) / 2σ 2 =e ( net k −1) / σ 2 netk = wkt x Simplified form due to normalization x t x = wkt wk = 1 CSE 555: Srihari 30 PNN Training Algorithm Over all samples j Normalize Feature values Set weights equal to feature values Over all features k Only connections Corresponding to class labels are set to 1 The aji correspond to making connections between labeled samples and the corresponding classes CSE 555: Srihari 31 PNN Classification Algorithm Made possible by appropriate choice of activation function Output is the sum of the activation functions corresponding to the labeled samples of that class CSE 555: Srihari 32 Parzen/PNN Classifier Stage 1 • • • Inputs: d normalized features Weights correspond to feature values of n labeled samples wij = xij, i=1,.., n; j=1,..,d Input units are fully connected to n sample units x1 x2 Input unit . . . xd .. . W11 .. . . net k = w kt x; k = 1,..n p1 p2 W2d Wnd CSE 555: Srihari . . . Labeled sample units pn 33 PNN Stage 2: outputs are sparsely connected pn sample patterns .. . . p1 .. . p2 . . . pk . . . pn ω1 ω2 . . . Category units ωc Activation functions (nonlinear functions) e ( net k −1) / σ 2 CSE 555: Srihari 34 kn-Nearest neighbor estimation Parzen disadvantage: unknown “best” window function • Let cell volume be a function of the training data • Center a cell about x and grow it until it captures kn samples • kn is a function of n Or kn nearest-neighbors of x n=1 n=4 n=9 CSE 555: Srihari n = 16 n = 100 35 Choice of kn • If density is high near x • • cell will be small -- which provides good resolution If density is low • cell will grow large and stop until higher density regions are reached • Choose • • Requirement for convergence: lim kn = infinity and lim kn/n = 0 We can obtain a family of estimates by setting kn=k1/√ n and choosing different values for k1 CSE 555: Srihari 36 Examples of estimated pdfs using k nearest neighbor estimate Discontinuities in the estimated densities take place away from the points Two dimensions: k = 5 One dimension: k= 3 and 5 CSE 555: Srihari 37 Comparison of Parzen and k nn estimates in one-dimension For k-nearest neighbor estimation using kn = √n = 1 the estimate becomes pn(x) = kn / n.Vn = 1 / V1 =1 / 2|x-x1| Clearly a poor estimate k-nn estimate Parzen estimate Whereas Parzen yields: CSE 555: Srihari 38 Examples of kn-Nearest-Neighbor Estimation Bi-modal Gaussian Spiky estimates when there are few samples CSE 555: Srihari 39 Estimation of a-posteriori probabilities Product of class-conditional and a posteriori is represented among samples p n ( x, ω i ) = Pn (ωi / x) = ki / n V p n ( x, ωi ) c ∑ p ( x, ω ) j =1 n = ki k j • ki/k is the fraction of the samples within the cell that are labeled ωi • For minimum error rate, the most frequently represented category within the cell is selected • If k is large and the cell sufficiently small, the performance will approach the best possible Size of the cell can be chosen using either Parzen or kn nearest neighbor • CSE 555: Srihari 40
© Copyright 2026 Paperzz