14. Development and Plasticity

Ch 1. Introduction (Latter)
Pattern Recognition and Machine Learning,
C. M. Bishop, 2006.
Summarized by
J.W. Ha
Biointelligence Laboratory, Seoul National University
http://bi.snu.ac.kr/
Contents

1.4 The Curse of Dimensionality

1.5 Decision Theory

1.6 Information Theroy
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
1.4 The Curse of Dimensionality

The High Dimensionality Problem
 Ex. Mixture of Oil, Water, Gas
- 3-Class
(Homogeneous, Annular, Laminar)
- 12 Input Variables
- Scatter Plot of x6, x7
- Predict Point X
- Simple and Naïve Approach
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
1.4 The Curse of Dimensionality (Cont’d)

The Shortcomings of Naïve Approach
- The number of cells increase
exponentially.
- Needs a large training data set for cells
not to be empty.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
1.4 The Curse of Dimensionality (Cont’d)

Polynomial Curve Fitting Method(M Order)
- Althogh D increases, it grows propotionally to Dm

The Volume of High Dimensional Sphere
- Concentrated in a thin shell near the space
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
1.4 The Curse of Dimensionality (Cont’d)

Gaussian Distribution
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
1.5 Decision Theory

Make Optimal Decisions
- Inference Step & Decision Step
- Select Higher Posterior Probability
 Minimizing the Misclassification Rate
- Object:
→ Minimizing Colored Area
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
1.5 Decision Theory (Cont’d)

Minimizing the Expected Loss
- Class마다 Missclassification의 Damage가 다르다.
- Introduction of Loss Function(Cost Function)
- Object :
Minimizing Expected Loss

The Reject Option
- Threshold θ
- Reject if θ > Posterior Prob.
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
1.5 Decision Theory (Cont’d)

Inference and Decision
- Three Distinct Approach
1. Obtain Posterior Probability & Generative Models
- Obtain data distribution by Caculating p(x|Ck) for
each class
- Obtain p(Ck), p(x) to get p(Ck|x) in Bayesian Rule
- Can generate synthetic data points
- Overheads of Calculation
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
9
1.5 Decision Theory (Cont’d)
2. Discriminative Models using Posterior
- Obtain Posterior Directly
- Classify the class for new input
data
- In case that classification is
needed only
3. Discriminative Function
- Maps input x to class directly
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
1.5 Decision Theory (Cont’d)

Why do we compute the posterior?
1. Minimizing Risk
- Frequently changed Loss Matrix
2. Reject Option
3. Compensating for Class Priors
- In case of large difference between the probablities
of each class
- Posterior is proportional to prior
4. Combining Models
- Seprate subproblem and Obtain each posterior
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
1.5 Decision Theory (Cont’d)

Loss Function for Regression
- Multiple Target Variable Vector
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
1.5 Decision Theory (Cont’d)

Minkowski Loss
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
1.6 Information Theory

Entropy
- Low probability events corresponds to high information
content.( h(x) = -log2p(x) )
- Expectaion value of information content.
- Higher Entropy, Lager Uncertainty
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
1.6 Information Theory (Cont’d)

Maximum Entropy Configuration for Continuous Variable
H[x]  - p( x) ln p(x)dx



p( x)dx  1



xp( x)dx  



( x   )2 p( x)dx   2
- Adopt Lagrange multipliers to obtain maximum entropy
- p( x) ln p( x) dx  1
3






( x   ) 2 p ( x)dx   2
p ( x) 
 
p( x) dx  1  2



xp( x)dx  

 ( x   )2 
exp 

2
2
2

2


1
- The distribution that maximize the differential entropy
is the Gaussian
 Conditional Entropy : H[x,y] = H[y|x] + H[x]
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
1.6 Information Theory (Cont’d)

Relative Entropy [Kullback-Leibler divergence]
- Predict unknown distribution p(x) with an approaxiamting
distribution q(x)

Convexity Function (Jensen’s Inequality)
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
16
1.6 Information Theory (Cont’d)

Mutual Information
- Relative Entropy between the joint distribution and the
product of the marginals
- I[x, y] = H[x] – H[x|y] = H[y] – H[y|x]
- If x and y are independent, I[x,y] = 0
- the Reduction in the uncertainty about x by virtue of being
told the value of y
(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/
17