Dimensionality, Decision theory, Information Theory

Dimensionality and Decision
Theory: Discussion questions
•
How does the “curse of dimensionality” apply to the
brain?
•
Why might people play the lottery, even though
they know they lose money on average?
•
How do we define the performance of an agent?
Classification in two
dimensions
•
Consider the strategy of classifying a new input
according to the most frequent label among its
neighbors
Classification in two
dimensions
•
Divide the space of possible inputs into a grid of
equal-sized cells, and compute a predicted label for
each cell
•
Some cells have enough data, others have very little
What happens when we
scale up the dimensionality?
•
If we divide the values on each of D dimensions into n
regions, then there are nD cells
•
Example: Representing a 100x100 image by its pixel
values, differentiating only between black and white, gives
210000 cells!
•
Most cells will contain no data
Curve fitting in multiple
dimensions
•
A polynomial of order M in D dimensions requires
on the order of MD coefficients
!
•
Again, this is impractical for high dimensional inputs
Behavior of multivariate normal
distribution in many dimensions
•
High-dimensional space has different geometry
•
The probability that a normal random variable in 20
dimensions takes a value close to its mean is ~0
How to deal with the “curse
of dimensionality”
•
Data in many domains are “smooth,” so that small
changes in the input variables lead to small
changes in the outcomes
•
This allows prediction in regions of the input space
where there is no data
How to deal with the “curse
of dimensionality”
•
Data may be confined to a region (manifold) of the
input space having lower effective dimensionality.
How to deal with the “curse
of dimensionality”
•
Data may be confined to a region of the input space
having lower effective dimensionality.
•
Even if data occupy a high-dimensional region of
the input space, we can lower dimensionality by
discarding information
•
Examples: Principal component analysis, Feature
extraction
Decision Theory
•
Once we have discovered the joint distribution of
the input and label p(x,t), how is this used to guide
a decision?
•
An agent must use this information to choose from
its possible actions
•
A special case is to output a single prediction of
the label t
Example: Binary
classification
•
Consider a system that classifies X-ray images
(inputs x) based on whether a cancer is present
•
Suppose t = 0 corresponds to class C1(cancer),
and t = 1 corresponds to class C2 (no cancer)
•
The possible actions are A1(diagnose cancer) and
A2(diagnose no cancer)
Minimizing the
misclassification rate
•
Given the joint distribution p(x,t), we divide the
input space into two regions R1, R2 where we
choose A1 and A2 respectively
Minimizing the
misclassification rate
•
To minimize p(mistake), we define R1, R2 so that
each x is assigned to the class that has a smaller
value of the integrand
•
By the product rule,
•
p(x) is the same for all k, so we assign x to the
class with largest posterior probability
Minimizing the
misclassification rate
•
Green and blue regions are error with any classifier
due to inherent uncertainty
•
Optimal classifier eliminates red region
Minimizing the expected
loss
•
Not all errors are alike! Failing to detect cancer (or a fire,
or an oncoming car) is more costly than a false alarm
•
We can define and minimize a loss function or cost
function for each of the possible (state, action) pairs
•
Equivalently, we can maximize a utility function or gain
function, which is just negative loss
Lkj is the loss
incurred by
classifying j when
the true state is k
Minimizing the expected
loss
Assign each x to the class j
that minimizes
which is proportional to
Thus to minimize expected loss we need the
posterior probabilities and the loss function
The reject option
•
If no posterior probability is greater than a threshold
θ, then refrain from making a classification
Three levels of modeling
•
1. generative model: Find the joint distribution p(x,
Ck), or, equivalently, find the prior p(Ck) and
likelihood p(x |Ck). Use this to compute the posterior
p(Ck|x).
•
2. discriminative model: Find the posterior
distribution p(Ck|x) directly.
•
3. discriminant function: Directly learn a function f(x)
that assigns a class label, without using probability
or decision theory.
Why use the more complex
models?
•
generative model: Calculates more than needed to
make optimal decisions. Useful for generating
synthetic data, and for outlier detection (low
probability x)
•
2. discriminative model: Can easily recompute
decisions after adjusting the loss function or the
prior, use reject option, and combine results of
multiple models
Loss functions for regression
In regression, we predict the value of a target t using a function y(x).
The loss is 0 if y(x) = t, and increases with error.
Commonly we use the squared loss L(t, y(x) = {y(x) - t}2
With squared loss, E[L] is minimized by setting y(x) = Et[t|x]
Loss functions for regression
Other regression functions are optimal if we choose a
loss function other than squared loss