Discriminant analysis and MANOVA

Principal component analysis
M i i for
f PCA came ffrom major-axis
j
i regression.
i
• Motivation
• Strong assumption: single homogeneous sample.
– Free of assumptions when used for exploration.
exploration
– Classical tests of significance of eigenvectors and
eigenvalues assume multivariate normality.
– Bootstrap tests assume only that sample is representative of
the population.
• Can be used with multiple samples for exploration:
–
–
–
–
Search for structure: e.g., how many groups?
Not optimized for discovering group structure.
Classical significance tests can’t be used.
If discover structure by exploring data, then can’t test for
significance.
3
3
25
2.5
25
2.5
Scores on PC2 (19.4%)
Scores on PC2 (19.4%)
Principal component analysis
2
1.5
1
0.5
0
-0.5
2
1
1.5
2
1
0.5
0
-0.5
7
8
9
10
11
12
13
Scores on PC1 (72.4%)
7
8
9
10
11
Scores on PC1 (72.4%)
MANOVA: p < 0.0001
But: data were sampled randomly from a
i l multivariate-normal
lti i t
l population.
l ti
single
12
13
Multiple groups and multiple variables
• Suppose that:
– We have two or more groups (treatments, etc.) defined on
extrinsic criteria
criteria.
– We wish to know whether and how we can discriminate
groups on the basis of two or more measured variables.
• Things we might want to know:
– Can we discriminate the groups?
– If so,
so how well? How different are the groups?
– Are the groups ‘significantly’ different? How do we assess
significance in the presence of correlations among variables?
– Which variables are most important in discriminating the
groups?
– Can ggroup
p membership
p be ppredicted for “unknown”
individuals? How good is the prediction?
Multiple groups and multiple variables
Th
ti
i three
th related
l t d methods:
th d
• These
questions
are answeredd using
(1) Discriminant function analysis (DFA):
• = Discriminant analysis (DA), = canonical variate analysis (CVA).
• Determines the linear combinations of variables that best
discriminate groups.
(2) Multivariate analysis of variance (MANOVA):
• Determines whether multivariate samples differ non-randomly
(significantly).
(3) Mahalanobis distance (D2):
• Measures distances in multivariate character space in the presence
of correlations among variables.
• Developed independently by three mathematicians:
– Fisher (DFA) in England, Hotelling (MANOVA) in the United
States, Mahalanobis (D2) in India.
– Due to differences in notation,, underlying
y g similarities not
noticed for 20 years.
– Now have a common matrix formulation.
Discriminant analysis
• Principal component analysis:
– Inherently a single-group procedure:
• Assumes that data represent a single homogeneous sample from
a population.
population
• Can be used for multiple groups, but cannot take group
structure into consideration.
• Often used to determine whether groups differ in terms of the
variables used, but:
– Can’t use grouping information even if it exists.
– Maximizes variance,, regardless
g
of its source.
– Not guaranteed to discriminate groups.
• Discriminant analysis:
– Explicitly a multiple-group procedure
procedure.
– Assumes that groups are known (correctly) before analysis,
on the basis of extrinsic criteria.
– Optimizes
O ti i discrimination
di i i ti between
b t
the
th groups by
b one or more
linear combinations of the variables (discriminant functions).
Discriminant analysis
• Q1
Q1: How
H are th
the groups different,
diff
t andd which
hi h variables
i bl mostt contribute
t ib t
to the differences?
• A: For k groups, find the k–1 linear discriminant functions (axes,
vectors,
t
functions)
f ti ) that
th t maximally
i ll separate
t the
th k groups.
– Discriminant functions (DFs) are eigenvectors of the among-group
variance (rather than total variance).
– Like PCs
PCs, discriminant ffunctions:
nctions:
• Are linear combinations of the original variables.
• Are specified by sets of eigenvector coefficients (weights).
– Can be rescaled as vector correlations.
correlations
– Allow interpretation of contributions of individual variables.
• Have corresponding eigenvalues.
– Specify the proportion of among
among-group
group variance (rather than total
variance) accounted for by each DF.
• Can be estimated from either the covariance matrices (one per group) or
the correlation matrices.
– Groups are assumed to have multivariate normal distributions with
identical covariance matrices.
Discriminant analysis
• Example: 2 groups, 2 variables:
Original data
Original data with 95% data ellipses
1
9
11
90
10
80
9
70
8
8
7
X2
2
5
6
5
4
A
4
3
3
2
2
60
50
40
30
20
10
B
3
4
5
6
7
8
9
10
3
4
5
6
X1
7
8
9
0
10
0
X1
Line A: ANOVA F= 0.54
20
40
60
80
100
120
Angle of line from horizontal
DF: ANOVA F=85.62
Line B: ANOVA F=56.68
2
2.5
1.5
2
1
1.5
1.5
0.5
0
-0.5
-1
-1.5
Projection scorres
Projection scorres
1
Projection scorres
X2
7
6
F from ANOVA of scorres
10
0.5
0
-0.5
-1
2
Group
0
-0.5
-1.5
-2
1
0.5
-1
-1.5
-2
1
1
2
Group
1
2
Group
140
160
180
Discriminant analysis
• Example: 3 groups, 2 variables:
Original data with 95% data ellipses
Original data
9
60
10
1
2
9
3
50
F from ANOVA of score
es
8
8
7
7
X2
6
5
5
4
A
4
3
3
3
4
5
6
7
8
9
10
11
20
10
2
3
4
5
6
7
8
9
10
11
0
12
0
40
Line A: ANOVA F=29.71
Line B: ANOVA F= 8.91
1.5
1.5
1
1
1
0.5
0.5
-1
0
-0.5
-1
-1.5
-1.5
-2
-2
-2.5
-2.5
-3
2
Group
3
120
1.5
Projection sco
ores
-0.5
100
DF: ANOVA F=54.55
2
0
80
2
2
05
0.5
60
Angle of line from horizontal
2.5
1
20
X1
X1
Projection sco
ores
2
30
B
2
2
Projection sco
ores
X2
6
40
0
-0.5
-1
-1.5
-2
25
-2.5
1
2
Group
3
1
2
Group
3
140
160
180
Discriminant analysis
• The discriminant functions are eigenvectors:
– For PCA, the eigenvectors are estimated from S, the covariance
matrix, with accounts for the total variance of the sample.
– For
F DFA,
DFA the
th eigenvectors
i
t are estimated
ti t d from
f
a matrix
t i that
th t
accounts for the among-group variance.
• For a single variable, a measure of among-group variation, scaled
by within
within-group
group variation,
variation is the ratio:
sa2
 2
sw
1
• Discriminant functions are eigenvectors of the matrix W B
W = pooled
l d within-group
ithi
covariance
i
matrix.
ti
B = among-group covariance matrix.
– Analogous to univariate measure.
Discriminant analysis
• Thus the DFA eigenvectors:
– Maximize the ratio of among-group variation to within-group
variation.
– Optimize
O ti i discrimination
di i i ti among all
ll groups simultaneously.
i lt
l
• For any set of data, there exists one axis (the discriminant function,
DF) for which projections of groups of individuals are maximally
separated, as measured by ANOVA of the projections onto the
axis.
– For 2 groups: this DF completely accounts for group discrimination.
– For 3+ groups, have series of orthogonal DFs:
• DF1 accounts for largest proportion of among-group variance.
among group
• DF2 accounts for largest proportion of residual among-group
variance.
• Etc.
• DFs can be used as bases of a new coordinate system for plotting
scores of observations, and loadings of original variables.
Discriminant analysis
• Example: 2 groups, 5 variables:
Original data
Original data with 95% data ellipses
11
11
10
10
1
9
9
2
8
X2
7
7
6
6
5
5
4
4
3
3
2
3
4
5
6
7
8
9
10
11
12
2
4
6
X1
8
10
12
X1
3
1
0.8
2
0.6
1
2
0
1
-1
Loadings on DF2
Scores on DF2
2 (0.0%)
X2
8
0.4
X4
X
2
02
0.2
X5
X3
0
-0.2
X1
-0.4
-0.6
-2
-0.8
-1
-3
-5
0
Scores on DF1 (100.0%)
5
-1
-0.5
0
0.5
Loadings on DF1
1
Discriminant analysis
• Example: 3 groups, 5 variables:
Oi i ld
t with
ith 95% d
t ellipses
lli
Original
data
data
Oi i ld
t
Original
data
11
11
10
10
9
9
8
8
7
3
6
X2
2
1
6
5
5
4
4
3
3
2
2
1
1
0
2
4
6
8
0
10
2
4
6
8
10
X1
X1
4
1
3
0.8
0.6
2
1
1
2
0
3
-1
-2
Loadings on DF2
Scores on DF2 (37.3%)
X2
7
X
X5
4
0.4
02
0.2
X3
0
X1
-0.2
-0.4
X
-0.6
2
-0.8
-3
-1
-3
-2
-1
0
1
2
Scores on DF1 (62.7%)
3
4
-1
-0.5
0
0.5
Loadings on DF1
1
Discriminant analysis
Di i i
f
i
h
l
h to
• Discriminant
functions
have
no necessary relationship
principal components:
9
PC axes
DF axes
8
21
8
3
3
PC axes
DF axes
7
2
7
1
X2
X2
6
6
5
5
4
4
3
3
2
2
3
4
5
6
7
8
9
10
2
3
4
5
6
X1
9
8
7
8
9
10
X1
3
PC axes
DF axes
9
2
PC axes
DF axes
2
1
8
7
X2
X2
7
1
6
3
6
5
5
4
4
3
3
2
3
4
5
6
X1
7
8
9
10
3
4
5
6
7
X1
8
9
10
11
MANOVA
• Q2: Are the groups significantly heterogeneous?
• A: Multivariate analysis of variance:
– G
Generall case off ttesting
ti for
f significant
i ifi t differences
diff
among a sett
of predefined groups (treatments), with multiple correlated
variables.
• ANOVA: special case for one variable (univariate).
– Hotelling’s T2-test: special case of MANOVA for two
g p
groups.
• t-test: special univariate case for two groups.
MANOVA
Di i i
f
i
i
h matrix:
i W 1B
• Discriminant
functions
are eigenvectors
off the
• The eigenvalues of W 1B are 1 , 2 ,  ,  p .
p
1
• A general multivariate test statistic is Wilk’s lambda:   
j 1 1   j
– Commonlyy reported
p
byy statistical packages.
p
g
– Expression to determine significance is complicated.
– Wilk’s lambda can be transformed to an F-statistic, but the
e pression for this is complicated,
expression
complicated too.
too
• Several other test statistics are commonly reported by
statistical packages:
p
g
– Varying terminology, varying assumptions.
– All reported with corresponding p-values.
Mahalanobis’ distance
• Q3: How to measure the distance between two groups?
• A: Depends on whether we want to take correlations among
variables into consideration.
– If not, just measure the Euclidean distance between centroids.
– If so, must measure the Mahalanobis distance between centroids
‘along’
along the covariance structure: D 2   x1 - x2  S -1  x1 - x2 
– Can also measure Mahalanobis distances between points.
Euclidean distances
16
16
14
14
12
12
10
2
6.93
6.93
8
10
6
4
4
4
6
8
62.7
10
X
12
1
12.6
8
6
2
71.7
X
6.93
X
2
Mahalanobis distances
14
16
18
2
4
6
8
10
X
12
1
14
16
18
Classifying ‘unknowns’ into predetermined groups
C t t have
h
‘k
’ groups off observations.
b
ti
• Context:
k ‘known’
– Also have one or more ‘unknown’ observations, assumed to be
a member of one of the ‘known’ groups.
• Task: to assign each ‘unknown’ observation to one of the k
groups.
• Procedure:
– Find Mahalanobis distance from the ‘unknown’ observation to
each of the centroids of the k groups.
– Assign the ‘unknown’
unknown to the closest group.
group
• Can be randomized:
– Bootstrap the ‘known’ observations by sampling within
groups, with
i h replacement.
l
– Assign the ‘unknown’ observation to the closest group, based
on distance from the observation to the group centroids.
– Repeat many times: gives the proportion of times the
observation is assigned to each of the groups.
Classifying ‘unknowns’ into predetermined groups
E
l : 2 groups, 2 variables,
i bl 1 unknown,
k
b
• Example
100 bootstrap
iterations:
1
9
9
2
8
2
8
7
X
X
2
2
7
6
1
6
5
5
4
4
3
4.5
5
5.5
6
6.5
7
X
7.5
8
8.5
1
Classification probabilities:
Groupp 1: 0.15
Group 2: 0.85
9
9.5
3
3.5
4
4.5
5
5.5
X
6
6.5
7
1
Classification probabilities:
Groupp 1: 0.51
Group 2: 0.49
7.5
Assessing misclassification rates (probabilities)
W ld like
lik to know
k
h ‘good’
‘ d’ the
h discriminant
di i i
f
i
• Would
how
functions
are.
– DFA involves findingg axes of maximum discrimination for
the data included in the analysis.
– Would like to know how well the procedure will generalize.
– Can’t
C ’t trust
t t misclassification
i l ifi ti rates
t based
b d on the
th observations
b
ti
used in the analysis.
• Ideally,
y would like to have new, ‘known’ data to assign
g to
known groups based on the discriminant functions.
Assessing misclassification rates (probabilities)
• Alternately, can cross-validate:
– Divide all data into:
(1) ‘Calibration’ data set: used to find discriminant functions.
(2) ‘Test’ data set: used to test discriminant functions.
• Determine how well the DFs can correctly assign
‘unknowns’ to their correct groups.
• Proportions of incorrect assignments are estimates of ‘true’
misclassification rates.
– Problem: need all data to get the ‘best’ estimates discriminant
functions.
– Solution: cross-validate one observation at a time via the
jackknife procedure.
Assessing misclassification probabilities
C
lid ti via
i the
th jackknife
j kk if (‘leave-one-out’)
(‘l
t’) procedure.
d
• Cross-validation
– Set one observation aside.
– Estimate discriminant functions from remaining observations.
– Classify
l if the
h remaining
i i ‘known’
k
observation
b
i using
i the
h discriminant
di i i
functions.
– Repeat for all observations, leaving one out at a time.
• Example:
E
l 2 groups, 2 variables,
i bl 5 observations/group:
b
i /
2
9
85
8.5
1.5
7.5
2
7
X
1
2
65
6.5
6
5.5
5
Scores o
on DF2
8
1
0.5
2
0
-0.5
1
4.5
-1
4
3.5
4
4.5
5
5.5
6
X1
6.5
7
7.5
8
-2
-1.5
-1
-0.5
0
Scores on DF1
0.5
1
1.5
Assessing misclassification probabilities
•
•
•
A i each
h iindividual,
di id l in
i turn,
t
tto one off the
th known
k
i the
th
Assign
groups using
jackknife procedure.
Bootstrap 100 times.
Misclassification rate: 2/10 = 20%
Percentage of
bootstrap replicates
Observation
Group
Assigned to group
Group 1
Group 2
1
1
2
0.49
0.51
2
1
1
0.53
0.47
3
1
1
0.52
0.48
4
1
1
0.54
0.46
5
1
1
0.57
0.47
6
2
2
0 44
0.44
0 56
0.56
7
2
2
0.45
0.55
8
2
2
0.47
0.53
9
2
2
0.39
0.61
10
2
1
0.53
0.47