Section 17 - Discriminant Analysis

Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
17 – Discriminant Analysis (LDA, QDA, and RDA)
17.1 - Discriminant Analysis in JMP
Though we have almost exclusively used R in this course, we will begin by examining
discriminant analysis in JMP because it provides some nice tools for visualizing the data that
will help demonstrate the main concepts of discriminant analysis.
Example 17.1 – Flea Beetles: The genus of flea beetle Chaetocnema contains three species
that are difficult to distinguish from one another and, indeed, were confused for a long
time. These data comprise the six different measured characteristics made on these
three species of flea beetles.
The variables in this data set are:

Species - species of flea beetle (1= Chaetocnema concinna, 2 = Chaetocnema
heikertingeri, or 3 = Chaetocnema heptapotamica)






Width 1 - a numeric vector giving the width of the first joint of the first tarsus in
microns (the sum of measurements for both tarsi)
Width 2 - a numeric vector giving the width of the second joint of the first tarsus
in microns (the sum of measurements for both tarsi)
Maxwid 1 - a numeric vector giving the maximal width of the head between the
external edges of the eyes in 0.01 mm
Maxwid 2 - a numeric vector giving the maximal width of the aedeagus in the
fore-part in microns
Angle - a numeric vector giving the front angle of the aedeagus (1 unit = 7.5
degrees)
Width 3 - a numeric vector giving the aedeagus width from the side in microns
The main questions of interest are:
Are there significant differences in the two measured characteristics for these three
species of flea beetle? Can the differences on these measurements be used to classify the
species of a flea beetle?
A classification rule should allow us to predict the species of flea beetle based on these
six measurements. Discriminant analysis does this by using distances from species
averages on these measurements to compute a posterior probability of being from of the
classes, in this case species, in the training data. It is important to note that discriminant
513
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
analysis requires that all predictors (𝑋1 , 𝑋2 , … , 𝑋𝑝 ) are continuous or meaningfully
numeric.
Analysis in JMP
First we use graphical techniques such as histograms, comparative boxplots and
scatterplots with color coding in an attempt to derive a classification rule for the three
species of flea beetle. This can done as follows:
Color coding - From the Rows menu select Color/Marker by Col... then check box
labeled Marker (color will already be checked by default). Now highlight Species with
the mouse and click OK. In subsequent plots the different species of flea beetle will
color coded and a different plotting symbol will be used for each species. Look at the
spreadsheet to see the correspondence.
Histograms - Select Distribution of Y from the Analyze menu and add Species and all
six of the measurements to the right hand box, then click OK. Now use the mouse to
click on the bars corresponding to the different species of beetle in the bar graph for
Species, carefully observing what happens in the histograms for the two measured
characteristics. This is an example of linked viewing - data that is highlighted or
selected in one plot becomes highlighted in all other plots.
What can we conclude from this preliminary histogram analysis?
514
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
From the above displays we can see that Heptapot flea beetles has width measurements
between 125 and 150. On the angle variable we can see that this species of flea beetle
tends to have much smaller angle measurements than the other two species. At this point
might conjecture that if we observed a width value between 125-150 and an angle measurement
less than 12 for an unclassified flea beetle that it would be classified as a Heptapot flea beetle.
Similar types of observations can be made by clicking on the bars
for the other two species in the bar graph for species.
Comparative Boxplots - Select Fit Y by X from the Analyze menu and add Species in the X box
and all six measurements in the Y box. Add boxplots and/or means diamonds to the resulting
plots by selecting these options in the Display menu which is located below the plot.
The plot at the top to the left gives
comparative boxplots for the three flea
beetle genus for the width variable. We
clearly see that the Heikert. flea beetles
have the lowest width measurements in
general, while Concinnas generally
have the highest.
The Compare Densities option gives
smoothed histograms which also can be
used to show univariate differences
between the three species of flea beetle.
515
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Comparative Displays for Width 1
516
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Comparative Displays for Angle
These graphs again show that the genus Heptapot have the smallest angle measurements
in general. Another interesting plot to consider when looking at multivariate data
where there are potentially different populations involved is the parallel coordinate
plot. In the parallel coordinates plot, a set of parallel axes are drawn for each variable.
Then a given row of data is represented by drawing a line that connects the value of
that row on each corresponding axis. When you have predefined populations
represented in your data, these plots can be particularly useful when looking for
distinguishing characteristics of these populations.
To obtain a parallel coordinate plot in JMP select Parallel Plot from the Graph menu
and place the variables of interest in the Y response box. You can place a grouping
variable in the X, Grouping box if you wish to see a parallel coordinate display for each
level of the grouping variable X. The results from both approaches are shown on the
following page.
517
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
At this point we can see our classification rule could be stated as follows... If the angle is
less than 12 classify a flea beetle as genus Heptapot. If angle is greater than 12 then the
beetles is most likely from genus Concinna or Heikert. To distinguish genus Concinna
from Heikert the comparative boxplots for Maxwid 2 suggest that we could classify a
flea beetle with angle greater than 12 as being from genus Concinna if the maximal
width exceeded 133 or 134, otherwise classify it as being from genus Heikert. This rule
is more akin to what we obtain from a classification tree (CART) model fit to these data
as the rules for classification are essentially binary splits based upon the measurements.
To formally compare the genus types on these six characteristics we could perform a
one-way analysis of variance (ANOVA) for each. This is can be done by selecting
Means/ANOVA from the Analysis menu below the plot. Pairwise multiple
comparisons can be done using Tukey’s method by selecting Compare All Pairs from
the Oneway Analysis pull-down menu.
518
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
ANOVA TABLE for Maximal Width 2
The p-value for the F-test for testing the null hypothesis that the population means for
width are all equal is less than <.0001 indicating that at least two means are significantly
different. To decide which means are significantly different we can use Tukey’s
multiple comparison procedure examining all pairwise comparisons of the populations
means. The results are shown below.
Results of Tukey's Multiple Comparisons
Here we can see that the mean maximal widths differ for all species.
To aid in the development of a simple classification rule we could also examine a
scatterplot for Maxwid 2 vs. Angle making sure that the different species are still color
coded (see above). This can be done by choosing Fit Y by X from the Analyze menu and
placing Width in the X box and Angle in the Y box. We can add a density ellipse for
each group to the plot by choosing Grouping Variable from the Analysis menu below
the plot and highlighting Species as the variable to group on. Then select Density
Ellipses with the desired percent confidence from the Analysis menu. The ellipses
519
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
should contain the specified percent of the points for each group. Notice the nice
separation of the density ellipses for the three flea beetle species in these data.
Scatterplot of Angle vs. Width
We can use this scatterplot to state a classification rule that can be used to identify
species on the basis of these measurements. Clearly an angle less than 12 would
indicate that the flea beetle was genus Heptapot. Concinna and Heikert both appear
to have angle measurements exceeding 12. However by using the maximal width
measurement we can distinguish between these two species. Concinna flea beetles have
widths exceeding 134, while the Heikert beetles have widths less than 134. Again the
rule is similar to what we might obtain from CART. Another way to statistically
formalize the procedure above we could perform discriminant analysis, which we will
be the focus of this section.
Rather than consider each of the six characteristics individually using ANOVA.
MANOVA allows us to determine if the three species differ on the six characteristics
measured simultaneously. This is achieved by looking at the multivariate response
consisting of all six measurements rather than each characteristic individually which
was done above through the use of one-way ANOVA. To perform MANOVA in JMP
first select the Fit Model option in the Analyze menu. Place all six measurements in
520
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
the Y box and place Species in the Effects in Model box, then select MANOVA from the
Personality pull-down menu and select Run. When the analysis is completed, choose
Identity from the menu below where it says Choose Response and click Done. Below
you will see a variety of boxes with many different statistics and statistical tests. To test
the main hypotheses that these measured characteristics differ from species to species
you will want to examine the results of the tests in the Species box. This box contains
four different test statistics which all the answer the question - - do the means for these
six measurements differ in any way from species to species? If the p-value is small for
any of them there is evidence that the mean of width and angle differ significantly from
species to species. You can examine a profile of the mean values of width and angle for
each species by examining the plot for species in the Least Square Means box in the
output. The table on the following page shows the results of the MANOVA for species.
MANOVA Results for Species Comparison
Here we can see that the p-values associated with each test statistic is less than .0001,
which provides compelling evidence that the three species differ significantly on the six
measured characteristics. To see the nature of these differences select Centroid Plot
from the pull-down menu at the top of the Species box. The centroid plot for these data
is shown on the following page.
521
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Canonical Centroid Plot
A 2-D canonical centroid plot is a plot of the first
two discriminants from Fisher’s Discriminant
Analysis. Fisher’s discriminant analysis is a
method where linear combinations of 𝑋1 , … , 𝑋𝑝
are found that maximally separate the groups.
The first linear combination maximally separates
the groups in 1-D dimension, the second linear
combination maximally separates the groups
subject to the constraint that the linear
combination is orthogonal to the first. Thus
when the resulting linear combinations are plotted
for each observation we obtain a scatterplot
exhibiting zero correlation and hopefully good
group separation. We can also visualize the
results in 3-D considering a third linear
combination, provided there are more than three
groups/populations.
The above plot confirms things we have already seen. Notice that the Concinna and
Heikert centroid (mean) circles lie in the direction of the Angle ray indicating that these
two genus types have large angle measurements relative to genus Heptapot. The circle
for genus Heikert lies in the direction of the Width 1 ray indicating that these flea
beetles have relatively large width measurements. In total, we see that a nice species
separation is achieved.
The canonical centroid plot displays the results of discriminant analysis. Discriminant
analysis, though related to MANOVA, is really a standalone method. There are two
main approaches to classic discriminant analysis: Fisher’s method which is discussed
above and a Bayesian approach where the posterior probability of group membership is
calculated assuming 𝑋’𝑠 have an approximate multivariate normal distribution. In
Fisher’s approach to discriminating between g groups, (g – 1) orthogonal linear
combinations are found that maximally separate the groups, with the first linear
combination doing the largest degree of separation and so on. Future observations are
classified to the group they are closest to in the lower dimensional space created by
these linear combinations.
522
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
In LDA/QDA we classify observations to groups based on their posterior
“probability” of group membership. The probabilities are calculated assuming the
populations have an approximate multivariate normal distribution, though we can use
the method with some success even if this restrictive assumption is not satisfied. In
linear discriminant analysis (LDA) we assume each population, while having different
mean vectors, have the same variance-covariance structure. In quadratic discriminant
analysis (QDA) we assume that the variance-covariance structure of the populations is
different. QDA requires more observations per group, (𝑛𝑖 > 𝑝 )𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖, as the
variance-covariance matrix is estimated separately for each of the groups, whereas LDA
uses a pooled estimate of the common variance-covariance structure. Because QDA is
effectively estimating more “parameters” it should provide a better discrimination
between than LDA. Regularized discriminant analysis (RDA) is a balance between the
two extremes by essentially taking a weighted average of variance-covariance structure
of the two approaches. You can think of it as a shrunken version of QDA, where the
shrinkage is towards LDA.
To classify future observations a Bayesian approach is used where the posterior
probability of group membership for each group is computed as
𝑃(𝐺𝑟𝑜𝑢𝑝 = 𝑘|𝒙) =
2
exp(−.5 𝐷𝑘∗ (𝒙))
∑𝑔𝑖=1 exp(−.5 𝐷𝑖∗ 2 (𝒙))
𝑓𝑜𝑟 𝑘 = 1, … , 𝑔
where,
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 ) + 𝒍𝒏|𝑺𝒊 | − 𝟐𝒍𝒏𝒑𝒊 for QDA with unequal priors
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒊 (𝒙 − 𝒙
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 ) + 𝒍𝒏|𝑺𝒊 |
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒊 (𝒙 − 𝒙
for QDA with equal priors
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 ) − 𝟐𝒍𝒏𝒑𝒊
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒑 (𝒙 − 𝒙
for LDA with unequal priors
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 )
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒑 (𝒙 − 𝒙
for LDA with equal priors
and
𝑆𝑝−1 =
𝑔
∑𝑖=1(𝑛𝑖 −1)𝑆𝑖−1
𝑔
∑𝑖=1(𝑛𝑖 −1)
 pooled estimate of the common variance-covariance matrix in LDA
and 𝑝𝑖 = 𝑝𝑟𝑖𝑜𝑟 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑓𝑟𝑜𝑚 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑜𝑟 𝑔𝑟𝑜𝑢𝑝 𝑖. For RDA we use the
same formula as QDA with the sample variance-covariance matrix replaced by
523
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
𝑆𝑖∗ = 𝜆𝑆𝑖 + (1 − 𝜆)𝑆𝑝 (RDA estimate for variance covariance matrix)
If the another parameter (gamma)  is different from 0, then another convex
combination is formed based on 𝑆𝑖∗ .
𝛾
𝑆𝛾 = (1 − 𝛾)𝑆𝑖∗ + 𝑝 𝑡𝑟(𝑆𝑖∗ )𝐼, where 𝐼 is the pxp identity matrix. (0 ≤ 𝛾 ≤ 1).
More on RDA: The values for the two regularization parameters, 0 ≤ 𝜆 ≤ 1 and 0 ≤
𝛾 ≤ 1, are chosen to minimize jointly an unbiased estimate of future misclassification
risk which we could obtain using some form of cross-validation. Regularized
discriminant analysis provides for a fairly rich class of regularization alternatives. The
four corners of the unit square shown on the following page defining the extremes of
the (𝜆, 𝛾) plane representing the well-known classification procedures outlined above.
The lower left corner (𝜆 = 0, 𝛾 = 0) represents QDA. The lower right corner (𝜆 = 1, 𝛾 = 0)
represents LDA. The upper right corner (𝜆 = 1, 𝛾 = 1) corresponds to the nearest-means
classifier where an observation is assigned to the class with the closest (Euclidean
distance) mean, but this could be changed to statistical distance if we first standardize
the 𝑋𝑖 ′𝑠. The upper left corner of the plane represents a weighted nearest-means
classifier, with the class weights inversely proportional to the average variance of the
measurement variables within the class. Holding 𝛾 fixed at 0 and varying 𝜆 produces
models somewhere between QDA and LDA. Holding 𝜆 fixed at 0 and increasing 𝛾
attempts to un-bias the sample-based eigenvalue estimates. Holding 𝜆 fixed at 1 and
increasing 𝛾 gives rise to a ridge-regression analog for LDA. A good pair of values for 𝜆
and 𝛾 is not likely to be known in advance. We therefore must use cross-validation
methods to find optimal values. The resulting misclassification loss averaged over the
training/test samples are then used as an estimate of future misclassification risk. We
can choose a grid of points in the (𝜆, 𝛾) plane (0 ≤ 𝜆 ≤ 1, 0 ≤ 𝛾 ≤ 1) and evaluate the
cross-validated estimate of misclassification risk at each prescribed points on the grid,
and then choose the point with the smallest estimated misclassification rate.
524
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Weighted Nearest Means
(assumes unequal variancecovariance structure)
Brant Deppa - Winona State University
(𝜆, 𝛾) 𝑡𝑢𝑛𝑖𝑛𝑔 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑓𝑜𝑟 𝑅𝐷𝐴
Nearest Means
(common variance-covariance structure)
𝛾
QDA
𝜆
LDA
With all three methods (LDA, QDA, and RDA) posterior probabilities for each
observation in the training data are computed and group membership is predicted. The
performance of the discriminant analysis is the reported the number/percent
misclassified in the training data and in the test data if it is available. In JMP future
observations can be classified by adding them to the data table or by forming a
validation column using that option in the Cols > Modeling Utilities menu.
525
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
LDA/QDA Details
526
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
LDA/QDA Details (continued)
527
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
To perform LDA/QDA/RDA in JMP, choose Discriminant from the Multivariate
Methods option within the Analyze menu as shown below.
Put all six measurements
in the Y, Covariates box
and Species in the X,
Categories.
The results are shown below:
Here we can see that linear discriminant analysis misclassifies none of the flea beetles in
these training data.
A contingency table showing the classification results is displayed below the table of
posterior genus probabilities for each observation. This table is sometimes referred to as
a confusion matrix.
528
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Confusion Matrix for Species Classification
None of the species are misclassified giving us an apparent error rate (APER) of .000 or
0%. The apparent error rate is analogous to the RSS for the training data fit in a
regression problem. Applying either QDA (which actually might be recommended
given that the variability of some of the characteristics differs across species) or RDA
results in perfect classification of the species as well.
We can save the distances to each group along with the posterior probabilities for each
species to our spreadsheet by selecting Save Formulas from Score Options pull-out
menu as shown below.
Having saved these formulae to the data spreadsheet we can use the results of our
discriminant analysis to classify the species of new observations. For example, adding 2
rows to our spreadsheet and entering the measurements for two yet to be classified
beetles will obtain their predicted species based on our model.
529
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
The predictions are shown below for the two new beetles, the first is classified as
Heikert and the second as Concinna.
Below is a visualization of the classification of the new flea beetles using the first two
discriminants.
Example 17.2 - Italian Olive Oils:
The goal here is to classify olive oils in terms of the area of Italy they are from based upon
amino acid composition. There are two geographic classifications in these data. The first
classification is nine individual growing areas in Italy (Area Name) – East Liguria, West Liguria,
Umbria, North-Apulia, South-Apulia, Sicily, Coastal Sardinia, Inland-Sardinia, and Calabria. A
broader classification is the growing region in Italy (Region Name) – Northern, Southern, and
530
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Sardinia. The map below should help in your understanding of where these areas/regions are
located in Italy.
Puglia = Apulia
Series of Visualizations of Region/Area
Differences in the Italian Olive Oil Data
Sardegna = Sardinia
Sicilia = Sicily
The bar graph above shows the number of olive oils
- Parallel coordinate plot
- Bar graph for linked views
- Comparative boxplot
- Scatterplot matrices
show a parallel coordinate plot and comparative
- Histograms
- Color coding throughout
boxplots for palmitic acid across areas.
in these data from each area. The plots on the left
531
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
The scatterplot matrix on the right above clearly shows that LDA is NOT appropriate for these
data. Why? Thus we will consider QDA first.
532
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Results for QDA
We can see the overall misclassification rate on
the entire training data set is 11/572 = 1.923%.
We can randomly choose some cases to use as
test data.
.
The end result of doing this will be a column
with 0 for training and 1 for test cases. We can
then Hide and Exclude the test cases and when
we run a discriminant analysis (LDA, QDA, or
RDA) it will use the Excluded cases as test cases
and report the misclassification rate on those
cases that were not used to perform the
discriminant analysis.
We can create training/test cases in JMP using the Initialize Data feature and then we
can use a bar graph of this column to Hide and Exclude the test cases.
533
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
The results from a QDA now contain a misclassification rate (APER) for the training
cases and the test cases (Excluded).
The APER = 1.312% and the Test Case Error Rate = 4.712%. Next we consider using
regularized discriminant analysis for these data. We will again use the training and test
cases to obtain a measure of prediction error.
534
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Results from RDA with 𝜆 = .6667 and 𝛾 = 0 (giving a test case error rate = 8.38%)
and using 𝜆 = 0.1 and 𝛾 = 0 (giving a test case error rate = 4.19%)
There is no easy method for finding “optimal” values 𝜆 & 𝛾 in JMP, unfortunately it is
pretty much trial and error though we can continually cross-validate in searching
through combinations for (𝜆, 𝛾).
535
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
17.2 - LDA, QDA, and RDA in R
The package MASS contains functions lda and qda for performing LDA and QDA
respectively. The package klaR contains the rda and provides visualization through the
partimat function for LDA, QDA, and RDA.
Example 17.2: Italian Olive Oils
> dim(Olives)
[1] 572 12
> names(Olives)
[1] "Region.name" "Area.name"
[6] "palmitoleic" "strearic"
[11] "linolenic"
"eicosenoic"
>
>
>
>
"Region"
"oleic"
"Area"
"linoleic"
"palmitic"
"eicosanoic"
set.seed(1)
sam = sample(1:572,floor(.6666*572),replace=F)
Area.train = Olives[sam,-c(1,3,4,12)]  remove Region identifiers and eicosenoic acid
Area.test = Olives[-sam,-c(1,3,4,12)]
> dim(Area.train)
[1] 381
8
> dim(Area.test)
[1] 191
8
> table(Area.train$Area.name)  make sure all areas are in training cases.
Calabria Coastal-Sardinia
39
21
North-Apulia
Sicily
15
28
West-Liguria
35
> table(Area.test$Area.name)
Calabria Coastal-Sardinia
15
10
North-Apulia
Sicily
12
15
West-Liguria
14
East-Liguria
33
South-Apulia
136
Inland-Sardinia
42
Umbria
33
 make sure all areas are represented in test cases.
East-Liguria
18
South-Apulia
62
Inland-Sardinia
24
Umbria
21
> area.lda = lda(Area.name~.,data=Area.train)
> summary(area.lda)
Length Class Mode
prior
9
-none- numeric
counts
9
-none- numeric
means
63
-none- numeric
scaling 49
-none- numeric
lev
9
-none- character
svd
7
-none- numeric
N
1
-none- numeric
call
3
-none- call
terms
3
terms call
xlevels 0
-none- list
536
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
> yfit = predict(area.lda,newdata=Area.train)
> attributes(yfit)
$names
[1] "class"
"posterior" "x"
> misclass(yfit$class,Area.train$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
40
0
0
0
Coastal-Sardinia
0
22
0
0
East-Liguria
0
0
26
0
Inland-Sardinia
0
1
0
41
North-Apulia
0
0
2
0
Sicily
0
0
0
0
South-Apulia
1
0
0
0
Umbria
0
0
1
0
West-Liguria
0
0
3
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
1
4
1
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
1
0
0
1
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
11
2
0
0
0
Sicily
0
14
2
0
0
South-Apulia
0
1
141
0
0
Umbria
0
0
0
29
0
West-Liguria
0
0
0
0
36
Misclassification Rate =
0.0551
 Misclassification rate for training cases.
> plot(yfit$x[,1],yfit$x[,2],type="n",xlab="First Discriminant",
ylab="Second Discriminant",main="D2 vs. D1 with Area Names")
> text(yfit$x[,1],yfit$x[,2],as.character(yfit$class),col=as.numeric(yfit$class)+2,
cex=.35)
537
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Below is a function that will draw a scatterplot matrix with the points color coded by the
factor/nominal variable that must be the first column of the argument to the function x.
pairs.grps = function(x) {
pairs(x[,-1],pch=21,bg=as.numeric(as.factor(x[,1]))+3)
}
> poo = cbind(Area.train$Area.name,yfit$x)
> pairs.grps(poo)
We can see that the first few discriminants (LD1-LD3) exhibit the best separation between the
growing areas. Beyond that there is very little separation.
538
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
> ypred = predict(area.lda,newdata=Area.test)
> misclass(ypred$class,Area.test$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
14
0
0
0
Coastal-Sardinia
0
8
0
0
East-Liguria
0
0
11
0
Inland-Sardinia
0
2
0
22
North-Apulia
0
0
4
0
Sicily
1
0
0
1
South-Apulia
0
0
0
1
Umbria
0
0
1
0
West-Liguria
0
0
2
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
0
3
0
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
4
0
0
1
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
5
2
0
0
0
Sicily
1
10
1
0
0
South-Apulia
0
0
61
0
0
Umbria
2
0
0
20
0
West-Liguria
0
0
0
0
14
Misclassification Rate =
0.136
 13.6% of the test cases are misclassified.
> area.qda = qda(Area.name~.,data=Area.train)
> yfit = predict(area.qda,newdata=Area.train)
> attributes(yfit)
$names
[1] "class"
"posterior"
> misclass(yfit$class,Area.train$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
39
0
0
0
Coastal-Sardinia
0
23
0
0
East-Liguria
0
0
32
0
Inland-Sardinia
0
0
0
41
North-Apulia
0
0
0
0
Sicily
1
0
0
0
South-Apulia
1
0
0
0
Umbria
0
0
0
0
West-Liguria
0
0
0
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
0
0
1
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
0
0
0
0
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
13
0
0
0
0
Sicily
0
21
2
0
0
South-Apulia
0
0
141
0
0
Umbria
0
0
0
30
0
West-Liguria
0
0
0
0
36
Misclassification Rate =
0.0131
 1.31% of the training cases are misclassified.
539
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Predicting for the test cases gives the following.
> ypred = predict(area.qda,newdata=Area.test)
> ypred = predict(area.qda,newdata=Area.test)
> misclass(ypred$class,Area.test$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
15
0
0
0
Coastal-Sardinia
0
10
0
0
East-Liguria
0
0
16
0
Inland-Sardinia
0
0
0
23
North-Apulia
0
0
0
0
Sicily
0
0
0
0
South-Apulia
0
0
0
1
Umbria
0
0
0
0
West-Liguria
0
0
2
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
0
1
0
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
3
0
0
0
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
9
1
0
0
0
Sicily
0
11
0
0
0
South-Apulia
0
2
62
0
0
Umbria
0
0
0
21
0
West-Liguria
0
0
0
0
14
Misclassification Rate =
0.0524
 QDA misclassifies 5.24% of the test cases.
We will now consider using Regularized Discriminant Analyses (RDA) for these data in R. The
best implementation of regularized discriminant analysis in R is the rda() function in the klaR
library.
> area.rda = rda(Area.name~.,data=Area.train)
> attributes(area.rda)
$names
[1] "call"
"regularization" "classes"
[5] "error.rate"
"varnames"
"means"
[9] "covpooled"
"converged"
"iter"
[13] "xlevels"
"prior"
"covariances"
"terms"
$class
[1] "rda"
> area.rda$regularization
gamma
lambda
1.699485e-08 1.316095e-01
> area.rda$error.rate
APER
crossval
0.01049869 0.03410386
 optimal settings for (𝛾, 𝜆) found by 10-fold CV.
 estimated error rate from 10-fold CV results.
540
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
> ypred = predict(area.rda,newdata=Area.test)
> misclass(ypred$class,Area.test$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
15
0
0
0
Coastal-Sardinia
0
10
0
0
East-Liguria
0
0
16
0
Inland-Sardinia
0
0
0
23
North-Apulia
0
0
0
0
Sicily
0
0
0
0
South-Apulia
0
0
0
1
Umbria
0
0
0
0
West-Liguria
0
0
2
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
0
1
0
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
2
0
0
0
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
9
1
0
0
0
Sicily
0
12
0
0
0
South-Apulia
0
1
62
0
0
Umbria
1
0
0
21
0
West-Liguria
0
0
0
0
14
Misclassification Rate = 0.0471
 Better than QDA as RDA misclassifies 4.71% of the test cases.
541
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
RDA help file from the package klaR
Notice by default the rda() function chooses
the optimal values for (𝜆. 𝛾) via 10-fold crossvalidation (crossval=T, fold=10). If
crossval = F then repeated bootstrap splitsamples are used (train.fraction = 0.50 so
50%/50% is used by default). The error rate using
either method will be returned by default
(estimate.error=T).
We clearly need to consider changing the regularization parameters  if we are going to see
any benefit from using RDA vs. LDA/QDA.
> area.rda = rda(Area.name~.,data=Area.train,lambda=.1,gamma=0)
> pred.rda = predict(area.rda,newdata=Area.test)
> misclass(pred.rda$class,Area.test$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
16
0
0
0
Coastal-Sardinia
0
12
0
1
East-Liguria
0
0
16
0
Inland-Sardinia
0
0
0
21
North-Apulia
0
0
0
0
Sicily
0
0
0
0
South-Apulia
1
0
0
1
Umbria
0
0
0
0
West-Liguria
0
0
1
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
0
1
0
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
2
0
0
0
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
6
0
0
0
0
Sicily
2
7
4
0
0
South-Apulia
0
0
66
0
0
Umbria
0
0
0
18
0
West-Liguria
0
0
0
0
15
Misclassification Rate = 0.0684
We can write a simple function for looking through a 𝑘 × 𝑘 grid of (𝜆, 𝛾) of
combinations in order to find “optimal” values for these tuning parameters. The
542
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
function assumes the response is in the first column of both the training and test data
frames. Even though the rda() function I wrote the following function to optimize the
choices for (𝛾, 𝜆) given a training and test set to predict. The values for these tuning
parameters can be chosen based upon the misclassification error from predicting the
test cases.
find.gamlam = function(formula,train,test,ming=0,maxg=1,minl=0,maxl=1,k=5){
lambda = seq(minl,maxl,length=k)
gamma = seq(ming,maxg,length=k)
mcr = rep(0,as.integer(k^2))
ntest = dim(test)[1]
lg.grid = expand.grid(lambda=lambda,gamma=gamma)
for (i in 1:as.integer(k^2)){
temp = rda(formula,data=train,lambda=lg.grid[i,1],gamma=lg.grid[i,2])
pred = predict(temp,newdata=test)$class
numinc = ntest - sum(diag(table(pred,test[,1])))
mcr[i] = numinc/ntest
}
cbind(lg.grid,mcr)
}
> find.gamlam(Area.name~.,train=Area.train,test=Area.test,k=10)
lambda
gamma
mcr
1
0.0000000 0.0000000 0.05235602
2
0.1111111 0.0000000 0.04712042
3
0.2222222 0.0000000 0.04188482
4
0.3333333 0.0000000 0.04188482
5
0.4444444 0.0000000 0.05235602
6
0.5555556 0.0000000 0.06806283
7
0.6666667 0.0000000 0.07329843
8
0.7777778 0.0000000 0.08900524
9
0.8888889 0.0000000 0.10471204
10 1.0000000 0.0000000 0.13612565
11 0.0000000 0.1111111 0.13612565
12 0.1111111 0.1111111 0.13612565
⋯
⋯
⋯
⋯
91 0.0000000 1.0000000 0.18848168
92 0.1111111 1.0000000 0.19371728
93 0.2222222 1.0000000 0.20418848
94 0.3333333 1.0000000 0.21465969
95 0.4444444 1.0000000 0.21989529
96 0.5555556 1.0000000 0.21989529
97 0.6666667 1.0000000 0.21989529
98 0.7777778 1.0000000 0.21465969
99 0.8888889 1.0000000 0.22513089
100 1.0000000 1.0000000 0.20942408
It appears that 𝜆 ∈ [. 2222, .3333] and  is approximately optimal for this train/test
combination. By changing the min/max settings for both parameters in the function above we
can drill down in finer detail to find the “optimal” settings.
> find.gamlam(Area.name~.,train=Area.train,test=Area.test,k=10,
minl=.2,maxl=.34,ming=0,maxg=.001)
1
2
3
4
lambda
0.2000000
0.2155556
0.2311111
0.2466667
gamma
0.0000000000
0.0000000000
0.0000000000
0.0000000000
mcr
0.04188482
0.04188482
0.04188482
0.04188482
543
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
⋯
95
96
97
98
99
100
0.2622222
0.2777778
0.2933333
0.3088889
0.3244444
0.3400000
0.2000000
0.2155556
0.2311111
0.2466667
0.2622222
0.2777778
0.2933333
0.3088889
0.3244444
0.3400000
0.2000000
0.2155556
0.2311111
0.2466667
0.2622222
0.2777778
0.2933333
0.3088889
0.3244444
0.3400000
0.2000000
0.2155556
0.2311111
0.2466667
0.2622222
0.2777778
0.2933333
0.3088889
0.3244444
0.3400000
⋯
0.2622222
0.2777778
0.2933333
0.3088889
0.3244444
0.3400000
0.0000000000
0.0000000000
0.0000000000
0.0000000000
0.0000000000
0.0000000000
0.0001111111
0.0001111111
0.0001111111
0.0001111111
0.0001111111
0.0001111111
0.0001111111
0.0001111111
0.0001111111
0.0001111111
0.0002222222
0.0002222222
0.0002222222
0.0002222222
0.0002222222
0.0002222222
0.0002222222
0.0002222222
0.0002222222
0.0002222222
0.0003333333
0.0003333333
0.0003333333
0.0003333333
0.0003333333
0.0003333333
0.0003333333
0.0003333333
0.0003333333
0.0003333333
⋯
0.0010000000
0.0010000000
0.0010000000
0.0010000000
0.0010000000
0.0010000000
0.04712042
0.04188482
0.04188482
0.04188482
0.04188482
0.04188482
0.04712042
0.05235602
0.04712042
0.05235602
0.04712042
0.04712042
0.04188482
0.04188482
0.04188482
0.04188482
0.05235602
0.05759162
0.05235602
0.05235602
0.05235602
0.04712042
0.04712042
0.04188482
0.04188482
0.04188482
0.05759162
0.05235602
0.05235602
0.05235602
0.05235602
0.05235602
0.05235602
0.05235602
0.04188482
0.04188482
⋯
0.05235602
0.05235602
0.05759162
0.05759162
0.05759162
0.05759162
Brant Deppa - Winona State University
There are appear to be several combinations of
(𝛾, 𝜆) that will produce a misclassification error
rate of 4.188% for the test cases.
544
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
We can use the partimat function to visualize the classification boundaries in a pairwise basis
using two of the 𝑋𝑖′ 𝑠 at a time. The classification error rates are for discriminant analysis using
the two displayed variables only, thus the usefulness of these displays is questionable and as
expected the error rates are markedly higher than the overall.
> partimat(Area.name~.,data=Area.train,method="lda",nplots.hor=4,nplots.ver=4)
Individual plot from LDA
> partimat(Area.name~.,data=Area.train,method="qda",nplots.hor=4,nplots.ver=4)
545
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Individual plot from QDA
> partimat(Area.name~.,data=Area.train,method="rda",lambda =
.20,gamma=0,nplots.hor=4,nplots.ver=4)
An individual plot from RDA
546
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
> partimat(Area.name~.,data=Area.train,method=”sknn”,nplots.hor=4,nplots.ver=4)
> partimat(Area.name~.,data=Area.train,method=”naiveBayes”,nplots.hor=4,nplots.ver=4)
It is interesting to contrast the decision boundaries between the different methods of
classification we have discussed so far.
547
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Next we consider using discriminant analysis to classify the growing region which has three
levels (Northern, Sardinia, Southern).
> table(Region.train$Region.name)
Northern Sardinia Southern
111
74
244
> table(Region.test$Region.name)
Northern Sardinia Southern
40
24
79
> names(Region.train)
[1] "Region.name" "palmitic"
[5] "oleic"
"linoleic"
"palmitoleic" "strearic"
"eicosanoic" "linolenic"
> pairs.grps(Region.train)
> yfit = predict(region.lda,newdata=Region.train)
> misclass(yfit$class,Region.train$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
108
0
6
Sardinia
0
72
0
Southern
3
2
238
Misclassification Rate =
0.0256
548
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
LDA can be plotted using the discriminant scores.
> plot(yfit$x[,1],yfit$x[,2],type="n",xlab="First Discriminant",
ylab="Second Discriminant",main="D2 vs. D1 with Region Names")
> text(yfit$x[,1],yfit$x[,2],as.character(yfit$class),col=as.numeric(yfit$class)+2,
cex=.5)
549
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
We can add the predictions for the test olive oils to this plot as follows.
> ypred = predict(region.lda,newdata=Region.test)
> text(ypred$x[,1],ypred$x[,2],as.character(ypred$class))
> misclass(ypred$class,Region.test$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
36
0
5
Sardinia
0
23
0
Southern
4
1
74
Misclassification Rate =
0.0699
550
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Next we consider using QDA for these data.
> region.qda = qda(Region.name~.,data=Region.train)
> yfit = predict(region.qda,newdata=Region.train)
> misclass(yfit$class,Region.train$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
110
0
4
Sardinia
0
73
0
Southern
1
1
240
Misclassification Rate =
0.014
Predicting the test cases gives…
> ypred = predict(region.qda,newdata=Region.test)
> misclass(ypred$class,Region.test$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
39
0
4
Sardinia
0
23
0
Southern
1
1
75
Misclassification Rate =
0.042
Use the find.gamlam function to find optimal values for (𝛾, 𝜆) for use in RDA.
⋯
⋯
⋯
⋯
551
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Further drilling down for 𝜆, 𝛾 ∈ [0,0.1]
> region.rda = rda(Region.name~.,data=Region.train,gamma=0,lambda=.06)
> yfit = predict(region.rda,newdata=Region.train)
> misclass(yfit$class,Region.train$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
110
0
3
Sardinia
0
73
0
Southern
1
1
241
Misclassification Rate =
0.0117
> ypred = predict(region.rda,newdata=Region.test)
> misclass(ypred$class,Region.test$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
40
0
4
Sardinia
0
24
0
Southern
0
0
75
Misclassification Rate =
0.028
RDA appears to be the best for classifying the test cases.
552
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
17.3 - Flexible and Mixture Discriminant Analysis (FDA & MDA)
Suppose we have observations from K distinct groups denoted by G = {1,2,…,K}.
Further suppose that we have set p numeric variables measured for each observation
which we will denote 𝒙𝒊 . Now consider a function 𝜃: 𝐺 → ℜ1 assigns scores to the
classes, such that the transformed class labels are optimally predicted by linear
regression on 𝑿. If our training sample has the form (𝑔𝑖 , 𝒙𝒊 ) 𝑖 = 1,2, … , 𝑛 then we solve
𝑛
2
min ∑(𝜃 (𝑔𝑖 ) − 𝒙𝑻𝒊 𝛽)
𝛽,𝜃
𝑖=1
We generally restrict 𝜃(𝑔𝑖 ) to have mean 0 and variance 1. The end result will be a one
dimensional separation of the K classes. For problems where K > 2 this will probably
not be a very good separation of the groups. We extend this idea by using by finding
up to 𝐿 ≤ 𝐾 – 1 scorings of the class labels, 𝜃1 , 𝜃2 , … , 𝜃𝐿 thereby constructing the
information needed to construct a 𝐾 – 1 dimensional separation.
𝐾−1 𝑛
min ∑ ∑(𝜃𝑙 (𝑔𝑖 ) − 𝒙𝑻𝒊 𝛽𝑙 )
𝛽𝑙 ,𝜃𝑙
2
𝑙=1 𝑖=1
One can show that if we use traditional OLS regression as shown the resulting
discriminating dimensions are the same as those returned by LDA. However because
we have a variety of nonparametric regression methods we can use those in place of
traditional OLS to obtain a more flexible discriminant analysis. For example, we might
add polynomial terms based on the 𝑿’s or use multivariate adaptive regression splines
(MARS). As long as the regression piece can be expressed in the form,
𝑦̂ = 𝑆𝜆 𝑦
𝑒. 𝑔. 𝑆𝜆 = 𝑋(𝑋 𝑇 𝑋)−1 𝑋 𝑇
this algorithm will work. Polynomial regression and MARS can easily be put into this
form. The computational steps for FDA taken from Elements of Statistical Learning are
shown below.
553
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
FDA Algorithm
The library mda from CRAN contains the fda() and mda() functions for performing
more “flexible” discriminant analyses. Examples of both are shown on the pages that
follow.
554
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Linear Discriminant Analysis for the Italian Olive Oils Data
555
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Flexible Discriminant Analysis (FDA)
> poly.fda
Call:
fda(formula = Area.name ~ ., data = Area.train, method = polyreg)
Dimension: 7
Percent Between-Group Variance Explained:
v1
v2
v3
v4
v5
v6
v7
40.00 73.35 91.32 96.50 99.16 99.72 100.00
Degrees of Freedom (per dimension): 8
Training Misclassification Error: 0.05512 ( N = 381 )
> plot(poly.fda)
> ypred = predict(poly.fda,newdata=Area.test)
556
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
> attributes(ypred)
$levels
[1] "Calabria"
[5] "North-Apulia"
[9] "West-Liguria"
Brant Deppa - Winona State University
"Coastal-Sardinia" "East-Liguria"
"Sicily"
"South-Apulia"
"Inland-Sardinia"
"Umbria"
$class
[1] "factor"
> misclass(ypred,Area.test$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
14
0
0
0
Coastal-Sardinia
0
8
0
0
East-Liguria
0
0
11
0
Inland-Sardinia
0
2
0
22
North-Apulia
0
0
4
0
Sicily
1
0
0
1
South-Apulia
0
0
0
1
Umbria
0
0
1
0
West-Liguria
0
0
2
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
0
3
0
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
3
0
0
1
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
6
2
0
0
0
Sicily
1
10
1
0
0
South-Apulia
0
0
61
0
0
Umbria
2
0
0
20
0
West-Liguria
0
0
0
0
14
Misclassification Rate =
0.131
This looks terrible compared to other methods we have used above. Let’s try adding squared
terms based upon the fatty acid levels by specifying degree = 2.
> poly2.fda = fda(Area.name~.,data=Area.train,method=polyreg,degree=2)
> poly2.fda
Call:
fda(formula = Area.name ~ ., data = Area.train, method = polyreg,
degree = 2)
Dimension: 8
Percent Between-Group Variance Explained:
v1
v2
v3
v4
v5
v6
35.39 60.59 81.07 88.03 93.76 97.11
v7
v8
99.06 100.00
Degrees of Freedom (per dimension): 36
Training Misclassification Error: 0.02887 ( N = 381 )
Again predicting the test cases.
> ypred = predict(poly2.fda,newdata=Area.test)
> misclass(ypred,Area.test$Area.name)
Misclassification Rate =
0.0838
557
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
> plot(poly2.fda)
558
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
We will now try using MARS instead of polynomial regression.
> mars.fda = fda(Area.name~.,data=Area.train,method=mars)
> mars.fda
Call:
fda(formula = Area.name ~ ., data = Area.train, method = mars)
Dimension: 8
Percent Between-Group Variance Explained:
v1
v2
v3
v4
v5
v6
31.56 60.01 78.15 87.06 93.78 97.87
v7
v8
99.44 100.00
Training Misclassification Error: 0.05249 ( N = 381 )
Predicting the test cases in Area.test.
> ypred = predict(mars.fda,newdata=Area.test)
> misclass(ypred,Area.test$Area.name)
Misclassification Rate = 0.0942
Try a second-degree MARS model, i.e. add interaction terms.
> mars2.fda = update(mars.fda,degree=2)
> mars2.fda
Call:
fda(formula = Area.name ~ ., data = Area.train, method = mars,
degree = 2)
Dimension: 8
Percent Between-Group Variance Explained:
v1
v2
v3
v4
v5
v6
35.12 58.89 74.20 84.62 93.52 97.26
v7
v8
99.27 100.00
Training Misclassification Error: 0.05249 ( N = 381 )
Predicting the test cases.
> ypred = predict(mars2.fda,newdata=Area.test)
> misclass(ypred,Area.test$Area.name)
Misclassification Rate = 0.0995
559
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
> plot(mars2.fda)
At least for these data, it appears that FDA does worse than LDA, QDA, and RDA.
We will now try using FDA for classifying growing region (3 levels) rather than growing area (9
levels).
> mars.fda = fda(Region.name~.,data=Region.train,method=mars,degree=2)
> mars.fda
Call:
fda(formula = Region.name ~ ., data = Region.train, method = mars,
degree = 2)
Dimension: 2
Percent Between-Group Variance Explained:
v1
v2
68.31 100.00
Training Misclassification Error: 0.01632 ( N = 429 )
560
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Predicting the test cases.
> ypred = predict(mars.fda,newdata=Region.test)
> misclass(ypred,Region.test$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
39
0
4
Sardinia
0
24
1
Southern
1
0
74
Misclassification Rate =
0.042
> plot(mars.fda)
561
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Mixture Discriminant Analysis (MDA) – pgs. 449 – 451 Elements of Statistical Learning
> area.mda = mda(Area.name~.,data=Area.train)  default is 3 subclasses per group
> plot(area.mda,Area.train)
> misclass(predict(area.mda,Area.train),Area.train$Area.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Calabria Coastal-Sardinia East-Liguria Inland-Sardinia
Calabria
40
0
0
0
Coastal-Sardinia
0
23
0
0
East-Liguria
0
0
31
0
Inland-Sardinia
0
0
0
41
North-Apulia
0
0
1
0
Sicily
0
0
0
0
South-Apulia
1
0
0
0
Umbria
0
0
0
0
West-Liguria
0
0
0
0
y
fit
North-Apulia Sicily South-Apulia Umbria West-Liguria
Calabria
0
2
0
0
0
Coastal-Sardinia
0
0
0
0
0
East-Liguria
1
0
0
0
0
Inland-Sardinia
0
0
0
0
0
North-Apulia
12
1
0
0
0
Sicily
0
17
3
0
0
South-Apulia
0
1
141
0
0
Umbria
0
0
0
30
0
West-Liguria
0
0
0
0
36
Misclassification Rate =
0.0262
562
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Predict the test cases.
> ypred = predict(area.mda,newdata=Area.test)
> misclass(ypred,Area.test$Area.name)
Misclassification Rate =
0.0838
> area.mda = mda(Area.name~.,data=Area.train,subclass=1)
> misclass(predict(area.mda,Area.train),Area.train$Area.name)
Misclassification Rate =
0.0551
Predicting the test cases.
> ypred = predict(area.mda,newdata=Area.test)
> misclass(ypred,Area.test$Area.name)
Misclassification Rate =
0.131
 POOPY!
Next we will try 2 and 4 subclasses, though it is doubtful the results will be impressive for these
either.
563
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Two subclasses
> area.mda = mda(Area.name~.,data=Area.train,subclass=2)
> misclass(predict(area.mda,Area.train),Area.train$Area.name)
Misclassification Rate =
0.0315
> ypred = predict(area.mda,newdata=Area.test)
> misclass(ypred,Area.test$Area.name)
Misclassification Rate =
0.0785
Four subclasses
> area.mda = mda(Area.name~.,data=Area.train,subclass=4)
> misclass(predict(area.mda,Area.train),Area.train$Area.name)
Misclassification Rate =
0.0157
> ypred = predict(area.mda,newdata=Area.test)
> misclass(ypred,Area.test$Area.name)
Misclassification Rate =
0.0838
Next we consider trying to classify growing region vs. growing area. The growing areas are
contained within the three growing regions. To examine this structure we will construct a table
from the original full data frame Olives.
> table(Olives$Area.name,Olives$Region.name)
Calabria
Coastal-Sardinia
East-Liguria
Inland-Sardinia
North-Apulia
Sicily
South-Apulia
Umbria
West-Liguria
Northern Sardinia Southern
0
0
56
0
33
0
50
0
0
0
65
0
0
0
25
0
0
36
0
0
206
51
0
0
50
0
0
Notice that the northern region is comprised of three growing areas, Sardinia two, and southern
Italy comprised of four areas. Thus when specifying subclasses when using MDA to predict
growing region we may wish to vary the subclasses for each region.
> region.mda = mda(Region.name~.,data=Region.train,subclasses=c(3,2,4))
> region.mda
Call:
mda(formula = Region.name ~ ., data = Region.train, subclasses = c(3,
2, 4))
Dimension: 7
Percent Between-Group Variance Explained:
v1
v2
v3
v4
v5
v6
v7
63.80 81.58 94.49 97.08 99.09 99.78 100.00
Degrees of Freedom (per dimension): 8
Training Misclassification Error: 0.01166 ( N = 429 )
564
Section 17 – Discriminant Analysis
DSCI 425 – Supervised (Statistical) Learning
Brant Deppa - Winona State University
Deviance: 30.424
Predicting the test cases.
> ypred = predict(region.mda,newdata=Region.test)
> misclass(ypred,Region.test$Region.name)
Table of Misclassification
(row = predicted, col = actual)
y
fit
Northern Sardinia Southern
Northern
39
0
2
Sardinia
0
23
0
Southern
1
1
77
Misclassification Rate =
0.028
> plot(region.mda)
Comparing the predictive performance of MDA to RDA (which was better than LDA/QDA) we
see that MDA is on par (2.8% misclassified).
565