Exercises of Multivariate Statistical Methods for Engineering and

Exercises
of
Multivariate Statistical Methods for
Engineering and Management
Master in Industrial Engineering and Management
2nd Semester 2010/2011
1
Principal Components Analysis
1. Let x1 be the return on income and x2 be earnings before interest and taxes, from
which we obtained the following results:
19.32
70.41 5.87
x=
,
S=
.
1.51
5.87 0.97
(a) Find the principal components based on the covariance matrix and the estimated variance of each principal component.
(b) Find the percentage of the total variation explained by each principal component.
d Ŷ1 , Xk ), k = 1, 2. Interpret the first principal com(c) Find the correlation, Cor(
ponent.
(d) Would the results change if correlation matrix is used to extract the principal
components? Why?
2. Given the covariance matrix


8 0 1
8 3 .
Σ=
5
(a) Compute the eigenvalues λ1 , λ2 , and λ3 of Σ, and the eigenvectors γ 1 , γ 2 , and
γ 3 of Σ.
Hint: You may use R to compute the eigenvalues and eigenvectors of Σ.
(b) Show that λ1 + λ2 + λ3 = tr(Σ), where the trace of a matrix equals the sum of
its diagonal elements.
(c) Show that λ1 λ2 λ3 = |Σ|, where |Σ| is the determinant of Σ.
3. Four measurements were made on 500 animals. The first three measurements refer
to different linear dimensions measured in centimeters, while the fourth variable
refers to the weight of the animal measured in grams. We obtained the eigenvalues
associated with the sample covariance matrix: 14.1, 4.3, 1.2 and 0.4. The two first
eigenvectors were also obtained:
γ 1 = (0.39, 0.42, 0.44, 0.69)t ,
1
γ 2 = (0.40, 0.39, 0.42, −0.72)t .
(a) Analyzed the data using principal components.
(b) Find the variation explained by each principal component and interpret the
principal components.
4. Given (X1 , X2 )t with covariance matrix:
4 6
S=
,
6 400
eigenvalues: λ1 = 400.09, λ2 = 3.909, and eigenvectors:
γ 1 = (0.015, 0.999)t ,
γ 2 = (0.999, −0.015)t .
(a) Write the principal components.
(b) Obtained the principal components for the standardized variables.
(c) Compare the two sets of principal components, by means of the percentage of
explained variability and the correlation between each principal components
and each observed variable.
(d) In what kind of practical situation should you use principal components obtained by standardized variables?
5. Let X = (X1 , X2 )t be a random vector with correlation matrix:
1 ρ
R=
.
ρ 1
(a) Find the principal components and the amount of variation explained by each
of them. What happens when ρ = 0? What happen when ρ = 1?
(b) Consider another random vector: X1 = (X1 , X2 , X3 )t (where X3 was added to
X), that has correlation matrix:


1 ρ 0
R1 =  ρ 1 0  .
0 0 1
Find the principal components and the amount of variation explained by each
of them.
(c) Compare the results obtained in (9a) and (9b). What can you conclude?
6. Table 1 provides the average price in cents per pound of five food items in 24 U.S
cities1 .
(a) Using principal components analysis, define price index measure(s) based on
the five food items.
1
U.S. Department of Labor, Bureau of Labor Statistics, Washington, D.C., 1978.
2
Table 1: Food Price Data. Average price in cents per pound.
City
Anchorage
Atlanta
Baltimore
Boston
Buffalo
Chicago
Cincinnati
Cleveland
Dallas
Detroit
Honolulu
Houston
Kansas City
Los Angeles
Milwaukee
Minneapolis
New York
Philadelphia
Pittsburgh
St. Louis
San Diego
San Francisco
Seattle
Washington
Bread
70.9
36.4
28.9
43.2
34.5
37.1
37.1
38.5
35.5
40.8
50.9
35.1
35.1
36.9
33.3
32.5
42.7
42.9
36.9
36.9
32.5
40
32.2
31.8
Hamburger
135.6
111.5
108.8
119.3
109.9
107.5
118.1
107.7
116.8
108.8
131.7
102.3
99.8
96.2
109.1
116.7
130.8
126.9
115.4
109.8
84.5
104.6
105.4
116.7
Butter
155
144.3
151
142
124.8
145.4
149.6
142.7
142.5
140.1
154.4
150.3
162.3
140.4
123.2
135.1
148.7
153.8
138.9
140
145.9
139.1
136.8
154.81
Apples
63.9
53.9
47.5
41.1
35.6
65.1
45.6
50.3
62.4
39.7
65
59.3
42.6
54.7
57.7
48
47.6
51.9
43.8
46.7
48.5
59.2
54
57.6
Tomatoes
100.1
95.9
104.5
96.5
75.9
94.2
90.8
83.2
90.7
96.1
93.9
84.5
87.9
79.3
87.7
89.1
92.1
101.5
91.9
79
82.3
81.9
88.6
86.6
(b) Identify the most and the least expensive cities (based on the above price
index measures). Do the most and the least expensive cities change when
standardized data are used as against mean-correlated data? Which type of
data should be used to define price index measures? Why?
(c) Plot the data using principal components scores and identify distinct groups
of cities. How are these groups different from each other?
Hint: Use R and explore the functions princomp and prcomp.
(d) Anchorage is characterized by high prices, and can be seem as a potencial
outlier. Eliminate this observation from your dataset and repeat the analysis.
What can you conclude?
7. Protein consumption measured in twenty-five European countries for 5 food groups
were obtained to determine whether there are groups of countries and whether meat
consumption is related to that of other foods. The 5 food groups under study are:
Red Meat, White Meat, Eggs, Milk, and Fish. The eigenvalues associated with the
sample correlation matrix (R) are: λ1 = 2.394, λ2 = 1.214 λ3 = 0.721, λ4 = 0.479,
and λ5 = 0.191. The first two eigenvectors (γi , i = 1, 2) of R are:
3
γ1
Red Meat
-0.151
White Meat
-0.129
Eggs
-0.067
Milk
-0.425
Fish
-0.127
Cereals
0.861
Starch
-0.067
Nuts
0.114
Fruit and Vegetables 0.020
γ2
γ3
-0.133 11.203
-0.043 13.646
-0.021 1.249
-0.831 50.487
0.292 11.577
-0.406 120.446
0.076
2.670
0.070
3.943
0.169
3.254
(a) Write the first two sample principal components.
(b) Find the percentage of the total variability explained by each principal component and interpret the first two principal components. How many principal
components should be retained?
(c) Find the correlation between the first principal component (Y1 ) and each of the
standardized original variables (Xi , i = 1, . . . , 5). Interpret the first principal
component and compare your findings with
q part (7b).
p
d Ŷj , Xi ) = γij λj / V \
ar(Xi ).
Hint: Recall that Cor(
4
Factor Analysis
1. Consider the one-factor model associated with the standardized random vector X =
(x1 , . . . , X5 )t , represented by the following equations:
X1 = 0.65f + e1 ,
X2 = 0.84f + e2 ,
X3 = 0.70f + e3 ,
X4 = 0.32f + e4 ,
X5 = 0.28f + e5 .
Admit the usual assumptions: E(f ) = 0, E(e) = 0, Var(f ) = 1, Var(e) =
diag{ψ1 , . . . , ψ5 }, Cov(f, e) = 0.
(a) Compute the communalities of the manifest variables with the common factor.
(b) What are the unique variances associated with each manifest variable?
(c) Compute the correlation between the following sets of manifest variables:
i.
ii.
iii.
iv.
X1
X2
X2
X4
and
and
and
and
X2 .
X3 .
X5 .
X5 .
(d) What is the shared variance between each manifest variable and the common
factor.
2. Consider the two-factor model associated with the standardized random vector X =
(x1 , . . . , X6 )t , represented by the following equations:
X1 = 0.85f1 + 0.12f2 + e1 ,
X2 = 0.74f1 + 0.07f2 + e2 ,
X3 = 0.67f1 + 0.18f2 + e3 ,
X4 = 0.21f1 + 0.93f2 + e4 ,
X5 = 0.05f1 + 0.77f2 + e5 ,
X6 = 0.08f1 + 0.62f2 + e6 .
Admit the usual assumptions holds for the above model.
1
(a) Compute the communalities of the manifest variables with the common factor.
(b) What are the unique variances associated with each manifest variables?
(c) Compute the correlation between the following sets of manifest variables:
i. X1 and X2 .
ii. X2 and X3 .
iii. X4 and X5 .
(d) What percentage of the variance of variables X1 , X3 , and X6 is not accounted
for by the common factors f1 and f2 ?
(e) Identify sets of manifest variables that share more than 90% of the total shared
variance with each common factor. Which variables should therefore be used
to interpret each common factor?
3. Show that the correlation matrix:


1.00 0.63 0.45
Σ =  0.63 1.00 0.35 
0.45 0.35 1.00
associated with p = 3 standardized variables Z1 , Z2 and Z3 can be obtained by the
one-factor model
Z1 = 0.9f1 + e1 ,
Z2 = 0.7f1 + e2 ,
Z3 = 0.5f1 + e3 ,
where Var (f1 ) = 1, Cov (e, f1 ) = 0 and


0.19 0
0
Ψ = Cov (e, e0 ) =  0 0.51 0  .
0
0 0.75
That is, Σ can be written as Σ = ΛΛt + Ψ.
4. Consider exercise 3, compute:
(a) The communalities h2i , i = 1, 2, 3. Give an interpretation of those quantities.
(b) The correlations Cor (Zi , f1 ), i = 1, 2, 3. What manifest variable is the most
important in the interpretation of the common factor? Why?
5. Consider the following correlation matrix:

1.000
 0.690 1.000
Σ=
 0.280 0.255 1.000
0.350 0.195 0.610 1.000


,

and that the estimated factor loadings were extracted by the principal components
method:
2
Variable
X1
X2
X3
X4
f1
0.80
0.70
0.10
0.20
f2
0.20
0.15
0.90
0.70
Compute and discuss the following:
(a) Communalities.
(b) Specific variances.
(c) Proportion of variance explained by each factor.
(d) Interpretation of each factor.
(e) Estimated or reproduced correlation matrix.
(f) Residual matrix.
6. Consider the following two-factor models:
Model 1:
X1 = 0.558f1 + 0.615f2 + e1 ,
X2 = 0.604f1 + 0.748f2 + e2 ,
X3 = 0.469f1 + 0.556f2 + e3 ,
X4 = 0.818f1 − 0.411f2 + e4 ,
X5 = 0.866f1 − 0.466f2 + e5 ,
X6 = 0.686f1 − 0.461f2 + e6 .
Model 2:
X1 = 0.104f1 + 0.824f2 + e1 ,
X2 = 0.065f1 + 0.959f2 + e2 ,
X3 = 0.065f1 + 0.725f2 + e3 ,
X4 = 0.906f1 + 0.134f2 + e4 ,
X5 = 0.977f1 + 0.116f2 + e5 ,
X6 = 0.827f1 + 0.016f2 + e6 .
(a) Show that although the loadings and shared variances of each manifest variable are different for the two models, the total communalities of each manifest
variable, the unique variances, and the correlation matrices of the manifest
variables are the same for the two models. This illustrates the factor indeterminacy problem.
(b) In what way is the interpretation of the common factors in the two models
different?
3
7. The eigenvalues and eigenvectors associated with Σ from exercise 3 are:
λ1 = 1.96, γ1 = (0.625, 0.593, 0.507)t
λ2 = 0.68, γ2 = (−0.219, −0.491, 0.843)t
λ3 = 0.36, γ3 = (0.749, −0.638, −0.177)t
(a) Consider the one-factor model. Compute the loadings and the specific variances
using the principal components estimation method.
(b) What percentage of the total shared variance is explained by the common
factor?
8. Given Σ and Ψ from exercise 3 and one-factor model, compute the estimated or
reproduced correlation matrix Σ∗ = Σ − Ψ by the principal components estimation
method and obtained the loadings Λ. Are the results coherent with the one from
exercise 3?
9. What is the conceptual difference between factor analysis and principal components
analysis?
4
Cluster Analysis
1. Let sij be a similarity such that 0 < sij ≤ 1, show that dij = 1−sij and dij = −log sij
are dissimilarities.
2. Consider two binary variables X and Y and that a set of subjects were measured
on these two variables, leading to the following results:
X\Y
1
0
1 0
a b
c d
A usual similarity coefficient between variables X e Y is the correlation coefficient,
rXY . Show that:
ad − bc
rXY =
.
[(a + b)(c + d)(a + c)(b + d)]1/2
3. Consider the following species:
Tiger, Dog, Whale, Hare, Man,
and the following attributes:
• eats other animals,
• eat vegetables,
• moves on four legs,
• is a domestic animal,
• is a wild animal.
Obtain the similarity matrix based on the Jaccard coefficient.
4. Show that a distance dk(ij) (between clusters k and the cluster formed by the union
of cluster i and j) used by single linkage, complete linkage, and average linkage
verifies the formula:
dk(ij) = αi dki + αj dkj + γ |dki − dkj |
where:
1
Single linkage
αi = αj = 21 , γ = − 12
Complete linkage αi = αj = γ = 12
i
Average linkage
αi = nin+n
, αj = 1 − αi , γ = 0
j
5. The matrix




A=



1.0
1.5
2.0
3.0
4.0
2.0
7.0
7.5
8.5
7.0
4.0
1.0








contains measurements on two characteristics obtained on 6 objects.
(a) Using the Euclidean distance, obtained the dendrograms using the the following hierarchical clustering methods: single linkage, complete linkage, average
linkage, and Ward’s method.
(b) Do a scatter plot of the observed data and give an interpretation of the results
obtained in part 5a.
(c) Choose the appropriate number of clusters and comment the results obtained.
6. Admit that you want to group the following

2 10
 5 8

 1 2

 2 5

 8 4

 7 5

 6 4
4 9
8 objects:












in 3 clusters. Use for initial partition the three first three objects as centroids of the
Clusters. Use K-means and obtained:
(a) The centroids of each cluster after the first iteration of the algorithm.
(b) The final partition in 3 clusters.
7. Answer the following questions using Fisher’s iris data.
(a) Does there appear to be more than one natural cluster in these data?
(b) Using an appropriate clustering method, find a three-cluster solution for the
iris data. How does your clustering solution correspond to the actual species
as identified in the data set?
(c) Ignore the suggestion in three clusters and use the methods you have learned
to discuss the correct number of clusters.
2
(d) Use the validation methods you have learned to validate the three-cluster solution.
8. Use R to generate six independent columns of twenty observations, each from a
uniform distribution on the interval [0,1]. Use R to carry out clustering techniques,
both on the observations, 1-20, and the variables, 1-6. Repeat the exercise, generating a new set of data. Compare the clusters obtained from the two occasions. What
do you conclude?
Hint: Use the command runif to solve this question.
9. Use R to generate two columns of 100 observation, each from a uniform distribution
on the interval [0,1], called set A and another one in the same conditions named set
B.
(a) Using k-means clustering, find a four-cluster solution for the data of set A.
(b) Following the procedures you learned to validate the clustering solution from
part (9a), eventually using set B. Describe you results Would you say a four
cluster solution is valid?
(c) Repeat part (9a) and (9b) above for a three-cluster solution. What are the
differences between the three-cluster validation and the four-cluster validation?
How would you explain them?
3
Regression Analysis
1. In an experiment to investigate the variation of the specific heat (Y ) of a certain
chemical with the temperature (x), two specific heat measurements were made for
each of the 6 temperatures under study.
Temperature (o C)
Specific heat (o C/g)
50
1.60
1.64
60
1.63
1.65
70
1.67
1.67
80
1.70
1.72
90
1.71
1.72
100
1.71
1.74
(a) Estimate the regression equation using the least squares method.
(b) Obtain a 95% confidence interval on the intercept β0 .
(c) Test the hypothesis H0 : β1 = 0 against the hypothesis H1 : β1 6= 0.
(d) Get an estimate of the specific heat corresponding to temperature 75o C. Determine the 95% confidence interval and prediction interval for this temperature
value.
2. In an experiment carried out to perform the calibration of a device that measures
the concentration of acid in the blood, 20 samples of known concentration (x) were
used and the values measured by the device (Y ) were recorded:
x
y
x
y
1
1.1
5
8.2
1
0.7
5
6.2
1
1.8
10
12.0
1
0.4
10
13.1
3
3
3
3
3.0 1.4 4.9 4.4
10
10
15
15
12.6 13.2 18.7 19.7
3
4.5
15
17.4
5
7.3
15
17.1
(a) Sketch the graph of the data associated with this experiment and verify that
the relation between the variables is approximately linear. Draw a regression
line that you think best serves to translate the linear relationship between the
known values of concentration and the corresponding values recorded by the
device.
(b) Estimate the regression equation of Y on x and compare the estimated values
with the corresponding ones read directly from the graph obtained in (a).
(c) Consider the following statements about the model:
- The slope of the regression equation is zero.
1
- The slope of the regression equation is 1.
- The regression equation passes through the origin.
Proceed to test each of these cases against all the alternatives.
(d) Derive a 95% confidence interval for the regression equation. For this build
five confidence intervals for the mean of Y in the fixed points of x used in the
experiment. Draw the extreme values of each confidence interval and sketch
the contours of the confidence interval for the regression equation.
(e) Calculate and comment on the coefficient of determination of the model.
3. A criminologist decided to study the relationship between population density and
the rate of burglaries in medium-sized cities for the USA. In order to do so he
collected a random sample of 16 cities where the variable Y , the rate of burglaries
in the past years (number of robberies per 100,000 habitants), is supposed to be
influenced by the variable x, population density (number of habitants per unit area),
whose observed values are between 47 and 94. According to the collected sample
we obtained the following data, presented in matrix form:
1
81452 −1118
3220
T
T
−1
, X Y =
, Y T Y = 649736.
(X X) =
16
225869
53308 −1118
Assume that the model of linear regression is appropriate.
(a) Get the estimated regression equation and interpret the results.
(b) Estimate with a 99% confidence level the expected number of robberies in last
year, in cities with population equal to 60.
(c) Is there a linear association between the rate of theft and population density?
Consider a significance level of 5% and SST = 1711.
(d) How is the rate of theft reduced when the population density is introduced in
the model? Is it relatively small or large? Comment the result obtained by
comparing with the results of the preceding question.
4. Among the factors that can influence the price of a house stand out the area and
age of the house. To understand the relationship between price, area and age, 20
American households selected randomly were studied. The houses observed areas
are between 0.8 and 3.6 and ages are between 1 and 33 (the area is measured in
thousands of square feet, age is measured in years, and the price in thousands of
US dollars). Assuming the multiple linear regression model is appropriate to this
dataset, R was used to estimate the parameters, and part of the output is presented
below.
Call:
lm(formula = Price ~ Area + Age, data = x)
Residuals:
2
Min
1Q
-3.27322 -1.63194
Median
0.05641
3Q
1.59341
Max
3.90155
Coefficients:
Estimate Std. Error t value
(Intercept) 10.39367
1.80957
5.744
Area
12.75077
0.60261 21.159
Age
-0.07187
0.05312 -1.353
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01
Pr(>|t|)
2.39e-05 ***
1.19e-13 ***
0.194
’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 2.208 on 17 degrees of freedom
Multiple R-squared: 0.9643,
Adjusted R-squared: 0.9602
F-statistic: 229.9 on 2 and 17 DF, p-value: 4.932e-13
Analysis of Variance Table
Response: Price
Df Sum Sq Mean Sq F value
Pr(>F)
Area
1 2232.74 2232.74 457.9580 9.878e-14 ***
Age
1
8.92
8.92
1.8305
0.1938
Residuals 17
82.88
4.88
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
solve(t(X)%*%X)
[,1]
[,2]
[,3]
[1,] 0.67164255 -0.1630071739 -0.0137501489
[2,] -0.16300717 0.0744831743 0.0006530616
[3,] -0.01375015 0.0006530616 0.0005787919
(a) Explain the meaning of the estimates 12.75077 and -0.07187.
(b) Calculate the 95% confidence interval on the mean price of houses with an area
of 2200 square feet and 26 years of age. Calculate the 95% prediction interval
on a house with the same characteristics. Compare and discuss the results
obtained from the two intervals.
(c) From the ANOVA table we can easily conclude that the regression model is
significant. But it is suspected that between the two predictors involved only
the area is meaningful.
3
i. Confirm the assumption “the regression model is significant”, by making
the appropriate analysis of the ANOVA table.
ii. Confirm that only area is a meaningful predictor using a t-test.
(d) The aim is to predict the price of a house with 50 years of age. What do you
think about the usefulness of this model to make this prediction?
5. Consider a multiple regression model with 5 predictors.
(a) Provide the relevant sums of squares for testing
(
(
H01 : β1 = 0
H02 : β3 = β4 = 0
e
H11 : β1 6= 0
H12 : β3 6= 0 or β4 6= 0
.
(b) Show that SSR(x1 , x2 , x3 , x4 ) = SSR(x1 |x2 , x3 , x4 )+SSR(x2 , x3 |x4 )+SSR(x4 ).
6. A regression was constructed by introducing successively the following variables:
x1 , then x2 , up to x5 , and successive sums of squares were obtained: SSR(x1 ),
SSR(x2 |x1 ), . . . , SSR(x5 |x1 , x2 , x3 , x4 ). Based on SSR(x2 , x3 , x4 |x1 ), we intend to
test the hypothesis that the terms x2 , x3 , x4 should not be in the regression model,
given that x1 is in the regression model. Obtain SSR(x2 , x3 , x4 |x1 ) as a function of
the sums of squares obtained by the successive regressions model.
4