Advanced Statistical Approaches to Quality
INSE 6220: winter2016
ASSIGNMENT – 2
Submitted by
submitted to
Harmandeep singh
prof. Dr. A. Ben. Hamza
S.id: 27732422
Date 5 april 2016
QUESTION 1:
SOLUTION:
(A) The box plots for the wafer data are represented below. Observe all the individual box plots
contain outliers. In fig (a)
(B) The plot of the PC1 score Vs. PC2 score is shows in Figure b. Notice that the one sample number
appears to be outlier.
Figure a. Side-by-side box plots for the Wafer data.
Figure b. PC2 Vs. PC1 score for the wafer data.
(C) Using the columns of the eigen vector matrix as below:
A
=
-0.4223
-0.2974
-0.2284
-0.3376
-0.1887
-0.3281
-0.3466
-0.4097
-0.3685
-0.1424
-0.3153
-0.3757
-0.5324
-0.1948
0.3647
0.1641
0.3864
0.3294
0.8194
-0.2314
-0.4590
-0.0183
-0.0363
-0.1937
-0.0807
-0.0517
-0.1264
-0.2226
-0.2042
-0.3208
0.7059
-0.5004
0.0707
0.1447
0.1319
-0.1177
0.2488
0.2610
0.3748
-0.2251
-0.8097
0.0878
-0.0814
-0.0939
-0.0046
-0.1281
0.5621
-0.3224
-0.0810
-0.0949
-0.5992
-0.0378
0.4154
0.1230
0.0358
-0.5232
0.4783
-0.0507
-0.0494
-0.5057
0.2753
0.3902
-0.0859
0.0246
0.1093
0.0435
-0.0715
0.0870
0.3068
-0.2806
0.5298
-0.7206
0.0150
-0.2171
0.1343
0.1975
-0.0056
-0.0127
-0.8142
0.2203
0.4290
PC1 IS:
Z1 = -0.4223 X1 -0.2974 X2 -0.2284 X3 -0.3376 X4 -0.1887 X5 -0.3281 X6 -0.3466 X7 -0.4097 X8 -0.3685 X9
PC2:
Z2 = -0.1424X1 -0.3153 X2 -0.3757 X3 -0.5324 X4 -0.1948X5 + 0.3647X6 + 0.1641X7 + 0.3864X8 + 0.3294X9
(D) The Scatter plot of COEFF. OF PC2 Vs. PC1 is shown in figure C. This plot helps understand which
variables have a similar involvement with in PC’s. As can be seen in Figure C, the variables X1, X2,
X3, X4, X5 are in lower part of plot, while the remaining variables X6, X7, X8, X9 are on higher side.
This is consistent with the coefficients of PC1 and PC2.
Figure c. Coeff. of PC2 Vs. PC1 for the wafer data.
(E) Variance of three PCs are l1 = 44.4574, l2 = 18.8833 and l3 = 9.8181. remember that PC1, PC2 and
PC3 together account for 73.1588 variance in the data. Based on the explained variance by the
both PC1 and PC2 and also from the scree and Pareto plots in fig D., it can be said that the lowest
dimensional space to represent the wafer data corresponds to d = 3.
(a) Scree Plot
(b) Pareto Plot
Figure d. Scree and Pareto Plots for the Wafer data.
(F) The 2D bipot of PC2 Vs. PC1 is shown in the Figure 5(a). the axes in the biplot represent the
principal components columns of A, and the observed variables - rows of A are represented as
vectors. Each observation (row of Z) is represented as a red point in the biplot. From figure 5(a),
we can see that the first principal component has positive coefficients for all the variables. That
correspondings to 9 vectors directed into the right hals of the plot. The second principal
component, represented by the vertical axis, has 5 positive coefficients for the variables X1, X2, X3,
X4, X5 and 4 negaitive coefficients for the variables X6, X7, X8, X9. That corresponds to 5 and 4
vectors directed into the top and bottom halves of the plot respectively. The 3D biplot of PC1,
PC2 and PC3 is shown in the Figure e(b).
(a) 2D biplot
(b) 3D biplot
Figure e. 2D and 3D biplots for the Wafer data.
(G) The Hotelling and first PC charts are displayed in Figure f. The Hotelling chart indicate that the
samples 21, 40, 44, 60, 67, 69, 74, 79 are out-of-control. The first PCA chart shows that the sample
74 appears to be out-of-control.
Figure f. Hotelling and First PC control char for the wafer data.
MATLAB CODE:
load waf.mat;
%Q1a : Display the side-by-side boxplots.
figure;
boxplot(X,'labels',{'X1' 'X2' 'X3' 'X4' 'X5' 'X6' 'X7' 'X8' 'X9'});
title('Q1a : side-by-side boxplots.');
xlabel(' ','fontsize',14,'fontname','times');
ylabel(' ','fontsize',14,'fontname','times');
[A,Z,variance,Tsquare]=princomp(X);
%Q1b : Plot the PC1 score vs. the PC2 score.
figure;
scatter(Z(:,1),Z(:,2),15,'ko','MarkerFaceColor',[.49 1 .63],'LineWidth',1);
title('Q1b : Scatter plot of 2nd PC score vs. 1st PC score.');
xlabel('PC1 score','fontsize',14,'fontname','times');
ylabel('PC2 score','fontsize',14,'fontname','times');
%Q1d : Display the scatter plot of PC2 coefficients vs. PC1 coefficients, and
label the points.
figure;
scatter(A(:,1),A(:,2),15,'ko','MarkerFaceColor',[.49 1 .63],'LineWidth',1);
title('Q1d : Scatter plot of PC2 coefficients vs. PC1 coefficients. ');
xlabel('PC1 coefficient','fontsize',14,'fontname','times');
ylabel('PC2 coefficient','fontsize',14,'fontname','times');
text(A(:,1),A(:,2),variables,'VerticalAlignment','bottom','HorizontalAlignmen
t','left')
%Q1e : Compute the explained variance, and plot it against the number of PCs.
%Plotting Explained variance vs number of Principal Components
%using Plot and Pareto commands
expvar=100*variance/sum(variance);%percent of the total variability explained
by each principal component.
figure;
plot(expvar,'ko-','MarkerFaceColor',[.49 1 .63],'LineWidth',1);
title('Q1e : Scree Plot: Explained variance vs. Principal Component
Number.');
xlabel('Number of Principal Components','fontsize',14,'fontname','times');
ylabel('Explained Variance %','fontsize',14,'fontname','times');
figure;
pareto(expvar);
title('Q1e : Pareto plot : Explained variance vs. Principal Component
Number.');
xlabel('Number of Principal Components','fontsize',14,'fontname','times');
ylabel('Explained Variance %','fontsize',14,'fontname','times');
% Q1f : Display the 2D biplot of PC2 vs. PC1. Then, display the 3D biplot of
PC1, PC2, and PC3.
figure;
cumsum(variance)/sum(variance);
%Biplot helps visualize both the principal component coefficients for each
variable and the principal
%component scores for each observation in a single plot.
biplot(A(:,1:2),'Scores',Z(:,1:2),'VarLabels',variables)
xlabel('$PC1$','fontsize',14,'fontname','times','Interpreter','LaTex');
title('Q1f : 2D biplot of PC2 vs. PC1.');
ylabel('$PC2$','fontsize',14,'fontname','times','Interpreter','LaTex');
axis tight;
figure;
biplot(A(:,1:3),'Scores',Z(:,1:3),'VarLabels',variables)
xlabel('$PC1$','fontsize',14,'fontname','times','Interpreter','LaTex');
title('Q1f : 3D biplot of PC1, PC2, and PC3.');
ylabel('$PC2$','fontsize',14,'fontname','times','Interpreter','LaTex');
zlabel('$PC3$','fontsize',14,'fontname','times','Interpreter','LaTex');
axis tight;
% Q1g : Plot the Hotelling and first PC control charts. Identify any out-ofcontrol points.
figure;
alpha = 0.05;
[outliers1, h1] = tsquarechart(X,alpha); %T^2 chart
title('Q1g : Hotelling chart.');
figure;
k=1;
[outliers2, h2] = pcachart(X,k); %1st PCA control chart
title('Q1g : first PC control chart.');
ylabel('$PC1$','fontsize',14,'fontname','times','Interpreter','LaTex');
QUESTION 2:
SOLUTION:
(A) Selling price is directly propotional to demand price. That is, we can expect a positive relationship
between these two variables in the regression model.
(B) The least-squares regression lines is as follows:
Regression lines as follows: Y = -11.03 + .98X, where β0 = -11.03 and β1 = 0.98.
(C) The value β0 = -11.03 gives the value of Y for x=0; that is, it gives the selling price for a house
with no demand price. The value β1 = 0.98 gives the change in Y due to a change of one unit in
x, indicating that, on average, for every extra house dollars of asking price, the selling price
increases by $980.
(D) The scatter plot and the regression line are shown in Figure f. As expected, we can see a positive
relationship between dependent and independent variables.
Figure g. Scatter plot and regression line for the real estate data.
(E) The value of r = 0.9941 indicates that the asking price and the selling price are positively related.
The value of r2 = 0.9882 states that 98.2% of the total variation in the selling is explained by the
asking price and 1.18% is not.
(F) Using the regression line, we can predict the selling price for a house with $360K asking price as
follows:
Y = -11.03 + 0.98(360K) ≈ $342K
(G) The estimated value of the variance is ˆσ2 = 255.232 and standard errors are follows:
s.e.( ˆ β0) = 11.996 and s.e.( ˆ β1) = 0.0298
(H) A normal probability plot of the residuals for the real estate data is shown in Figure h. The plot
is roughly linear and all the data values lie within the bounds of the normal probability plot,
indicating that the data are roughly normal.
Figure h. Normal probability plot of the residuals for the real estate data.
(I) The residual plot for the real estate data, shown in figure i, indicated that the points seem to be
randomly dispersed around zero. That is a linear regression model is appropriate for the data.
Thus the residual plot does not suggest violations of the assumptions of zero means and same
variance of random errors.
Figure i. Residual plot for the real estate data.
(J) 100(1-α)% = 95% α = 0.05, thus the 95% confidence interval for the slope β1 is as follows:
(β1 ± tα/2,n−2 s.e.( ˆ β1) = (0.916, 1.045)
Thus, 0.916 ≤ β1 ≤ 1.045. That is, on average the selling price increased by an amount between $916 and
$1045 for every extra thousand dollars of the asking price.
(K) The residuals are normally distributed with the constant error variance, we can perform
statistical inference on the regression. The t-test for a significant regression is tow-tailed. We are
testing for significance of regression at a significance level α = 0.05.
1: Test the Hypotheses (two-tailed) :
H0 : β1 = 0
H1 : β1 ≠ 0
2: the value of the t-test is given by:
t0 = β1 / s.e(β1) = 32.95
3: the rejection regions is | t0 | > tα/2,n−2 where tα/2,n−2 = t0.025,13 = 2.16
4: since | t0 | = 32.95 > tα/2,n−2 = 2.16, so we can reject the null hypothesis H0.
It is clear that β1 ≠ 0, and the regression is hence significant.
p-value = 2[1 – F(|t0|)] = 0 < α.
QUESTION 3:
SOLUTION:
(A) This is a 2k factorial experiment with K = 2 and n = 2 that the data for the viscosity
experiment with totals in below mentioned table. Figure J shows the interaction plot
for the viscosity experiment. From the interaction plot in figure j, we can observe that
there is interaction.
Treatment
combination
1
a
b
ab
Factor
A
+
+
Data
B
+
+
Rep1
145
145
132
149
Rep2
147
150
134
152
Averages
Total
146
147.5
133
150.5
292
295
266
301
Figure j. Interaction plot for the viscosity experiment.
(B) Effect estimates for k=2, n=2 are:
Effect a = Contrast a / n 2k-1 =( -(1) + a – b + ab) / 2n =(-292 +295 – 266 +301) / 4 = 9.5
Effect b = Contrast b / n 2k-1 =( -(1) - a + b + ab) / 2n =( -292 -295 + 266 + 301 ) / 4 = -5
Effect ab = Contrast ab / n 2k-1 =( (1) - a – b + ab) / 2n =( 292 - 295 - 266 +301) / 4 = 8
(C) By SS formula = n(Effect)2 2k-2 for the sum of squares as follows:
SSA = na22k-2 = (2)(9.5)2(2)0 = 180.5
SSB = nb22k-2 = (2)(-5)2(2)0 = 50
SSAB = nab22k-2 = (2)(8)2(2)0 = 128
SSTotal = 379.5
Where y is the grand mean of all observations of Yij = 144.25
And the error sum of square is:
SSE = SSTotal - SSA - SSB - SSAB = 379.5 – 180.5 – 50 – 128 = 21
(D) Analysis of variance table as given below:
(E) At the significance level α = 0.05, the values of the F-test statistics for factor A and factor
B are greater than Fα,1,error = F0.05,1,4 = 7.7086 ≈ 7.71
As shown in the above ANOVA table, Fratio > F0.05,1,4 = 7.71
The regression equation is given by:
Ŷ = 144.25 + 9.5/2 x1 -5/2 x2 + 8/2 x12 = 144.25 – 4.75 + 2.5 + 4= 146
(F) Figure k, shows the residual plots for the router experiment. The data of normal
probability plot appear fairly lines, suggesting that no reason to doubt the normality
assumption. Also, the residual plots against the fitted values as well as against the factor
levels of A and B exhibit random scatter around 0. Thus, the model assumptions are valid.
Treatment
combination
1
a
b
ab
Factor
A
+
+
B
+
+
(a) Normal probability plot
Data
Rep1
145
145
132
149
Rep2
147
150
134
152
Averages
Total
Y
estimate
146
147.5
133
150.5
292
295
266
301
146
147.5
133
150.5
Residuals
-1
-2.5
-1
-1.5
(b) Residual plot against fitted values
1
2.5
1
1.5
( C ) Residual plot against factor A
(d) Residual plot against factor B
© Copyright 2026 Paperzz