Supplemental Material for Chapter 3 S3

Supplemental Material for Chapter 3
S3-1. Random Samples
To properly apply many statistical techniques, the sample drawn from the population of
interest must be a random sample. To properly define a random sample, let x be a
random variable that represents the results of selecting one observation from the
population of interest. Let f ( x) be the probability distribution of x. Now suppose that n
observations (a sample) are obtained independently from the population under
unchanging conditions. That is, we do not let the outcome from one observation influence
the outcome from another observation. Let xi be the random variable that represents the
observation obtained on the ith trial. Then the observations x1 , x2 ,..., xn are a random
sample.
In a random sample the marginal probability distributions f ( x1 ), f ( x2 ),..., f ( xn ) are all
identical, the observations in the sample are independent, and by definition, the joint
probability distribution of the random sample is f ( x1 , x2 ,..., xn )  f ( x1 ) f ( x2 )... f ( xn ) .
S3-2. Expected Value and Variance Operators
Readers should have prior exposure to mathematical expectation from a basic statistics
course. Here some of the basic properties of expectation are reviewed.
The expected value of a random variable x is denoted by E ( x) and is given by
  xi p( xi ), xi is a discrete random variable
 all xi
E ( x)  

 xf ( x)dx, x is a continuous random variable
 
The expectation of a random variable is very useful in that it provides a straightforward
characterization of the distribution, and it has a simple practical interpretation as the
center of mass, centroid, or mean of the distribution.
Now suppose that y is a function of the random variable x, say y  h( x) . Note that y is
also a random variable. The expectation of h( x) is defined as
  h( xi ) p( xi ), xi is a discrete random variable
 all xi
E[h( x)]  

 h( x) f ( x)dx, x is a continuous random variable

 
An interesting result, sometimes called the “theorem of the unconscious statistician”
states that if x is a continuous random variable with probability density function f ( x) and
y  h( x) is a function of x having probability density function g ( y ) , then the expectation
of y can be found either by using the definition of expectation with g ( y ) or in terms of
its definition as the expectation of a function of x with respect to the probability density
function of x. That is, we may write either
1

E ( y)   yg ( y)dy

or

E ( y)  E[h( x)]   h( x) f ( x)dx

The name for this theorem comes from the fact that we often apply it without consciously
thinking about whether the theorem is true in our particular case.
Useful Properties of Expectation I:
Let x be a random variable with mean  , and c be a constant. Then
1. 1. E (c)  c
2. 2. E ( x)  
3. 3. E (cx)  cE ( x)  c
4. 4. E[ch( x)]  cE[h( x)]
5. If c1 and c2 are constants and h1 and h2 are functions, then
E[c1h1 ( x)  c2h2 ( x)]  c1E[h1 ( x)]  c2 E[h2 ( x)]
Because of property 5, expectation is called a linear (or distributive) operator.
Now consider the function h( x)  ( x  c)2 where c is a constant, and suppose that
E[( x  c)2 ] exists. To find the value of c for which E[( x  c)2 ] is a minimum, write
E[( x  c)2 ]  E[ x 2  2 xc  c 2 ]
 E ( x 2 )  2cE ( x)  c 2
Now the derivative of E[( x  c)2 ] with respect to c is 2 E ( x)  2c , and this derivative is
zero when c  E ( x ) . Therefore, E[( x  c)2 ] is a minimum when c  E ( x)   .
The variance of the random variable x is defined as
V ( x)  E[( x   ) 2 ]
2
and we usually call
V ( x)  E[( x   )2 ]
the variance operator. It is straightforward to show that if c is a constant, then
V (cx)  c 2 2
The variance is analogous to the moment of inertia in mechanics.
2
Useful Properties of Expectation II:
Let x1 and x2 be random variables with means 1 and 2 and variances  12 and  22 ,
respectively, and let c1 and c2 be constants. Then
1. E( x1  x2 )  1  2
2. It is possible to show that V ( x1  x2 )  12   22  2Cov( x1 , x2 ) , where
Cov( x1 , x2 )  E[( x1  1 )( x2  2 )]
is the covariance of the random variables x1 and x2. The covariance is a measure
of the linear association between x1 and x2. More specifically, we may show that
if x1 and x2 are independent, then Cov( x1 , x2 )  0 .
3. V ( x1  x2 )  12   22  2Cov( x1 , x2 )
1. If the random variables x1 and x2 are independent, V ( x1  x2 )   12   22
2. If the random variables x1 and x2 are independent, E( x1 x2 )  E( x1 ) E( x2 )  12
3. Regardless of whether x1 and x2 are independent, in general
 x  E ( x1 )
E 1  
 x2  E ( x2 )
7. For the single random variable x
V ( x  x)  4 2
because Cov( x, x)   2 .
Moments
Although we do not make much use of the notion of the moments of a random variable
in the book, for completeness we give the definition. Let the function of the random
variable x be
h( x )  x k
where k is a positive integer. Then the expectation of h( x)  x k is called the kth moment
about the origin of the random variable x and is given by
  xik p ( xi ), xi is a discrete random variable
 all xi
E( xk )  

 x k f ( x)dx, x is a continuous random variable
 
Note that the first origin moment is just the mean  of the random variable x. The
second origin moment is
E( x2 )   2   2
3
Moments about the mean are defined as
  ( xi   ) k p( xi ), xi is a discrete random variable
 all xi
E[( x   ) k ]  

 ( x   ) k f ( x)dx, x is a continuous random variable

 
The second moment about the mean is the variance  2 of the random variable x.
S3-3. Proof That E ( x )   and E (s 2 )   2
It is easy to show that the sample average x and the sample variance s2 are unbiased
estimators of the corresponding population parameters  and  2 , respectively. Suppose
that the random variable x has mean  and variance  2 , and that x1 , x2 ,..., xn is a random
sample of size n from the population. Then
1 n 
E ( x )  E   xi 
 n i 1 
1 n
  E ( xi )
n i 1

1 n

n i 1

because the expected value of each observation in the sample is E ( xi )   . Now
consider
 n
2 
  ( xi  x ) 

E ( s 2 )  E  i 1
n 1






1
 n


E   ( xi  x ) 2 
n  1  i 1

It is convenient to write
n
n
i 1
i 1
 ( xi  x )2   xi2  nx 2 , and so
 n
 n
E   ( xi  x ) 2    E ( xi2 )  E (nx 2 )
 i 1
 i 1
n
Now
 E(x
i 1
2
i
)   2   2 and E (x 2 )   2   2 / n . Therefore
4
E (s 2 ) 
1  n

(  2   2 )  n(  2   2 / n) 


n  1  i 1


1
n 2  n 2  n 2   2
n 1
(n  1) 2

n 1
2



Note that:
a. These results do not depend on the form of the distribution for the random
variable x. Many people think that an assumption of normality is required, but
this is unnecessary.
b. Even though E (s 2 )   2 , the sample standard deviation is not an unbiased
estimator of the population standard deviation. This is discussed more fully in
section S3-5.
S3-4. More About Parameter Estimation
Throughout the book estimators of various population or process parameters are given
without much discussion concerning how these estimators are generated. Often they are
simply “logical” or intuitive estimators, such as using the sample average x as an
estimator of the population mean  .
There are methods for developing point estimators of population parameters. These
methods are typically discussed in detail in courses in mathematical statistics. We now
give a brief overview of some of these methods.
The Method of Maximum Likelihood
One of the best methods for obtaining a point estimator of a population parameter is the
method of maximum likelihood. Suppose that x is a random variable with probability
distribution f ( x; ) , where  is a single unknown parameter. Let x1 , x2 ,..., xn be the
observations in a random sample of size n. Then the likelihood function of the sample is
L( )  f ( x1; )  f ( x2 ; ) 
 f ( xn ; )
The maximum likelihood estimator of  is the value of  that maximizes the likelihood
function L(  ).
Example 1 The Exponential Distribution
To illustrate the maximum likelihood estimation procedure, set x be exponentially
distributed with parameter  . The likelihood function of a random sample of size n, say
x1 , x2 ,..., xn , is
5
n
L( )    e   xi
i 1
  ne

n
 xi
i 1
Now it turns out that, in general, if the maximum likelihood estimator maximizes L( ), it
will also maximize the log likelihood, ln L( ) . For the exponential distribution, the log
likelihood is
n
ln L( )  n ln     xi
i 1
Now
d ln L( ) n n
   xi
d
 i 1
Equating the derivative to zero and solving for the estimator of  we obtain
ˆ 
n
n
x

1
x
i
i 1
Thus the maximum likelihood estimator (or the MLE) of  is the reciprocal of the sample
average.
Maximum likelihood estimation can be used in situations where there are several
unknown parameters, say 1 , 2 , , p to be estimated. The maximum likelihood
estimators would be found simply by equating the p first partial derivatives
L(1 , 2 , , p ) /  i , i  1, 2,..., p of the likelihood (or the log likelihood) equal to zero
and solving the resulting system of equations.
Example 2 The Normal Distribution
Let x be normally distributed with the parameters  and  2 unknown. The likelihood
function of a random sample of size n is
1  xi   
 
2
 
1
L(  ,  2 )  
e 2
i 1  2
n

1
(2 )
2 n/2

e
n
1
2
2
 ( xi   )2
i 1
The log-likelihood function is
n
1
ln L(  ,  2 )   ln(2 2 )  2
2
2
n
 (x  )
i 1
2
i
6
Now
 ln L(  2 ) 1
 2


n
 (x  )
i
i 1
 ln L(  2 )
n
1
 2  4
2

2
2
n
 (x  )
i 1
i
2
0
The solution to these equations yields the MLEs
ˆ 
1 n
 xi  x
n i 1
1 n
( xi  x ) 2

n i 1
ˆ 2 
Generally, we like the method of maximum likelihood because when n is large, (1) it
results in estimators that are approximately unbiased, (2) the variance of a MLE is as
small as or nearly as small as the variance that could be obtained with any other
estimation technique, and (3) MLEs are approximately normally distributed.
Furthermore, the MLE has an invariance property; that is, if ˆ is the MLE of  , then the
MLE of a function of  , say h( ) , is the same function h(ˆ) of the MLE . There are also
some other “nice” statistical properties that MLEs enjoy; see a book on mathematical
statistics, such as Hogg and Craig (1978) or Bain and Engelhardt (1987).
The unbiased property of the MLE is a “large-sample” or asymptotic property. To
illustrate, consider the MLE for  2 in the normal distribution of example 2 above. We
can easily show that
E (ˆ 2 ) 
n 1 2

n
Now the bias in estimation of  2 is
E (ˆ 2 )   2 
n 1 2
2
  2  
n
n
Notice that the bias in estimating  2 goes to zero as the sample size n   . Therefore,
the MLE is an asymptotically unbiased estimator.
The Method of Moments
Estimation by the method of moments involves equating the origin moments of the
probability distribution (which are functions of the unknown parameters) to the sample
moments, and solving for the unknown parameters. We can define the first p sample
moments as
n
M k 
x
i 1
n
k
i
, k  1, 2,..., p
and the first p moments around the origin of the random variable x are just
7
k  E ( x k ), k  1, 2,..., p
Example 3 The Normal Distribution
For the normal distribution the first two origin moments are
1  
2   2   2
and the first two sample moments are
M 1  x
M 2 
1 n 2
 xi
n i 1
Equating the sample and origin moments results in
x
2  2 
1 n 2
 xi
n i 1
The solution gives the moment estimators of  and  2 :
ˆ  x
ˆ 2 
1 n
 ( xi  x )2
n i 1
The method of moments often yields estimators that are reasonably good. For example,
in the above example the moment estimators are identical to the MLEs. However,
generally moment estimators are not as good as MLEs because they don’t have statistical
properties that are as nice. For example, moment estimators usually have larger
variances than MLEs.
Least Squares Estimation
The method of least squares is one of the oldest and most widely used methods of
parameter estimation. Unlike the method of maximum likelihood and the method of
moments, least squares can be employed when the distribution of the random variable is
unknown.
To illustrate, suppose that the simple location model can describe the random variable x:
xi     i , i  1, 2,..., n
where the parameter  is unknown and the  i are random errors. We don’t know the
distribution of the errors, but we can assume that they have mean zero and constant
variance. The least squares estimator of  is chosen so the sum of the squares of the
model errors  i is minimized. The least squares function for a sample of n observations
x1 , x2 ,..., xn is
8
n
L    i2
i 1
n
  ( xi   ) 2
i 1
Differentiating L and equating the derivative to zero results in the least squares estimator
of  :
ˆ  x
In general, the least squares function will contain p unknown parameters and L will be
minimized by solving the equations that result when the first partial derivatives of L with
respect to the unknown parameters are equated to zero. These equations are called the
least squares normal equations.
The method of least squares dates from work by Karl Gauss in the early 1800s. It has a
very well-developed and indeed quite elegant theory. For a discussion of the use of least
squares in estimating the parameters in regression models and many illustrative
examples, see Montgomery, Peck and Vining (2001), and for a very readable and concise
presentation of the theory, see Myers and Milton (1991).
S3-5. Proof That E ( S )  
In Section S3-4 of the Supplemental Text Material we showed that the sample variance is
an unbiased estimator of the population variance; that is, E (s 2 )   2 , and that this result
does not depend on the form of the distribution. However, the sample standard deviation
is not an unbiased estimator of the population standard deviation. This is easy to
demonstrate for the case where the random variable x follows a normal distribution.
Let x have a normal distribution with mean  and variance  2 , and let x1 , x2 ,..., xn be a
random sample of size n from the population. Now the distribution of
(n  1) s 2
2
is chi-square with n – 1 degrees of freedom, denoted  n21 . Therefore the distribution of
s2 is  2 /(n  1) times a  n21 random variable. So when sampling from a normal
distribution, the expected value of s2 is
9
 2 2 
E (s 2 )  E 
 n 1 
 n 1



2
n 1
2
n 1
E (  n21 )
(n  1)
2
because the mean of a chi-square random variable with n – 1 degrees of freedom is n – 1.
Now it follows that the distribution of
(n  1) s

is a chi distribution with n – 1 degrees of freedom, denoted  n1 . The expected value of
S can be written as
 

E (s)  E 
 n 1 
 n 1



n 1
E (  n 1 )
The mean of the chi distribution with n –1 degrees of freedom is
E (  n 1 )  2
(n / 2)
[(n  1) / 2]

where the gamma function (r )   y r 1e y dy . Then
0
E ( s) 
2
(n / 2)

n  1 [(n  1) / 2]
 c4
The constant c4 is given in Appendix table VI.
While s is a biased estimator of  , the bias gets small fairly quickly as the sample size n
increases. From Appendix table VI, note that c4 = 0.94 for a sample of n = 5, c4 = 0.9727
for a sample of n = 10, and c4 = 0.9896 or very nearly unity for a sample of n = 25.
S3-6. More about Checking Assumptions in the t-Test
The two-sample t-test can be presented from the viewpoint of a simple linear regression
model. This is a very instructive way to think about the t-test, as it fits in nicely with the
general notion of a factorial experiment with factors at two levels. This type of
experiment is very important in process development and improvement, and is discussed
extensively in Chapter 12. This also leads to another way to check assumptions in the t-
10
test. This method is equivalent to the normal probability plotting of the original data
discussed in Chapter 3.
We will use the data on the two catalysts in Example 3-9 to illustrate. In the two-sample
t-test scenario, we have a factor x with two levels, which we can arbitrarily call “low” and
“high”. We will use x = -1 to denote the low level of this factor (Catalyst 1) and x = +1 to
denote the high level of this factor (Catalyst 2). The figure below is a scatter plot (from
Minitab) of the yield data resulting from using the two catalysts shown in Table 3-2 of
the textbook.
Scatterplot of Yield vs Catalyst
98
97
96
Yield
95
94
93
92
91
90
89
-1.0
0.0
Catalyst
-0.5
0.5
1.0
We will a simple linear regression model to this data, say
yij   0   1 xij   ij
where  0 and  1 are the intercept and slope, respectively, of the regression line and the
regressor or predictor variable is x1 j  1 and x2 j  1 . The method of least squares can
be used to estimate the slope and intercept in this model. Assuming that we have equal
sample sizes n for each factor level the least squares normal equations are:
2
n
2n 0    yij
i 1 j 1
n
n
j 1
j 1
2n 1   y2 j   y1 j
11
The solution to these equations is
 0  y
1
2
 1  ( y2  y1 )
Note that the least squares estimator of the intercept is the average of all the observations
from both samples, while the estimator of the slope is one-half of the difference between
the sample averages at the “high” and “low’ levels of the factor x. Below is the output
from the linear regression procedure in Minitab for the catalyst data.
Regression Analysis: Yield versus Catalyst
The regression equation is
Yield = 92.5 + 0.239 Catalyst
Predictor
Constant
Catalyst
Coef
92.4938
0.2387
S = 2.70086
SE Coef
0.6752
0.6752
R-Sq = 0.9%
T
136.98
0.35
P
0.000
0.729
R-Sq(adj) = 0.0%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
14
15
SS
0.912
102.125
103.037
MS
0.912
7.295
F
0.13
P
0.729
Notice that the estimate of the slope (given in the column labeled “Coef” and the row
1
1
labeled “Catalyst” above) is 0.2387  ( y2  y1 )  (92.7325  92.255) and the estimate
2
2
1
1
of the intercept is 92.4938  ( y2  y1 )  (93.7325  92.255) . Furthermore, notice that
2
2
the t-statistic associated with the slope is equal to 0.35, exactly the same value (apart
from sign, because we subtracted the averages in the reverse order) we gave in the text.
Now in simple linear regression, the t-test on the slope is actually testing the hypotheses
H0 :  1  0
H0 :  1  0
and this is equivalent to testing H0 :  1   2 .
It is easy to show that the t-test statistic used for testing that the slope equals zero in
simple linear regression is identical to the usual two-sample t-test. Recall that to test the
above hypotheses in simple linear regression the t-statistic is
12
t0 
 1
 2
S xx
2
n
where Sxx    ( xij  x ) 2 is the “corrected” sum of squares of the x’s. Now in our
i 1 j 1
specific problem, x  0, x1 j  1 and x2 j  1, so S xx  2n. Therefore, since we have
already observed that the estimate of  is just sp,
t0 
ˆ1
ˆ 2
S xx
1
( y2  y1 )
y y
2
 2 1
1
2
sp
sp
2n
n
This is the usual two-sample t-test statistic for the case of equal sample sizes.
Most regression software packages will also compute a table or listing of the residuals
from the model. The residuals from the Minitab regression model fit obtained above are
as follows:
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Catalyst
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
-1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
Yield
91.500
94.180
92.180
95.390
91.790
89.070
94.720
89.210
89.190
90.950
90.460
93.210
97.190
97.040
91.070
92.750
Fit
92.255
92.255
92.255
92.255
92.255
92.255
92.255
92.255
92.733
92.733
92.733
92.733
92.733
92.733
92.733
92.733
SE Fit
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
0.955
Residual
-0.755
1.925
-0.075
3.135
-0.465
-3.185
2.465
-3.045
-3.543
-1.783
-2.273
0.477
4.457
4.307
-1.663
0.017
St Resid
-0.30
0.76
-0.03
1.24
-0.18
-1.26
0.98
-1.21
-1.40
-0.71
-0.90
0.19
1.76
1.70
-0.66
0.01
The column labeled “Fit” contains the predicted values of yield from the regression
model, which just turn out to be the averages of the two samples. The residuals are in the
sixth column of this table. They are just the differences between the observed values of
yield and the corresponding predicted values. A normal probability plot of the residuals
follows.
13
Normal Probability Plot of the Residuals
(response is Yield)
99
95
90
Percent
80
70
60
50
40
30
20
10
5
1
-7.5
-5.0
-2.5
0.0
Residual
2.5
5.0
Notice that the residuals plot approximately along a straight line, indicating that there is
no serious problem with the normality assumption in these data. This is equivalent to
plotting the original yield data on separate probability plots as we did in Chapter 3.
S3-7. Expected Mean Squares in the Single-Factor Analysis of Variance
In section 3-5.2 we give the expected values of the mean squares for treatments and error
in the single-factor analysis of variance (ANOVA). These quantities may be derived by
straightforward application of the expectation operator.
Consider first the mean square for treatments:
E ( MSTreatments )  E
SS
F
IJ
G
Ha 1 K
Treatments
Now for a balanced design (equal number of observations in each treatment)
SSTreatments
1 a 2 1 2
  yi . 
y..
n i 1
an
and the single-factor ANOVA model is
yij     i   ij
i  1,2, , a
R
S
Tj  1,2,, n
In addition, we will find the following useful:
14
E ( ij )  E ( i . )  E ( .. )  0, E ( ij2 )   2 , E ( i2. )  n 2 , E ( ..2 )  an 2
Now
E ( SSTreatments )  E (
1 a 2
1
yi . )  E ( y..2 )

n i 1
an
Consider the first term on the right hand side of the above expression:
E(
1 a 2
1 a
y
)

E (n  n i   i . ) 2
 i. n 
n i 1
i 1
Squaring the expression in parentheses and taking expectation results in
E(
a
1 a 2
1
2
2
y
)

[
a
(
n

)

n
 i2  an 2 ]
 i. n

n i 1
i 1
a
 an 2  n  i2  a 2
i 1
because the three cross-product terms are all zero. Now consider the second term on the
right hand side of E ( SSTreatments ) :
1 I 1
F
G
Han y J
K an E (an  n 
a
E
2
..
i
  .. ) 2
i 1

a
since

i
1
E (an   .. ) 2
an
 0. Upon squaring the term in parentheses and taking expectation, we obtain
i 1
E
1 I 1
F
G
Han y J
K an [(an)
2
..
2
 an 2 ]
 an 2   2
since the expected value of the cross-product is zero. Therefore,
E ( SSTreatments )  E (
1 a 2
1
yi . )  E ( y..2 )

n i 1
an
a
 an 2  n  i2  a 2  (an 2   2 )
i 1
a
  2 (a  1)  n  i2
i 1
Consequently the expected value of the mean square for treatments is
15
E ( MSTreatments )  E
SS
F
IJ
G
Ha 1 K
Treatments
a

 2 (a  1)  n  i2
i 1
a 1
a
2 
n  i2
i 1
a 1
This is the result given in the textbook.
For the error mean square, we obtain
 SS E 
E ( MS E )  E 

 N a

 a n

1
E   ( yij  yi. ) 2 
N  a  i 1 j 1


 a n

1
1 a
E   yij2   yi2. 
N  a  i 1 j 1
n i 1 
Substituting the model into this last expression, we obtain
2
 a n
 
1
1 a  n
2
E ( MS E ) 
E   (    i   ij )     (    i   ij )  
N  a  i 1 j 1
n i 1  j 1
 

After squaring and taking expectation, this last equation becomes
E ( MS E ) 
a
a
1 

2
2
2
2
N


n


N


N


n
 i2  a 2 


i

N a
i 1
i 1

2
16