Box, G.E.P., J.S. Hunter, and R.J. Hader; (1954)The effect of inadequate models in surface fitting." (U.S. Army Ordnance)

THE EFFECT OF INADEQUATE MODELS IN SURFACE FITTING
Prepared
U~der
Offioe of Ordnance Research
Contraot Noo
DA·36-034-oP~-1177
by
Gil E~ Po
Box
J.. S II Hunte:r
R. J o Hader
Institute of Statistics
Mimeo Series No o 91
JanuaryJ 1954
(p~)
•
TECHNICAL REPORT NO. 5
THE EFFECT OF INADEQUATE MODElS IN SURFACE FITTING
Prepared Under Contract No. DA..36-o34-0RD-ll77 (Rn)
(Experimental Designs for Industrial Researoh)
Ordnanoe Project No. TB2.-000l (832)
Dept. of A;rmy Project No. 599-01-004
Philadelphia Ordnance District
Department of the Army, Department of Defense
with
Institute of Statistics
North Carolina State College of
The University of North Oarolina
Raleigh, North Carolina
/
Technical Supervisor
Ballistics Research Laboratories
Aberdeen Proving Ground
Aberdeen, Maryland
•
G. E.. P. Box
J. S. Hunter
R. J. Hader
Authors of Report
1.
INTRODUCTION
In earlier work on statistical problems oo~ected with surface fitting L-l_7,
,-2.7, ,-4.7 some of the consequences of IIlaok of fit" were considered briefly.
This report is intended to supplement the original discussion and particularly to
point out the extremely interesting and fundamental role of the alias matrix.
Certain relations derived previously are herein presented in a manner which more
clearly reveals their essential nature.
In order to make this report reasonably seJ£-contained we review some important
aspeots of surface f1 tting by the method of least squares. ,Suppose experimental
values of a response variable, y, are available for N selected combinations of
;
independent variables J.Cl , x2' ••• , x • The eXJ:l6rimenter postulates that the surface
k
can be adequately desoribed by an equation of the form
in whioh the ~ i are unknown ooefficients
~d
the Xi are the original independent
variables and produots and powers thereof. Equation (1) may be regarded as the
Taylor Series approximation, say to terms of order d, of the true response surface.
The objeot is then to estimate the
~i'
The least squares prooedure of estimation is conveniently described in
notation,
Let Y be the (N x 1) matrix of experimental values of y and
DEl. trix
1.. the
corresponding (N x 1) matrix of "'true ll. values of the response variable, i.e.,
E(Y)
•
1
(2)
Also assume
. . . . .\where IN is the (N x N) identity matrix; Equation (3) states that the errors in y
are unoorrelated and have common .variance, a2 •
-2-
•
Iat X be an N x L matrix in whioh the elements of the i th colwnn are the
t,lumerical values taken in the N experiments by the variable xi of equation (1).
In
matrix notation we write the postulated model as
(4)
The least squares estimate of
~
is then given by the (L x 1) matrix B
B •
(X'X)-l X'Y
(5)
~
(6)
and it can be shown that
:a: (B) =
and
(7)
The estimates thus obtained have minimum variance.
An unbiased estimate s2 of a2 is provided by
(N - L)6
2
•
(Y - XB)'(y - XB)
• y'y _ Y'XB
• y'y - B'X'XB
(8)
(9)
(10)
(11)
Let
t
be an N x 1 matrix of predicted values~ i.e.,
.y•
= XB
(12)
..3Writing
y
• + (I - Y)
•
Y
D
(13)
and taking advantage of
•
•l
we have
I ly
D
y (I - I)
=
.,.
.
(J4)
0
.
Y I + (I ... I)' (I .. I)
(15)
that is, the sum of squares of the observed y's can be broken up into two parts,
• , I,
•
• I (I';'" I)
•
Y
the sum of squares of the predicted values and (y - Y)
the sum of squares
of the residuals.
It is sometimes convenient to set out these quantities, which are
often called IIsum of squares due to regression" and "sum of squares about regression",
in an analysis of variance table
Sum of Squares
Expeoted Value of
Sum of Squares
Due to Regres!ion
B,tX'XB
rn2
About Regression
III _ BIX'XB
(N - L)<1
Total
III
Ni . . ~ 'Xt~
~IS,
91 , Q2' ••• , gpo
~
but a set of linear
Let this set be written as
9
..
113
where Mis a (p x L) rna trix of known coefficients.
~
~ IXI~
2
Often we wish to estimate not the individual coefficient
functions of the
of,
(16)
Then the linear estimates
t , t 2 , ••• , t p of the gts whioh severallYhave smallest possible varianoes are
l
-4'supplied by
T
lit
(17)
ME
where B is the vector of least squares estimates and T is the (p x 1) vector of
the estimates t , t , ... , t '
l
2
p
Markoff theorem.. ,-3].
This important theorem is. otten called the Gauss-
The variances and covariances of these estimates are provided by the matrix
(18)
2..
EFFECT OF LACK OF FIT OF MODEL
In practice it will often happen that the model (1) used by the experimenter is
not oompletely adequate to desoribe the surface.
Variables, other than those taken
acoount of.. may, in fact, be
In particular, higher order terms
exertL~g influenoe~
of the Taylor Series, may not really be negligible.
In order to investigate the
consequences of this so-oalled "'laok of fit" we write the true model as
where Xl in an (N x Ll ) matrix of the values of the ~ variables aotually fitted for ..
X2 is an (N x ~) matrix of the values of the ~ variables not taken aocount of J
~1.2
an
X's and
of X's.
(~
x 1) matrix of partial regression coefficients for the first set of
~2.l
an
(~
x 1) matrix of partial regression coefficients for the second set
Ii' now the experimenter uses
. (20)
•
-,then
(21)
where
(22)
The
(~
matrix.
x
~)
matrix A is a lI1aItrix of bias coefficients and may be called the "alias"
As is obvious from (2a) these coefficients are in fact regression coefficients
of X2 on Xl' The r th col~ of A contains the ~ regression coefficients of the r th
column vector in X2 on the ~ column vectors of Xl'
The estimate 8 2 of the residual error variance a2 would also be biassed in this
situation, The essential nature of the bias term is best understood if we arrange
the model (19) by adding Xl~2.l to the first term and subtracting it from the
f .
second J whence we obtain
We notioe that the columns of the matrix (X2 - X1A) are the vectors of residuals of
the regressions of X2 on X1J so that this matrix may be written as X ,l' Equation
2
(19) may finally then be written in the form
(24)
where
(2,)
is the set of regression coefficients for Xl ignoring the X variables,
2
Then
(N - Ll )8
2
III
C(Y
- X1Bl ) , (Y - X1Bl )
=
t [c y -2) - Xl (Bl
iCY - 7)
=t{(y -2) - Xl(Bl
-
~l)]
.~~ .lx~.l t {(Y - 2)
of;-
I
- Xl (Bl
- Xl (Bl
f(Y -() - Xl(Bl
-
-
-
~l)
oft
X2 .JJ32 .l}
-
~l)
oft
X2 .1~2 .11 (26)
I
~l)]
~l) 5 · [{(y -2)
-Xl(Bl -(31)]
X2.1~2.1
~~ .1X~ .lX2 .1~2.1
(2'7)
The second and third terms on the right hand side of (27) are clearly zero" and
(28)
where
(2.9)
(30)
•
f -
X~ ~
Xl (X~Xl)Xl) X2
We see that the bias in s2 is essentially non-negative sinoe the matrix 02.1 is
°
non-negative.
Also the elements of 2 • are the sums of squares and produots of
1
residuals of the regressions of the vectors of X2 on those of Xl and beoome large
therefore when the vectors X2. have only small inner products with the veotors in
Xl'
The ana.lysis of variance is now
Sums of Squares
Expected Value of
Sums of Squares
Due to regreesion
~i +(~1.2 +A{32 ,1) IX~Xl(~lo2 +A{32 ,1)
About regression
(N-~)i+j32.1(X2-X1A) I (X2-X1A)~2.1
Total
yty
-7From the point of view of the experimenter who is attempting to estimate the
coeffioients of the Xl terms only and who believed the model of equation (1) to be
adequate" the total sum of squares is influenoed by terms lnvolving
~2 .1'
This
additional sum of squares is divided between the oomponents "due to regression" and
"'about regression"" the distribution of the bias between these two quantities being
determined by the design used.
negative,
In no case can the bias in either sum of squares be
It is possible however for the whole of the bias to be conoentrated in
one of the oomponents leaving the other unbiassed.
A test frequently used for adequacy of fit of the mathematical model involves
the
~omparison
of the "'about regression ll sum of squares with an independent estimate
of error variance obtained, for example" by replication.
Although the incompatibility
of the two estimates due to an oversized lIabout regression" sum of squares will
oertainly indicate lack of fit" it is now clear that when two such estimates
~
compatable it would be dangerous to assume that the fit of the equation is necessarily adequate.
Quite apart from sampling variation" the extent to which the bias
effects will appear in the "'about regression ll sum of squares will depend on the type
of design chosen.
In seleoting designs and performing the subsequent analysis it is
important to bear in mind the alias characteristics of the designo
In particular,
the matrix A and also the alias pattern to be expected in the analysis of variance
should be examined.
For example" suppose we wish to fit the three dimensional planar surface
(32)
The eight levels of (Xl' x 2 ' x ) employed would
3
determine eight points in the three dimensional space of the variables Xl' x ' x
2 3
,which we shall call the experimental design D. One arrangement D which would aUor1
l
to a set of eight observations.
•
",,8the efficient estimation of the
~ts
of a cube that is, a 23 factorial.
_
would be a set of points arranged at the vertices
An alternative oould be based on a selection of
four points ohosen from the cube which form a tetrahedron.
This is a half replioate
of the 2 3 design in which the defining contrast is the three factor interaction.
Imagine then an alternative design of eight points in which two such tetrahedra
where used, one being obtained from the other by a simple translation (so that
oorresponding to a point xl' x2' x in the first tetrahedra there was a seoond point
3
xl + a, ~ + b, x + c.)
3
/',
':
/1
,
@-_..... - - .
•....._._.\1
I
;
I
I
I
••••
I
I
,:
.-.. _... _....;·/tl
•
/,--"-' .- .. '-.•. ~
....:--..._..- .../ i r
I I / _..' ,.... I .. _ .....-'
., IV ._- !-- •. - +-I / / I
~
I
:
'
\
'(
I
...
;1
I I
!
I
i
I O._·L.-;.. ':' I
1.-'
I
I
I
:./ I
I
. ,~._- I"-O"'-~· _. ~o .J
I
/
Ie: ..
..
0"
••••••
-
0'"
Suppose now that, in faot, within the region considered, the surfaoe was not planar,
that is to say not expressible in terms of equation (32) but instead a seoond degree
equation was required.
~.
(33)
Using equation (22.) we would have the following alias pa.tterns for the two designs
D
l
D
2
bO~ ~O + ~ll + ~22 + ~33
bO ~~O + ~.ll + ~22 + ~33
b1 -"> ~l
bl~ ~l + ~2.3
~2
b2 -" ~2 + ~13
b3~~3
b3~~3 1o~12
b2 ,
where the arrow notation indicates that the quantity of the left is an unbiassed
estimate of the quantity on the right.
The expe(rbed values of the sums of squares in the analysis of variance would be
as follows:
Expected Values of Sums of Squares
Dl
2
4,(1
it
6 Lt~o + ~ll + ~22
About regression
40'2
+.
6
Due to regression
4(1 "" 6 L\~o+ ~ll+ P22+ ~33) +(~l+' P. 23 ) +(~2"" ~13)
About regression
4i
2
L-2
~12
it
2
~lJ
+.
2
+.
~23-
~
~3J)
222
2 7
+ ~l .. ~2 + ~J-
Due to regression
7
2·
2
2
+(p)io ~12)
2
With the first design the "due to regression" sum of squares is inflated only
by the biassing of
~o
with
~llJ ~22J ~3JQ
If~
as would be usual, the mean where
eliminated the IIdue to regression" sum of squares would be unbiassed.
However" the
II/about regression" sum of squares would be biassed by all three interaction terms.
We see in contrast
th~t
using the second design" the sum of squares "due to regres-
sion" is biassed by all the second order terms, but now the "about regression" sum
of squares is unbiassed.
This illustrates two limiting cases which could" of course"
be readily treated by more elementary means.
They serve to show, however, the care
that is required in interpreting lack of fit procedures.
In particular, where" as is usual in multiple regression problems, an equation
is 1'itted to data in which the variables Xl' x ' x have not been controlled but
2 3
merely observed, correlation between the XiS may lead to pathological situations.
-10-
3. .ADDITION OF EXTRA CONSTANTS
Aproblem whioh sometimes faoes the experimenter is that of adding further
constants to the model.
The first model assumed might be
(34)
and the experimenter will have oalculated the estimates Bl , the precision matrix
{ XiX
I _7 -1 and the estimate of error s2 in the manner already descr~bed.
.
C-1
l •
l
It may now appear (for example as the result of an analysis of variance such as that
described above) that the model is madequate and that a more elaborate model
(35)
should be assumed. The new model could be fitted of course by calculating
the estimates B1 • 2 and B2 •1 ot
preoision matrix
~1.2
and
~2
-1
°1 + 3
.1'
~
initio
During the oalculation the new
,
X~Xl
XI X2
~Xl
X2I X2
(36)
III
L
2
would normally be found and using these quantities the new estimate of error 8 1 + 2
could also be obtained. However" it would be preferable if possible to use a method
in which the labor of obtaining the first estimates could be utilized. As we have
seen in section 2, above, we may re-write the model in the form
_
that is
(38)
-11-
•
where
Now
(40)
That is" the vectors of X2 •1 are orthogonal to the vectors of Xl'
~Jhence
(41)
and using (17) with (39)
(42)
This provides the desired procedure which requires only the inversion of
an
(L2 x ~) matrix C2 ,1" {X~ .lx2 .1_7 snd utilizes the already inverted Cil,
precision matrix
Ci~2 is readily found by writing the modified model
.
The
(37) in 'the form
(43)
thus
and equals
f
r.
l
I :-A
- o
I
I
--
-1
C1
-
-I"':'
0
0
-
__ 1 __ -
I
I
I
I
I
I
I
C-1
2.1
I
-
I
-
O-
J
f- -
_At I I
I
I
(4$)
-12where
(46)
Consequently
(47)
The same result may of oourse be obtained by the direct inversion of Cl ,2' The
above method of derivation is preferred since it indioates more olearly the
essential nature of the prooedure oarred out.
Finall beoause X • l and Xl are orthogonal we have
2
(48)
REFERENCES
(1)
Box, G. E. P. and Wilson, K. B., "On the Experimental Attainment of Optimum
Conditions"', Journal of the Royal Statistical Sooiety, Series B, XIII,
1, p. 1, 1951.
(2)
Box, G. E. P." II'Multi-Faotor Designs of First Order", Biometrika 39:49, 1952.
(3)
Plackett, R. L." "Some Theorems on Least Squares", Biometrika 36:458, 1949.
(4)
Box" G. E. P., Hader, R, J., Hunter" J. 5" "Experimental Designs for MultiFaotor Experiments: Preliminary Report", Prepared under Contraot
DA-36-034-0RD-1177 (RD) for the Office of Ordnance Research by the
Institute of Statistics, No C. State College, June 1953.
DISTRIBUTION LIST
No. of
Technioal Copies
Ageno;r
Office of Ordnance Research
Box eM
Duke Station
Durham, North Carolina
10
Office, Chief of Ordnance
Washington ~5, D. C.
ATTN: ORDTB
1
Director
National Bureau of Standards
Washington 25, Do C.
ATTN: Statistical Engineering Laboratory
1
District Chief" Phila. Ordnance District
1500 Chestnut Street
Philadelphia 2, Pat
Attn: Chief" Artillery..Small
2
Com~anding General
Abe::.~cieen Proving Ground,
AT'll] ;
BEL
Maryland
•
•
-2-
Agency
No. of
Technical Copies
Commanding Officer
Frankford Arsenal
Bridesburg Station
Philadelphia 37 1 Pa.
1
Commanding Officer
Picantinny Arsenal
Dover, New Jersey
1
CommandingO.fficer
Redstone Arsenal
Huntsville, Alabama
1
Commanding Officer
Rock Island Arsenal
Rock Island, Il]inois
1
Commanding Officer
Watertown Arsenal
Watertown 72, Mass.
1
Chief, Bureau of O!'dnance (AD3)
Department of the Navy
Washington 25, D. C.
1
Commander
U. S. Naval Proving Ground
Dahlgren, Virginia
1
Director
Applied Physics Laboratory
Johns Hopkins Uni-.rersity
8621 Georgia Avenile
Silver Spring 19" Maryland
1
Corona Laboratories
National Bureau of Standards
Corona" Ca.lifornia
1
U. S. Naval Ordnance
White Oak" Silver Spring 19, Md.
ATTN: Library Division
1
The Director
Naval Research Laboratory
Washington 25, D. C.
ATTN: Code 2021
1
..
)-
Agency
No. of
Technical Copies
Commander, U. S. Naval Ordnance
Test Station, Inyokern
China Lake, California
ATTN: Technical Library
1
Commanding General
Air University
Maxwell Air Force Base, Ala.
ATTN: Air University Library
1
Commanding General
Air Research and Development Command
Baltimore.. Maryland
ATTN: RnR
1
Commanding General
Air Research and Development Command
Baltimore, Naryland
ATTN:, RnD
1
Commanding General
Air Material
Wright-Patterson Air Force Base
Dayton 2.. Ohio
ATTN: Flight Research Lab.
F. N. Bubb, Chief Scientist
1
NAC for Aeronautics
1724 F street, Northwest,
Washington 25.. D. C.
Attn: Mr. E. B. Jackson, Chief
Office of Aeronautical Intelligence
•
1
U. S. Atomic Energy Commission
Document Library
19th and Constitution Avenue
Washington 25, D. C.
Attn: Director, Division of Research
1
Technical Information Service
P. O. Box 62
Oak Ridge.. Tennessee
Attn: Reference Branch
1
Commanding General
Research and Engineering Command
Army Chemical Center, Maryland
1
-4• No• .ot.
'.
.
Technical Copies'
Agency
Commanding Officer
Signal Corps Engineering Laboratory
Fort Monmouth" New Jersey
Attn: Director of Research
1
Commanding Officer
Engineer Research and Development Laboratories
Fort Belvoir, Virginia
1
Scientific Information Section
Research Branch
Research and Development Division
Office" Assistant Chief of Staff,
Department of the Army
Washington 25, D. C
1
0-4
Director
National Bureau of Standards
Washington 25, D. C.
1
Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive
Pasandena 3" California
1
Chier, Ordnance Development Division
National Bureau ot Standards
Washington 25" D. C.
1
Commanding General
Air Research and Development Command
pO Box 1395
Baltimore 3" Maryland
ATTN: Office of Scientific Research (RDR)
1
Commanding General
Air Research and Development Command
PO Box 1395
Baltimore 3, Maryland
ATTN. BDD
1
Document Service Center
U. B. BUilding
Dayton 2, Ohio
Attn: DSC-SD
1
Chief of Naval Research
c/o Reference Department
Technical Information Division
Library of Congress
Washington 25, D. C.
3