Introduction to Biostatistics: Part 6, Correlation and Regression

SPECIAL CONTRIBUTION
biostatistics
Introduction to Biostatistics: Part 6,
Correlation and Regression
Correlation and regression analysis are applied to data to define and
quantify the relationship between two variables. Correlation analysis is
used to estimate the strength of a relationship between two variables. The
correlation coefficient r is a dimensionless number ranging from - 1 to
+ 1. A value of - 1 signifies a perfect negatiw~, or indirect (inverse) relationship. A value of + 1 signifies a perfect positive, or direct relationship.
The r can be calculated as the Pearson-product r, using normally distributed interval or ratio data, or as the Spearman rank r, using non-normally
distributed data that are not interval of ratio in nature. Linear regression
analysis results in the formation of an equation of a line (Y = m X + b),
which mathematically describes the line of best fit for a data relationship
between X and Y variables. This equation can then be used to predict
additional dependent variable values (}'), based on the value or the independent variable X, the slope m, and the Y-intercept b. Interpretation of
the correlation coefficient r involves use of r 2, which implies the degree of
variability of Y due to X. Tests of significance for linear regression are
similar conceptually to significance testing using analysis of variance.
Multiple correlation and regression, more complex analytical methods
that define relationships between three or more variables, are not covered
in this article. Closing comments for this final installment of this introduction to biostatistics series are presented. /Gaddis ML, Gaddis GM: Introduction to biostatistics: Part 6, correlation and regression. Ann Emerg
Med December 1990;19:1462-1468.]
Monica L Gaddis, PhD*
Gary M Gaddis, MD, PhDt
Kansas City, Missouri
From the Departments of Surgery* and
Emergency Health Services,t University of
Missouri-Kansas City School of Medicine,
Truman Medical Center, Kansas City,
Missouri.
Received for publication August 23, 1990.
Accepted for publication September 10,
1990.
Address for reprints: Monica L Gaddis,
PhD, Department of Surgery, University of
Missouri-Kansas City School of MedicinG
Truman Medical Center, 2301 Holmes,
Kansas City, Missouri 64108.
INTRODUCTION
In research or clinical practice, the physician is often asked to relate two
or more variables to predict an outcome. This is exemplified by h o w risk
factors are devised for prediction of chance of disease occurrence. Risk factors associate physical or behavioral characteristics with illness in a manner that is easily understood by the lay public. An example is the association between cigarette smoking, hypertension, and heart disease.
The statistical methods used to define or describe such risk factor-disease relationships are termed correlation and regression analysis. While the
terms correlation and regression are often used together, they represent
separate steps in the process of relationship analysis. Correlation analysis
provides a quantitative way of measuring the strength of a relationship
between two variables. Regression analysis is used to mathematically describe that relationship, with the ultimate goal being the development of
an equation for prediction of one variable from one or more other variables.
This final installment of the Introduction to Biostatistics series will provide an independent discussion of correlation and regression analysis. Final
c o m m e n t s regarding the entire biostatistics series also are included.
CORRELATION ANALYSIS
In nature, one can often see that two or more variables are related or
correlated. The latitude of a city and its average daily temperature, intelligence quotient of a student and his grade point average, drug dosage administered and the resultant physical response, and caloric consumption
and weight gain or loss are examples. While some relationships between
two or more variables appear obvious, they by no means serve as perfect
19:12 December 1990
Annals of Emergency Medicine
1462/139
BIOSTATISTICS
Gaddis & Gaddis
FIGURE 1. Scatter diagram for sample data given in Table 1 (caloric consumption vs weight change).
predictors of outcome. The relationships may be confounded
by
e x t r a n e o u s v a r i a b l e s t h a t can adversely affect a "perfect" correlation,
causing variability of the responses.
For example, while it seems obvious
that the closer to the equator a city
lies, the higher its average daily temp e r a t u r e s h o u l d be, a l t i t u d e and
w e a t h e r p a t t e r n s m a y influence or
change the expected relationship. In
addition, individual m e t a b o l i c rate
and physical activity will certainly
affect weight gain or loss, regardless
of c o n t r o l l e d caloric c o n s u m p t i o n .
Thus, while the d i r e c t i o n of a relat i o n s h i p is o f t e n i n t u i t i v e , t h e
strength of that relationship is n o t J
The discussion of the degree of relationship between two random variables is also a discussion of the correlation between those variables because " c o r r e l a t i o n is a relation. ''2
However, it is very important to rem e m b e r that correlation simply recognizes a relation. Correlation does
not imply causation!
The degree of correlation between
two variables is estimated in the following manner: first, visualization of
the relationship is best obtained by
placing the variables in question in
graphic form. This graphic represent a t i o n is t e r m e d a " s c a t t e r diagram."1, 3 A scatter diagram is made
by use of an X (horizontal) and Y (vertical) axis graph. Figure 1 is a scatter
diagram of m e a n caloric c o n s u m p tion per day versus weight change per
m o n t h (Table 1). W h i l e the t o t a l
n u m b e r of patients is small, a relationship emerges from the scatter diagram that indicates that as caloric
consumption increases, so also does
weight. In Figure 2, some of the various relationships seen between two
random variables in a scatter diagram
are Shown. These include a) positive
(direct) linear; b) negative (indirect)
linear; c) and d) curvilinear; e) no rel a t i o n s h i p ; and f) exponential.l,4, 5
However, the scatter diagram is limited to simply describing the appearance or pattern of the relationship.
To quantify the relationship in a
meaningful manner, a correlation coefficient, a numerical value that des c r i b e s t h e s t r e n g t h of t h e relationship, m u s t be calculatedA -8 The
correlation coefficient r is a dimen-
140/1463
¢-
O
t-
4-
O
Y
E
t'-
3-
¢-
2-
1-
i
r
l
i
200
400
600
800
F
~
--
E
i
p
,
- i
',
1800 2000 2200 2400 2600
- - ~
1000 1200 1400 1600
mean caloric consumption/day (Kcals)
1
X
TABLE 1. Sample data: Caloric consumption versus weight change
(X)
Mean Caloric
Consumption/Day
(Y)
Weight Change/
Month
1
1,200
0.0
2
3
4
5
6
7
8
1,500
1,80O
2,000
2,5OO
1,800
2,500
2,000
0.5
0.5
1.5
4.0
1.0
3.0
2.0
Patient
sionless n u m b e r ranging from - 1 to
+ 1, with - 1 depicting a perfect negative linear relationship and + 1 depicting a perfect positive linear relationship.l-8 The closer that r is to 0,
allegedly the weaker the relationship.
However, "a small c o r r e l a t i o n between two variables can also be due
to 1) little linear association between
the two variables, or 2) large errors in
the measurement of the variables. ''6
There are several m e t h o d s available to calculate a correlation coefficient. Included in this discussion are
the Pearson p r o d u c t - m o m e n t r and
the Spearman rank 1", both c o m m o n l y
used in clinical data analysis.
Annals of Emergency Medicine
Pearson Product-Moment
r
The Pearson r is used to quantify
the strength of a linear relationship
between two continuous variables (of
interval or ratio scales) that are from
normally distributed populations. 1,3,5
Before calculation of the Pearson r
is shown, it is i m p o r t a n t to understand the concept of covariance. Recall from part 3 of this series 9 that
variance is a m e a s u r e of the variability or dispersion of a single random variable. 6 Covariance, an extension of this, is defined as "a measure
of h o w two r a n d o m variables vary together. ''6 Using the example in Table
1, c o m p u t a t i o n of covariance is as
19:12 December 1990
FIGURE 2. Scatter diagram relation-
ships: a)
negative
vilinear;
tionship;
b
•
•
Y
Y
O•
e
X
X
c
•
Y
•
O
O
•
•
•
•
Q
Y
•
•
•
~
.
• °°
e• •
X
X
e
•
•
o
o
Y
Y
o°. "
X
X
follows: 1,3
1. Calculate the m e a n values of the
X and Y variables: (X and Y).
2~ S u b t r a c t t h e c o r r e s p o n d i n g m e a n
from each individual value: {xi X) and Yi
Y).
3. C a l c u l a t e t h e p r o d u c t (x i - X)
(Yi - Y) for each variable pair.
4. S u m t h e p r o d u c t s c a l c u l a t e d i n
step 3.
5. Divide the s u m of the products by
the total n u m b e r of variable pairs
(n) m i n u s 1: (n - 1).
C o v a r i a n c e = SKy -X)
(Yi - Y ) ] / (n -[~1)(xi A positive covariance implies that
w h e n one variable is greater t h a n its
m e a n , so t o o is t h e c o r r e s p o n d i n g
variable. A n e g a t i v e c o v a r i a n c e i m p l i e s t h a t w h e n o n e v a r i a b l e is
greater than its mean, the corres p o n d i n g v a r i a b l e will be less t h a n
its c o r r e s p o n d i n g m e a n . C o v a r i a n c e
19:12 December 1990
relates to the equation for the Pearson r in that it is the n u m e r a t o r in
the equation for r. T h e d e n o m i n a t o r
is o b t a i n e d by c a l c u l a t i n g the standard deviation (SD I for the values of
variable x (SK), the SD for the values
of v a r i a b l e y (Sy), and finding t h e i r
n r o d u c t / S ° S I ITable 21 Therefore
the e q u a t i o n for the Pearson product
m o m e n t r is: 1,3,5,7
r
-
Sxy
/ S x o Sy
T h e c a l c u l a t e d P e a r s o n r in t h e exa m p l e (Table 2) is .94. This implies
some degree of positive or direct linear correlation.
Spearman Rank Order r
W h e n one or m o r e of the variables
being analyzed for strength or direction of a relationship or trend is n o t
of an i n t e r v a l of ratio scale, is n o t
drawn from normally distributed
population, or does n o t possess a linear relationship, a n o n p a r a m e t r i c corAnnals of Emergency Medicine
positive (direct) linear; b)
(indirect) linear; c) curd) curvilinear; e) no relaand f) exponential.
relation technique m u s t be used. One
such m e t h o d is the S p e a r m a n r a n k
order correlation coefficient. The
Spearman r a n k r is based on the rank
order of t h e i n d i v i d u a l data p o i n t s
and n o t the actual n u m e r i c a l values.
The Spearman rank r is calculated as
follows (Table 3):3, 5
1. T h e v a l u e s of the t w o variables
are r a n k e d i n a s c e n d i n g or descending order (for ties, the r a n k is
the average rank).
2. C a l c u l a t e the difference b e t w e e n
the ranks for each pair of data.
3. C a l c u l a t e the s u m of the squared
differences (from step 2) and mult i p l y by 6.
4. D i v i d e t h e n u m b e r c a l c u l a t e d in
s t e p 3 b y n ( n 2 - 1), w h e r e n
equals the n u m b e r of pairs of data.
5. Subtract the n u m b e r found in step
4 from 1.
Spearman rank
r = 1 - 6E(d 2)/n(n 2
1)
N o t e t h a t Pearson r and S p e a r m a n
r a n k r are s i m i l a r for the same set of
data. This will be true, especially for
large data sets where the n u m b e r of
pairs of data is m o r e than 50.1
I n t e r p r e t a t i o n a n d U s e of t h e
Correlation Coefficient
As previously stated, while a
s t r o n g c o r r e l a t i o n m i g h t e x i s t between two independent variables,
this does n o t i m p l y t h a t one e v e n t
causes the other. T h e r e a s o n i n g behind this is threefold;
1. W h i c h comes first, the chicken or
the egg ie, . . . does X cause Y, or
does Y cause X? The r calculation
does n o t have the c a p a b i l i t i e s to
d e t e r m i n e the order of the causal
relationship.
2. A variable other t h a n X or Y m a y
influence the relationship.
3. C o m p l e x relationships as found in
the b i o m e d i c a l sciences can rarely
be explained by a simple two variable relationship. 1
Thus, r (whether from Pearson or
Spearman) is s i m p l y a d e s c r i p t o r of
the strength and direction of the relat i o n s h i p b e t w e e n t h e t w o variables
1464/141
BtOSTATISTICS
Gaddis & Gaddis
TABLE 2. Calculation of Pearson product-moment correlation coefficient using sample data from Table 1
Patient
x
X
(xi -
X)
y
V
(Yi -
Y)
1
1,200
-
1 912.5
=
-712.5
0.0 -
1.56 -
-1.56
(x i -
x)(Yi
-
2
1,500
-
1 912.5
=
-412.5
0.5 -
1.56 -
-1.06
437.25
3
1,800
-
1 912.5
=
-112.5
0.5 -
1.56 =
-1.06
119.25
4
2,000
-
1 912.5
=
87.5
1.5 -
1.56 =
-0.06
-5.25
5
2,500
-
1 912.5
=
587.5
4.0 -
1.56 =
-2.44
1,433.5
6
1,800
-
1 912.5
=
1.0 -
1.56 =
-0.56
63.0
7
2,500
-
1 912.5
=
587.5
3.0 -
1.56 =
-1.44
846.0
8
2,000
-
1 912.5
=
87.5
2.0 -
1.56 =
-0.44
38.5
X =
Y =
1,912.5
Step 1 Sxy = ~ [ (xF - X) (Yi
Step 2 S x = G
(xi
(n
Step3
--
x)2
Sxy
_
1)
Sy =G((Yn -- YI 2
Step 4 r = S×oSy
V)] /
1)
4,043.75
8-1
-
1.56
~,
4,043.75
577.7
1,408,750 _ 448.6
8
-
(n -
1
13.22
81
-1.37
_ 577.7
(448.6) (1.37) = 0.94
in question and n o t h i n g more. Just as
in h y p o t h e s i s testing, P < .001 does
not i m p l y greater significance t h a n P
.05, so a Pearson or Spearman r of
0.99 d o e s n o t i m p l y a m o r e l i k e l y
causation t h a n an r of 0.95. However,
correlation analysis may guide the
researcher in d e t e r m i n i n g causation.
A d d i t i o n a l studies to aid in defining
the causative variables m a y be developed. T h e s e studies m a y be b a s e d on
knowledge gained from prior correlation analysis. W h i l e the value of r describes the strength of a relationship
b e t w e e n two variables, an r of 0.5, for
example, does n o t i m p l y t h a t " t h e
strength of that r e l a t i o n s h i p is 'halfway' b e t w e e n zero correlation (no relationship) and a perfect positive correlation (1.0). ''3 T h e correlation coeff i c i e n t r can be s q u a r e d and t h e n
used to e s t i m a t e " t h e percent of varia t i o n i n o n e v a r i a b l e t h a t is explained by variation in the other variable. ,,3
In the sample data given (Table 1),
the calculated Pearson r is 0.94. The
square of r is 0.88. T h i s i m p l i e s that
88% of the variability in weight gain
or l o s s c a n be a t t r i b u t e d to v a r i a b i l i t y i n t h e a m o u n t of c a l o r i e s
consumed. Otherwise stated, the
a m o u n t of c a l o r i e s c o n s u m e d provides us w i t h a p p r o x i m a t e l y 88% of
142/1465
-112.5
Y)
1,111.5
TABLE 3. Calculation of Spearman rank order correlation coefficient
using sample data from Table 1
Patient
RankX
RankY
d
d2
1
1.0
1.0
2
2.0
2.5
0
3
3.5
2.5
1.0
1.00
4
5.5
5.0
0.5
0.25
5
7.5
8.0
6
3.5
7
7.5
8
5.5
-0.5
0
0.25
-0.5
0.25
4.0
-0.5
0.25
7.0
-0.5
0.25
6.0
-0.5
0.25
~;(d 2) = 2.5
rs = 1
[6 (2:(d2)]
n(n 2 - 1 )
=
1
[(6)(2.5)]
(8) (63)
t h e i n f o r m a t i o n n e e d e d to p r e d i c t
weight gain or loss. Finally, r is calculated using s a m p l e data. It is only
an e s t i m a t e of t h e true c o r r e l a t i o n
coefficient, rho (p), b e t w e e n two population variables, just as the m e a n (X)
is a s a m p l e m e a n and is only an estim a t e of the true p o p u l a t i o n m e a n m u
S h o u l d o n e w i s h to d e t e r m i n e
w h e t h e r r is significant (ie, w h e t h e r a
significant correlation exists b e t w e e n
Annals of Emergency Medicine
= 1 -
0.030 = 0.97
two sample variables), hypothesis
testing is indicated. This hypothesis
t e s t i n g a n s w e r s t h e question, "Is r
d i f f e r e n t f r o m 0 o n l y b e c a u s e of
chance variation (sampling error), or
because the true p o p u l a t i o n correlation coefficient p is n o t 0? "3 To comp l e t e t h e h y p o t h e s i s test for r, the
value of r and the degrees of freedom,
( n - 2) (where n = t h e n u m b e r of
pairs of data correlated), are applied
to t h e a p p r o p r i a t e t a b l e of c r i t i c a l
19:12 December 1990
FIGURE 3. For a given value of the
independent variable (X), there will
be a subpopulation of d e p e n d e n t
variables (Y) that displays a normal
distribution. The mean value of each
subpopulation will form a point on
the regression line.
FIGURE 4. Regression line for sampie data given in Table 1 (caloric consumption vs weight change).
'E
values of r. If the calculated r is
greater than rcritical from a statistical
table at a predetermined significance
level, r is determined to be significandy different from zero, and thus a
significant correlation exists. Any
correlation coefficients reported in
research studies should include the
corresponding level of significance.
r(D
O.
O
X
(Independent variable)
i
3-
2-
"ID
e"I
O
y
/
1-
&
.E
O
E
0
I
200
I
400
I
600
I
I
~
/~
800 1000 1 ~ 1 4 0 0
t
I
¢ 22100 I
26100
1600 1800 2000
2400
O)
r-
mean caloric consumption/day (Kcals)
.E
O
-1
._m
O
-2
ne of best fit:
/
~ = .O03X + (-3.93)
Pearson r = .94
-3-
/
4
-4*
-5
19:12 December 1990
Annals of Emergency Medicine
REGRESSION ANALYSIS
Like correlation analysis, regressi0/1 analysis is also used to describe
the r e l a t i o n s h i p b e t w e e n two or
more variables. However, regression,
in the process of mathematically describing a relationship, also develops
a means of prediction of one variable
based on the value of a second variable. 8 "In regression analysis, the relationship between two variables (or
more) is expressed by fitting a line or
curve to the pairs of data points" as
seen in a scatter diagram. 3 The "predictor variable," or independent variable (X), is that which is selected and
manipulated by the investigator. The
"predicted" variable, or dependent
variable (Y), is that which is influenced by X.
Simple linear regression is used to
describe and predict a linear relationship between two variables, one independent and one dependent. (While
regression analysis can be used in the
analysis and prediction of nonlinear
relationships as well as multiple
variable data sets containing two or
more independent variables, this is
beyond the scope of this discussion.)
The assumptions underlying simple linear regression include: 4
1. The values of the independent
variable X are set by the investigator. In the process of data collection, the " p r e - s e t " levels of X
should not be changed.
2. T h e i n d e p e n d e n t v a r i a b l e X
should be measured without experimental error.
3. For each value of X, there is a sub1466/143
BIOSTATISTICS
Gaddis & Gaddis
population of Y variables that are
normally distributed up and down
the Y-axis (Figure 3).
4. The variances of the subpopulations of Y are homogeneous (Figure 3).
5. The means of the subpopulations
of Y lie on a straight line (thus,
the assumption of linearity is met)
(Figure 3).
6. All values of Y are independent
from one another, though they are
dependent on X.
The Regression Line
The process of linear regression involves defining the "line of best fit,"
or the line that best defines the linear
relationship of the data. This line is
the one that passes through the center of the data points. 8 For linear regression, this line is defined by the
equation for a straight line.
R e c a l l t h a t t h e e q u a t i o n for a
straight line is:
Y=mX+b
where Y is the dependent variable, X
is the independent variable, m is the
slope of the line, and b is the y-intercept of the line.
Figure 4 shows the line of best fit
for the data set described in Table 1.
This line is also used as a predictor
for additional data (dependent variables). When the line of best fit is
used to predict Y, t h e e q u a t i o n is
simply modified as Y = mX + b. ~"
implies a predicted (or estimated) dep e n d e n t variable, based on the defined slope, y-intercept, and independent variable values.
The calculation of the regression
line (line of best fit) is performed by
the "least-squares m e t h o d . " This is
a c c o m p l i s h e d by c a l c u l a t i n g t h e
equation for the line such that the
s u m of the squared deviations for
each data point from the predicted
line of best fit is as small as possible.3, 6 In o t h e r w o r d s , t h e leastsquares m e t h o d calculates the line
that c o m e s the closest to r u n n i n g
through all of the data points. Therefore, the least-squares m e t h o d derives the equation of the line that relates the linear relationship between
all of the data points w i t h a minim u m of error. For details of the calculation of the regression line by the
least-squares method, consult any of
the referenced statistics texts J,3, 5-8
The prediction of a dependent vari144/1467
able by use of the regression line of
best fit is also related to the correlation coefficient r. The letter r is used
to designate the correlation coefficient, yet r implies regression. This
is indicative of the relationship between correlation and regression. An
r of + 1 or - 1 implies a perfect correlation between X and Y and also
implies t h a t predicted values of Y
would assume a straight line. If r -0, the best predicted value of Y is the
mean of Y. Thus, the closer the absolute value of r is to 1.0, the closer the
predicted values of Y will lie to the
regression line. 2
Interpretation of the Regression
Line
Tests of significance for linear reg r e s s i o n are c o n c e p t u a l l y s i m i l a r
to analysis of v a r i a n c e (ANOVA). 2
Recall from the prior discussion of
A N O V A 9 that the total variance in
A N O V A can be partitioned into the
"between groups (experimental) variance," and the "within groups (error)
v a r i a n c e . " A l t e r n a t i v e l y expressed,
total variance - experimental variance + error variance. Regression
analysis is similar in that the regression variance is p a r t i t i o n e d in the
same way. Variance due to the regression process per se is analogous to
A N O V A experimental variance. Variance due to the "residuals," or the
d e v i a t i o n s of t h e i n d i v i d u a l data
points from the regression line, is
analogous to A N O V A error variance.
Therefore, for regression, total variance - regression variance + residual variance. T h e variance is then
put into the f o r m of the s u m s of
squares (SS), so that for ANOVA, the
equation is: 9
SStota 1 =
SSbetwcen groups q- SSwithin groups
and for regression analysis, the equation is: 2
SStotal = SSreg . . . . ion q- SSresidual
Once the sums of squares are calculated, the F ratio can be derived. For
A N O V A , the F ratio is calculated
using SS values and the degrees of
freedom (df) applicable:
F = [(SSbe t . . . .
groups) / djq /
[(SSwithin groups) y
df]
and for regression, the F ratio is calculated:
F
[(SSreg.... ion)/ dJq /
[(SSresidual ) / d)q
Annals of Emergency Medicine
In both cases, the Fcalculated is then
compared with Fcritical , found in a table of critical F ratios. For ANOVA,
when
Fcalculated i s g r e a t e r t h a n
Fcritical, the difference between group
means is concluded to be different
from zero, and there exists a significant difference between at least two
of the groups compared. For regression, when Fcalculated is greater than
Fcritical , the slope of the regression
line is s t a t i s t i c a l l y different f r o m
zero. Calculation by this method is
cumbersome and often tedious. Most
statistical packages for c o m p u t e r s
that c o n t a i n linear or m u l t i p l e regression calculate both the regression
line and the test of significance for
that line.
CONCLUDING REMARKS
This i n s t a l l m e n t concludes this
six-part series of Introduction to Biostatistics articles. Objectives of the
series have included the introduction
of basic concepts of c o m m o n l y used
biomedical statistical tests to facilitate comprehension by persons with
little or no background in biostatistics. We hope that this series has increased the u n d e r s t a n d i n g of biomedical statistics a m o n g clinicians,
residents, and residency faculty.
Part 1 of this series in the January
issue of A n n a l s included introductory c o m m e n t s and discussed the frequency of errors of use of biomedical
statistics in the medical literature.
The concepts of sample and population were introduced. Types of data
were discussed. As has been shown
in subsequent installments of the series, it is crucial to understand what
type of data is being analyzed in order to properly select both descriptive statistics and inferential statistical tests for significance testing.
Part 2 in March discussed descriptive statistics, including measures of
c e n t r a l t e n d e n c y and m e a s u r e s of
variability. It was shown that certain
d e s c r i p t i v e s t a t i s t i c s are inappropriate descriptors of certain types of
data. For example, the mean value for
an ordinal data sample is at best misleading and represents an erroneous
use of the mean. Confidence intervals were also discussed.
Part 3 in May covered sensitivity,
specificity, and predictive value of
clinical tests. These c o m m o n l y applied clinical principles were t h e n
used as an analogy to facilitate understanding of hypothesis testing. In
19:12 December 1990
this m a n n e r , t h e c o n c e p t s of type I
error (c,), type II error ([~), and statistical power (I-B) were introduced.
Part 4 in July presented inferential
statistical t e c h n i q u e s appropriate for
p a r a m e t r i c data, i n c l u d i n g the Student t-test and ANOVA. These techniques are appropriate only for interval or ratio data that m e e t the ass u m p t i o n s of these tests.
Part 5 in September c o n t in u e d discussion of inferential statistical techni que s, c o v e r i n g tests p r o p e r l y applied to n o n p a r a m e t r i c data. Included
were the X~ and Fisher's exact test for
n o m i n a l data, and the " d i s t r i b u t i o n
f re e " r a n k tests such as t h e M a n n W h i t n e y U, Wilcoxon, KolmogorovSmirnov, Kruskal-Wallis, and Friedm a n tests. T h e s e rank tests are appropriate for analysis of ordinal data
and also for interval or ratio data that
do not m e e t the assumptions of the
t-test or A N O V A .
Finally, this article, part 6, has disc u s s e d c o r r e l a t i o n and r e g r e s s i o n ,
two data analysis techniques used to
quantify and define the relationship
b e t w e e n t wo variables. It was s h o w n
that tests for significance for linear
regression are c o n c e p t u a l l y equivalent to A N O V A tests for significance
19:12 December 1990
of parametric data.
This series did not cover n o n l i n ear
and m u l t i p l e correlation and regression, or m u l t i v a r i a t e ANOVA. Some
specific tests were o m i t t e d from discussion in s o m e sections in the interest of brevity. Th e i n t e n t of this series was to introduce basic concepts,
n o t to p r o v i d e a c o m p r e h e n s i v e rev i e w of all possible statistical tests.
The authors acknowledge the contributions of numerous individuals who inspired and aided the completion of this
series. Dr Hal Morris of the Department
of Health, Physical Education, and Recreation at Indiana University first brought
the frequency and consequences of improper statistical analysis in biomedical
sciences to our attention and instructed
us in much of the information presented
here. Dr Carl Rothe of the Department of
Physiology and Biophysics of the Indiana
University School of Medicine provided
further practical teaching and detailed
analysis methods. The Department of
Emergency Medicine at the Wright State
University School of Medicine provided
us our first opportunity to contribute to
the statistical education of emergency
medicine residents and faculty. Dr Joseph
Waeckerle recognized the importance of
proper statistical analysis, provided a fo-
Annals of Emergency Medicine
rum to present this information, and lent
editorial assistance to the project. We also
thank Jayna Ross and Kathy Wagner of
the Department of Emergency Health Services at the University of Missouri-Kansas City School of Medicine for preparation of the manuscripts.
REFERENCES
1. Hopkins KD, Glass GV: Basic Statistics for
the Behavioral Sciences. EnglewoodCliffs, New
Jersey, Prentice-Hall Inc, 1978.
2. Kerlinger £N, Pedhazur EJ: Multiple Regression in Behavioral Research. New York, Holt,
Rinehart and Winston, lnc, 1973.
3. Knapp RG: Basic Statistics for Nurses, ed 2.
New York, John Wiley and Sons, 1985.
4. Daniel WW: Biostatistics: A Foundation for
Analysis in the Health Sciences. New York,
John Wiley and Sons, 1974.
5. Glantz SA: Primer of Biostatistics, ed 2. New
York, McGraw-Hill Book Co, 1987.
6. Elston RC, Johnson WE): Essentials of Biostatistdcs. Philadelphia, FA Davis, Co, 1987.
7. 8okol RR, Rohlf FS: Biovnetry, ed 2. New
York, WH Freeman and Co, 1981.
8. Kviz FJ, Knapp KA: Statistics for Nurses, An
Introductory Text. Boston, Little, Brown and
Co, 1980.
9. Gaddis ML, Gaddis GM: Introduction to biok
statistics: Part 4, statistical inference tech
niques in hypothesis testing. Ann Ernerg Med
1990; 19:820-825.
1468/145