Document

Biostatistics,
statistical software VI.
Relationship between two continuous variables,
correlation, linear regression, transformations.
Relationship between two discrete variables, contingency
tables, test for independence.
Krisztina Boda PhD
Department of Medical Informatics,
University of Szeged
Relationship between two continuous variables
correlation, linear regression, transformations.
Krisztina Boda
INTERREG
2

Imagine that 6 students are given a battery of tests by a
vocational guidance counsellor with the results shown in the
following table:
A
D
T
A
U
T
1
0
0
0
P
2
0
0
0
S
3
0
0
0
I
4
0
0
0
A
5
0
0
0
G
6
0
0
0
B

Krisztina Boda
Variables measured on the same individuals are often related to
each other.
INTERREG
3
Let us draw a graph called scattergram to
investigate relationships.



Scatterplots show the
relationship between two
quantitative variables
measured on the same cases.
In a scatterplot, we look for the
direction, form, and strength of
the relationship between the
variables. The simplest
relationship is linear in form
and reasonably strong.
Scatterplots also reveal
deviations from the overall
pattern.
Krisztina Boda
INTERREG
4
Creating a scatterplot
When one variable in a scatterplot
explains or predicts the other, place it on
the x-axis.
 Place the variable that responds to the
predictor on the y-axis.
 If neither variable explains or responds to
the other, it does not matter which axes
you assign them to.

Krisztina Boda
INTERREG
5
Possible relationships
100
80
520
500
retailing
language
560
540
480
460
440
420
400
400
60
40
20
450
500
550
0
400
600
450
math score
500
550
600
math score
negative correlation
theater
positive correlation
100
90
80
70
60
50
40
30
20
10
0
400
450
500
550
600
math score
no correlation
Krisztina Boda
INTERREG
6
Describing linear relationship with
number: the coefficient of correlation



Correlation is a numerical measure of the
strength of a linear association.
The formula for coefficient of correlation treats x
and y identically. There is no distinction between
explanatory and response variable.
Let us denote the two samples by x1,x2,…xn and
y1,y2,…yn , the coefficient of correlation can be
computed according to the following formula
n
r
Krisztina Boda
n
n
n
n   xi yi   xi  yi
i 1
i 1
i 1
n
n
n
n

2
2
2
2
n

x

(
x
)
n

y

(
y
)
  i



i 
i
i 
 i 1



i 1
i 1
i 1
INTERREG

(x
i
 x )( yi  y)
i 1
n
(x
i 1
i
 x)
n
2
 ( y  y)
2
i
i 1
7
Properties of r
560
540
520
500
480
460
440
420
400
400
450
500
550
600
math score
100
80
retailing

Correlations are between -1 and +1; the value of r is always
between -1 and 1, either extreme indicates a perfect linear
association.
1r 1.
a) If r is near +1 or -1 we say that we have high correlation.
language

60
40
20
0
400

b) If r=1, we say that there is perfect positive correlation.
If r= -1, then we say that there is a perfect negative
correlation.
450
500
550
600
math score
12
10
8
6
4
2
0
c) A correlation of zero indicates the absence of linear
association. When there is no tendency for the points to lie
in a straight line, we say that there is no correlation (r=0) or
we have low correlation (r is near 0 ).
2
4
6
10
8
6
4
2
0
0
theater

0
12
2
4
6
100
90
80
70
60
50
40
30
20
10
0
400
450
500
550
600
math score
Krisztina Boda
INTERREG
8
 an apparently
strong correlation
where none would
be found otherwise,
 or hide a strong
correlation by
making it appear to
be weak.
180
160
theater
100
90
80
70
60
50
40
30
20
10
0
80
60
40
20
0
400
450
500
550
400
600
500
600
700
800
900
math score
math score
r=-0.21
r=0.74
560
560
540
540
520
520
500
480
460
440
420
400
400
450
500
math score
r=0.998
Krisztina Boda
140
120
100
language

Even a single outlier
can change the
correlation
substantially.
Outliers can create
language

theater
Effect of outliers
INTERREG
550
600
500
480
460
440
420
400
400
500
600
700
800
math score
r=-0.26
9
900

Two variables may be
closely related and
still have a small
correlation if the form
of the relationship is
not linear.
y
10
8
6
4
2
0
-4
-3
-2
-1
0
1
2
3
4
r=2.8 E-15
y
1.2
1
0.8
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
3
3.5
r=0.157
Krisztina Boda
INTERREG
10
Correlation and causation

Krisztina Boda
a correlation between two variables does
not show that one causes the other.
Causation is a subtle concept best
demonstrated statistically by designed
experiments.
INTERREG
11
Correlation by eye
http://onlinestatbook.com/stat_sim/reg_by_eye/index.html) .


Krisztina Boda
This applet lets you estimate
the regression line and to
guess the value of Pearson's
correlation.
Five possible values of
Pearson's correlation are
listed. One of them is the
correlation for the data
displayed in the scatterplot.
Guess which one it is. To see
the correct value, click on the
"Show r" button.
INTERREG
12
When is a correlation „high”?
What is considered to be high correlation
varies with the field of application.
 The statistician must decide when a
sample value of r is far enough from zero,
that is, when it is sufficiently far from zero
to reflect the correlation in the population.

Krisztina Boda
INTERREG
13




Testing the significance of the coefficient
of correlation
The statistician must decide when a sample value of r is far enough from zero
to be significant, that is, when it is sufficiently far from zero to reflect the
correlation in the population.
H0: ρ=0 (greek rho=0, correlation coefficient in population = 0)
Ha: ρ ≠ 0 (correlation coefficient in population ≠ 0)
This test can be carried out by expressing the t statistic in terms of r. The
following t-statistic has n-2 degrees of freedom
t

r n2
1 r 2
 r
n2
1 r 2
Decision using statistical table:
 If |t|>tα,n-2, the difference is significant at α level, we reject H0 and state that the
population correlation coefficient is different from 0.
 If |t|<tα,n-2, the difference is not significant at α level, we accept H0 and state that the
population correlation coefficient is not different from 0.

Decision using p-value:
 if p < α a we reject H0 at α level and state that the population correlation coefficient
is different from 0.
Krisztina Boda
INTERREG
14
Example 1.




The correlation coefficient between math skill and
language skill was found r=0.9989. Is significantly
different from 0?
H0: the correlation coefficient in population = 0, ρ =0.
Ha: the correlation coefficient in population is different
from 0.
Let's compute the test statistic:
t



Krisztina Boda
0.9989  62
10.9989 2
 0.9989 
4
 42.6
2
10.9989
Degrees of freedom: df=6-2=4
The critical value in the table is t0.05,4 = 2.776.
Because 42.6 > 2.776, we reject H0 and claim that there
is a significant linear correlation between the two
variables at 5 % level.
INTERREG
15
Example 1, cont.
Scatterplot (corr 5v*6c)
LANGUAGE = 15.5102+1.0163*x
560
540
520
500
480
LANGUAGE
460
440
420
400
380
400
420
440
MATH:LANGUAGE: r = 0.9989; p = 0.000002
460
480
500
520
540
MATH
p<0.05, the correlation is significantly different from 0 at 5% level
Krisztina Boda
INTERREG
16
Example 2.




The correlation coefficient between math skill and
retailing skill was found r= -0.9993. Is significantly
different from 0?
H0: the correlation coefficient in population = 0, ρ =0.
Ha: the correlation coefficient in population is different
from 0.
Let's compute the test statistic:
t



Krisztina Boda
 0.9993  62
10.99932
 0.9993 
4
 53.42
10.9986
Degrees of freedom: df=6-2=4
The critical value in the table is t0.05,4 = 2.776.
Because |-53.42|=53.42 > 2.776, we reject H0 and claim
that there is a significant linear correlation between the
two variables at 5 % level.
INTERREG
17
Example 2., cont.
Scatterplot (corr 5v*6c)
RETAIL = 234.135-0.3471*x
100
90
80
70
RETAIL
60
50
40
380
400
420
440
MATH:RETAIL: r = -0.9993; p = 0.0000008
Krisztina Boda
460
480
500
520
540
MATH
INTERREG
18
Example 3.




The correlation coefficient between math skill and
theater skill was found r= -0.2157. Is significantly
different from 0?
H0: the correlation coefficient in population = 0, ρ =0.
Ha: the correlation coefficient in population is different
from 0.
Let's compute the test statistic:
t



Krisztina Boda
 0.2157  62
10.2157 2
 0.2157 
4
 0.4418
10.04653
Degrees of freedom: df=6-2=4
The critical value in the table is t0.05,4 = 2.776.
Because |-0.4418|=0.4418 < 2.776, we do not reject H0
and claim that there is no a significant linear correlation
between the two variables at 5 % level.
INTERREG
19
Example 3., cont.
Scatterplot (corr 5v*6c)
THEATER = 112.7943-0.1137*x
100
90
80
70
60
THEATER
50
40
30
20
380
400
420
MATH:THEATER: r = -0.2157; p = 0.6814
Krisztina Boda
440
460
480
500
520
540
MATH
INTERREG
20
Prediction based on linear correlation:
the linear regression





Krisztina Boda
When the form of the relationship in a scatterplot is
linear, we usually want to describe that linear form more
precisely with numbers. We can rarely hope to find data
values lined up perfectly, so we fit lines to scatterplots
with a method that compromises among the data values.
This method is called the method of least squares. The
key to finding, understanding, and using least squares
lines is an understanding of their failures to fit the data;
the residuals
A straight line that best fits the data:
y=bx + a
is called regression line
Geometrical meaning of a and b.
b: is called regression coefficient, slope of the best-fitting
line or regression line;
a: y-intercept of the regression line.
INTERREG
21
Equation of regression line for the data of
Example 1.


Scatterplot (corr 5v*6c)
LANGUAGE = 15.5102+1.0163*x
560
540
520
500
480
LANGUAGE

y=1.016·x+15.5
the slope of the line is 1.016
Prediction based on the
equation: what is the predicted
score for language for a student
having 400 points in math?
ypredicted=1.016 ·400-15.5=421.9
460
440
420
400
380
400
420
440
MATH:LANGUAGE: r = 0.9989; p = 0.000002
Krisztina Boda
INTERREG
460
480
500
520
540
MATH
22
How to get the formula for the line which
is used to get the best point estimates
The general equation of a line is y = a + b x. We would like to find the values of a and b
in such a way that the resulting line be the best fitting line. Let's suppose we have n pairs of
(xi, yi) measurements. We would like to approximate yi by values of a line . If xi is the
independent variable, the value of the line is a + b xi.
We will approximate yi by the value of the line at xi, that is, by a + b xi. The
approximation is good if the differences yi  (a  b  xi ) are small. These differences can be
positive or negative, so let's take its square and summarize:
n
 (y  (a  b  x ))
i
2
i
 S ( a , b)
i 1
This is a function of the unknown parameters a and b, called also the sum of squared
residuals. To determine a and b: we have to find the minimum of S(a,b). In order to find the
minimum, we have to find the derivatives of S, and solve the equations
S
S
 0,
0
a
b
The solution of the equation-system gives the formulas for b and a:
b
n
n
i 1
i 1
n   xi yi   xi
n
n
 yi
i 1
n
n   x  (  xi )
2
i
i 1
i 1
2
n

 ( x  x )( y  y )
i
i
i 1
n
 ( x  x)
and a  y  b  x
2
i
i 1
It can be shown, using the 2nd derivatives, that these are really minimum places.
Krisztina Boda
INTERREG
23
Computation of the correlation
coefficient from the regression
coefficient.

There is a relationship between the correlation and the
regression coefficient:
sx
r  b
sy



Krisztina Boda
where sx, sy are the standard deviations of the samples .
From this relationship it can be seen that the sign of r
and b is the same: if there exist a negative correlation
between variables, the slope of the regression line is
also negative .
It can be shown that the same t-test can be used to test
the significance of r and the significance of b.
INTERREG
24
Coefficient of determination




Krisztina Boda
The square of the correlation coefficient
multiplied by 100 is called the coefficient of
determination.
It shows the percentages of the total variation
explained by the linear regression.
Example.
The correlation between math aptitude and
language aptitude was found r =0,9989.
The coefficient of determination, r2 = 0.917 .
So 91.7% of the total variation of Y is caused by
its linear relationship with X .
INTERREG
25
Regression using transformations

Krisztina Boda
Sometimes, useful models are not linear in
parameters. Examining the scatterplot of
the data shows a functional, but not linear
relationship between data.
INTERREG
26
Example



A fast food chain opened in
1974. Each year from 1974 to
1988 the number of
steakhouses in operation is
recorded.
The scatterplot of the original
data suggests an exponential
relationship between x (year)
and y (number of
Steakhouses) (first plot)
Taking the logarithm of y, we
get linear relationship (plot at
the bottom)
450
400
350
300
y 250
200
150
100
50
0
0
5
10
15
10
15
time
6
5
4
ln(y) 3
2
1
0
0
5
time
Krisztina Boda
INTERREG
27
Performing the linear regression procedure
to x and log (y) we get the equation
 log y = 2.327 + 0.2569 x
 that is
 y = e2.327 + 0.2569 x=e2.327e0.2569x= 1.293e0.2569x
is the equation of the best fitting curve to the
original data.

Krisztina Boda
INTERREG
28
6
450
400
350
300
y 250
200
150
100
50
0
5
4
ln(y) 3
2
1
0
0
5
10
15
0
time
10
15
time
y = 1.293e0.2569x
Krisztina Boda
5
log y = 2.327 + 0.2569 x
INTERREG
29
Types of transformations

Krisztina Boda
Some non-linear models can be
transformed into a linear model by taking
the logarithms on either or both sides.
Either 10 base logarithm (denoted log) or
natural (base e) logarithm (denoted ln) can
be used. If a>0 and b>0, applying a
logarithmic transformation to the model
INTERREG
30
Exponential relationship ->take log y
0
1
2
3
4




Krisztina Boda
1.1
1.9
4
8.1
16
lg y
0.041393
0.278754
0.60206
0.908485
1.20412
Model: y=a*10bx
Take the logarithm of both
sides:
lg y =lga+bx
so lg y is linear in x
INTERREG
18
16
14
12
10
y
y
8
6
4
2
0
0
1
2
3
4
5
3
4
5
x
1.4
1.2
1
0.8
log y
x
0.6
0.4
0.2
0
0
1
2
x
31
Logarithm relationship ->take log x
5
y
1
4
8
16

log x
0.1
2
3.01
3.9
4
0
0.60206
0.90309
1.20412
3
y
x
2
1
0
0
5
10
15
20
x
Model: y=a+lgx
5
4
y
3

2
so y is linear in lg x
1
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
log10 x
Krisztina Boda
INTERREG
32
Power relationship ->take log x and log y
130
y
1
2
3
4
log x
2
0
16 0.30103
54 0.477121
128 0.60206
log y
0.30103
1.20412
1.732394
2.10721
100
90
130
80
120
70
60
y
x
120
110
50
110
40
30
100
20
10
90
0
0
80
1
2
3
4
5
x
y
70
60



Krisztina Boda
Model: y=axb
Take the logarithm of both
sides:
lg y =lga+b lgx
so lgy is linear in lg x
INTERREG
50
2.5
40
2
30
1.5
20
log y

10
1
0
0.5
0
1
2
3
4
5
x
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
log x
33
Reciprocal relationship ->take reciprocal of x
y
1
2
3
4
5


2
1.1
1
0.45
0.5
0.333 0.333333
0.23
0.25
0.1999
0.2
Model: y=a +b/x
y=a +b*1/x
so y is linear in 1/x
1.5
1
0.5
0
0
1
2
3
4
5
6
x
2
1.5
y

1/x
y
x
1
0.5
0
0
0.2
0.4
0.6
0.8
1
1.2
1/x
Krisztina Boda
INTERREG
34
Example from the literature
Krisztina Boda
INTERREG
35
Krisztina Boda
INTERREG
36
Relationship between two discrete variables,
contingency tables, test for independence
Krisztina Boda
INTERREG
37
Comparison of categorical variables
(percentages): 2 tests (chi-square)



Example: rates of
diabetes in three groups:
31%, 27% and 25%*.
Frequencies can be
arranged into
contingency tables.
H0: the occurrence of
diabetes is independent of
groups (the rates are the
same in the population)
Krisztina Boda
DIAB
Treatment1
Treatment 2
Treatment3
Total
yes
31
27
25
83
no
69
73
75
217
Total
100
100
100
300
INTERREG
38
2 tests, assumptions








If H0 is true, the expected
frequencies can be computed
(Ei=row total*column total/total)
2 statistics:
2 =Σ(Oi-Ei)2/Ei
DF (degrees of freedom: (number
of rows-1)*(number of columns-1)
Decision based on table:
2 > 2 table, , df
Assumption: cells with expected
frequencies <5 are maximum 20%
of the cells
Exact test (Fisher): it has no
assumption, it gives the exact pvalue
Krisztina Boda
DIAB
yes
no
Total


Treatment Treatment Treatment
1
2
3
27.7
27.7
27.7
72.3
72.3
72.3
100
100
100
0.933<5.99(= 2 table, 0.05,2)
p=0.627

Assumption holds
Exact p=0.663
INTERREG
83
217
300
2 =0.933
Df=(3-1)*(2-1)=2


Total
39
2x2-es tables


2 formula can be
computed from cell
frequencies
When the
assumptions are
violated :
 Yates correction
 Fisher’s exact test
Krisztina Boda
Factor
Group 1.
Present +
a
Absent c
Total
a+c
Group 2.
b
d
b+d
Total
a+b
c+d
N
(ad  bc) 2 N
 
(a  b)(c  d )(a  c)(b  d )
N 2
(| ad  bc |  ) N
2
2
Yates 
(a  b)(c  d )( a  c)(b  d )
2
INTERREG
40
Example
Angiographic
restenosis
Present +
Absent Total
Group1l
Group2
Total
20
72
92
41
51
92
61
123
184
2=10.815, df=1, p=0.001
 2Yates=9.809, df=1, p=0.002
 Fisher’s exact test p=0.002

Krisztina Boda
INTERREG
41
Review questions and exercises

Problems to be solved by handcalculations
 ..\Handouts\Problems hand VI.doc

Solutions
 ..\Problems hand VI solution.doc

Problems to be solved using computer
 ..\Handouts\Problems comp VI.doc
Krisztina Boda
INTERREG
42
Useful WEB pages




Krisztina Boda
http://www-stat.stanford.edu/~naras/jsm
http://www.ruf.rice.edu/~lane/rvls.html
http://my.execpc.com/~helberg/statistics.html
http://www.math.csusb.edu/faculty/stanton/m26
2/index.html
INTERREG
43