regression and correlation

REGRESSION AND CORRELATION
INTRODUCTION
In simple correlation and regression studies, the researcher collects data on two numerical or
quantitative variables to see whether a relationship exists between the variables. The
independent variable (X) is the variable in the regression that can be controlled or
manipulated. The dependent variable is the variable in regression that cannot be controlled or
manipulated.
REGRESSION
The term regression analysis is used to refer to studies of relations between variables.
Regression analysis then is a technique for quantifying the relationship between a criterion
variable (also called a dependent variable) and one or more predictor variables (or
independent variables). Put another way regression analysis is a technique whereby a
mathematical equation is fitted to a set of data. It describes the relationship between two
variables. It is a statistical method used to isolate cause- and –effect relation among variables.
A line of best fit that is independent of individual judgement and drawn mathematically is
called a regression line. Regression line equations once computed can be graphed and be used
to estimate values previously unknown.
The reasons for computing a regression line are:
(i)
to obtain a line of best fit free of subjective judgement. The regression line
improves our estimates.
(ii)
The regression equation can be used to make predictions within the given range of
the data, that is, making interpolations.
(iii)
the reliability of estimates made from such a line can be measured
mathematically.
The purpose of the regression line is to enable the researcher to see the trend and make
predictions on the basis of the available data. Values of Y will be predicted from the values of
X, hence the closer the points are to the line, the better the fit and the predictions will be.
Each observation of bivariate data can be viewed as a point (x, y) where x is the explanatory
or independent variable while y is the dependent or response variable. It is important to
determine which one of the variables in question is the dependent variable and which one is
the independent variable. The starting point in regression analysis is the construction of a
scatter diagram/ scatter graph.
1
There are normally two regression line equations for any sets of bivariate data:
(1) Regression of Y on X. The line equation is given by Y= a+ bX where the line is used to
predict/ estimate the value of Y that follows from any value of X
(2) Regression of X on Y. The equation is given by X = a + bY. The line equation is used to
predict/ estimate the value of X that follows from any value of Y.
The guide on the choice of regression line equation is one should always use the line that has
the variable to be estimated on the left hand side of the equation. In other words if you want
to estimate X use X = a + bY and if you want to estimate Y use Y = a + bX. In our analysis
we are going to use the Y on X regression equation, that is Y = a + bX. In this equation a is
the y intercept term, while b is called the regression coefficient or the slope of the graph.
In single equation regression models one variable called the dependent variable or regressand
is expressed as a linear function of one or more other variables called the explanatory
variables or independent or regressor variables. It is assumed implicitly that causal
relationships if any between the independent and dependent variables flow in one direction
only, namely from the explanatory to the dependent (or regressand).
Although regression analysis deals with the dependence of one variable on other variables, it
does not necessarily imply causation. A statistical relationship may be very strong but it
never can establish causal connection. Our ideas of causation must come from outside
statistics, ultimately from some theory or other. For example there is no statistical reason to
assume that rainfall depends on crop yield. In regression analysis we try to estimate or predict
the average value of one variable on the basis of fixed values of other variables. The
regression equation is given by Y = a + bX or
Y = f(X) or Yi = a + bX +μi where μi is the stochastic error term or random error term ( it
can also be looked at as the surrogate for all omitted variables. ß1 and ß2 are regression
coefficients. ß1 is the intercept and ß2 is the slope coefficient.
E(Y|Xi ) = ß1 + ß2Xi .= Yˆ
µi = Yi – E (Y|Xi), where Yi is the observed value and E (Y|Xi), is the predicted value.
Regression analysis techniques assume that
(1) Each item of data is independent of the others.
(2) Data measurements are unbiased.
2
(3) The error variance is constant over the entire range of data, rather than larger in some
parts of the data range and smaller in others.
(4) There is no autocorrelation between the disturbances (or errors).
The probability distribution of the random error  or  has the following assumptions:
1. The mean of the probability of e or  or  is 0, that is the average of the values of 
or  over an infinitely long series of experiments is 0 for each setting of the
independent variable X. E(µi|Xi) = 0; that is zero mean value of disturbance µi. This
assumption implies that the mean value of Y, E(Y), for a given value of X is E(Y) =
ß1 + ß2 X
2. Variance of the probability distribution of e or  or  is constant for all settings of
the independent variable X. Var (µi|Xi) = σi2 i.e equal variance of error term for all
observations (that is there is homoscedasticity). For a straight line model, this
assumption means that the variance of e or  or  is equal to a constant, say  2 for
all values of X.
3. Probability distribution of  or  is normal.
4. The values of e or  or  associated with any two observed values of Y are
independent. That is the value of  or  associated with one value of Y has no effect
on the values of  or  associated with other Y values. That is to say there is no
autocorrelation between the values of the disturbances or errors (e or  or  ).
Problems and difficulties can be encountered when using regression analysis and these can
cause results to be inaccurate and misleading. The four problems are listed below:
(a) An inadequate sample size may have been used to collect the data. Sample size should be
at least two or three times the number of variables used in the regression equation and
preferably much larger.
(b) The independent variables measured during the study may have been poorly measured,
are in the wrong form or are not the right ones that have a direct effect on the dependent
variable.
(c) The independent variables are highly correlated with each other. If two independent
variables are perfectly correlated with each other (that is r =+1.00) , their effect will be
3
the same as that of a single independent variable which has been used twice in the same
regression analysis.
(d) The true relationship between the dependent variable and the independent variable(s) is
not linear, or it is an unusual shape which cannot be analysed using regression techniques.
NB. The regression equation may apply to a certain range of data observations such that the
regression equation derived from those data may not apply for figures smaller than the lowest
value and values in excess of the largest observed value. For example if data sets observed
range from 50 million to 100 million , the regression equation cannot be used for any values
in excess of 100 million and those below 50 million. In other words when predictions are
made, they are based on present conditions or on the premise that present trends will
continue. This assumption may or may not prove true in the future. For example in 1979
some experts predicted that the US would run out of oil by the year 2003!
Standard methods of obtaining regression line equations
(1) Inspection method- this involves drawing a scatter diagram and then fitting the line of
best fit where one feels it ought to be.
(2) Semi average- this involves splitting data into two equal groups and plotting the mean
point for each group and joining the two points by means of a straight line.
(3) Least squares method. This minimises the total of the squared deviations.
Least squares method
The fitted regression line can be viewed as a “predictor line”. For each X i , regression line
“predicts” Yˆi . The difference between Yi and its predicted equivalent Yˆi is Yi - Yˆi and this
difference is called a “residual” or error ei. . OLS minimizes the sum of the square of all
residuals.
 Y
i
all.i
 Yˆi

2
or minimize
e
i
2

  Y  Yˆ
ˆ )] .
  [Y  (aˆ  bX
2
2
ˆ is the estimated regression line, which is the same as Yˆ = â  bX
ˆ
The expression â  bX
4
Yi
Yi
Residual
ei
Yˆi
Yi = a + b X +μi
The estimated Yi is given as Yˆi , while the estimated a is given by â and that for b is b̂ . The
error term μi or ei= (Yi - Yˆi ) = Yi -( â + b̂ Xi )
The sum of the squares of the deviations (errors) of the Y values about their predicted values


2
ˆ  . The quantities â and b̂ that make the
for all the n data points is SSE   Yi  aˆ  bX
i


SSE a minimum are called the least squares estimates of the population parameters â and b̂
ˆ is called the least squares line. The least squares line
and the prediction equation Yˆ  aˆ  bX
is one that has the following 2 properties:
1. The sum of the errors (SE) equals 0.
2. Sum of squared errors (SSE) is smaller than that for any other straight line model.
Formulas for the least squares estimates for a and b are :
bˆ 
  X  X Y  Y 
 X  X 
i
i
2
i
aˆ  Y  bˆ X
or we can use the formula
n XY   X  Y
bˆ 
n X 2  ( X ) 2
aˆ 
Y  b  X
n
n
where n represents the pairs of X,Y values.
5
while for X  a  bY
n XY   X  Y
bˆ 
n Y 2  ( Y ) 2
aˆ 
 X  bˆ  Y
n
n
In the first set of equations for the y on x regression line equation n represents the pairs of
sets of y and x values. NB. the calculated values of a and b should be to 3 decimal places.
Table 36.Advertising and Expenditure Statistics
Year
2000
2001
2002
2003
2004
Total
Advertising exp (x) Sales (y)
2
60
5
100
4
70
6
90
3
80
20
400
xy
120
500
280
540
240
1680
x^2
4
25
16
36
9
90
From the data above one can get a and b in the regression equation Y= a + bX as follows:
5(1680)  20(400)
bˆ 
8
5(90)  400
400
(20)
aˆ 
8
 48
5
5
The regression equation is therefore Yˆ  48  8 X ( answer in $000). If we are required to
estimate the sales for 2005 (when advertising expenditure is expected to be $5000 then
substituting into our regression equation we get Yˆ = 48 +8(5)= 88= $88000.
N.B. It is important that you be able to interpret the intercept and slope in terms of the data
being utilized to fit the model. The model parameters should be interpreted only within the
sampled range of the independent variable (in this case between $2000 and $6000).
Sometimes it is possible to have only one set of data available and one is expected to make
projections based on the available data. From our initial table if the data on advertising
expenditure is missing and we are required to estimate sales in a given period we need to
choose the years as our x while the sales remain as y. The years are given coded values
staring at 2000 being coded as x = 1; 2001 coded as 2 etc. Alternatively one could start at
2000 being coded as x = 0; 2001 coded as 1 etc. The estimates are the same in each case
although the regression equation is slightly different.
6
From the table below b =3 and a =71 hence the regression equation is Y = 71 +3x. From this
result forecast the sales for 2005 and 2007. To get the answer we need to establish the value
of x in 2005 and 2007 and these are X=6 and X =8 respectively. Substituting into the
equation this gives y2005 = 71 +3(6) = 89 ( i.e. 89000) for 2005 and also y2007= 71 +3(8)=95
( i.e. $95000).
Table 37. Regression workings
Year
Coded (x)
Sales (y)
2000
1
60
2001
2
100
2002
3
70
2003
4
90
2004
5
80
Total
15
400
xy
60
200
210
360
400
1230
x^2
1
4
9
16
25
55
An alternative formula which has the advantage of not using a previously calculated value
can also be used. This formula is aˆ 
( Y )( X 2 )  ( X )( XY )
n X 2  ( X ) 2
and
n( XY )  ( X )( Y )
bˆ 
n(  X 2 )  (  X ) 2
This will give the same results as the first formula used above.
NB. When one is drawing the scatter plot and the regression line, it is sometimes desirable to
truncate the graph. The reason is to show the line drawn in the range of the independent and
dependent variables.
ˆ , each X has an
For a set of X and Y values and the estimated regression equation Yˆ  aˆ  bX
observed Y and a predicted Yˆ value. The total variation
 (Y  Y )
2
is the sum of the squares
of the vertical distances each point is from the mean. The total variation can be divided into
two parts: that which is attributed to the relationship of X and Y and that which is due to
chance. The variation obtained from the relationship (i.e. from the predicted Yˆ values) is
and is called the explained variation. Most of the variation can be explained by the
relationship. The closer the value r is to +1 or -1, the better the points fit the line and the
closer
 (Yˆ  Y )
2
is to
 (Yˆ  Y ) =  (Y  Y )
2
2
 (Y  Y )
2
. In fact if all points fall on the regression line,
since Yˆ would be equal to Y in each case.
7
On the other hand, the variation due to chance found by
 (Y  Yˆ )
2
is called the unexplained
variation. This variation cannot be attributed to the relationship. When the unexplained
variation is small, the value of r is close to +1 or -1. If all points fall on the regression line,
the unexplained variation
 (Y  Yˆ )
2
will be 0. Hence the total variation is equal to the sum
of the explained variation and the unexplained variation. That is
 (Yˆ  Y )
2
+
 (Y  Yˆ )
2
 (Y  Y )
2
=
or put in words this is
Total Variation = Explained Variation + Unexplained Variation
The values Y  Yˆ are called residuals. A residual is the difference between the actual value of
Y and the predicted value Yˆ for a given X value. The mean of the residuals is always zero.
The regression line determined by the formulas for a and b mentioned above is the line that
best fits the points of the scatter plot. The sum of the squares of the residuals computed by
using the regression line is the smallest possible value. For this reason, a regression line is
also called a least squares line.
(x,y)
y- ŷ unexplained
variation
ŷ - y
Explained
Variation
y
y- y
or total variation
(x, ŷ )
(x, y )
y
y
x
x
For the prediction of the interval about the Yˆ value see the chapter on estimation and
hypothesis testing.
Problems in Regression Analysis
(1). Multicollinearity- when two or more explanatory variables in the regression are highly
correlated. This can sometimes be overcome by extending sample size, using a priori
information or dropping one of the highly collinear variables.
8
(2). Heteroscedasticity- this arises when the assumption that the variance of the error term is
constant for all the values of the independent variables is violated. Heteroscedastic
disturbances lead to biased standard errors and thus to incorrect statistical tests and
confidence intervals for the parameter estimates.
(3). Autocorrelation- this is when consecutive errors or residuals are correlated. While
estimated coefficients are not biased in the presence of autocorrelation, their standard errors
are biased downward (so that the value of their t statistics is exaggerated. As a result, we may
conclude that an estimated coefficient is statistically significant when in fact it is not. If there
is evidence of autocorrelation, to adjust for its effect, time may be included as an additional
explanatory variable to take into consideration the trend that may exist in the data, the
inclusion of an important missing variable into the regression or the re-estimation of the
regression in nonlinear form.
COVARIANCE AND CORRELATION
Covariance and Correlation
Covariance and correlation are related parameters that indicate the extent to which two
random variables co-vary. Suppose there are two technology stocks. If they are affected by
the same industry trends, their prices will tend to rise or fall together. They co-vary.
Covariance and correlation measure such a tendency.
Covariance measures the association between sets of values X and Y. It measures how X and
Y vary together. If we have a pair of random variables X and Y with means  x and  y and
variances  x 2 and  y , the covariance measures the association. The covariance is defined by
2
the expectation Cov(X, Y) =E[(X-  x )(Y-  y )].= E[( X i   i )(Yi   j )]
1 n
 ( X i  X )(Yi  Y ) . If we take x  X  X and y  Y  Y
n 1 1
1 n
 xy . If x and y disagree in sign
( X i  X )(Yi  Y ) =
then the formula becomes S x , y 

n 1
n 1 1
we have xy<0. If X and Y move together most xy will be positive as will their sum.. If X and
Y are related negatively  xy  0 .
Sample covariance= S x , y 
If there is positive association between random variables, high values of X tend to be
associated with high values of Y and low values of X with low X. When there is negative
association, so that high values of X associated with low values of Y and low with high Y,
9
the covariance is negative. If there is no linear association between X and Y their covariance
is 0.
Covariance is of little use in assessing the strength of the relation between a pair of random
variables, as its value depends on the units in which they are measured. Ideally we would
require a pure scale free measure. Such a measure is easily obtained by dividing the
covariance by the product of the individual standard deviations. The quantity resulting is
called the correlation coefficient, represented by  for the population correlation
coefficient and r for the sample correlation coefficient.
CORRELATION
Correlation is a statistical method used to determine whether a relationship between variables
exists. Correlation is designed to measure the strength or degree of linear association between
two variables. While regression analysis establishes a mathematical relationship between the
dependent and independent variables, correlation goes a step further to establish the strength
of the relationship by calculating what is called the correlation coefficient. There are several
types of correlation coefficients but the best known and most used is the Pearson’s (after Karl
Pearson) product moment correlation coefficient, r, for a sample and  (rho) for the
population. The coefficient measures the strength and direction of a linear relationship
between two variables. The coefficient lies between 0 and 1 for positive correlation and
between 0 and -1 for negative correlation, that is 0 ≤ |r| ≤ 1. If there is perfect positive
correlation r = +1.00; perfect negative correlation is indicated by r= -1.00; no relationship is
shown by r= 0.00. For perfect negative correlation the scatter points form a straight line when
plotted and the slope of the graph is negative. For a negative correlation the line of best fit is
negatively sloped when plotted. On the other hand for perfect positive correlation the scatter
points form a straight line when plotted and the slope of the graph is positive. For a positive
correlation the line of best fit is positively sloped when plotted. NB. The sign of the
correlation coefficient and the sign of the slope of the regression line b will always be the
same. The regression line will always pass through the point ( x, y ). When r is not
significantly different from 0, the best predictor of y is the mean of the data values of y.. For
valid predictions, the value of the correlation coefficient must be significant. (See under
Hypothesis testing for detailed account of the significance of the correlation coefficient).
10
The starting point in correlation analysis is the construction of a scatter diagram. The scatter
points show a general negative slope for negative correlation. For a zero correlation the
scatter points do not show a definite pattern. There is no unique relationship between
individual x and y values. In positive correlation the scatter points generally show a positive
slope. The line of best fit when plotted will be upward sloping. In perfect positive correlation
r = +1. The scatter points when plotted will form a straight line, which is also the line of best
fit. The drawing of scatter points will show from the outset whether the relationship is
positive or negative.
A scatter plot should be checked for outliers. An outlier is a point that seems out of place
when compared with other points. Some of these points can affect the equation of the
regression line. When this happens, the points are called influential points or influential
observations. An influential point tends to ‘pull’ the regression line toward the point itself. To
check for an influential point, the regression line should be graphed with the point included in
the data set, then a second regression line should be graphed that excludes the point from the
data set. If the position of the second line is changed considerably, the point is said to be an
influential point. Points that are outliers in the x direction tend to be influential points.
Researchers should use their judgement as to whether to include influential observations in
the final analysis of the data. If the researcher feels the observation is unnecessary it should
be excluded but if researcher feels it is necessary then he or she may want to obtain additional
data values whose x values are near the x values of the influential point and then include
them in the study.
11
negative correlation –1 < r < 0
r=-1 perfect negative correlation
y
y
x
x
r = 0 zero correlation
positive correlation r >0
y
y
x
x
perfect positive correlation ( r= +1)
y
x
Population. Correlation coefficient (  ) Corr(X,Y)=
Cov( X , Y )
 x y

E[( X   x )(Y   y )]
E[( X   x ) 2 (Y   y ) 2 ]

1
N
 X i   x  Y   y



i 1 
x
  y
N
 
12




Sample correlation coefficient r =Corr(X,Y)=
S x, y
Cov( X , Y )
1 n  X i  X  Yi  Y 




SxSy
S x S y n  1 i 1  S x  S y 
or
r
 xy
x y
2
2
r
 ( X  X )(Y  Y )
(n  1)( sx )( s y )
or
r
n XY   X  Y
[n X  ( X ) 2 ][n Y 2  ( Y ) 2 ]
2
or
r
 [(Y  Y )( X  X )]   xy
[ (Y  Y ) ][ ( X  X ) ]
x y
i
i
2
or r 
2
2
i
2
i
 XY  n X Y
 Y  nY  X
2
2
2
 nX
2
(care should be taken here not to use the approximations
for the two means).
The advantage of correlation analysis
1. A high correlation coefficient indicates that there is common variation between the
dependent variable and the independent variable. In other words certain values of the
dependent variable will tend to be associated with certain values of the independent
variable. If value of r (disregarding the sign r  0.8 there is a very strong or high
relationship between variables. If 0.4 <|r|< 0.8 the relationship between the variables is
considered moderate to high. For |r|< 0.4 the relationship between variables is considered
small to insignificant. More specifically the relationship can be shown in the table below.
Size of r (disregarding the sign)
General Interpretation
0.9 to 1.00
Very strong relationship
0.8 to 0.9
Strong relationship
0.6 to 0.8
Moderate relationship
0.2 to 0.6
Weak relationship
0.0 to 0.2
Very weak or no relationship
Source: Willemse, 2009 :119
2. The sign + or – on the correlation coefficient indicates whether the relationship is positive
or negative. The + indicates a positive relationship between variables while the – sign
indicates a negative relationship between variables.
The correlation coefficient is therefore a summary measure that indicates the relative strength
of a relationship between two variables and the direction of the relationship, although it does
not describe the underlying relationship.
13
Table 38. Correlation coefficient worked out
Year
x
2
5
4
6
3
20
2000
2001
2002
2003
2004
Total
r
r
y
60
100
70
90
80
400
xy
120
500
280
540
240
1680
x^2
4
25
16
36
9
90
y^2
3600
10000
4900
8100
6400
33000
n XY   X  Y
[n X 2  ( X ) 2 ][n Y 2  ( Y ) 2 ]
5(1680)  20(400)
[5(90)  202 ][5(33000)  400 2 ]

400
400

 0.8
50  5000 500
This shows that there is a strong positive relationship between advertising and sales.
NB. the rounding rule for the correlation coefficient r is the values should be rounded to three
decimal places.
Coefficient of Determination
This measures the proportion of the total variation in the dependent variable that is explained
by the variation in the independent or explanatory variables in the regression. It is represented
by r2 and varies between 0 and 1. An r2 value of 1.00 indicates that the regression equation
“explains” 100 percent of the variation in the dependent variable about its mean. An r2 value
in the 0.5 –1.00 range are usually interpreted to mean that the regression equation does a
good job of explaining the Y variation.
r2 
(total variance in the dependent variable)  (variance " unexplained " by regression equation)
Total variance in the dependent variable
We can look at r2 as Explained Variation in all items/ Total variation in all items.
r2
 Yˆ  Y 

 Y  Y 
2
i
i
2

Explained Variation of Y
Total Variation of Y
From our regression example
Year
Adv exp Sales (Y)
(X)
2000
2
60
2001
5
100
2002
4
70
2003
6
90
2004
3
80
Total
20
400
Yˆ
(Yˆ  Y ) 2
Y
64
88
80
96
72
14
80
80
80
80
80
256
64
0
256
64
640
Y  Y 
2
i
400
400
100
100
0
1000
Our regression equation was Yˆ  48  8 X
Based
on
the
data
from
the
table
above
2
 Yˆi  Y  Explained Variation of Y , r 2  640  0.64
r2 
2
1000
Total Variation of Y
Y

Y
 i




and
the
formula
Based on the correlation formula for r in the advertising and sales example r = 0.8.The
coefficient of determination r2 = 0.64. This is the same result as calculated based on the table
above. This means that 64% of the variation in sales is due to the advertising expenditure
while the residual 36% is due to other factors such as changes in income, population, etc.
This value of 36% or 0.38 is also called the coefficient of nondetermination and is found by
subtracting the coefficient of determination from 1. As the value of r approaches 0, r2
decreases more rapidly. Coefficient of nondetermination = 1- r2
Spearman’s Rank Correlation Coefficient (rs)
This provides a measure of correlation between ranks. Rank correlation is in fact an
approximation to the product moment coefficient. Rank correlation methods can be used to
measure the correlation between any pair of variables. Rank correlation is a quick and easy
technique especially when the bivariate data are difficult to obtain physically or involve great
expense and yet can be ranked in size order. Also if one or both of the variables involved is
non-numeric rank correlation can be used.
The formula for the rs is given as :
2
6 d i
rs  1 
n(n 2  1)
where di = ui-vi (difference in ranks of the ith observations for samples 1 and 2) and n is the
number of bivariate pairs.
The value of rs always falls between –1 and +1, with +1 indicating perfect positive correlation
and –1 indicating perfect negative correlation.
Procedure
The measurements associated with each variable are ranked separately. Ties receive the
average of the ranks of the tied observations. In other words the ranks that would have been
allocated separately must be averaged and this average rank given to each item with this
equal value. In the ranking procedure assign a 1 to the smallest number and the highest
number to the largest value.
Using the advertising expenditure and sales figures we can use the rank correlation as below.
Table 39.Rank correlation workings
Year
x
y
rx
ry
d=rx-ry
d2
2000
2
60
1
1
0
0
2001
5
100
4
5
-1
1
2002
4
70
3
2
1
1
2003
6
90
5
4
1
1
2004
3
80
2
3
-1
1
Total
4
rs  1 
6(4)
24
 1
 1  0.2  0.8
5(25  1)
120
15
This result is the same as the result found using the formula for r.
Suppose we have the following table:
Table 39. Rank correlation calculations
x
y
rx
ry
d= rx - ry
d2
2
60
1
1
0
0
5
70
4
2.5
1.5
2.25
4
80
3
4
-1
1
6
70
5
2.5
2.5
6.25
3
90
2
5
-3
9
7
100
6
6
0
0
18.5
6(18.5)
111
 1
 1  0.53  0.47
6(36  1)
210
This result gives a positive correlation coefficient 0.47, which is a very low one indeed.
rs  1 
Importance of correlation and regression
Even though correlation or regression may have established that 2 variables move together,
no claim can be made that this necessarily indicates cause and effect. For example if the
correlation of teachers’ salaries and the consumption of liquor over a period of years turned
out to be 0.9 this does not prove teachers drink; nor does it prove that liquor sales increase
teachers’ salaries. Both variables move together because they are influenced by a third
variable that is long run growth in national income and population.
Although correlation and regression cannot be used as a proof of cause and effect, they can
estimate a relation that theory already tells us exists. Secondly they are often helpful in
suggesting causal relations that were not previously suspected e.g. when cigarette smoking
was found to be highly correlated with lung cancer, possible causal links between the two
were investigated.
Correlation and Causation
There are possible relationships between variables.
1. There is a direct cause-and –effect relationship between the variables that is x causes y
e.g. poison causes death or heat causes ice to melt.
2. There is a reverse cause and effect relationship between the variables. That is y
causes x. For example the researcher believes excessive coffee consumption causes
nervousness when the reverse is true.
3. The relationship between the variables may be caused by a third variable. For
example the statistician or researcher may have correlated number of deaths due to
drowning and the number of cans of soft drink consumed daily during summer. Both
variables probably are related to a third – high temperatures in summer which cause
people to drink a lot of soft drinks and at the same time more people tend tom swim
during the summer.
4. There may be a complexity of interrelationships among many variables. For example
a significant relationship between students’ high school grades and college grades
(Other variables are IQ, hours of study, influence of parents, motivation, age and
instructors)
16
5. The relationship may be coincidental- for example a researcher may be able to find a
significant relationship between the increase in the number of people who are
exercising and the increase in the number of people who are committing crimes.
Therefore from all this what can be said is that Correlation does not necessarily imply
causation.
Question
It is frequently conjectured that income is one of the primary determinants of social status for
an adult male. To investigate this theory 15 adult males chosen at random from a community
comprising primarily professional people, and their annual gross incomes are noted. The
social status scores (higher scores correspond to higher social status) and gross incomes (in $
000) are given in the following table for 15 adult males.
Subject
Social Status
Income
1
92
29.9
2
51
18.7
3
88
32.0
4
65
15.0
5
80
26.0
6
31
9.0
7
38
11.3
8
75
22.1
9
45
16.0
10
72
25.0
11
53
17.2
12
43
9.7
13
87
20.1
14
30
15.5
15
74
16.5
Compute Spearman’s rank correlation coefficient for these data.
17