Overview of Correlation and Regression

INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES
(ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014
Overview of Correlation and Regression
1
TALARI PRASHANTI
1
Asst. Professor, Department of Mathematics, Bharat Institute of Engineering & Technology
1
[email protected]
Abstract— Common statistic for indicating the strength of a
linear relationship existing between two continuous variables
is called the Pearson correlation coefficient, or correlation
coefficient (there are other types of correlation coefficients).
The correlation coefficient is a number ranging from -1 to +1.
A positive correlation means that as values of one variable
increase, values of the other variable also tend to increase. A
small or zero correlation coefficient tells us that the two
variables are unrelated. Finally, a negative correlation
coefficient show an inverse relationship between the variable:
as one goes up, the other goes down. The purpose of a
LINEAR CORRELATION ANALYSIS is to determine
whether there is a relationship between two sets of variables.
Bivariate correlation and regression evaluate
the degree of relationship between two quantitative
variables. Pearson Correlation (r), the most commonly
used bivariate correlation technique, measures the
association between two quantitative variables without
distinction between the independent and dependent
variables
Keywords— correlation coefficient, Linear
bivariate correlation, bivariate regression.
I.
Correlation,
INTRODUCTION
Correlation is a statistical measure that indicates the extent
to which two or more variables fluctuate together. A
positive correlation indicates the extent to which those
variables increase or decrease in parallel; a negative
correlation indicates the extent to which one variable
increases as the other decreases.
The purpose of a Linear Correlation Analysis is to
determine whether there is a relationship between two sets
of variables. We may find that: 1) there is a positive
correlation, 2) there is a negative correlation, or 3) there is
no correlation.

–0.30. A weak downhill (negative) linear
relationship

0. No linear relationship

+0.30. A weak uphill (positive) linear relationship

+0.50. A moderate uphill (positive) relationship

+0.70. A strong uphill (positive) linear
relationship

Exactly +1. A perfect uphill (positive) linear
relationship
We have recorded the GENDER, HEIGHT, WEIGHT, and
AGE of seven people and want to compute the correlations
between HEIGHT and WEIGHT, HEIGHT and AGE, and
WEIGHT and AGE.
DATA CORR_EG;
INPUT GENDER $ HEIGHT WEIGHT AGE;
DATALINES;
M 68 155 23
F 61 99 20
F 63 115 21
M 70 205 45
M 69 170 .
F 65 125 30
M 72 220 48
;
PROC CORR DATA=CORR_EG;
TITLE 'EXAMPLE OF A CORRELATION MATRIX';
VAR HEIGHT WEIGHT AGE;
RUN;
In statistics, the correlation coefficient r measures the
strength and direction of a linear relationship between two
variables on a scatterplot. The value of r is always between
+1 and –1. To interpret its value, see which of the
following values your correlation r is closest to:

Exactly –1. A perfect downhill (negative) linear
relationship

–0.70. A strong downhill (negative) linear
relationship

–0.50. A moderate downhill (negative)
relationship
www.ijetts.org
263
INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES
(ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014
perfect positive correlation all of the points would fall on a
straight line. The more linear the data points, the closer the
relationship between the two variables.
THE OUTPUT:
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
HEIGHT
WEIGHT
0.97165
0.86614
0.0003
0.0257
Negative Correlation
AGE
HEIGHT
1.00000
7
7
6
WEIGHT
1.00000
0.97165
0.92496
0.0003
0.0082
7
7
6
0.92496
1.00000
AGE
0.86614
0.0257
Notice that in this example as the number of parasites
increases, the harvest of unblemished apples decreases. If
this were a perfect negative correlation all of the points
would fall on a line with a negative slope. The more linear
the data points, the more negatively correlated are the two
variables.
No Correlation
0.0082
6
6
6
.
II. DESCRIPTION
. We may find that: 1) there is a positive correlation, 2)
there is a negative correlation, or 3) there is no correlation.
These relationships can be easily visualized by using
SCATTER DIAGRAMS
Positive Correlation
Notice that in this example there seems to be no
relationship between the two variables. Perhaps pillbugs
and clover do not interact with one another.
Notice that in this example as the heights increase, the
diameters of the trunks also tend to increase. If this were a
www.ijetts.org
264
INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES
(ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014
associated with the coefficient. That number gives the
probability of obtaining a sample correlation coefficient as
large as or larger than the one obtained by chance alone.
The significance of a correlation coefficient is a
function of the magnitude of the correlation and the sample
size. With a large number of data points, even a small
correlation coefficient can be significant.
It is important to remember that correlation indicates
only the strength of a relationship – it does not imply
causality. For example, we would probably find a high
positive correlation between the number of hospital in each
of the 50 states versus the number of household pets in
each state. It does not mean that pets make people sick and
therefore make more hospitals necessary! The most
plausible explanation is that both variables (number of pets
and number of hospitals) are related to population size.
II.I. HOW TO DECIDE IF THERE IS A LINEAR
CORRELATION OR NOT
First, locate the samples size nearest to yours in the table
below at look at the critical region for that samples size: If
your correlation coefficient (r-value) falls within the critical
region, your two variables do not change in a regular
enough way to be considered significant. You conclude
that your data indicate no correlation. If your correlation
coefficient is a larger positive number than the critical
region, your data indicate a significant positive correlation.
The closer your r-value is to 1.0, the stronger the
correlation.
If your correlation coefficient is a larger
negative number than the critical region, your data indicate
a significant negative correlation. The closer your r-value
is to —1.0, the stronger the negative correlation.
II.II. REPORTING YOUR RESULTS
The examples below show how the results of your analysis of
linear correlation should be presented. “There is a positive
correlation between the height and the diameter of Eastern
White Pine trees (n = 15, r = 0.857, critical value = 0.514).”
“There is a negative correlation between the
population density of codling moths and the % of unblemished
apples harvested (n = 23, r = -0.500, critical value = — 0.444).”
“There is no linear correlation between the population densities
of pillbugs and red clover (n = 30,
r = 0.202, critical values = ± 0.361).”
III. SIGNIFICANCE OF A CORRELATION
COEFFICIENT
Being SIGNIFICANT is not the same as being
IMPORTANT or STRONG. That is, knowing the
significance of the correlation coefficient does not tell us
very much. Once we know that our correlation coefficient
is significantly different from zero, we need to look
further in interpreting the importance of that correlation.
IV. HOW TO INTERPRET A CORRELATION
COEFFICIENT
One of the best ways to interpret a correlation
coefficient(r) is to look at the square of the coefficient(rsquared);r-squared can be interpreted as the proportion of
variance in one of the variables that can be explained by
variation in the other variable. As an example, our
height/weight correlation is .97. Thus, r-squared is .94.We
can say that 94% of the variation in weights can be
explained by variation in height (or vice versa). Also, (10.94) or 6% of the variance of weight, is due to factors
other than height variation.
Another consideration in the interpretation of
correlation coefficient is this: Be sure to look at a scatter
plot of the data (using PROC PLOT). It often turns out that
one or two extreme data points can cause the correlation
coefficient to be much larger than expected. One
keypunching error can cause dramatically alter a
correlation coefficient.
An important assumption concerning a correlation
coefficient is that each pair of x.y data points is
independent of any other pair. That is, each pair of points
has to come from a separate subject. Otherwise we cannot
compute a valid correlation coefficient.
―How large a correlation coefficient do I need to show
that two variables are correlated?‖ Each time PROC CORR
prints a correlation coefficient, it also prints a probability
www.ijetts.org
265
INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES
(ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014
V. PARTIAL CORRELATIONS
THE OUTPUT:
A researcher may wish to determine the strength of the
relationship between two variables when the effect of other
variables has been removed. One way to accomplish this is
by computing a partial correlation. To remove the effect of
one or more variables from a correlation, use a PARTIAL
statement to list those variables whose effects you want to
remove.
PROC CORR DATA=CORR_EG NOSIMPLE;
TITLE 'EXAMPLE OF A PARTIAL CORRELATION';
VAR HEIGHT WEIGHT;
PARTIAL AGE;
RUN;
The CORR Procedure
1 Partial Variables:
2
Variables:
AGE
HEIGHT
WEIGHT
Pearson Partial Correlation Coefficients, N = 6
Prob > |r| under H0:
Partial Rho=0
HEIGHT
HEIGHT
WEIGHT
1.00000
0.91934
0.91934
1.00000
0.0272
THE OUTPUT:
The CORR Procedure
1 Partial Variables:
2
Variables:
WEIGHT
AGE
HEIGHT
0.0272
WEIGHT
Parameter Estimates
Pearson Partial Correlation Coefficients, N = 6
Prob > |r| under H0: Partial Rho=0
Value
HEIGHT
HEIGHT
1.00000
WEIGHT
0.91934
7.27
0.0272
WEIGHT
0.91934
1.00000
9.19
Variable
Pr > |t|
Intercept
0.0008
HEIGHT
0.0003
Parameter
Standard
DF
Estimate
1
-592.64458
1
Error
t
81.54217
-
11.19127
1.21780
0.0272
VI. LINEAR REGRESSION
Giving a person’s height, what would be their
predicted weight? How can we best define the relationship
between height and weight? By studying the graph below
we see that the relationship is approximately linear. That is,
we can imagine drawing a straight line on the graph with
most of the data points being only a short distance from the
line. The vertical distance from each data point to this line
is called a residual.
How do we determine the ―best‖ straight line to fit out
height/weight data? The method of least squares is
commonly used, which finds the line (called the regression
line) that minimizes the sum of the squared residuals.
An equivalent way to define a residual is the difference
between a subject’s predicted score and his/her actual
score.
PROC REG DATA=CORR_EG;
TITLE 'REGRESSON LINE FOR HEIGHT-WEIGHT
DATA';
MODEL WEIGHT=HEIGHT;
RUN;
www.ijetts.org
VII. PARTITIONING THE TOTOAL SUM OF
SQUARES
The total sum of squares is the sum of squared
deviations of each person’s weight from the grand mean.
This total sum of squares (SS) can be divided into the two
portions: the sum of squares due to regression (or model),
and the sum of square error (sometimes called residual).
One portion called the Sum of Squares (ERROR) in the
output reflects the deviations of each weight from the
PREDICTED weight. The other portion reflects deviations
between the PREDICTED values and the MEAN.
Our intuition tells us that if the deviations about the
regression line are small (error mean square) compared to
the deviation between the predicted values and the mean
(mean square model) then we have a good regression line.
To compare these mean squares, the ratio F=Mean square
mode/Mean square error is formed. The larger this ratio,
the better the model fit.
266
INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES
(ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014
VIII.
PLOTTING THE POINTS ON THE
REGRESSION LINE
In particular, the names PREDICTED. and
RESIDUAL. (The periods are part of these keywords) are
used to plot predicted and residual values. A common
option is OVERLAY, which is used to plot more than one
graph on a single set of axes.
PROC REG DATA=CORR_EG;
MODEL WEIGHT=HEIGHT;
PLOT WEIGHT*HEIGHT;
RUN;
PROC REG DATA= CORR_EG;
MODEL WEIGHT=HEIGHT;
PLOT
PREDICTED.*HEIGHT='1'
WEIGHT*HEIGHT='2' / OVERLAY;
RUN;
THE OUTPUT:
W
EI GHT = - 592. 64 +11. 191 HEI GHT
220
IX. PLOTTING
RESIDUALS
CONFIDENCE LIMITS
AND
The most useful statistics besides the predicted values are:
RESIDUAL. – The residual (i.e., difference between actual
and predicted values for each observation).
L95. U95. – The lower or upper 95% confidence limit for
an individual predicted value. (i.e., 95% of the dependent
data values would be expected to be between these two
limits).
L95M. U95M. – The lower or upper 95% confidence limit
for the mean of the dependent variable for a given value of
the independent variable.
PROC REG DATA=CORR_EG;
MODEL WEIGHT = HEIGHT;
PLOT PREDICTED.*HEIGHT = 'P'
U95M.*HEIGHT = '-' L95M.*HEIGHT='-'
WEIGHT*HEIGHT='*' / OVERLAY;
PLOT RESIDUAL.*HEIGHT = 'O';
RUN;
N
7
Rsq
0. 9441
Adj Rsq
0. 9329
RMSE
11. 861
200
THE OUTPUT:
180
160
140
120
100
80
61
62
63
64
65
66
67
68
69
70
71
72
HEI GHT
GOPTIONS CSYMBOL=BLACK;
SYMBOL1 VALUE=DOT;
SYMBOL2 VALUE=NONE I=RLCLM95;
SYMBOL3 VALUE=NONE I=RLCLI95 LINE=3;
PROC GPLOT DATA=CORR_EG;
TITLE "Regression lines and 95% CI's";
PLOT WEIGHT*HEIGHT =1
WEIGHT*HEIGHT =2
WEIGHT*HEIGHT
=3
OVERLAY;
RUN;
www.ijetts.org
267
/
INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES
(ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014
Another regression example is provided here to
demonstrate some additional step that may be necessary
when doing regression.
WEI GHT
220
DATA HEART;
INPUT DRUG_DOSE HEART_RATE;
DATALINES;
2 60
2 58
4 63
4 62
8 67
8 65
16 70
16 70
32 74
32 73
;
210
200
190
180
170
160
150
140
130
120
110
100
90
61
62
63
64
65
66
67
68
69
70
71
72
HEI GHT
X. ADDING A QUADRATIC TERM TO THE
REGRESSION EQUATION
The plot of residuals, shown in the graph above, suggests
that a second-order term (height squared) might improve
the model since the points do not seem random but, rather,
form a curve that could be fit by a second-order equation.
To add a quadratic term, first, we need to add a line in the
original DATA step to compute a variable that represents
the height squared. After the INPUT statement, include a
line such as the following:
HEIGHT2 = HEIGHT * HEIGHT; or HEIGHT2 =
HEIGHT**2;
Next,write your MODEL and PLOT statements:
DATA CORR_EG;
SET CORR_EG;
HEIGHT2=HEIGHT**2;
RUN;
PROC REG DATA=CORR_EG;
MODEL WEIGHT = HEIGHT HEIGHT2;
PLOT R.*HEIGHT;
RUN;
When you run this model, you will get an r-squared of
.9743, and improvement over the .9441 obtained with the
linear model.
PROC GPLOT DATA=HEART;
PLOT HEART_RATE*DRUG_DOSE;
RUN;
PROC REG DATA=HEART;
MODEL HEART_RATE=DRUG_DOSE;
RUN;
Either by clinical judgment or by careful
inspection of the graph, we decide that the relationship
is not linear. We see an approximately equal increase
in heart rate each time the dose is doubled. Therefore,
if we plot log dose against heart rate we can expect a
linear relationship. SAS software has a number of
built-in functions such as logarithms and trigonometric
functions. We can write mathematical equations to
define new variable by placing these statements
between the INPUT and DATALINES statements.
DATA HEART_LOG;
SET HEART;
L_DRUG_DOSE = LOG(DRUG_DOSE);
RUN;
PROC GPLOT DATA=HEART_LOG;
PLOT HEART_RATE*L_DRUG_DOSE;
RUN;
QUIT;
PROC REG DATA=HEART_LOG;
MODEL HEART_RATE=L_DRUG_DOSE;
PLOT R.*HEART_RATE;
RUN;
Notice that the data points are now closer to the
regression line. The MEAN SQUARE ERROR term is
smaller and r-square is large, confirming our conclusion
that dose versus heart rate fits a logarithmic curve better
than a linear one.
XI. TRANSFORMING DATA
www.ijetts.org
268
INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES
(ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014
XII. REFERENCES
1. McClave JT, Dietrich II FH. Statistics. San Francisco,
Calif: Dellen;1985.
2. Daniel WW. Applied Nonparametric Statistics. 2nd ed.
Boston, Mass:PWS-KENT; 1990.
3. Erickson BH, Nosanchuk TA. Understanding Data. 2nd
ed. Toronto,Canada: University of Toronto Press; 1992.
4. Weisberg S. Applied Linear Regression. New York, NY:
Wiley; 1980.
5. Tabachnick BG, Fidell LS. Using Multivariate Statistics.
3rd ed. NewYork, NY: Harper Collins; 1996.
6. Belsley DA, Kuh E, Welsch RE. Regression
Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York, NY: Wiley;1980.
7. Draper N, Smith H. Applied Regression Analysis. 2nd
ed. New York, NY:Wiley; 1981.
8. Rousseeuw PJ, Leroy AM. Robust Regression and
Outlier Detection.New York, NY: Wiley; 1987.
9. Ragland DR. Dichotomizing continuous outcome
variables: dependence of the magnitude of association and
statistical power on the cutpoint. Epidemiology.
1992;3:434–440.
10. Mazumdar M, Glassman JR. Categorizing a prognostic
variable: review of methods, code for easy implementation
and applications to decisionmaking about cancer
treatments. Stat Med. 2000;19:113–132.
www.ijetts.org
269