INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES (ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014 Overview of Correlation and Regression 1 TALARI PRASHANTI 1 Asst. Professor, Department of Mathematics, Bharat Institute of Engineering & Technology 1 [email protected] Abstract— Common statistic for indicating the strength of a linear relationship existing between two continuous variables is called the Pearson correlation coefficient, or correlation coefficient (there are other types of correlation coefficients). The correlation coefficient is a number ranging from -1 to +1. A positive correlation means that as values of one variable increase, values of the other variable also tend to increase. A small or zero correlation coefficient tells us that the two variables are unrelated. Finally, a negative correlation coefficient show an inverse relationship between the variable: as one goes up, the other goes down. The purpose of a LINEAR CORRELATION ANALYSIS is to determine whether there is a relationship between two sets of variables. Bivariate correlation and regression evaluate the degree of relationship between two quantitative variables. Pearson Correlation (r), the most commonly used bivariate correlation technique, measures the association between two quantitative variables without distinction between the independent and dependent variables Keywords— correlation coefficient, Linear bivariate correlation, bivariate regression. I. Correlation, INTRODUCTION Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases. The purpose of a Linear Correlation Analysis is to determine whether there is a relationship between two sets of variables. We may find that: 1) there is a positive correlation, 2) there is a negative correlation, or 3) there is no correlation. –0.30. A weak downhill (negative) linear relationship 0. No linear relationship +0.30. A weak uphill (positive) linear relationship +0.50. A moderate uphill (positive) relationship +0.70. A strong uphill (positive) linear relationship Exactly +1. A perfect uphill (positive) linear relationship We have recorded the GENDER, HEIGHT, WEIGHT, and AGE of seven people and want to compute the correlations between HEIGHT and WEIGHT, HEIGHT and AGE, and WEIGHT and AGE. DATA CORR_EG; INPUT GENDER $ HEIGHT WEIGHT AGE; DATALINES; M 68 155 23 F 61 99 20 F 63 115 21 M 70 205 45 M 69 170 . F 65 125 30 M 72 220 48 ; PROC CORR DATA=CORR_EG; TITLE 'EXAMPLE OF A CORRELATION MATRIX'; VAR HEIGHT WEIGHT AGE; RUN; In statistics, the correlation coefficient r measures the strength and direction of a linear relationship between two variables on a scatterplot. The value of r is always between +1 and –1. To interpret its value, see which of the following values your correlation r is closest to: Exactly –1. A perfect downhill (negative) linear relationship –0.70. A strong downhill (negative) linear relationship –0.50. A moderate downhill (negative) relationship www.ijetts.org 263 INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES (ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014 perfect positive correlation all of the points would fall on a straight line. The more linear the data points, the closer the relationship between the two variables. THE OUTPUT: Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations HEIGHT WEIGHT 0.97165 0.86614 0.0003 0.0257 Negative Correlation AGE HEIGHT 1.00000 7 7 6 WEIGHT 1.00000 0.97165 0.92496 0.0003 0.0082 7 7 6 0.92496 1.00000 AGE 0.86614 0.0257 Notice that in this example as the number of parasites increases, the harvest of unblemished apples decreases. If this were a perfect negative correlation all of the points would fall on a line with a negative slope. The more linear the data points, the more negatively correlated are the two variables. No Correlation 0.0082 6 6 6 . II. DESCRIPTION . We may find that: 1) there is a positive correlation, 2) there is a negative correlation, or 3) there is no correlation. These relationships can be easily visualized by using SCATTER DIAGRAMS Positive Correlation Notice that in this example there seems to be no relationship between the two variables. Perhaps pillbugs and clover do not interact with one another. Notice that in this example as the heights increase, the diameters of the trunks also tend to increase. If this were a www.ijetts.org 264 INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES (ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014 associated with the coefficient. That number gives the probability of obtaining a sample correlation coefficient as large as or larger than the one obtained by chance alone. The significance of a correlation coefficient is a function of the magnitude of the correlation and the sample size. With a large number of data points, even a small correlation coefficient can be significant. It is important to remember that correlation indicates only the strength of a relationship – it does not imply causality. For example, we would probably find a high positive correlation between the number of hospital in each of the 50 states versus the number of household pets in each state. It does not mean that pets make people sick and therefore make more hospitals necessary! The most plausible explanation is that both variables (number of pets and number of hospitals) are related to population size. II.I. HOW TO DECIDE IF THERE IS A LINEAR CORRELATION OR NOT First, locate the samples size nearest to yours in the table below at look at the critical region for that samples size: If your correlation coefficient (r-value) falls within the critical region, your two variables do not change in a regular enough way to be considered significant. You conclude that your data indicate no correlation. If your correlation coefficient is a larger positive number than the critical region, your data indicate a significant positive correlation. The closer your r-value is to 1.0, the stronger the correlation. If your correlation coefficient is a larger negative number than the critical region, your data indicate a significant negative correlation. The closer your r-value is to —1.0, the stronger the negative correlation. II.II. REPORTING YOUR RESULTS The examples below show how the results of your analysis of linear correlation should be presented. “There is a positive correlation between the height and the diameter of Eastern White Pine trees (n = 15, r = 0.857, critical value = 0.514).” “There is a negative correlation between the population density of codling moths and the % of unblemished apples harvested (n = 23, r = -0.500, critical value = — 0.444).” “There is no linear correlation between the population densities of pillbugs and red clover (n = 30, r = 0.202, critical values = ± 0.361).” III. SIGNIFICANCE OF A CORRELATION COEFFICIENT Being SIGNIFICANT is not the same as being IMPORTANT or STRONG. That is, knowing the significance of the correlation coefficient does not tell us very much. Once we know that our correlation coefficient is significantly different from zero, we need to look further in interpreting the importance of that correlation. IV. HOW TO INTERPRET A CORRELATION COEFFICIENT One of the best ways to interpret a correlation coefficient(r) is to look at the square of the coefficient(rsquared);r-squared can be interpreted as the proportion of variance in one of the variables that can be explained by variation in the other variable. As an example, our height/weight correlation is .97. Thus, r-squared is .94.We can say that 94% of the variation in weights can be explained by variation in height (or vice versa). Also, (10.94) or 6% of the variance of weight, is due to factors other than height variation. Another consideration in the interpretation of correlation coefficient is this: Be sure to look at a scatter plot of the data (using PROC PLOT). It often turns out that one or two extreme data points can cause the correlation coefficient to be much larger than expected. One keypunching error can cause dramatically alter a correlation coefficient. An important assumption concerning a correlation coefficient is that each pair of x.y data points is independent of any other pair. That is, each pair of points has to come from a separate subject. Otherwise we cannot compute a valid correlation coefficient. ―How large a correlation coefficient do I need to show that two variables are correlated?‖ Each time PROC CORR prints a correlation coefficient, it also prints a probability www.ijetts.org 265 INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES (ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014 V. PARTIAL CORRELATIONS THE OUTPUT: A researcher may wish to determine the strength of the relationship between two variables when the effect of other variables has been removed. One way to accomplish this is by computing a partial correlation. To remove the effect of one or more variables from a correlation, use a PARTIAL statement to list those variables whose effects you want to remove. PROC CORR DATA=CORR_EG NOSIMPLE; TITLE 'EXAMPLE OF A PARTIAL CORRELATION'; VAR HEIGHT WEIGHT; PARTIAL AGE; RUN; The CORR Procedure 1 Partial Variables: 2 Variables: AGE HEIGHT WEIGHT Pearson Partial Correlation Coefficients, N = 6 Prob > |r| under H0: Partial Rho=0 HEIGHT HEIGHT WEIGHT 1.00000 0.91934 0.91934 1.00000 0.0272 THE OUTPUT: The CORR Procedure 1 Partial Variables: 2 Variables: WEIGHT AGE HEIGHT 0.0272 WEIGHT Parameter Estimates Pearson Partial Correlation Coefficients, N = 6 Prob > |r| under H0: Partial Rho=0 Value HEIGHT HEIGHT 1.00000 WEIGHT 0.91934 7.27 0.0272 WEIGHT 0.91934 1.00000 9.19 Variable Pr > |t| Intercept 0.0008 HEIGHT 0.0003 Parameter Standard DF Estimate 1 -592.64458 1 Error t 81.54217 - 11.19127 1.21780 0.0272 VI. LINEAR REGRESSION Giving a person’s height, what would be their predicted weight? How can we best define the relationship between height and weight? By studying the graph below we see that the relationship is approximately linear. That is, we can imagine drawing a straight line on the graph with most of the data points being only a short distance from the line. The vertical distance from each data point to this line is called a residual. How do we determine the ―best‖ straight line to fit out height/weight data? The method of least squares is commonly used, which finds the line (called the regression line) that minimizes the sum of the squared residuals. An equivalent way to define a residual is the difference between a subject’s predicted score and his/her actual score. PROC REG DATA=CORR_EG; TITLE 'REGRESSON LINE FOR HEIGHT-WEIGHT DATA'; MODEL WEIGHT=HEIGHT; RUN; www.ijetts.org VII. PARTITIONING THE TOTOAL SUM OF SQUARES The total sum of squares is the sum of squared deviations of each person’s weight from the grand mean. This total sum of squares (SS) can be divided into the two portions: the sum of squares due to regression (or model), and the sum of square error (sometimes called residual). One portion called the Sum of Squares (ERROR) in the output reflects the deviations of each weight from the PREDICTED weight. The other portion reflects deviations between the PREDICTED values and the MEAN. Our intuition tells us that if the deviations about the regression line are small (error mean square) compared to the deviation between the predicted values and the mean (mean square model) then we have a good regression line. To compare these mean squares, the ratio F=Mean square mode/Mean square error is formed. The larger this ratio, the better the model fit. 266 INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES (ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014 VIII. PLOTTING THE POINTS ON THE REGRESSION LINE In particular, the names PREDICTED. and RESIDUAL. (The periods are part of these keywords) are used to plot predicted and residual values. A common option is OVERLAY, which is used to plot more than one graph on a single set of axes. PROC REG DATA=CORR_EG; MODEL WEIGHT=HEIGHT; PLOT WEIGHT*HEIGHT; RUN; PROC REG DATA= CORR_EG; MODEL WEIGHT=HEIGHT; PLOT PREDICTED.*HEIGHT='1' WEIGHT*HEIGHT='2' / OVERLAY; RUN; THE OUTPUT: W EI GHT = - 592. 64 +11. 191 HEI GHT 220 IX. PLOTTING RESIDUALS CONFIDENCE LIMITS AND The most useful statistics besides the predicted values are: RESIDUAL. – The residual (i.e., difference between actual and predicted values for each observation). L95. U95. – The lower or upper 95% confidence limit for an individual predicted value. (i.e., 95% of the dependent data values would be expected to be between these two limits). L95M. U95M. – The lower or upper 95% confidence limit for the mean of the dependent variable for a given value of the independent variable. PROC REG DATA=CORR_EG; MODEL WEIGHT = HEIGHT; PLOT PREDICTED.*HEIGHT = 'P' U95M.*HEIGHT = '-' L95M.*HEIGHT='-' WEIGHT*HEIGHT='*' / OVERLAY; PLOT RESIDUAL.*HEIGHT = 'O'; RUN; N 7 Rsq 0. 9441 Adj Rsq 0. 9329 RMSE 11. 861 200 THE OUTPUT: 180 160 140 120 100 80 61 62 63 64 65 66 67 68 69 70 71 72 HEI GHT GOPTIONS CSYMBOL=BLACK; SYMBOL1 VALUE=DOT; SYMBOL2 VALUE=NONE I=RLCLM95; SYMBOL3 VALUE=NONE I=RLCLI95 LINE=3; PROC GPLOT DATA=CORR_EG; TITLE "Regression lines and 95% CI's"; PLOT WEIGHT*HEIGHT =1 WEIGHT*HEIGHT =2 WEIGHT*HEIGHT =3 OVERLAY; RUN; www.ijetts.org 267 / INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES (ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014 Another regression example is provided here to demonstrate some additional step that may be necessary when doing regression. WEI GHT 220 DATA HEART; INPUT DRUG_DOSE HEART_RATE; DATALINES; 2 60 2 58 4 63 4 62 8 67 8 65 16 70 16 70 32 74 32 73 ; 210 200 190 180 170 160 150 140 130 120 110 100 90 61 62 63 64 65 66 67 68 69 70 71 72 HEI GHT X. ADDING A QUADRATIC TERM TO THE REGRESSION EQUATION The plot of residuals, shown in the graph above, suggests that a second-order term (height squared) might improve the model since the points do not seem random but, rather, form a curve that could be fit by a second-order equation. To add a quadratic term, first, we need to add a line in the original DATA step to compute a variable that represents the height squared. After the INPUT statement, include a line such as the following: HEIGHT2 = HEIGHT * HEIGHT; or HEIGHT2 = HEIGHT**2; Next,write your MODEL and PLOT statements: DATA CORR_EG; SET CORR_EG; HEIGHT2=HEIGHT**2; RUN; PROC REG DATA=CORR_EG; MODEL WEIGHT = HEIGHT HEIGHT2; PLOT R.*HEIGHT; RUN; When you run this model, you will get an r-squared of .9743, and improvement over the .9441 obtained with the linear model. PROC GPLOT DATA=HEART; PLOT HEART_RATE*DRUG_DOSE; RUN; PROC REG DATA=HEART; MODEL HEART_RATE=DRUG_DOSE; RUN; Either by clinical judgment or by careful inspection of the graph, we decide that the relationship is not linear. We see an approximately equal increase in heart rate each time the dose is doubled. Therefore, if we plot log dose against heart rate we can expect a linear relationship. SAS software has a number of built-in functions such as logarithms and trigonometric functions. We can write mathematical equations to define new variable by placing these statements between the INPUT and DATALINES statements. DATA HEART_LOG; SET HEART; L_DRUG_DOSE = LOG(DRUG_DOSE); RUN; PROC GPLOT DATA=HEART_LOG; PLOT HEART_RATE*L_DRUG_DOSE; RUN; QUIT; PROC REG DATA=HEART_LOG; MODEL HEART_RATE=L_DRUG_DOSE; PLOT R.*HEART_RATE; RUN; Notice that the data points are now closer to the regression line. The MEAN SQUARE ERROR term is smaller and r-square is large, confirming our conclusion that dose versus heart rate fits a logarithmic curve better than a linear one. XI. TRANSFORMING DATA www.ijetts.org 268 INTERNATIONAL JOURNAL OF EMERGING TRENDS IN TECHNOLOGY AND SCIENCES (ISSN: 2348-0246(online)), VOLUME-03, ISSUE-04, DECEMBER-2014 XII. REFERENCES 1. McClave JT, Dietrich II FH. Statistics. San Francisco, Calif: Dellen;1985. 2. Daniel WW. Applied Nonparametric Statistics. 2nd ed. Boston, Mass:PWS-KENT; 1990. 3. Erickson BH, Nosanchuk TA. Understanding Data. 2nd ed. Toronto,Canada: University of Toronto Press; 1992. 4. Weisberg S. Applied Linear Regression. New York, NY: Wiley; 1980. 5. Tabachnick BG, Fidell LS. Using Multivariate Statistics. 3rd ed. NewYork, NY: Harper Collins; 1996. 6. Belsley DA, Kuh E, Welsch RE. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York, NY: Wiley;1980. 7. Draper N, Smith H. Applied Regression Analysis. 2nd ed. New York, NY:Wiley; 1981. 8. Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection.New York, NY: Wiley; 1987. 9. Ragland DR. Dichotomizing continuous outcome variables: dependence of the magnitude of association and statistical power on the cutpoint. Epidemiology. 1992;3:434–440. 10. Mazumdar M, Glassman JR. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decisionmaking about cancer treatments. Stat Med. 2000;19:113–132. www.ijetts.org 269
© Copyright 2026 Paperzz