Biostatistics, statistical software VI. Relationship between two continuous variables, correlation, linear regression, transformations. Relationship between two discrete variables, contingency tables, test for independence. Krisztina Boda PhD Department of Medical Informatics, University of Szeged Relationship between two continuous variables correlation, linear regression, transformations. Krisztina Boda INTERREG 2 Imagine that 6 students are given a battery of tests by a vocational guidance counsellor with the results shown in the following table: A D T A U T 1 0 0 0 P 2 0 0 0 S 3 0 0 0 I 4 0 0 0 A 5 0 0 0 G 6 0 0 0 B Krisztina Boda Variables measured on the same individuals are often related to each other. INTERREG 3 Let us draw a graph called scattergram to investigate relationships. Scatterplots show the relationship between two quantitative variables measured on the same cases. In a scatterplot, we look for the direction, form, and strength of the relationship between the variables. The simplest relationship is linear in form and reasonably strong. Scatterplots also reveal deviations from the overall pattern. Krisztina Boda INTERREG 4 Creating a scatterplot When one variable in a scatterplot explains or predicts the other, place it on the x-axis. Place the variable that responds to the predictor on the y-axis. If neither variable explains or responds to the other, it does not matter which axes you assign them to. Krisztina Boda INTERREG 5 Possible relationships 100 80 520 500 retailing language 560 540 480 460 440 420 400 400 60 40 20 450 500 550 0 400 600 450 math score 500 550 600 math score negative correlation theater positive correlation 100 90 80 70 60 50 40 30 20 10 0 400 450 500 550 600 math score no correlation Krisztina Boda INTERREG 6 Describing linear relationship with number: the coefficient of correlation Correlation is a numerical measure of the strength of a linear association. The formula for coefficient of correlation treats x and y identically. There is no distinction between explanatory and response variable. Let us denote the two samples by x1,x2,…xn and y1,y2,…yn , the coefficient of correlation can be computed according to the following formula n r Krisztina Boda n n n n xi yi xi yi i 1 i 1 i 1 n n n n 2 2 2 2 n x ( x ) n y ( y ) i i i i i 1 i 1 i 1 i 1 INTERREG (x i x )( yi y) i 1 n (x i 1 i x) n 2 ( y y) 2 i i 1 7 Properties of r 560 540 520 500 480 460 440 420 400 400 450 500 550 600 math score 100 80 retailing Correlations are between -1 and +1; the value of r is always between -1 and 1, either extreme indicates a perfect linear association. 1r 1. a) If r is near +1 or -1 we say that we have high correlation. language 60 40 20 0 400 b) If r=1, we say that there is perfect positive correlation. If r= -1, then we say that there is a perfect negative correlation. 450 500 550 600 math score 12 10 8 6 4 2 0 c) A correlation of zero indicates the absence of linear association. When there is no tendency for the points to lie in a straight line, we say that there is no correlation (r=0) or we have low correlation (r is near 0 ). 2 4 6 10 8 6 4 2 0 0 theater 0 12 2 4 6 100 90 80 70 60 50 40 30 20 10 0 400 450 500 550 600 math score Krisztina Boda INTERREG 8 an apparently strong correlation where none would be found otherwise, or hide a strong correlation by making it appear to be weak. 180 160 theater 100 90 80 70 60 50 40 30 20 10 0 80 60 40 20 0 400 450 500 550 400 600 500 600 700 800 900 math score math score r=-0.21 r=0.74 560 560 540 540 520 520 500 480 460 440 420 400 400 450 500 math score r=0.998 Krisztina Boda 140 120 100 language Even a single outlier can change the correlation substantially. Outliers can create language theater Effect of outliers INTERREG 550 600 500 480 460 440 420 400 400 500 600 700 800 math score r=-0.26 9 900 Two variables may be closely related and still have a small correlation if the form of the relationship is not linear. y 10 8 6 4 2 0 -4 -3 -2 -1 0 1 2 3 4 r=2.8 E-15 y 1.2 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 r=0.157 Krisztina Boda INTERREG 10 Correlation and causation Krisztina Boda a correlation between two variables does not show that one causes the other. Causation is a subtle concept best demonstrated statistically by designed experiments. INTERREG 11 Correlation by eye http://onlinestatbook.com/stat_sim/reg_by_eye/index.html) . Krisztina Boda This applet lets you estimate the regression line and to guess the value of Pearson's correlation. Five possible values of Pearson's correlation are listed. One of them is the correlation for the data displayed in the scatterplot. Guess which one it is. To see the correct value, click on the "Show r" button. INTERREG 12 When is a correlation „high”? What is considered to be high correlation varies with the field of application. The statistician must decide when a sample value of r is far enough from zero, that is, when it is sufficiently far from zero to reflect the correlation in the population. Krisztina Boda INTERREG 13 Testing the significance of the coefficient of correlation The statistician must decide when a sample value of r is far enough from zero to be significant, that is, when it is sufficiently far from zero to reflect the correlation in the population. H0: ρ=0 (greek rho=0, correlation coefficient in population = 0) Ha: ρ ≠ 0 (correlation coefficient in population ≠ 0) This test can be carried out by expressing the t statistic in terms of r. The following t-statistic has n-2 degrees of freedom t r n2 1 r 2 r n2 1 r 2 Decision using statistical table: If |t|>tα,n-2, the difference is significant at α level, we reject H0 and state that the population correlation coefficient is different from 0. If |t|<tα,n-2, the difference is not significant at α level, we accept H0 and state that the population correlation coefficient is not different from 0. Decision using p-value: if p < α a we reject H0 at α level and state that the population correlation coefficient is different from 0. Krisztina Boda INTERREG 14 Example 1. The correlation coefficient between math skill and language skill was found r=0.9989. Is significantly different from 0? H0: the correlation coefficient in population = 0, ρ =0. Ha: the correlation coefficient in population is different from 0. Let's compute the test statistic: t Krisztina Boda 0.9989 62 10.9989 2 0.9989 4 42.6 2 10.9989 Degrees of freedom: df=6-2=4 The critical value in the table is t0.05,4 = 2.776. Because 42.6 > 2.776, we reject H0 and claim that there is a significant linear correlation between the two variables at 5 % level. INTERREG 15 Example 1, cont. Scatterplot (corr 5v*6c) LANGUAGE = 15.5102+1.0163*x 560 540 520 500 480 LANGUAGE 460 440 420 400 380 400 420 440 MATH:LANGUAGE: r = 0.9989; p = 0.000002 460 480 500 520 540 MATH p<0.05, the correlation is significantly different from 0 at 5% level Krisztina Boda INTERREG 16 Example 2. The correlation coefficient between math skill and retailing skill was found r= -0.9993. Is significantly different from 0? H0: the correlation coefficient in population = 0, ρ =0. Ha: the correlation coefficient in population is different from 0. Let's compute the test statistic: t Krisztina Boda 0.9993 62 10.99932 0.9993 4 53.42 10.9986 Degrees of freedom: df=6-2=4 The critical value in the table is t0.05,4 = 2.776. Because |-53.42|=53.42 > 2.776, we reject H0 and claim that there is a significant linear correlation between the two variables at 5 % level. INTERREG 17 Example 2., cont. Scatterplot (corr 5v*6c) RETAIL = 234.135-0.3471*x 100 90 80 70 RETAIL 60 50 40 380 400 420 440 MATH:RETAIL: r = -0.9993; p = 0.0000008 Krisztina Boda 460 480 500 520 540 MATH INTERREG 18 Example 3. The correlation coefficient between math skill and theater skill was found r= -0.2157. Is significantly different from 0? H0: the correlation coefficient in population = 0, ρ =0. Ha: the correlation coefficient in population is different from 0. Let's compute the test statistic: t Krisztina Boda 0.2157 62 10.2157 2 0.2157 4 0.4418 10.04653 Degrees of freedom: df=6-2=4 The critical value in the table is t0.05,4 = 2.776. Because |-0.4418|=0.4418 < 2.776, we do not reject H0 and claim that there is no a significant linear correlation between the two variables at 5 % level. INTERREG 19 Example 3., cont. Scatterplot (corr 5v*6c) THEATER = 112.7943-0.1137*x 100 90 80 70 60 THEATER 50 40 30 20 380 400 420 MATH:THEATER: r = -0.2157; p = 0.6814 Krisztina Boda 440 460 480 500 520 540 MATH INTERREG 20 Prediction based on linear correlation: the linear regression Krisztina Boda When the form of the relationship in a scatterplot is linear, we usually want to describe that linear form more precisely with numbers. We can rarely hope to find data values lined up perfectly, so we fit lines to scatterplots with a method that compromises among the data values. This method is called the method of least squares. The key to finding, understanding, and using least squares lines is an understanding of their failures to fit the data; the residuals A straight line that best fits the data: y=bx + a is called regression line Geometrical meaning of a and b. b: is called regression coefficient, slope of the best-fitting line or regression line; a: y-intercept of the regression line. INTERREG 21 Equation of regression line for the data of Example 1. Scatterplot (corr 5v*6c) LANGUAGE = 15.5102+1.0163*x 560 540 520 500 480 LANGUAGE y=1.016·x+15.5 the slope of the line is 1.016 Prediction based on the equation: what is the predicted score for language for a student having 400 points in math? ypredicted=1.016 ·400-15.5=421.9 460 440 420 400 380 400 420 440 MATH:LANGUAGE: r = 0.9989; p = 0.000002 Krisztina Boda INTERREG 460 480 500 520 540 MATH 22 How to get the formula for the line which is used to get the best point estimates The general equation of a line is y = a + b x. We would like to find the values of a and b in such a way that the resulting line be the best fitting line. Let's suppose we have n pairs of (xi, yi) measurements. We would like to approximate yi by values of a line . If xi is the independent variable, the value of the line is a + b xi. We will approximate yi by the value of the line at xi, that is, by a + b xi. The approximation is good if the differences yi (a b xi ) are small. These differences can be positive or negative, so let's take its square and summarize: n (y (a b x )) i 2 i S ( a , b) i 1 This is a function of the unknown parameters a and b, called also the sum of squared residuals. To determine a and b: we have to find the minimum of S(a,b). In order to find the minimum, we have to find the derivatives of S, and solve the equations S S 0, 0 a b The solution of the equation-system gives the formulas for b and a: b n n i 1 i 1 n xi yi xi n n yi i 1 n n x ( xi ) 2 i i 1 i 1 2 n ( x x )( y y ) i i i 1 n ( x x) and a y b x 2 i i 1 It can be shown, using the 2nd derivatives, that these are really minimum places. Krisztina Boda INTERREG 23 Computation of the correlation coefficient from the regression coefficient. There is a relationship between the correlation and the regression coefficient: sx r b sy Krisztina Boda where sx, sy are the standard deviations of the samples . From this relationship it can be seen that the sign of r and b is the same: if there exist a negative correlation between variables, the slope of the regression line is also negative . It can be shown that the same t-test can be used to test the significance of r and the significance of b. INTERREG 24 Coefficient of determination Krisztina Boda The square of the correlation coefficient multiplied by 100 is called the coefficient of determination. It shows the percentages of the total variation explained by the linear regression. Example. The correlation between math aptitude and language aptitude was found r =0,9989. The coefficient of determination, r2 = 0.917 . So 91.7% of the total variation of Y is caused by its linear relationship with X . INTERREG 25 Regression using transformations Krisztina Boda Sometimes, useful models are not linear in parameters. Examining the scatterplot of the data shows a functional, but not linear relationship between data. INTERREG 26 Example A fast food chain opened in 1974. Each year from 1974 to 1988 the number of steakhouses in operation is recorded. The scatterplot of the original data suggests an exponential relationship between x (year) and y (number of Steakhouses) (first plot) Taking the logarithm of y, we get linear relationship (plot at the bottom) 450 400 350 300 y 250 200 150 100 50 0 0 5 10 15 10 15 time 6 5 4 ln(y) 3 2 1 0 0 5 time Krisztina Boda INTERREG 27 Performing the linear regression procedure to x and log (y) we get the equation log y = 2.327 + 0.2569 x that is y = e2.327 + 0.2569 x=e2.327e0.2569x= 1.293e0.2569x is the equation of the best fitting curve to the original data. Krisztina Boda INTERREG 28 6 450 400 350 300 y 250 200 150 100 50 0 5 4 ln(y) 3 2 1 0 0 5 10 15 0 time 10 15 time y = 1.293e0.2569x Krisztina Boda 5 log y = 2.327 + 0.2569 x INTERREG 29 Types of transformations Krisztina Boda Some non-linear models can be transformed into a linear model by taking the logarithms on either or both sides. Either 10 base logarithm (denoted log) or natural (base e) logarithm (denoted ln) can be used. If a>0 and b>0, applying a logarithmic transformation to the model INTERREG 30 Exponential relationship ->take log y 0 1 2 3 4 Krisztina Boda 1.1 1.9 4 8.1 16 lg y 0.041393 0.278754 0.60206 0.908485 1.20412 Model: y=a*10bx Take the logarithm of both sides: lg y =lga+bx so lg y is linear in x INTERREG 18 16 14 12 10 y y 8 6 4 2 0 0 1 2 3 4 5 3 4 5 x 1.4 1.2 1 0.8 log y x 0.6 0.4 0.2 0 0 1 2 x 31 Logarithm relationship ->take log x 5 y 1 4 8 16 log x 0.1 2 3.01 3.9 4 0 0.60206 0.90309 1.20412 3 y x 2 1 0 0 5 10 15 20 x Model: y=a+lgx 5 4 y 3 2 so y is linear in lg x 1 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 log10 x Krisztina Boda INTERREG 32 Power relationship ->take log x and log y 130 y 1 2 3 4 log x 2 0 16 0.30103 54 0.477121 128 0.60206 log y 0.30103 1.20412 1.732394 2.10721 100 90 130 80 120 70 60 y x 120 110 50 110 40 30 100 20 10 90 0 0 80 1 2 3 4 5 x y 70 60 Krisztina Boda Model: y=axb Take the logarithm of both sides: lg y =lga+b lgx so lgy is linear in lg x INTERREG 50 2.5 40 2 30 1.5 20 log y 10 1 0 0.5 0 1 2 3 4 5 x 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 log x 33 Reciprocal relationship ->take reciprocal of x y 1 2 3 4 5 2 1.1 1 0.45 0.5 0.333 0.333333 0.23 0.25 0.1999 0.2 Model: y=a +b/x y=a +b*1/x so y is linear in 1/x 1.5 1 0.5 0 0 1 2 3 4 5 6 x 2 1.5 y 1/x y x 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 1/x Krisztina Boda INTERREG 34 Example from the literature Krisztina Boda INTERREG 35 Krisztina Boda INTERREG 36 Relationship between two discrete variables, contingency tables, test for independence Krisztina Boda INTERREG 37 Comparison of categorical variables (percentages): 2 tests (chi-square) Example: rates of diabetes in three groups: 31%, 27% and 25%*. Frequencies can be arranged into contingency tables. H0: the occurrence of diabetes is independent of groups (the rates are the same in the population) Krisztina Boda DIAB Treatment1 Treatment 2 Treatment3 Total yes 31 27 25 83 no 69 73 75 217 Total 100 100 100 300 INTERREG 38 2 tests, assumptions If H0 is true, the expected frequencies can be computed (Ei=row total*column total/total) 2 statistics: 2 =Σ(Oi-Ei)2/Ei DF (degrees of freedom: (number of rows-1)*(number of columns-1) Decision based on table: 2 > 2 table, , df Assumption: cells with expected frequencies <5 are maximum 20% of the cells Exact test (Fisher): it has no assumption, it gives the exact pvalue Krisztina Boda DIAB yes no Total Treatment Treatment Treatment 1 2 3 27.7 27.7 27.7 72.3 72.3 72.3 100 100 100 0.933<5.99(= 2 table, 0.05,2) p=0.627 Assumption holds Exact p=0.663 INTERREG 83 217 300 2 =0.933 Df=(3-1)*(2-1)=2 Total 39 2x2-es tables 2 formula can be computed from cell frequencies When the assumptions are violated : Yates correction Fisher’s exact test Krisztina Boda Factor Group 1. Present + a Absent c Total a+c Group 2. b d b+d Total a+b c+d N (ad bc) 2 N (a b)(c d )(a c)(b d ) N 2 (| ad bc | ) N 2 2 Yates (a b)(c d )( a c)(b d ) 2 INTERREG 40 Example Angiographic restenosis Present + Absent Total Group1l Group2 Total 20 72 92 41 51 92 61 123 184 2=10.815, df=1, p=0.001 2Yates=9.809, df=1, p=0.002 Fisher’s exact test p=0.002 Krisztina Boda INTERREG 41 Review questions and exercises Problems to be solved by handcalculations ..\Handouts\Problems hand VI.doc Solutions ..\Problems hand VI solution.doc Problems to be solved using computer ..\Handouts\Problems comp VI.doc Krisztina Boda INTERREG 42 Useful WEB pages Krisztina Boda http://www-stat.stanford.edu/~naras/jsm http://www.ruf.rice.edu/~lane/rvls.html http://my.execpc.com/~helberg/statistics.html http://www.math.csusb.edu/faculty/stanton/m26 2/index.html INTERREG 43
© Copyright 2026 Paperzz