REGRESSION AND CORRELATION INTRODUCTION In simple correlation and regression studies, the researcher collects data on two numerical or quantitative variables to see whether a relationship exists between the variables. The independent variable (X) is the variable in the regression that can be controlled or manipulated. The dependent variable is the variable in regression that cannot be controlled or manipulated. REGRESSION The term regression analysis is used to refer to studies of relations between variables. Regression analysis then is a technique for quantifying the relationship between a criterion variable (also called a dependent variable) and one or more predictor variables (or independent variables). Put another way regression analysis is a technique whereby a mathematical equation is fitted to a set of data. It describes the relationship between two variables. It is a statistical method used to isolate cause- and –effect relation among variables. A line of best fit that is independent of individual judgement and drawn mathematically is called a regression line. Regression line equations once computed can be graphed and be used to estimate values previously unknown. The reasons for computing a regression line are: (i) to obtain a line of best fit free of subjective judgement. The regression line improves our estimates. (ii) The regression equation can be used to make predictions within the given range of the data, that is, making interpolations. (iii) the reliability of estimates made from such a line can be measured mathematically. The purpose of the regression line is to enable the researcher to see the trend and make predictions on the basis of the available data. Values of Y will be predicted from the values of X, hence the closer the points are to the line, the better the fit and the predictions will be. Each observation of bivariate data can be viewed as a point (x, y) where x is the explanatory or independent variable while y is the dependent or response variable. It is important to determine which one of the variables in question is the dependent variable and which one is the independent variable. The starting point in regression analysis is the construction of a scatter diagram/ scatter graph. 1 There are normally two regression line equations for any sets of bivariate data: (1) Regression of Y on X. The line equation is given by Y= a+ bX where the line is used to predict/ estimate the value of Y that follows from any value of X (2) Regression of X on Y. The equation is given by X = a + bY. The line equation is used to predict/ estimate the value of X that follows from any value of Y. The guide on the choice of regression line equation is one should always use the line that has the variable to be estimated on the left hand side of the equation. In other words if you want to estimate X use X = a + bY and if you want to estimate Y use Y = a + bX. In our analysis we are going to use the Y on X regression equation, that is Y = a + bX. In this equation a is the y intercept term, while b is called the regression coefficient or the slope of the graph. In single equation regression models one variable called the dependent variable or regressand is expressed as a linear function of one or more other variables called the explanatory variables or independent or regressor variables. It is assumed implicitly that causal relationships if any between the independent and dependent variables flow in one direction only, namely from the explanatory to the dependent (or regressand). Although regression analysis deals with the dependence of one variable on other variables, it does not necessarily imply causation. A statistical relationship may be very strong but it never can establish causal connection. Our ideas of causation must come from outside statistics, ultimately from some theory or other. For example there is no statistical reason to assume that rainfall depends on crop yield. In regression analysis we try to estimate or predict the average value of one variable on the basis of fixed values of other variables. The regression equation is given by Y = a + bX or Y = f(X) or Yi = a + bX +μi where μi is the stochastic error term or random error term ( it can also be looked at as the surrogate for all omitted variables. ß1 and ß2 are regression coefficients. ß1 is the intercept and ß2 is the slope coefficient. E(Y|Xi ) = ß1 + ß2Xi .= Yˆ µi = Yi – E (Y|Xi), where Yi is the observed value and E (Y|Xi), is the predicted value. Regression analysis techniques assume that (1) Each item of data is independent of the others. (2) Data measurements are unbiased. 2 (3) The error variance is constant over the entire range of data, rather than larger in some parts of the data range and smaller in others. (4) There is no autocorrelation between the disturbances (or errors). The probability distribution of the random error or has the following assumptions: 1. The mean of the probability of e or or is 0, that is the average of the values of or over an infinitely long series of experiments is 0 for each setting of the independent variable X. E(µi|Xi) = 0; that is zero mean value of disturbance µi. This assumption implies that the mean value of Y, E(Y), for a given value of X is E(Y) = ß1 + ß2 X 2. Variance of the probability distribution of e or or is constant for all settings of the independent variable X. Var (µi|Xi) = σi2 i.e equal variance of error term for all observations (that is there is homoscedasticity). For a straight line model, this assumption means that the variance of e or or is equal to a constant, say 2 for all values of X. 3. Probability distribution of or is normal. 4. The values of e or or associated with any two observed values of Y are independent. That is the value of or associated with one value of Y has no effect on the values of or associated with other Y values. That is to say there is no autocorrelation between the values of the disturbances or errors (e or or ). Problems and difficulties can be encountered when using regression analysis and these can cause results to be inaccurate and misleading. The four problems are listed below: (a) An inadequate sample size may have been used to collect the data. Sample size should be at least two or three times the number of variables used in the regression equation and preferably much larger. (b) The independent variables measured during the study may have been poorly measured, are in the wrong form or are not the right ones that have a direct effect on the dependent variable. (c) The independent variables are highly correlated with each other. If two independent variables are perfectly correlated with each other (that is r =+1.00) , their effect will be 3 the same as that of a single independent variable which has been used twice in the same regression analysis. (d) The true relationship between the dependent variable and the independent variable(s) is not linear, or it is an unusual shape which cannot be analysed using regression techniques. NB. The regression equation may apply to a certain range of data observations such that the regression equation derived from those data may not apply for figures smaller than the lowest value and values in excess of the largest observed value. For example if data sets observed range from 50 million to 100 million , the regression equation cannot be used for any values in excess of 100 million and those below 50 million. In other words when predictions are made, they are based on present conditions or on the premise that present trends will continue. This assumption may or may not prove true in the future. For example in 1979 some experts predicted that the US would run out of oil by the year 2003! Standard methods of obtaining regression line equations (1) Inspection method- this involves drawing a scatter diagram and then fitting the line of best fit where one feels it ought to be. (2) Semi average- this involves splitting data into two equal groups and plotting the mean point for each group and joining the two points by means of a straight line. (3) Least squares method. This minimises the total of the squared deviations. Least squares method The fitted regression line can be viewed as a “predictor line”. For each X i , regression line “predicts” Yˆi . The difference between Yi and its predicted equivalent Yˆi is Yi - Yˆi and this difference is called a “residual” or error ei. . OLS minimizes the sum of the square of all residuals. Y i all.i Yˆi 2 or minimize e i 2 Y Yˆ ˆ )] . [Y (aˆ bX 2 2 ˆ is the estimated regression line, which is the same as Yˆ = â bX ˆ The expression â bX 4 Yi Yi Residual ei Yˆi Yi = a + b X +μi The estimated Yi is given as Yˆi , while the estimated a is given by â and that for b is b̂ . The error term μi or ei= (Yi - Yˆi ) = Yi -( â + b̂ Xi ) The sum of the squares of the deviations (errors) of the Y values about their predicted values 2 ˆ . The quantities â and b̂ that make the for all the n data points is SSE Yi aˆ bX i SSE a minimum are called the least squares estimates of the population parameters â and b̂ ˆ is called the least squares line. The least squares line and the prediction equation Yˆ aˆ bX is one that has the following 2 properties: 1. The sum of the errors (SE) equals 0. 2. Sum of squared errors (SSE) is smaller than that for any other straight line model. Formulas for the least squares estimates for a and b are : bˆ X X Y Y X X i i 2 i aˆ Y bˆ X or we can use the formula n XY X Y bˆ n X 2 ( X ) 2 aˆ Y b X n n where n represents the pairs of X,Y values. 5 while for X a bY n XY X Y bˆ n Y 2 ( Y ) 2 aˆ X bˆ Y n n In the first set of equations for the y on x regression line equation n represents the pairs of sets of y and x values. NB. the calculated values of a and b should be to 3 decimal places. Table 36.Advertising and Expenditure Statistics Year 2000 2001 2002 2003 2004 Total Advertising exp (x) Sales (y) 2 60 5 100 4 70 6 90 3 80 20 400 xy 120 500 280 540 240 1680 x^2 4 25 16 36 9 90 From the data above one can get a and b in the regression equation Y= a + bX as follows: 5(1680) 20(400) bˆ 8 5(90) 400 400 (20) aˆ 8 48 5 5 The regression equation is therefore Yˆ 48 8 X ( answer in $000). If we are required to estimate the sales for 2005 (when advertising expenditure is expected to be $5000 then substituting into our regression equation we get Yˆ = 48 +8(5)= 88= $88000. N.B. It is important that you be able to interpret the intercept and slope in terms of the data being utilized to fit the model. The model parameters should be interpreted only within the sampled range of the independent variable (in this case between $2000 and $6000). Sometimes it is possible to have only one set of data available and one is expected to make projections based on the available data. From our initial table if the data on advertising expenditure is missing and we are required to estimate sales in a given period we need to choose the years as our x while the sales remain as y. The years are given coded values staring at 2000 being coded as x = 1; 2001 coded as 2 etc. Alternatively one could start at 2000 being coded as x = 0; 2001 coded as 1 etc. The estimates are the same in each case although the regression equation is slightly different. 6 From the table below b =3 and a =71 hence the regression equation is Y = 71 +3x. From this result forecast the sales for 2005 and 2007. To get the answer we need to establish the value of x in 2005 and 2007 and these are X=6 and X =8 respectively. Substituting into the equation this gives y2005 = 71 +3(6) = 89 ( i.e. 89000) for 2005 and also y2007= 71 +3(8)=95 ( i.e. $95000). Table 37. Regression workings Year Coded (x) Sales (y) 2000 1 60 2001 2 100 2002 3 70 2003 4 90 2004 5 80 Total 15 400 xy 60 200 210 360 400 1230 x^2 1 4 9 16 25 55 An alternative formula which has the advantage of not using a previously calculated value can also be used. This formula is aˆ ( Y )( X 2 ) ( X )( XY ) n X 2 ( X ) 2 and n( XY ) ( X )( Y ) bˆ n( X 2 ) ( X ) 2 This will give the same results as the first formula used above. NB. When one is drawing the scatter plot and the regression line, it is sometimes desirable to truncate the graph. The reason is to show the line drawn in the range of the independent and dependent variables. ˆ , each X has an For a set of X and Y values and the estimated regression equation Yˆ aˆ bX observed Y and a predicted Yˆ value. The total variation (Y Y ) 2 is the sum of the squares of the vertical distances each point is from the mean. The total variation can be divided into two parts: that which is attributed to the relationship of X and Y and that which is due to chance. The variation obtained from the relationship (i.e. from the predicted Yˆ values) is and is called the explained variation. Most of the variation can be explained by the relationship. The closer the value r is to +1 or -1, the better the points fit the line and the closer (Yˆ Y ) 2 is to (Yˆ Y ) = (Y Y ) 2 2 (Y Y ) 2 . In fact if all points fall on the regression line, since Yˆ would be equal to Y in each case. 7 On the other hand, the variation due to chance found by (Y Yˆ ) 2 is called the unexplained variation. This variation cannot be attributed to the relationship. When the unexplained variation is small, the value of r is close to +1 or -1. If all points fall on the regression line, the unexplained variation (Y Yˆ ) 2 will be 0. Hence the total variation is equal to the sum of the explained variation and the unexplained variation. That is (Yˆ Y ) 2 + (Y Yˆ ) 2 (Y Y ) 2 = or put in words this is Total Variation = Explained Variation + Unexplained Variation The values Y Yˆ are called residuals. A residual is the difference between the actual value of Y and the predicted value Yˆ for a given X value. The mean of the residuals is always zero. The regression line determined by the formulas for a and b mentioned above is the line that best fits the points of the scatter plot. The sum of the squares of the residuals computed by using the regression line is the smallest possible value. For this reason, a regression line is also called a least squares line. (x,y) y- ŷ unexplained variation ŷ - y Explained Variation y y- y or total variation (x, ŷ ) (x, y ) y y x x For the prediction of the interval about the Yˆ value see the chapter on estimation and hypothesis testing. Problems in Regression Analysis (1). Multicollinearity- when two or more explanatory variables in the regression are highly correlated. This can sometimes be overcome by extending sample size, using a priori information or dropping one of the highly collinear variables. 8 (2). Heteroscedasticity- this arises when the assumption that the variance of the error term is constant for all the values of the independent variables is violated. Heteroscedastic disturbances lead to biased standard errors and thus to incorrect statistical tests and confidence intervals for the parameter estimates. (3). Autocorrelation- this is when consecutive errors or residuals are correlated. While estimated coefficients are not biased in the presence of autocorrelation, their standard errors are biased downward (so that the value of their t statistics is exaggerated. As a result, we may conclude that an estimated coefficient is statistically significant when in fact it is not. If there is evidence of autocorrelation, to adjust for its effect, time may be included as an additional explanatory variable to take into consideration the trend that may exist in the data, the inclusion of an important missing variable into the regression or the re-estimation of the regression in nonlinear form. COVARIANCE AND CORRELATION Covariance and Correlation Covariance and correlation are related parameters that indicate the extent to which two random variables co-vary. Suppose there are two technology stocks. If they are affected by the same industry trends, their prices will tend to rise or fall together. They co-vary. Covariance and correlation measure such a tendency. Covariance measures the association between sets of values X and Y. It measures how X and Y vary together. If we have a pair of random variables X and Y with means x and y and variances x 2 and y , the covariance measures the association. The covariance is defined by 2 the expectation Cov(X, Y) =E[(X- x )(Y- y )].= E[( X i i )(Yi j )] 1 n ( X i X )(Yi Y ) . If we take x X X and y Y Y n 1 1 1 n xy . If x and y disagree in sign ( X i X )(Yi Y ) = then the formula becomes S x , y n 1 n 1 1 we have xy<0. If X and Y move together most xy will be positive as will their sum.. If X and Y are related negatively xy 0 . Sample covariance= S x , y If there is positive association between random variables, high values of X tend to be associated with high values of Y and low values of X with low X. When there is negative association, so that high values of X associated with low values of Y and low with high Y, 9 the covariance is negative. If there is no linear association between X and Y their covariance is 0. Covariance is of little use in assessing the strength of the relation between a pair of random variables, as its value depends on the units in which they are measured. Ideally we would require a pure scale free measure. Such a measure is easily obtained by dividing the covariance by the product of the individual standard deviations. The quantity resulting is called the correlation coefficient, represented by for the population correlation coefficient and r for the sample correlation coefficient. CORRELATION Correlation is a statistical method used to determine whether a relationship between variables exists. Correlation is designed to measure the strength or degree of linear association between two variables. While regression analysis establishes a mathematical relationship between the dependent and independent variables, correlation goes a step further to establish the strength of the relationship by calculating what is called the correlation coefficient. There are several types of correlation coefficients but the best known and most used is the Pearson’s (after Karl Pearson) product moment correlation coefficient, r, for a sample and (rho) for the population. The coefficient measures the strength and direction of a linear relationship between two variables. The coefficient lies between 0 and 1 for positive correlation and between 0 and -1 for negative correlation, that is 0 ≤ |r| ≤ 1. If there is perfect positive correlation r = +1.00; perfect negative correlation is indicated by r= -1.00; no relationship is shown by r= 0.00. For perfect negative correlation the scatter points form a straight line when plotted and the slope of the graph is negative. For a negative correlation the line of best fit is negatively sloped when plotted. On the other hand for perfect positive correlation the scatter points form a straight line when plotted and the slope of the graph is positive. For a positive correlation the line of best fit is positively sloped when plotted. NB. The sign of the correlation coefficient and the sign of the slope of the regression line b will always be the same. The regression line will always pass through the point ( x, y ). When r is not significantly different from 0, the best predictor of y is the mean of the data values of y.. For valid predictions, the value of the correlation coefficient must be significant. (See under Hypothesis testing for detailed account of the significance of the correlation coefficient). 10 The starting point in correlation analysis is the construction of a scatter diagram. The scatter points show a general negative slope for negative correlation. For a zero correlation the scatter points do not show a definite pattern. There is no unique relationship between individual x and y values. In positive correlation the scatter points generally show a positive slope. The line of best fit when plotted will be upward sloping. In perfect positive correlation r = +1. The scatter points when plotted will form a straight line, which is also the line of best fit. The drawing of scatter points will show from the outset whether the relationship is positive or negative. A scatter plot should be checked for outliers. An outlier is a point that seems out of place when compared with other points. Some of these points can affect the equation of the regression line. When this happens, the points are called influential points or influential observations. An influential point tends to ‘pull’ the regression line toward the point itself. To check for an influential point, the regression line should be graphed with the point included in the data set, then a second regression line should be graphed that excludes the point from the data set. If the position of the second line is changed considerably, the point is said to be an influential point. Points that are outliers in the x direction tend to be influential points. Researchers should use their judgement as to whether to include influential observations in the final analysis of the data. If the researcher feels the observation is unnecessary it should be excluded but if researcher feels it is necessary then he or she may want to obtain additional data values whose x values are near the x values of the influential point and then include them in the study. 11 negative correlation –1 < r < 0 r=-1 perfect negative correlation y y x x r = 0 zero correlation positive correlation r >0 y y x x perfect positive correlation ( r= +1) y x Population. Correlation coefficient ( ) Corr(X,Y)= Cov( X , Y ) x y E[( X x )(Y y )] E[( X x ) 2 (Y y ) 2 ] 1 N X i x Y y i 1 x y N 12 Sample correlation coefficient r =Corr(X,Y)= S x, y Cov( X , Y ) 1 n X i X Yi Y SxSy S x S y n 1 i 1 S x S y or r xy x y 2 2 r ( X X )(Y Y ) (n 1)( sx )( s y ) or r n XY X Y [n X ( X ) 2 ][n Y 2 ( Y ) 2 ] 2 or r [(Y Y )( X X )] xy [ (Y Y ) ][ ( X X ) ] x y i i 2 or r 2 2 i 2 i XY n X Y Y nY X 2 2 2 nX 2 (care should be taken here not to use the approximations for the two means). The advantage of correlation analysis 1. A high correlation coefficient indicates that there is common variation between the dependent variable and the independent variable. In other words certain values of the dependent variable will tend to be associated with certain values of the independent variable. If value of r (disregarding the sign r 0.8 there is a very strong or high relationship between variables. If 0.4 <|r|< 0.8 the relationship between the variables is considered moderate to high. For |r|< 0.4 the relationship between variables is considered small to insignificant. More specifically the relationship can be shown in the table below. Size of r (disregarding the sign) General Interpretation 0.9 to 1.00 Very strong relationship 0.8 to 0.9 Strong relationship 0.6 to 0.8 Moderate relationship 0.2 to 0.6 Weak relationship 0.0 to 0.2 Very weak or no relationship Source: Willemse, 2009 :119 2. The sign + or – on the correlation coefficient indicates whether the relationship is positive or negative. The + indicates a positive relationship between variables while the – sign indicates a negative relationship between variables. The correlation coefficient is therefore a summary measure that indicates the relative strength of a relationship between two variables and the direction of the relationship, although it does not describe the underlying relationship. 13 Table 38. Correlation coefficient worked out Year x 2 5 4 6 3 20 2000 2001 2002 2003 2004 Total r r y 60 100 70 90 80 400 xy 120 500 280 540 240 1680 x^2 4 25 16 36 9 90 y^2 3600 10000 4900 8100 6400 33000 n XY X Y [n X 2 ( X ) 2 ][n Y 2 ( Y ) 2 ] 5(1680) 20(400) [5(90) 202 ][5(33000) 400 2 ] 400 400 0.8 50 5000 500 This shows that there is a strong positive relationship between advertising and sales. NB. the rounding rule for the correlation coefficient r is the values should be rounded to three decimal places. Coefficient of Determination This measures the proportion of the total variation in the dependent variable that is explained by the variation in the independent or explanatory variables in the regression. It is represented by r2 and varies between 0 and 1. An r2 value of 1.00 indicates that the regression equation “explains” 100 percent of the variation in the dependent variable about its mean. An r2 value in the 0.5 –1.00 range are usually interpreted to mean that the regression equation does a good job of explaining the Y variation. r2 (total variance in the dependent variable) (variance " unexplained " by regression equation) Total variance in the dependent variable We can look at r2 as Explained Variation in all items/ Total variation in all items. r2 Yˆ Y Y Y 2 i i 2 Explained Variation of Y Total Variation of Y From our regression example Year Adv exp Sales (Y) (X) 2000 2 60 2001 5 100 2002 4 70 2003 6 90 2004 3 80 Total 20 400 Yˆ (Yˆ Y ) 2 Y 64 88 80 96 72 14 80 80 80 80 80 256 64 0 256 64 640 Y Y 2 i 400 400 100 100 0 1000 Our regression equation was Yˆ 48 8 X Based on the data from the table above 2 Yˆi Y Explained Variation of Y , r 2 640 0.64 r2 2 1000 Total Variation of Y Y Y i and the formula Based on the correlation formula for r in the advertising and sales example r = 0.8.The coefficient of determination r2 = 0.64. This is the same result as calculated based on the table above. This means that 64% of the variation in sales is due to the advertising expenditure while the residual 36% is due to other factors such as changes in income, population, etc. This value of 36% or 0.38 is also called the coefficient of nondetermination and is found by subtracting the coefficient of determination from 1. As the value of r approaches 0, r2 decreases more rapidly. Coefficient of nondetermination = 1- r2 Spearman’s Rank Correlation Coefficient (rs) This provides a measure of correlation between ranks. Rank correlation is in fact an approximation to the product moment coefficient. Rank correlation methods can be used to measure the correlation between any pair of variables. Rank correlation is a quick and easy technique especially when the bivariate data are difficult to obtain physically or involve great expense and yet can be ranked in size order. Also if one or both of the variables involved is non-numeric rank correlation can be used. The formula for the rs is given as : 2 6 d i rs 1 n(n 2 1) where di = ui-vi (difference in ranks of the ith observations for samples 1 and 2) and n is the number of bivariate pairs. The value of rs always falls between –1 and +1, with +1 indicating perfect positive correlation and –1 indicating perfect negative correlation. Procedure The measurements associated with each variable are ranked separately. Ties receive the average of the ranks of the tied observations. In other words the ranks that would have been allocated separately must be averaged and this average rank given to each item with this equal value. In the ranking procedure assign a 1 to the smallest number and the highest number to the largest value. Using the advertising expenditure and sales figures we can use the rank correlation as below. Table 39.Rank correlation workings Year x y rx ry d=rx-ry d2 2000 2 60 1 1 0 0 2001 5 100 4 5 -1 1 2002 4 70 3 2 1 1 2003 6 90 5 4 1 1 2004 3 80 2 3 -1 1 Total 4 rs 1 6(4) 24 1 1 0.2 0.8 5(25 1) 120 15 This result is the same as the result found using the formula for r. Suppose we have the following table: Table 39. Rank correlation calculations x y rx ry d= rx - ry d2 2 60 1 1 0 0 5 70 4 2.5 1.5 2.25 4 80 3 4 -1 1 6 70 5 2.5 2.5 6.25 3 90 2 5 -3 9 7 100 6 6 0 0 18.5 6(18.5) 111 1 1 0.53 0.47 6(36 1) 210 This result gives a positive correlation coefficient 0.47, which is a very low one indeed. rs 1 Importance of correlation and regression Even though correlation or regression may have established that 2 variables move together, no claim can be made that this necessarily indicates cause and effect. For example if the correlation of teachers’ salaries and the consumption of liquor over a period of years turned out to be 0.9 this does not prove teachers drink; nor does it prove that liquor sales increase teachers’ salaries. Both variables move together because they are influenced by a third variable that is long run growth in national income and population. Although correlation and regression cannot be used as a proof of cause and effect, they can estimate a relation that theory already tells us exists. Secondly they are often helpful in suggesting causal relations that were not previously suspected e.g. when cigarette smoking was found to be highly correlated with lung cancer, possible causal links between the two were investigated. Correlation and Causation There are possible relationships between variables. 1. There is a direct cause-and –effect relationship between the variables that is x causes y e.g. poison causes death or heat causes ice to melt. 2. There is a reverse cause and effect relationship between the variables. That is y causes x. For example the researcher believes excessive coffee consumption causes nervousness when the reverse is true. 3. The relationship between the variables may be caused by a third variable. For example the statistician or researcher may have correlated number of deaths due to drowning and the number of cans of soft drink consumed daily during summer. Both variables probably are related to a third – high temperatures in summer which cause people to drink a lot of soft drinks and at the same time more people tend tom swim during the summer. 4. There may be a complexity of interrelationships among many variables. For example a significant relationship between students’ high school grades and college grades (Other variables are IQ, hours of study, influence of parents, motivation, age and instructors) 16 5. The relationship may be coincidental- for example a researcher may be able to find a significant relationship between the increase in the number of people who are exercising and the increase in the number of people who are committing crimes. Therefore from all this what can be said is that Correlation does not necessarily imply causation. Question It is frequently conjectured that income is one of the primary determinants of social status for an adult male. To investigate this theory 15 adult males chosen at random from a community comprising primarily professional people, and their annual gross incomes are noted. The social status scores (higher scores correspond to higher social status) and gross incomes (in $ 000) are given in the following table for 15 adult males. Subject Social Status Income 1 92 29.9 2 51 18.7 3 88 32.0 4 65 15.0 5 80 26.0 6 31 9.0 7 38 11.3 8 75 22.1 9 45 16.0 10 72 25.0 11 53 17.2 12 43 9.7 13 87 20.1 14 30 15.5 15 74 16.5 Compute Spearman’s rank correlation coefficient for these data. 17
© Copyright 2026 Paperzz