From Univariate to Multivariate • Up until this point in the course, we have studied statistical methods (both descriptive and inferential) that have been designed to be applied to a single kind of data: • Our descriptive statistics’ have served to summarize a sample or population of values that are made up of a single kind of measurements (e.g. daily rainfall totals on several successive days or population densities in different places), reducing the data volume and providing a more concise description • Our inferential tests have served to distinguish if a sample of the conditions at one place or time was significantly different from another (e.g. did it rain more than usual in a certain place) David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 From Univariate to Multivariate • For a descriptive stat. to meaningfully describe a sample, all the values have to be the same sort of measurements • Likewise, an inferential test that is designed to check if a sample is significantly different from a population only makes sense if the values describing both the sample and the population are same sorts of measurements • The approaches we have looked at up to this point have been univariate approaches, and they are designed solely to consider data values that are all descriptive of the same sorts of information (i.e. weighing each apple in a bushel of apples and calculating a mean weight, or comparing the mean weight of this bushel of apples to another bushel of apples) David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 From Univariate to Multivariate • Clearly, it would not make much sense to calculate the mean weight of a bushel made up of a mix of apples and oranges, or to compare the mean weight of a bushel of apples to the mean weight of a bushel of oranges: • The univariate methods we have studied thus far really can only be used effectively on interval or ratio measurements of a reasonably uniform set of things • However, in geography we are interested in using quantitative approaches that do allow us to examine the relationships between different sorts of measurements / patterns of phenomena … perhaps not apples and oranges, but instead descriptors of geographic phenomena that might have some kind of relationship David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Bivariate Methods • The first few quantitative methods that we are going to study that allow us to examine the relationship between two variables, and these are known as bivariate methods • The aim of these approaches is to describe the relationship between two variables measured for a common sample of individual observations, i.e. asking the same set of people two different survey questions, measuring two different physical variables (temperature and moisture condition) at the same set of locations on the same date, or measuring two different physical variables on several occasions at the same location … in general the individuals in the sample must be the same when two different quantities are being sampled David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Bivariate Methods • We usually begin with the idea that there is some relationship between two variables, e.g. we might expect that there is a relationship between level of education and income, and to examine this idea, we would then procure a data set that describes these two variables for the same set of individuals • Our aim is to determine the nature of the relationship by examining whether the variation in the values of one variable are related to the variation in value of another variable (i.e. as education ↑, what does income do?) • Before we calculate any quantitative descriptions of the relationship, we can begin by plotting values of one variable against those of the other David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Positive and Negative Relationships Positive –As values of one variable increase, the values of the other increase too Negative (a.k.a. inverse) – As values of one variable increase, the values of the other decrease David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Positive and Negative Relationships Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 209. David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Positive and Negative Relationships •The examples on the previous slide show somewhat ideal situations, where the scatterplot’s form is that of a set of points that can be more or less plotted along a straight line •Often, what we see with real data is a little less ideal: Scatterplots usually deviate from that straight line of points to some degree or another, even when there is a linear, observable relationship between two variables: •The scatter of points may be curved (not so linear) •The points might be more spread out, forming an ellipse of points, or even a circular cloud of points where no relationship can be readily discerned David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Avg. – June 26/28 0.5 0.5 0.4 0.4 0.3 0.2 R2=0.71 0.0 4 5 6 7 8 9 10 11 12 13 0.6 R2=0.79 0.3 0.2 0.1 0.1 0.0 0.0 4 5 6 7 0.4 0.4 0.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 TMI 5 6 7 8 9 10 11 12 13 TMI 0.6 R2=0.24 0.5 0.4 Theta 0.5 Theta Glyndon – LIDAR 0.5m DEM 11x11 0.5 0.1 4 9 10 11 12 13 TMI 0.6 R2=0.29 8 R2=0.79 0.3 0.2 0.6 0.2 0.5 0.4 TMI 0.3 Dry – August 22 Theta 0.6 Theta 0.6 0.1 Theta Pond Branch PG 11.25m DEM Theta Wet – May 29/30 0.3 0.2 0.2 0.1 0.1 0.0 0.0 2.5 3.0 3.5 4.0 TMI 4.5 5.0 5.5 R2=0.10 0.3 2.5 3.0 3.5 4.0 4.5 5.0 5.5 TMI David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Theta-TVDI Scatterplots Glyndon Field Sampled Soil Moisture versus TVDI from a 3x3 kernel Pond Branch Field Sampled Soil Moisture versus TVDI from a 3x3 kernel 0.50 Volumetric Soil Moisture Volumetric Soil Moisture 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 TVDI (3x3 kernel) 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 TVDI (3x3 kernel) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 TVDI (3x3 kernel) Oxford Tobacco Research Station Field Sampled Soil Moisture versus TVDI from a 3x3 kernel Volumetric Soil Moisture 0.45 0.8 0.9 1.0 •Field sampled soil moisture is a measure of wetness, remotely sensed TVDI (Temperature Vegetation Dryness Index) is a measure of dryness, thus they have a negative or inverse relationship David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 API-TVDI Scatterplot •In this scatterplot, an antecedent precipitation index (API) derived from radar rainfall observations is plotted against TVDI values from satellite remote sensing •There is clearly some sort of inverse relationship between the two (API expressing wetness and TVDI expressing dryness), but is it linear? David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance •While interpreting scatterplots of a pair of variables can give us a general sense of the direction and strength of a relationship between two variables, this interpretation is a subjective judgment of the relationship • A more objective approach that allows us to characterize the strength and direction of a relationship between two variables uses the computation of a statistic that expresses the extent to which variables Y and X vary together about their mean values •This is known as covariance, and it extends the concept of variance by calculating the average value of the product of the two variables’ respective deviations from their means David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance Formulae •The covariance of variable X with respect to variable Y can be calculated using the following formula: 1 Cov [X, Y] = n-1 i=n Σ xi yi - xy i=1 •The formula for covariance can be expressed in many ways. The following equation is an equivalent expression of covariance (due to the distributive property): 1 Cov [X, Y] = n-1 i=n Σ (xi - x)(yi - y) i=1 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance Example •One of the data sets you saw earlier in this lecture was a set of 10 values of TVDI for remotely sensed pixels containing the Glyndon catchment in Baltimore County, and accompanying soil moisture measurements taken in the catchment on matching dates: Glyndon Field Sampled Soil Moisture versus TVDI from a 3x3 kernel Volumetric Soil Moisture 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 TVDI (3x3 kernel) 0.8 0.9 1.0 TVDI Soil Moisture 0.274 0.542 0.419 0.286 0.374 0.489 0.623 0.506 0.768 0.725 0.414 0.359 0.396 0.458 0.350 0.357 0.255 0.189 0.171 0.119 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance Example • To calculate the covariance of these variables, we will: 1. Calculate the mean of each variable (x, y) 2. Calculate the statistical distances between each value and its respective mean (xi - x, yi - y) 3. Multiply the statistical distances (xi - x) * (yi - y) 4. Calculate the sum of the products of the statistical distances Σ (xi - x) * (yi - y) 5. Divide the sum by (n - 1) • For this calculation, I’ll illustrate using a series of Excel worksheet operations David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance Example Mean TVDI (x) Soil Moisture (y) 0.274 0.542 0.419 0.286 0.374 0.489 0.623 0.506 0.768 0.725 0.414 0.359 0.396 0.458 0.350 0.357 0.255 0.189 0.171 0.119 1 0.501 0.307 (x - xbar) (y - ybar) 2 -0.227 0.042 -0.082 -0.215 -0.127 -0.011 0.122 0.005 0.267 0.225 0.107 0.052 0.090 0.151 0.044 0.050 -0.052 -0.118 -0.136 -0.188 Sum Covariance (x - xbar) * (y - ybar) -0.024305 0.0021624 -0.007323 -0.032424 -0.005553 -0.000566 -0.006374 -0.000618 -0.036282 -0.042289 3 -0.15357 -0.017063 4 5 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 How Does Covariance Work? •If X and Y are positively related, we can expect that a value xi > x would have a corresponding value yi > y, and if we have another value xi < x, it would have a corresponding value yi < y •Since Cov[X, Y] takes the product of the differences, + * + = + contribution to the Cov[X, Y], and - * - = +, so that situation would also make a positive contribution to the Cov[X, Y] •If X and Y are negatively related, we can expect that a value xi > x would have a corresponding value yi < y, and if we have another value xi < x, it would have a corresponding value yi > y •Since Cov[X, Y] takes the product of the differences, + * - = contribution to the Cov[X, Y], and - * + = -, so that situation would also make a negative contribution to the Cov[X, Y] David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Interpreting Covariances •Covariance does tell us the direction of the relationship and the magnitude of the relationship: •The covariance will be positive if most of the points (x,y) lie along a line with a positive slope, suggesting a positive relationship •The covariance will be negative if most of the points (x,y) lie along a line with a negative slope, suggesting an inverse relationship •A larger covariance indicates a stronger relationship, but the magnitude is a function of the units of measurement of the values David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Covariance Æ Correlation •Because the magnitude of covariance is a function of the units of measurement of the values, and covariance is calculated using two variables that likely have different units of measurement, getting a sense of the magnitude of the value is not a straightforward problem •Furthermore, we might want to compare the strength of relationships between multiple pairs of variables that do not have the same units of measurement; their covariances’ magnitudes would likely not be comparable •For this reason, we need a measure of covariance that is standardized such that the units of measurement of the values are not important and we can compare one such measure to another David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Pearson’s Correlation Coefficient •A standardized measure of covariance provides a value that describes the degree to which two variables correlate with one another, expressing this using a value ranging from –1 to +1, where –1 denotes an inverse relationship and +1 denotes a positive relationship •One such measure is known as Pearson’s Correlation Coefficient (a.k.a. Pearson’s Product Moment), and it produced through standardizing the covariance by dividing it by the product of the standard deviations of the Y and X variables: Cov [X, Y] Pearson’s Correlation r = Coefficient sXsY David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Pearson’s Correlation Coefficient •As is the case with covariance, the correlation coefficient can be expressed in several equivalent ways: i=n Pearson’s Correlation = Coefficient (x x)(y y) i i Σ i=1 (n - 1) sXsY •It can also be expressed in terms of z-scores, which is convenient if you have already calculated them: Σ Zx Zy i=n Pearson’s Correlation = Coefficient i=1 (n - 1) David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Pearson’s Correlation Coefficient •The correlation coefficient will have a value ranging from –1 to +1, where –1 denotes an inverse relationship and +1 denotes a positive relationship •The strength of the relationship is expressed by the size of the coefficient, larger values for stronger relationships •NOTE: Correlation coefficients cannot be interpreted proportionally, i.e. r = 0.6 is not twice as strong as r = 0.3 •Here are some ranges for interpreting r values: 0 - 0.2 Negligible relationship 0.2 - 0.4 Weak relationship, little association 0.4 - 0.6 Moderate relationship / association 0.6 - 0.8 Strong relationship, high association 0.8 - 1.0 Very strong relationship David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Pearson’s Correlation Coefficient Example •We can take the covariance we calculated earlier and convert it to a correlation coefficient by dividing the covariance by the product of the standard deviations •We calculated Cov[TVDI, Soil Moisture] = -0.017063 •STVDI = 0.169718, SSoil Moisture = 0.115347 •Using the formula Cov [X, Y] r= sXsY -0.017063 r= 0.169718 * 0.115347 r = -0.871 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Pearson’s r - Assumptions • To properly apply Pearson’s Correlation Coefficient, we first have to make sure that the following assumptions are satisfied: 1. The values need to be either interval or ratio scale data (later we will examine a different correlation method for ordinal data) 2. The (x,y) data pairs are selected randomly from a population of values of X and Y 3. The relationship between X and Y is linear (which can be qualitatively assessed by looking at the scatterplot) 4. The variables X and Y must share a joint bivariate normal distribution (which we tend to assume when sampling from a population) David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Interpreting Correlation Coefficients • Even if we find a strong correlation between two variables, we cannot use that information to demonstrate a causal relationship between the two variables Æ Correlation is not the same as causation! (Correlation suggests an association between variables) • There are a number of reasons why we might find a strong correlation between X and Y that are separate from the situation where the independent variable X is directly influencing dependent variable Y: 1. Both X and Y are actually dependent variables that are X being influenced by a further variable Z: Z Y David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Interpreting Correlation Coefficients 2. There is a causative chain between X and Y, but it is mediated by intermediate processes or linked, i.e. X Æ A Æ B Æ Y, e.g. rainfall Æ soil moisture Æ ground water Æ runoff 3. There is a mutual relationship between X and Y such that each is influencing the other, e.g. the relationship between income and social status 4. We might find a spurious relationship between two variables because each basically expresses the same information, I.e. if Y = kX, then X and Y will of course be strongly correlated 5. A true causal relationship exists, the situation we would prefer to find, where X Æ Y David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Interpreting Correlation Coefficients 6. There is the possibility that a strong correlation can be found as a result of chance Æ since this statistic is based upon samples, there is the unlikely but present possibility that an extremely unusual sample was chosen 7. There may be a few outliers present in the sampled data that have a large, undue influence on the resulting correlation 8. The strength of the correlation may be a result of a lack of independence among observations. This is particularly problematic with geographic data, where samples are drawn from many adjacent locations that have similar values in close proximity (spatial autocorrelation) David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Interpreting Correlation Coefficients 8. Cont. A related issue in the context of data used for correlation drawn from spatial sample units is the issue of ecological correlation: We find that any two variables, depending on the geographic scale at which they are sampled, will have different correlation values at various scales. Generally, correlation increases as we look at larger scale data (i.e. smaller sample units) • If we compare rainfall versus runoff in North Carolina and the USA, we would expect to find a higher r value in NC because the dependent variable (runoff in this case) is subject to greater random variability at the scale of the nation … we cannot compare correlations from different scales! David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 A Significance Test for r • Pearson’s Correlation Coefficient is, like many statistics, an estimator. Using a sample, it produces an estimate of the correlation coefficient (r) that is meant to approximate the population correlation coefficient (ρ) • We can formulate a test of significance to assess (using our sample data and the r we calculated from it) whether or not the population correlation coefficient is equal to zero (indicating no relationship between X and Y), or in the event that the test statistic exceeds the critical value, we can reject this null hypothesis and accept the alternative where ρ is not equal to zero and there is some relationship between X and Y • We will use another variant of the t-test for this purpose David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 A Significance Test for r • The sampling distribution of r follows a t-distribution with (n - 2) degrees of freedom, and we can estimate the standard error of r using: SEr = 1 - r2 n-2 • The test itself takes the form of the correlation coefficient divided by the standard error, thus: r ttest = = SEr r 1 - r2 n-2 = r n-2 1 - r2 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 A Significance Test for r • We will use this test in a 2-tailed fashion to assess whether or not the population correlation coefficient is equal to zero (no relationship) or not equal to zero (some relationship): H0: ρ = 0 HA: ρ ≠ 0 • Examining the test statistic, we can see that its value is purely a function of the correlation coefficient (r) and sample size (n): ttest = r n-2 1 - r2 • A given r may or may not be significant depending on the size of the sample! David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 A Significance Test for r • • For example, suppose we have three samples of variables X and Y of sizes n = 10, n = 100, and n = 1000, all with r = 0.5 We decide to test the significance of these r values using an α = 0.05, and we look up the appropriate critical values from a t-distribution table with appropriate degrees of freedom for each of the three samples: 1. n = 10, df = 8, tcrit = 2.306, ttest = 0.5 10 - 2 = 1.63 1 - 0.52 2. n = 100, df = 98, tcrit = 1.99, ttest = 0.5 100 - 2 = 5.71 1 - 0.52 3. n = 1000, df = 998, tcrit = 1.97, ttest = 0.5 1000 - 2 = 8.12 1 - 0.52 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 A Significance Test for r • As is the case with the other statistical tests that we have studied that have included sample size in the standard error formulation, sample size is important when it comes to evaluating the significance of a correlation coefficient • It is not enough to simply find a large r value; if it is derived from a very small sample, this may not tell you much about a relationship between the variables • On the other hand, once samples become larger than a certain size, the shapes of the appropriate t-distributions change very little (i.e. the tcrit value for α = 0.05, 2tailed stabilizes somewhere around tcrit ~ 2), so it is not necessary or useful to collect a sample of a huge size David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Hypothesis Testing - Significance of r t-test Example • Research question: Is there a significant relationship between TVDI and soil moisture in the Glyndon data set 1. H0: ρ = 0 (No significant relationship) 2. HA: ρ ≠ 0 (Some relationship) 3. Select α = 0.05, two-tailed because of how the alternate hypothesis is formulated 4. In order to compute the t-test statistic, we need to first calculate the Pearson Product Moment Correlation Coefficient. We have done so earlier in this lecture, finding r = -0.871, a very strong inverse relationship between remotely sensed TVDI and field measurements of soil moisture David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Hypothesis Testing - Significance of r t-test Example 4. Cont. We calculate the test statistic using: ttest = ttest = r n-2 1 - r2 -0.871 10 - 2 1 - (-0.871)2 = - 5.01 5. We now need to find the critical t-score, first calculating the degrees of freedom: df = (n - 2) = (10 - 2) = 8 We can now look up the tcrit value for our α (0.025 in each tail) and df = 8, tcrit = 2.306 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005 Hypothesis Testing - Significance of r t-test Example 6. |ttest| > |tcrit|, therefore we reject H0, and accept HA, finding that there is a significant relationship (i.e. the population correlation coefficient ρ, which we have estimated using the sample correlation coefficient r) is not equal to 0 David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
© Copyright 2026 Paperzz