Covariance and the Pearson Correlation Coefficient

From Univariate to Multivariate
• Up until this point in the course, we have studied
statistical methods (both descriptive and inferential) that
have been designed to be applied to a single kind of data:
• Our descriptive statistics’ have served to summarize a
sample or population of values that are made up of a
single kind of measurements (e.g. daily rainfall totals
on several successive days or population densities in
different places), reducing the data volume and providing
a more concise description
• Our inferential tests have served to distinguish if a
sample of the conditions at one place or time was
significantly different from another (e.g. did it rain
more than usual in a certain place)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
From Univariate to Multivariate
• For a descriptive stat. to meaningfully describe a sample,
all the values have to be the same sort of measurements
• Likewise, an inferential test that is designed to check if a
sample is significantly different from a population only
makes sense if the values describing both the sample and
the population are same sorts of measurements
• The approaches we have looked at up to this point have
been univariate approaches, and they are designed solely
to consider data values that are all descriptive of the same
sorts of information (i.e. weighing each apple in a bushel
of apples and calculating a mean weight, or comparing
the mean weight of this bushel of apples to another
bushel of apples)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
From Univariate to Multivariate
• Clearly, it would not make much sense to calculate the
mean weight of a bushel made up of a mix of apples and
oranges, or to compare the mean weight of a bushel of
apples to the mean weight of a bushel of oranges:
• The univariate methods we have studied thus far really
can only be used effectively on interval or ratio
measurements of a reasonably uniform set of things
• However, in geography we are interested in using
quantitative approaches that do allow us to examine the
relationships between different sorts of measurements /
patterns of phenomena … perhaps not apples and
oranges, but instead descriptors of geographic phenomena
that might have some kind of relationship
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Bivariate Methods
• The first few quantitative methods that we are going to
study that allow us to examine the relationship between
two variables, and these are known as bivariate methods
• The aim of these approaches is to describe the relationship
between two variables measured for a common sample of
individual observations, i.e. asking the same set of
people two different survey questions, measuring two
different physical variables (temperature and moisture
condition) at the same set of locations on the same date, or
measuring two different physical variables on several
occasions at the same location … in general the
individuals in the sample must be the same when two
different quantities are being sampled
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Bivariate Methods
• We usually begin with the idea that there is some
relationship between two variables, e.g. we might
expect that there is a relationship between level of
education and income, and to examine this idea, we would
then procure a data set that describes these two variables
for the same set of individuals
• Our aim is to determine the nature of the relationship by
examining whether the variation in the values of one
variable are related to the variation in value of another
variable (i.e. as education ↑, what does income do?)
• Before we calculate any quantitative descriptions of the
relationship, we can begin by plotting values of one
variable against those of the other
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Positive and Negative Relationships
Positive –As values of one
variable increase, the values of
the other increase too
Negative (a.k.a. inverse) – As
values of one variable increase,
the values of the other decrease
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Positive and Negative Relationships
Source: Earickson, RJ, and Harlin, JM. 1994. Geographic
Measurement and Quantitative Analysis. USA: Macmillan
College Publishing Co., p. 209.
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Positive and Negative Relationships
•The examples on the previous slide show somewhat ideal
situations, where the scatterplot’s form is that of a set of
points that can be more or less plotted along a straight line
•Often, what we see with real data is a little less ideal:
Scatterplots usually deviate from that straight line of points
to some degree or another, even when there is a linear,
observable relationship between two variables:
•The scatter of points may be curved (not so linear)
•The points might be more spread out, forming an
ellipse of points, or even a circular cloud of points
where no relationship can be readily discerned
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Avg. – June 26/28
0.5
0.5
0.4
0.4
0.3
0.2
R2=0.71
0.0
4 5 6 7 8 9 10 11 12 13
0.6
R2=0.79
0.3
0.2
0.1
0.1
0.0
0.0
4
5
6
7
0.4
0.4
0.0
2.5 3.0 3.5 4.0 4.5 5.0 5.5
TMI
5
6
7
8
9 10 11 12 13
TMI
0.6
R2=0.24
0.5
0.4
Theta
0.5
Theta
Glyndon – LIDAR
0.5m DEM 11x11
0.5
0.1
4
9 10 11 12 13
TMI
0.6
R2=0.29
8
R2=0.79
0.3
0.2
0.6
0.2
0.5
0.4
TMI
0.3
Dry – August 22
Theta
0.6
Theta
0.6
0.1
Theta
Pond Branch PG 11.25m DEM
Theta
Wet – May 29/30
0.3
0.2
0.2
0.1
0.1
0.0
0.0
2.5
3.0
3.5
4.0
TMI
4.5
5.0
5.5
R2=0.10
0.3
2.5
3.0
3.5
4.0
4.5
5.0
5.5
TMI
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Theta-TVDI Scatterplots
Glyndon Field Sampled Soil Moisture
versus TVDI from a 3x3 kernel
Pond Branch Field Sampled Soil Moisture
versus TVDI from a 3x3 kernel
0.50
Volumetric Soil Moisture
Volumetric Soil Moisture
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
TVDI (3x3 kernel)
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
TVDI (3x3 kernel)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
TVDI (3x3 kernel)
Oxford Tobacco Research Station Field Sampled Soil
Moisture versus TVDI from a 3x3 kernel
Volumetric Soil Moisture
0.45
0.8
0.9
1.0
•Field sampled soil moisture is
a measure of wetness, remotely
sensed TVDI (Temperature
Vegetation Dryness Index) is a
measure of dryness, thus they
have a negative or inverse
relationship
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
API-TVDI Scatterplot
•In this scatterplot, an
antecedent precipitation
index (API) derived from
radar rainfall observations is
plotted against TVDI values
from satellite remote sensing
•There is clearly some sort
of inverse relationship
between the two (API
expressing wetness and
TVDI expressing dryness),
but is it linear?
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance
•While interpreting scatterplots of a pair of variables can
give us a general sense of the direction and strength of a
relationship between two variables, this interpretation is a
subjective judgment of the relationship
• A more objective approach that allows us to characterize
the strength and direction of a relationship between two
variables uses the computation of a statistic that expresses
the extent to which variables Y and X vary together
about their mean values
•This is known as covariance, and it extends the concept of
variance by calculating the average value of the product of
the two variables’ respective deviations from their means
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance Formulae
•The covariance of variable X with respect to variable Y
can be calculated using the following formula:
1
Cov [X, Y] =
n-1
i=n
Σ xi yi - xy
i=1
•The formula for covariance can be expressed in many
ways. The following equation is an equivalent expression
of covariance (due to the distributive property):
1
Cov [X, Y] =
n-1
i=n
Σ (xi - x)(yi - y)
i=1
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance Example
•One of the data sets you saw earlier in this lecture was a
set of 10 values of TVDI for remotely sensed pixels
containing the Glyndon catchment in Baltimore County,
and accompanying soil moisture measurements taken in
the catchment on matching dates:
Glyndon Field Sampled Soil Moisture
versus TVDI from a 3x3 kernel
Volumetric Soil Moisture
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
TVDI (3x3 kernel)
0.8
0.9
1.0
TVDI
Soil
Moisture
0.274
0.542
0.419
0.286
0.374
0.489
0.623
0.506
0.768
0.725
0.414
0.359
0.396
0.458
0.350
0.357
0.255
0.189
0.171
0.119
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance Example
• To calculate the covariance of these variables, we will:
1. Calculate the mean of each variable (x, y)
2. Calculate the statistical distances between each value
and its respective mean (xi - x, yi - y)
3. Multiply the statistical distances (xi - x) * (yi - y)
4. Calculate the sum of the products of the statistical
distances Σ (xi - x) * (yi - y)
5. Divide the sum by (n - 1)
• For this calculation, I’ll illustrate using a series of Excel
worksheet operations
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance Example
Mean
TVDI (x)
Soil
Moisture
(y)
0.274
0.542
0.419
0.286
0.374
0.489
0.623
0.506
0.768
0.725
0.414
0.359
0.396
0.458
0.350
0.357
0.255
0.189
0.171
0.119
1
0.501
0.307
(x - xbar) (y - ybar)
2
-0.227
0.042
-0.082
-0.215
-0.127
-0.011
0.122
0.005
0.267
0.225
0.107
0.052
0.090
0.151
0.044
0.050
-0.052
-0.118
-0.136
-0.188
Sum
Covariance
(x - xbar) *
(y - ybar)
-0.024305
0.0021624
-0.007323
-0.032424
-0.005553
-0.000566
-0.006374
-0.000618
-0.036282
-0.042289
3
-0.15357
-0.017063
4
5
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
How Does Covariance Work?
•If X and Y are positively related, we can expect that a value xi > x
would have a corresponding value yi > y, and if we have another
value xi < x, it would have a corresponding value yi < y
•Since Cov[X, Y] takes the product of the differences, + * + =
+ contribution to the Cov[X, Y], and - * - = +, so that situation
would also make a positive contribution to the Cov[X, Y]
•If X and Y are negatively related, we can expect that a value xi >
x would have a corresponding value yi < y, and if we have another
value xi < x, it would have a corresponding value yi > y
•Since Cov[X, Y] takes the product of the differences, + * - = contribution to the Cov[X, Y], and - * + = -, so that situation
would also make a negative contribution to the Cov[X, Y]
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Interpreting Covariances
•Covariance does tell us the direction of the
relationship and the magnitude of the relationship:
•The covariance will be positive if most of the
points (x,y) lie along a line with a positive slope,
suggesting a positive relationship
•The covariance will be negative if most of the
points (x,y) lie along a line with a negative
slope, suggesting an inverse relationship
•A larger covariance indicates a stronger
relationship, but the magnitude is a function of
the units of measurement of the values
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Covariance Æ Correlation
•Because the magnitude of covariance is a function of the
units of measurement of the values, and covariance is
calculated using two variables that likely have different
units of measurement, getting a sense of the magnitude of
the value is not a straightforward problem
•Furthermore, we might want to compare the strength of
relationships between multiple pairs of variables that do
not have the same units of measurement; their covariances’
magnitudes would likely not be comparable
•For this reason, we need a measure of covariance that is
standardized such that the units of measurement of the
values are not important and we can compare one such
measure to another
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pearson’s Correlation Coefficient
•A standardized measure of covariance provides a value
that describes the degree to which two variables correlate
with one another, expressing this using a value ranging
from –1 to +1, where –1 denotes an inverse relationship
and +1 denotes a positive relationship
•One such measure is known as Pearson’s Correlation
Coefficient (a.k.a. Pearson’s Product Moment), and it
produced through standardizing the covariance by dividing
it by the product of the standard deviations of the Y and X
variables:
Cov [X, Y]
Pearson’s Correlation
r
=
Coefficient
sXsY
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pearson’s Correlation Coefficient
•As is the case with covariance, the correlation coefficient
can be expressed in several equivalent ways:
i=n
Pearson’s Correlation
=
Coefficient
(x
x)(y
y)
i
i
Σ
i=1
(n - 1) sXsY
•It can also be expressed in terms of z-scores, which is
convenient if you have already calculated them:
Σ Zx Zy
i=n
Pearson’s Correlation
=
Coefficient
i=1
(n - 1)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pearson’s Correlation Coefficient
•The correlation coefficient will have a value ranging from
–1 to +1, where –1 denotes an inverse relationship and +1
denotes a positive relationship
•The strength of the relationship is expressed by the size of
the coefficient, larger values for stronger relationships
•NOTE: Correlation coefficients cannot be interpreted
proportionally, i.e. r = 0.6 is not twice as strong as r = 0.3
•Here are some ranges for interpreting r values:
0 - 0.2 Negligible relationship
0.2 - 0.4 Weak relationship, little association
0.4 - 0.6 Moderate relationship / association
0.6 - 0.8 Strong relationship, high association
0.8 - 1.0 Very strong relationship
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pearson’s Correlation
Coefficient Example
•We can take the covariance we calculated earlier and
convert it to a correlation coefficient by dividing the
covariance by the product of the standard deviations
•We calculated Cov[TVDI, Soil Moisture] = -0.017063
•STVDI = 0.169718, SSoil Moisture = 0.115347
•Using the formula
Cov [X, Y]
r=
sXsY
-0.017063
r=
0.169718 * 0.115347
r = -0.871
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Pearson’s r - Assumptions
• To properly apply Pearson’s Correlation Coefficient, we
first have to make sure that the following assumptions
are satisfied:
1. The values need to be either interval or ratio scale data
(later we will examine a different correlation method for
ordinal data)
2. The (x,y) data pairs are selected randomly from a
population of values of X and Y
3. The relationship between X and Y is linear (which can
be qualitatively assessed by looking at the scatterplot)
4. The variables X and Y must share a joint bivariate
normal distribution (which we tend to assume when
sampling from a population)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Interpreting Correlation Coefficients
• Even if we find a strong correlation between two
variables, we cannot use that information to
demonstrate a causal relationship between the two
variables Æ Correlation is not the same as causation!
(Correlation suggests an association between variables)
• There are a number of reasons why we might find a
strong correlation between X and Y that are separate
from the situation where the independent variable X is
directly influencing dependent variable Y:
1. Both X and Y are actually dependent variables that are
X
being influenced by a further variable Z:
Z
Y
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Interpreting Correlation Coefficients
2. There is a causative chain between X and Y, but it is
mediated by intermediate processes or linked, i.e. X Æ
A Æ B Æ Y, e.g. rainfall Æ soil moisture Æ ground
water Æ runoff
3. There is a mutual relationship between X and Y such
that each is influencing the other, e.g. the relationship
between income and social status
4. We might find a spurious relationship between two
variables because each basically expresses the same
information, I.e. if Y = kX, then X and Y will of course
be strongly correlated
5. A true causal relationship exists, the situation we
would prefer to find, where X Æ Y
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Interpreting Correlation Coefficients
6. There is the possibility that a strong correlation can be
found as a result of chance Æ since this statistic is
based upon samples, there is the unlikely but present
possibility that an extremely unusual sample was chosen
7. There may be a few outliers present in the sampled data
that have a large, undue influence on the resulting
correlation
8. The strength of the correlation may be a result of a lack
of independence among observations. This is
particularly problematic with geographic data, where
samples are drawn from many adjacent locations that
have similar values in close proximity (spatial
autocorrelation)
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Interpreting Correlation Coefficients
8. Cont. A related issue in the context of data used for
correlation drawn from spatial sample units is the issue
of ecological correlation: We find that any two
variables, depending on the geographic scale at which
they are sampled, will have different correlation values
at various scales. Generally, correlation increases as
we look at larger scale data (i.e. smaller sample units)
• If we compare rainfall versus runoff in North Carolina
and the USA, we would expect to find a higher r value
in NC because the dependent variable (runoff in this
case) is subject to greater random variability at the scale
of the nation … we cannot compare correlations from
different scales!
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
A Significance Test for r
• Pearson’s Correlation Coefficient is, like many
statistics, an estimator. Using a sample, it produces an
estimate of the correlation coefficient (r) that is meant to
approximate the population correlation coefficient (ρ)
• We can formulate a test of significance to assess (using
our sample data and the r we calculated from it) whether
or not the population correlation coefficient is equal
to zero (indicating no relationship between X and Y), or
in the event that the test statistic exceeds the critical
value, we can reject this null hypothesis and accept the
alternative where ρ is not equal to zero and there is
some relationship between X and Y
• We will use another variant of the t-test for this purpose
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
A Significance Test for r
• The sampling distribution of r follows a t-distribution
with (n - 2) degrees of freedom, and we can estimate
the standard error of r using:
SEr =
1 - r2
n-2
• The test itself takes the form of the correlation
coefficient divided by the standard error, thus:
r
ttest =
=
SEr
r
1 - r2
n-2
=
r n-2
1 - r2
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
A Significance Test for r
• We will use this test in a 2-tailed fashion to assess
whether or not the population correlation coefficient is
equal to zero (no relationship) or not equal to zero
(some relationship):
H0: ρ = 0
HA: ρ ≠ 0
• Examining the test statistic, we can see that its value is
purely a function of the correlation coefficient (r) and
sample size (n):
ttest =
r n-2
1 - r2
• A given r may or may not be significant depending on
the size of the sample!
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
A Significance Test for r
•
•
For example, suppose we have three samples of variables X
and Y of sizes n = 10, n = 100, and n = 1000, all with r = 0.5
We decide to test the significance of these r values using an
α = 0.05, and we look up the appropriate critical values
from a t-distribution table with appropriate degrees of
freedom for each of the three samples:
1. n = 10, df = 8, tcrit = 2.306, ttest = 0.5 10 - 2 = 1.63
1 - 0.52
2. n = 100, df = 98, tcrit = 1.99, ttest = 0.5 100 - 2 = 5.71
1 - 0.52
3. n = 1000, df = 998, tcrit = 1.97, ttest = 0.5 1000 - 2 = 8.12
1 - 0.52
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
A Significance Test for r
• As is the case with the other statistical tests that we have
studied that have included sample size in the standard
error formulation, sample size is important when it
comes to evaluating the significance of a correlation
coefficient
• It is not enough to simply find a large r value; if it is
derived from a very small sample, this may not tell you
much about a relationship between the variables
• On the other hand, once samples become larger than a
certain size, the shapes of the appropriate t-distributions
change very little (i.e. the tcrit value for α = 0.05, 2tailed stabilizes somewhere around tcrit ~ 2), so it is not
necessary or useful to collect a sample of a huge size
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Hypothesis Testing - Significance of r
t-test Example
• Research question: Is there a significant relationship
between TVDI and soil moisture in the Glyndon data set
1. H0: ρ = 0 (No significant relationship)
2. HA: ρ ≠ 0 (Some relationship)
3. Select α = 0.05, two-tailed because of how the alternate
hypothesis is formulated
4. In order to compute the t-test statistic, we need to first
calculate the Pearson Product Moment Correlation
Coefficient. We have done so earlier in this lecture,
finding r = -0.871, a very strong inverse relationship
between remotely sensed TVDI and field measurements
of soil moisture
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Hypothesis Testing - Significance of r
t-test Example
4. Cont. We calculate the test statistic using:
ttest =
ttest =
r n-2
1 - r2
-0.871 10 - 2
1 - (-0.871)2
= - 5.01
5. We now need to find the critical t-score, first calculating
the degrees of freedom:
df = (n - 2) = (10 - 2) = 8
We can now look up the tcrit value for our α (0.025 in
each tail) and df = 8, tcrit = 2.306
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005
Hypothesis Testing - Significance of r
t-test Example
6. |ttest| > |tcrit|, therefore we reject H0, and accept HA,
finding that there is a significant relationship (i.e. the
population correlation coefficient ρ, which we have
estimated using the sample correlation coefficient r) is not
equal to 0
David Tenenbaum – GEOG 090 – UNC-CH Spring 2005