Correlation - Appalachian State University

Outline
Correlation
Correlation∗
Alan T. Arnholt
Department of Mathematical Sciences
Appalachian State University
[email protected]
Spring 2006 R Notes
∗
1
c 2006 Alan T. Arnholt
Copyright The R Script
Outline
Correlation
Correlation
Overview of Correlation
The R Script
2
The R Script
Outline
Correlation
The R Script
Correlation
The correlation coefficient, denoted by r, measures the strength
and direction of the linear relationship between two numeric
variables X and Y and is defined by
n
1 X
r=
n−1
i=1
3
xi − x̄
sX
yi − ȳ
sY
(1)
Outline
Correlation
Correlation Properties
• The value for r will always be between −1 and +1.
4
The R Script
Outline
Correlation
Correlation Properties
• The value for r will always be between −1 and +1.
• When r is close to +1, it indicates a strong positive linear
relationship. That is, when x increases so does y and vice
versa.
5
The R Script
6
Outline
Correlation
The R Script
Correlation Properties
• The value for r will always be between −1 and +1.
• When r is close to +1, it indicates a strong positive linear
relationship. That is, when x increases so does y and vice
versa.
• When the value of r is close to −1, it indicates a strong
negative linear relationship. Values of r close to zero indicate
weak linear relationships.
Outline
Correlation
The R Script
Correlation Properties
• The value for r will always be between −1 and +1.
• When r is close to +1, it indicates a strong positive linear
relationship. That is, when x increases so does y and vice
versa.
• When the value of r is close to −1, it indicates a strong
negative linear relationship. Values of r close to zero indicate
weak linear relationships.
• To compute the correlation between two numeric vectors with
R, one may use the function cor(x,y).
7
Outline
Correlation
Example
Use the data frame Correlat from the BSDA package to:
1. Create a scatterplot of Y versus X.
8
The R Script
Outline
Correlation
Example
Use the data frame Correlat from the BSDA package to:
1. Create a scatterplot of Y versus X.
2. Compute the sample correlation coefficient r using the
definition.
9
The R Script
Outline
Correlation
Example
Use the data frame Correlat from the BSDA package to:
1. Create a scatterplot of Y versus X.
2. Compute the sample correlation coefficient r using the
definition.
3. Compute the sample correlation coefficient r using the R
function cor().
> library(BSDA)
> attach(Correlat)
> plot(X,Y,col="blue",main="Scatterplot")
10
The R Script
11
Outline
Correlation
The R Script
Scatterplot of Y Versus X
40
50
60
Y
70
80
90
Scatterplot
20
40
60
X
80
Outline
Correlation
Using the Formula
> m.x <- mean(X)
> m.y <- mean(Y)
> s.x <- sd(X)
> s.y <- sd(Y)
> Z.x <- (X-m.x)/s.x
> Z.y <- (Y-m.y)/s.y
> ZxZy <- Z.x*Z.y
> r <- (1/(length(X)-1))*sum(ZxZy)
> r
[1] -0.813717
> cor(X,Y)
[1] -0.813717
12
The R Script
Outline
Correlation
Code Continued
> c(m.x, m.y, s.x, s.y)
[1] 54.53846 68.92308 21.81596 18.81693
> stuff <- cbind(X,Y,Z.x,Z.y,ZxZy)
> stuff
X Y
Z.x
Z.y
ZxZy
[1,] 42 75 -0.57473814 0.3229497 -0.18561153
[2,] 61 49 0.29618407 -1.0587846 -0.31359513
[3,] 12 95 -1.94987849 1.3858223 -2.70218502
[4,] 71 64 0.75456419 -0.2616302 -0.19741675
[5,] 52 83 -0.11635803 0.7480987 -0.08704730
[6,] 48 84 -0.29971007 0.8012424 -0.24014041
[7,] 74 38 0.89207822 -1.6433645 -1.46600964
[8,] 65 58 0.47953612 -0.5804919 -0.27836684
[9,] 53 81 -0.07052002 0.6418115 -0.04526056
[10,] 63 47 0.38786010 -1.1650718 -0.45188487
[11,] 55 78 0.02115601 0.4823806 0.01020525
[12,] 94 51 1.80883845 -0.9524973 -1.72291376
[13,] 19 93 -1.62901241 1.2795350 -2.08437841
13
The R Script
Outline
Correlation
The R Script
Correlation and Causation
• A positive correlation between two variables means that large
values of one variable tend to be associated with large values
in the other variable.
14
Outline
Correlation
The R Script
Correlation and Causation
• A positive correlation between two variables means that large
values of one variable tend to be associated with large values
in the other variable.
• This does not necessarily mean that the large values of the
first variable caused the large values of the other variable.
15
Outline
Correlation
The R Script
Correlation and Causation
• A positive correlation between two variables means that large
values of one variable tend to be associated with large values
in the other variable.
• This does not necessarily mean that the large values of the
first variable caused the large values of the other variable.
• Correlation measures the linear association between two
variables, not the causal effect.
16
Outline
Correlation
The R Script
Link to the R Script
• Go to my web page Script for Correlation
• Homework: problems 2.20, 2.21, 2.23, 2.27, 2.28, 2.30 and
2.31
• See me if you need help!
17