Applied Data Analysis

Applied Data Analysis
Spring 2017
Karen’s mom, Northern Ireland late 1970s
Karen Albert
[email protected]
Thursdays, 4-5 PM (Hark 302)
Lecture outline
1. Correlations
Meet Salem
Cat data
library(MASS)
data(cats)
attach(cats)
str(cats)
## 'data.frame': 144 obs. of
## $ Sex: Factor w/ 2 levels
## $ Bwt: num 2 2 2 2.1 2.1
## $ Hwt: num 7 7.4 9.5 7.2
3 variables:
"F","M": 1 1 1 1 1 1 1 1 1 1 ...
2.1 2.1 2.1 2.1 2.1 ...
7.3 7.6 8.1 8.2 8.3 8.5 ...
Scatter diagram
plot(Bwt,Hwt)
20
●
●
12
8
6
Hwt
16
●
●
● ●
●
●
●
●
● ●
● ●
●
2.0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
● ●
●
●
● ● ●
●
●
●
2.5
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
3.0
Bwt
●
●
● ●
●
● ●
●
● ●
●
●
●
●
●
● ●
●
● ●
●
● ● ● ●
● ●
●
3.5
●
●
●
●
The basics of scatter diagrams
• Ordered pairs
• Cartesian plane
• Abscissa/ordinate
• Association
Ordered pairs
• An ordered pair (x, y ) is two numbers presented in
parentheses and separated by a comma.
Ordered pairs
• An ordered pair (x, y ) is two numbers presented in
parentheses and separated by a comma.
• The first number represents the value along the x-axis, and
the second number represents the value along the y -axis.
Ordered pairs
• An ordered pair (x, y ) is two numbers presented in
parentheses and separated by a comma.
• The first number represents the value along the x-axis, and
the second number represents the value along the y -axis.
• The x-coordinate of the ordered pair is known as the
abscissa and the y -coordinate is called the ordinate.
Cartesian plane
Ordered paris can be plotted in a Cartesian plane, named after
the 17th-century French mathematician and philosopher Rene
Descartes.
Cartesian plane
Ordered paris can be plotted in a Cartesian plane, named after
the 17th-century French mathematician and philosopher Rene
Descartes.
Famous quote: “I’m pink, therefore I’m spam.”
-1
0
var[, 2]
1
2
3
Positive association
-2
-1
0
var[, 1]
1
2
0
-1
-2
-3
var[, 2]
1
2
Negative association
-2
-1
0
1
var[, 1]
2
3
-3
-2
-1
var[, 2]
0
1
2
No association
-3
-2
-1
0
var[, 1]
1
2
Summarizing the relationship
We need a number that will help us summarize the relationship.
The number is known as the correlation.
Correlation coefficient
• The value that helps us measure the association between
two variables is the correlation coefficient.
Correlation coefficient
• The value that helps us measure the association between
two variables is the correlation coefficient.
• The symbol for the correlation is r .
Correlation coefficient
• The value that helps us measure the association between
two variables is the correlation coefficient.
• The symbol for the correlation is r .
• The correlation coefficient measures how tightly clustered
the point cloud is. The more tightly clustered, the higher
the correlation. The less tightly clustered, the lower the
correlation.
Some facts about r
• the correlation coefficient ranges between -1 and 1
−1 ≤ r ≤ 1
Some facts about r
• the correlation coefficient ranges between -1 and 1
−1 ≤ r ≤ 1
• 1 is a perfect positive correlation, and -1 is a perfect
negative correlation.
Some facts about r
• the correlation coefficient ranges between -1 and 1
−1 ≤ r ≤ 1
• 1 is a perfect positive correlation, and -1 is a perfect
negative correlation.
• the correlation coefficient is unitless—it is not degrees,
percents, or inches.
Negative correlation
A positive correlation means that as one variable increases, so
does the other. A negative correlation means that as one
variable decreases, so does the other.
Right?
Negative correlation
A positive correlation means that as one variable increases, so
does the other. A negative correlation means that as one
variable decreases, so does the other.
Right?
Wrong. It means that as one variable decreases, the other
increases. Example?
What correlation cannot tell us
1. It cannot tell us how much one variable changes in
response to a change in the other variable.
2. It does not tell us the magnitude of the change.
3. That is, it does not tell us the slope of the point cloud.
4. That is a job for regression.
Cave! hic dragones
Pastafarianism
Church of the Flying Spaghetti Monster
Pastafarianism “beliefs”
• We believe pirates, the original Pastafarians, were peaceful
explorers, and it was due to Christian misinformation that
they have an image of outcast criminals today.
• We are fond of beer.
• Every Friday is a religious holiday.
• We do not take ourselves too seriously.
• We embrace contradictions (though in that we are hardly
unique).
How the world was created...
We believe the Flying Spaghetti Monster created the world
much as it exists today, but for reasons unknown made it
appear that the universe is billions of years old (instead of
thousands) and that life evolved into its current state (rather
than created in its current form). Every time a researcher
carries out an experiment that appears to confirm one of these
“scientific theories” supporting an old earth and evolution we
can be sure that the FSM is there, modifying the data with his
Noodly Appendage. We don’t know why He does this but we
believe He does, that is our Faith.
Other Examples
• The number of firefighters at a fire is positively correlated
with the amount of damage caused by the fire.
Other Examples
• The number of firefighters at a fire is positively correlated
with the amount of damage caused by the fire.
• The accuracy of bombers in WWII is positively correlated
with the number of opposition fighters.
Other Examples
• The number of firefighters at a fire is positively correlated
with the amount of damage caused by the fire.
• The accuracy of bombers in WWII is positively correlated
with the number of opposition fighters.
• Children’s reading level is positively correlated with the
amount of TV watched.
Notation flashback
x̄ =
n
n
i=1
i=1
X
1X
xi =
xi × pi
n
v
v
u n
u n
uX
u1 X
t
2
s=
(xi − x̄) = t (xi − x̄)2 × pi
n
i=1
i=1
Covariance
To calculate r , we need a value known as the covariance:
n
cov(x, y ) =
1X
(xi − x̄)(yi − ȳ )
n
i=1
=
n
X
(xi − x̄)(yi − ȳ ) × pi
i=1
The covariance measures the relationship between two
variables.
Problem
The covariance has a problem—it depends on the units with
which x and y are measured.
cov(Bwt,Hwt)
## [1] 0.9501127
Bwt.new <- Bwt*10
Hwt.new <- Hwt*10
cov(Bwt.new,Hwt.new)
## [1] 95.01127
Solution: the correlation coefficient
r=
cov(Bwt,Hwt)
## [1] 0.9501127
cov(Bwt,Hwt)/(sd(Bwt)*sd(Hwt))
## [1] 0.8041274
cor(Bwt,Hwt)
## [1] 0.8041274
cov(x, y )
sx sy
Another view
r
=
=
=
cov(x, y )
sx sy
1 Pn
i=1 (xi − x̄)(yi − ȳ )
n
q P
q P
n
n
1
1
2
2
i=1 (xi − x̄)
i=1 (yi − ȳ )
n
n
Pn
x̄)(yi − ȳ )
i=1 (xi − q
qP
Pn
n
2
2
i=1 (xi − x̄)
i=1 (yi − ȳ )
Another view
Remember that we said that the correlation coefficient is
unitless. How do you make a value unitless?
Another view
Remember that we said that the correlation coefficient is
unitless. How do you make a value unitless?
We use standardized scores:
xi − x̄
s
We should then be able to calculate the correlation coefficient
using standardized units.
z=
Another view
r can be calculated as the average of the X ’s in standard units
times the Y ’s in standard units.
n
r=
1X x
zi × ziy
n
i=1
Bwt.z <- (Bwt-mean(Bwt))/(sd(Bwt)*((length(Bwt)-1)/length(Bwt)))
Hwt.z <- (Hwt-mean(Hwt))/(sd(Hwt)*((length(Hwt)-1)/length(Hwt)))
mean(Bwt.z*Hwt.z)
## [1] 0.8097507
More about r
• r is not affected by interchanging the variables.
• Adding the same number to all the values of one variable
does not change r .
• Multiplying all the values of one variable by the same
positive number does not change r .
• r can by fooled by outliers and nonlinear associations.
Always look at a scatter diagram.
What did we learn?
• Pirates
• Global warming
• Pastafarianism
• Covariance
• Correlation