Slides - StatLit.Org

Statistically-Significant Correlations
0F
2014 NNN4 Statistically-Significant Correlations
11 Oct, 2014
1
0F
2
2014 NNN4 Statistically-Significant Correlations
Statistically-Significant
Correlations
Exact Solutions
Milo Schield
Augsburg College
Editor of www.StatLit.org
US Rep: International Statistical Literacy Project
Fall 2014
National Numeracy Network Conference
For N random pairs from an uncorrelated bivariate
normally-distributed distribution, the sampling
distribution is not simple.
Here are three common analytic approaches:
1.Fisher transformation (using LN and Arctanh),
2.an exact solution (using a Gamma function), or
3.Student-t distribution: t=rSqrt[(n-2)/(1-r^2)]; df=n-2
• For large n, the critical value of t (95% confidence) is 1.96.
• For small n, the critical value of t increases as n decreases.
www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf
None of these are simple or memorable.
2014 NNN4 Statistically-Significant Correlations
3
0F
Simple Model: 2/SQRT(n)
Sufficient Condition
Approach: Find an equation generating a minimum
correlation for statistical-significance given N.
N
400
256
100
49
25
16
12
10
7
6
5
4
1. Given N, find the smallest value of r where the left
end of a 95% confidence interval is non-negative.
Use calculator at www.vassarstats.net/rho.html or
www.danielsoper.com/statcalc3/calc.aspx?id=44
For Daniels, use the results for a two-tailed test.
2. Generate correlation coefficient with simple model
3. Calculate error difference between calculated and
exact using the exact as the standard. If all errors are
positive, then the model is sufficient.
0F
2014 NNN4 Statistically-Significant Correlations
5
0F
It is similar to the formula for the maximum 95%
Margin of Error in samples from a binary variable:
95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)]
95% ME < 1/Sqrt(n)
Simple and memorable for one binary variable.
0.10
0.12
0.20
0.28
0.40
0.50
0.57
0.63
0.75
0.81
0.88
0.96
0.10
0.13
0.20
0.29
0.40
0.50
0.58
0.63
0.76
0.82
0.89
1.00
3.0%
2.7%
2.0%
1.7%
1.3%
1.0%
0.6%
0.4%
0.4%
0.6%
1.4%
4.0%
6
2014 NNN4 Statistically-Significant Correlations
Time-Series Correlations
www.tylervigen.com
Tangled bed‐sheet Deaths vs. Skiing Revenues
900
Deaths (US)
Minimum statistically-significant r = 2/Sqrt(n)
“n” is the number of pairs being correlated
Less than 5% over for n between 5 and 4,000.
Simple and memorable for two variables.
Minimum Correlation for Statistical Significance
Exact
2/sqrt(n)
Error
All errors positive means the model is sufficient.
Solution
2014-Schield-NNN4-slides.pdf
4
2014 NNN4 Statistically-Significant Correlations
Revenues: Blue line
2600
800
2400
700
2200
600
2000
500
400
300
2000
Correlation: 0.969724
1800
1600
Revenues ($M)
0F
1400
2002
2004
2006
2008
Source: http://tylervigen.com/view_correlation?id=1864
10 pairs; 2/Sqrt(10) = 0.63; Statistically significant
1
Statistically-Significant Correlations
0F
7
2014 NNN4 Statistically-Significant Correlations
Correlation = -0.993
Bee colonies & MJ arrests
Pool Drownings vs. Films with Nicholas Cage
Bee Colonies vs. Juvenile Marijuana Arrests
55,000
45,000
Honey Bee Colonies
2,600
Drownings
65,000
Correlation: ‐0.933389
Arrests
75,000
3,000
4
120
85,000
Marijuna Arrests of Juveniles
Colonies (K)
Correlation: 0.666004
95,000
3,200
110
3.5
Cage films:
Blue line
Drowningns:
Red line
3
2.5
100
2
1.5
90
1
35,000
2,400
25,000
2,200
15,000
90
91
92
93
94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
80
1998
09
20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant
2014 NNN4 Statistically-Significant Correlations
2000
2002
2004
2006
2008
0.5
2010
www.tylervigen.com/view_correlation?id=359
www.tylervigen.com/view_correlation?id=1582
0F
4.5
130
3,400
2,800
8
2014 NNN4 Statistically-Significant Correlations
Correlation = 0.664
Drownings & Cage films
Cage films
0F
11 Oct, 2014
9
11pairs; 2/Sqrt(11) = 0.60; Statistically-significant!
0F
2014 NNN4 Statistically-Significant Correlations
10
Correlation = 0.664
Drownings & Cage films
Something Seems Wrong!
1. There is nothing linear about these associations.
2. These correlations seem unbelievably high.
----------------------#1: The correlation between two time-series
eliminates the common factor: time. The
question is whether their mutual association is
linear. To see this, an XY-plot is generated.
0F
2014 NNN4 Statistically-Significant Correlations
11
#2: Very High Correlations.
Three Explanations
1. Association is causal. See Tyler Vigen’s video:
www.youtube.com/watch?feature=player_embedded&v=g-g0ovHjQxs
2. Association is spurious – just random chance.
Five percent of random associations will be
mistakenly classified as statistically significant.
3. Association is cherry-picked -- after the fact.
According to Tyler, “This server has generated
24,470 correlations.” Tyler just picked those
with high or interesting correlations.
2014-Schield-NNN4-slides.pdf
0F
2014 NNN4 Statistically-Significant Correlations
12
Conclusions
1. Use 2/Sqrt(n) as the minimum correlation for
statistical significance. This criteria is sufficient,
fairly accurate (within 5%) and memorable.
2. The correlation between two time-series eliminates
time. Correlation determines the degree of linearity
in their cross-sectional association.
3. Do not use a test for statistical significance if the
data pairs were selected – after the fact via data
mining – solely because of their high correlation.
2
Statistically-Significant Correlations
0F
11 Oct, 2014
13
2014 NNN4 Statistically-Significant Correlations
Correlation = 0.993
Divorce & Margarine Usage
Divorce Rate vs. Margarine Consumption
5
9
Correlation: 0.992558
8
4.8
7
4.6
6
Margarine Consumption per capita (US)
4.4
5
Divorce Rate 4.2
4
in Maine
4
3
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Correlation: 0.993. N=10, SS_Rho = 1/sqrt(11) =
10 pairs; 2/Sqrt(10) = 0.64. Statistically-significant
www.tylervigen.com/view_correlation?id=1703
2014-Schield-NNN4-slides.pdf
3
0F
2014 NNN4 Statistically-Significant Correlations
1
Statistically-Significant
Correlations
Milo Schield
Augsburg College
Editor of www.StatLit.org
US Rep: International Statistical Literacy Project
Fall 2014
National Numeracy Network Conference
www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf
0F
2014 NNN4 Statistically-Significant Correlations
2
Exact Solutions
For N random pairs from an uncorrelated bivariate
normally-distributed distribution, the sampling
distribution is not simple.
Here are three common analytic approaches:
1.Fisher transformation (using LN and Arctanh),
2.an exact solution (using a Gamma function), or
3.Student-t distribution: t=rSqrt[(n-2)/(1-r^2)]; df=n-2
• For large n, the critical value of t (95% confidence) is 1.96.
• For small n, the critical value of t increases as n decreases.
None of these are simple or memorable.
0F
2014 NNN4 Statistically-Significant Correlations
3
Sufficient Condition
Approach: Find an equation generating a minimum
correlation for statistical-significance given N.
1. Given N, find the smallest value of r where the left
end of a 95% confidence interval is non-negative.
Use calculator at www.vassarstats.net/rho.html or
www.danielsoper.com/statcalc3/calc.aspx?id=44
For Daniels, use the results for a two-tailed test.
2. Generate correlation coefficient with simple model
3. Calculate error difference between calculated and
exact using the exact as the standard. If all errors are
positive, then the model is sufficient.
0F
2014 NNN4 Statistically-Significant Correlations
4
Simple Model: 2/SQRT(n)
N
400
256
100
49
25
16
12
10
7
6
5
4
Minimum Correlation for Statistical Significance
Exact
2/sqrt(n)
Error
0.10
0.12
0.20
0.28
0.40
0.50
0.57
0.63
0.75
0.81
0.88
0.96
0.10
0.13
0.20
0.29
0.40
0.50
0.58
0.63
0.76
0.82
0.89
1.00
3.0%
2.7%
2.0%
1.7%
1.3%
1.0%
0.6%
0.4%
0.4%
0.6%
1.4%
4.0%
All errors positive means the model is sufficient.
0F
2014 NNN4 Statistically-Significant Correlations
5
Solution
Minimum statistically-significant r = 2/Sqrt(n)
“n” is the number of pairs being correlated
Less than 5% over for n between 5 and 4,000.
Simple and memorable for two variables.
It is similar to the formula for the maximum 95%
Margin of Error in samples from a binary variable:
95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)]
95% ME < 1/Sqrt(n)
Simple and memorable for one binary variable.
0F
2014 NNN4 Statistically-Significant Correlations
6
Time-Series Correlations
www.tylervigen.com
Tangled bed‐sheet Deaths vs. Skiing Revenues
Revenues: Blue line
2600
800
2400
700
2200
600
2000
500
1800
400
300
2000
Correlation: 0.969724
1600
Revenues ($M)
Deaths (US)
900
1400
2002
2004
2006
2008
Source: http://tylervigen.com/view_correlation?id=1864
10 pairs; 2/Sqrt(10) = 0.63; Statistically significant
0F
2014 NNN4 Statistically-Significant Correlations
7
Correlation = -0.993
Bee colonies & MJ arrests
Bee Colonies vs. Juvenile Marijuana Arrests
3,400
95,000
3,200
85,000
75,000
3,000
65,000
Correlation: ‐0.933389
2,800
55,000
Honey Bee Colonies
2,600
45,000
35,000
2,400
25,000
15,000
2,200
90
91
92
93
94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
09
www.tylervigen.com/view_correlation?id=1582
20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant
Arrests
Colonies (K)
Marijuna Arrests of Juveniles
0F
2014 NNN4 Statistically-Significant Correlations
8
Correlation = 0.664
Drownings & Cage films
Pool Drownings vs. Films with Nicholas Cage
130
4.5
Correlation: 0.666004
4
110
3.5
Cage films:
Blue line
Drowningns:
Red line
3
2.5
100
2
1.5
90
80
1998
Cage films
Drownings
120
1
2000
2002
2004
2006
2008
0.5
2010
www.tylervigen.com/view_correlation?id=359
11pairs; 2/Sqrt(11) = 0.60; Statistically-significant!
0F
2014 NNN4 Statistically-Significant Correlations
9
Something Seems Wrong!
1. There is nothing linear about these associations.
2. These correlations seem unbelievably high.
----------------------#1: The correlation between two time-series
eliminates the common factor: time. The
question is whether their mutual association is
linear. To see this, an XY-plot is generated.
0F
2014 NNN4 Statistically-Significant Correlations
Correlation = 0.664
Drownings & Cage films
10
0F
2014 NNN4 Statistically-Significant Correlations
11
#2: Very High Correlations.
Three Explanations
1. Association is causal. See Tyler Vigen’s video:
www.youtube.com/watch?feature=player_embedded&v=g-g0ovHjQxs
2. Association is spurious – just random chance.
Five percent of random associations will be
mistakenly classified as statistically significant.
3. Association is cherry-picked -- after the fact.
According to Tyler, “This server has generated
24,470 correlations.” Tyler just picked those
with high or interesting correlations.
0F
2014 NNN4 Statistically-Significant Correlations
12
Conclusions
1. Use 2/Sqrt(n) as the minimum correlation for
statistical significance. This criteria is sufficient,
fairly accurate (within 5%) and memorable.
2. The correlation between two time-series eliminates
time. Correlation determines the degree of linearity
in their cross-sectional association.
3. Do not use a test for statistical significance if the
data pairs were selected – after the fact via data
mining – solely because of their high correlation.
0F
2014 NNN4 Statistically-Significant Correlations
13
Correlation = 0.993
Divorce & Margarine Usage
Divorce Rate vs. Margarine Consumption
5
9
Correlation: 0.992558
8
4.8
7
4.6
6
Margarine Consumption per capita (US)
4.4
5
Divorce Rate 4.2
4
in Maine
4
3
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Correlation: 0.993. N=10, SS_Rho = 1/sqrt(11) =
10 pairs; 2/Sqrt(10) = 0.64. Statistically-significant
www.tylervigen.com/view_correlation?id=1703