Statistically-Significant Correlations 0F 2014 NNN4 Statistically-Significant Correlations 11 Oct, 2014 1 0F 2 2014 NNN4 Statistically-Significant Correlations Statistically-Significant Correlations Exact Solutions Milo Schield Augsburg College Editor of www.StatLit.org US Rep: International Statistical Literacy Project Fall 2014 National Numeracy Network Conference For N random pairs from an uncorrelated bivariate normally-distributed distribution, the sampling distribution is not simple. Here are three common analytic approaches: 1.Fisher transformation (using LN and Arctanh), 2.an exact solution (using a Gamma function), or 3.Student-t distribution: t=rSqrt[(n-2)/(1-r^2)]; df=n-2 • For large n, the critical value of t (95% confidence) is 1.96. • For small n, the critical value of t increases as n decreases. www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf None of these are simple or memorable. 2014 NNN4 Statistically-Significant Correlations 3 0F Simple Model: 2/SQRT(n) Sufficient Condition Approach: Find an equation generating a minimum correlation for statistical-significance given N. N 400 256 100 49 25 16 12 10 7 6 5 4 1. Given N, find the smallest value of r where the left end of a 95% confidence interval is non-negative. Use calculator at www.vassarstats.net/rho.html or www.danielsoper.com/statcalc3/calc.aspx?id=44 For Daniels, use the results for a two-tailed test. 2. Generate correlation coefficient with simple model 3. Calculate error difference between calculated and exact using the exact as the standard. If all errors are positive, then the model is sufficient. 0F 2014 NNN4 Statistically-Significant Correlations 5 0F It is similar to the formula for the maximum 95% Margin of Error in samples from a binary variable: 95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)] 95% ME < 1/Sqrt(n) Simple and memorable for one binary variable. 0.10 0.12 0.20 0.28 0.40 0.50 0.57 0.63 0.75 0.81 0.88 0.96 0.10 0.13 0.20 0.29 0.40 0.50 0.58 0.63 0.76 0.82 0.89 1.00 3.0% 2.7% 2.0% 1.7% 1.3% 1.0% 0.6% 0.4% 0.4% 0.6% 1.4% 4.0% 6 2014 NNN4 Statistically-Significant Correlations Time-Series Correlations www.tylervigen.com Tangled bed‐sheet Deaths vs. Skiing Revenues 900 Deaths (US) Minimum statistically-significant r = 2/Sqrt(n) “n” is the number of pairs being correlated Less than 5% over for n between 5 and 4,000. Simple and memorable for two variables. Minimum Correlation for Statistical Significance Exact 2/sqrt(n) Error All errors positive means the model is sufficient. Solution 2014-Schield-NNN4-slides.pdf 4 2014 NNN4 Statistically-Significant Correlations Revenues: Blue line 2600 800 2400 700 2200 600 2000 500 400 300 2000 Correlation: 0.969724 1800 1600 Revenues ($M) 0F 1400 2002 2004 2006 2008 Source: http://tylervigen.com/view_correlation?id=1864 10 pairs; 2/Sqrt(10) = 0.63; Statistically significant 1 Statistically-Significant Correlations 0F 7 2014 NNN4 Statistically-Significant Correlations Correlation = -0.993 Bee colonies & MJ arrests Pool Drownings vs. Films with Nicholas Cage Bee Colonies vs. Juvenile Marijuana Arrests 55,000 45,000 Honey Bee Colonies 2,600 Drownings 65,000 Correlation: ‐0.933389 Arrests 75,000 3,000 4 120 85,000 Marijuna Arrests of Juveniles Colonies (K) Correlation: 0.666004 95,000 3,200 110 3.5 Cage films: Blue line Drowningns: Red line 3 2.5 100 2 1.5 90 1 35,000 2,400 25,000 2,200 15,000 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 80 1998 09 20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant 2014 NNN4 Statistically-Significant Correlations 2000 2002 2004 2006 2008 0.5 2010 www.tylervigen.com/view_correlation?id=359 www.tylervigen.com/view_correlation?id=1582 0F 4.5 130 3,400 2,800 8 2014 NNN4 Statistically-Significant Correlations Correlation = 0.664 Drownings & Cage films Cage films 0F 11 Oct, 2014 9 11pairs; 2/Sqrt(11) = 0.60; Statistically-significant! 0F 2014 NNN4 Statistically-Significant Correlations 10 Correlation = 0.664 Drownings & Cage films Something Seems Wrong! 1. There is nothing linear about these associations. 2. These correlations seem unbelievably high. ----------------------#1: The correlation between two time-series eliminates the common factor: time. The question is whether their mutual association is linear. To see this, an XY-plot is generated. 0F 2014 NNN4 Statistically-Significant Correlations 11 #2: Very High Correlations. Three Explanations 1. Association is causal. See Tyler Vigen’s video: www.youtube.com/watch?feature=player_embedded&v=g-g0ovHjQxs 2. Association is spurious – just random chance. Five percent of random associations will be mistakenly classified as statistically significant. 3. Association is cherry-picked -- after the fact. According to Tyler, “This server has generated 24,470 correlations.” Tyler just picked those with high or interesting correlations. 2014-Schield-NNN4-slides.pdf 0F 2014 NNN4 Statistically-Significant Correlations 12 Conclusions 1. Use 2/Sqrt(n) as the minimum correlation for statistical significance. This criteria is sufficient, fairly accurate (within 5%) and memorable. 2. The correlation between two time-series eliminates time. Correlation determines the degree of linearity in their cross-sectional association. 3. Do not use a test for statistical significance if the data pairs were selected – after the fact via data mining – solely because of their high correlation. 2 Statistically-Significant Correlations 0F 11 Oct, 2014 13 2014 NNN4 Statistically-Significant Correlations Correlation = 0.993 Divorce & Margarine Usage Divorce Rate vs. Margarine Consumption 5 9 Correlation: 0.992558 8 4.8 7 4.6 6 Margarine Consumption per capita (US) 4.4 5 Divorce Rate 4.2 4 in Maine 4 3 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Correlation: 0.993. N=10, SS_Rho = 1/sqrt(11) = 10 pairs; 2/Sqrt(10) = 0.64. Statistically-significant www.tylervigen.com/view_correlation?id=1703 2014-Schield-NNN4-slides.pdf 3 0F 2014 NNN4 Statistically-Significant Correlations 1 Statistically-Significant Correlations Milo Schield Augsburg College Editor of www.StatLit.org US Rep: International Statistical Literacy Project Fall 2014 National Numeracy Network Conference www.StatLit.org/pdf/2014-Schield-NNN4-Slides.pdf 0F 2014 NNN4 Statistically-Significant Correlations 2 Exact Solutions For N random pairs from an uncorrelated bivariate normally-distributed distribution, the sampling distribution is not simple. Here are three common analytic approaches: 1.Fisher transformation (using LN and Arctanh), 2.an exact solution (using a Gamma function), or 3.Student-t distribution: t=rSqrt[(n-2)/(1-r^2)]; df=n-2 • For large n, the critical value of t (95% confidence) is 1.96. • For small n, the critical value of t increases as n decreases. None of these are simple or memorable. 0F 2014 NNN4 Statistically-Significant Correlations 3 Sufficient Condition Approach: Find an equation generating a minimum correlation for statistical-significance given N. 1. Given N, find the smallest value of r where the left end of a 95% confidence interval is non-negative. Use calculator at www.vassarstats.net/rho.html or www.danielsoper.com/statcalc3/calc.aspx?id=44 For Daniels, use the results for a two-tailed test. 2. Generate correlation coefficient with simple model 3. Calculate error difference between calculated and exact using the exact as the standard. If all errors are positive, then the model is sufficient. 0F 2014 NNN4 Statistically-Significant Correlations 4 Simple Model: 2/SQRT(n) N 400 256 100 49 25 16 12 10 7 6 5 4 Minimum Correlation for Statistical Significance Exact 2/sqrt(n) Error 0.10 0.12 0.20 0.28 0.40 0.50 0.57 0.63 0.75 0.81 0.88 0.96 0.10 0.13 0.20 0.29 0.40 0.50 0.58 0.63 0.76 0.82 0.89 1.00 3.0% 2.7% 2.0% 1.7% 1.3% 1.0% 0.6% 0.4% 0.4% 0.6% 1.4% 4.0% All errors positive means the model is sufficient. 0F 2014 NNN4 Statistically-Significant Correlations 5 Solution Minimum statistically-significant r = 2/Sqrt(n) “n” is the number of pairs being correlated Less than 5% over for n between 5 and 4,000. Simple and memorable for two variables. It is similar to the formula for the maximum 95% Margin of Error in samples from a binary variable: 95% ME = 1.96 Sqrt[p*(1-p)/n] < 2 Sqrt[1/(4n)] 95% ME < 1/Sqrt(n) Simple and memorable for one binary variable. 0F 2014 NNN4 Statistically-Significant Correlations 6 Time-Series Correlations www.tylervigen.com Tangled bed‐sheet Deaths vs. Skiing Revenues Revenues: Blue line 2600 800 2400 700 2200 600 2000 500 1800 400 300 2000 Correlation: 0.969724 1600 Revenues ($M) Deaths (US) 900 1400 2002 2004 2006 2008 Source: http://tylervigen.com/view_correlation?id=1864 10 pairs; 2/Sqrt(10) = 0.63; Statistically significant 0F 2014 NNN4 Statistically-Significant Correlations 7 Correlation = -0.993 Bee colonies & MJ arrests Bee Colonies vs. Juvenile Marijuana Arrests 3,400 95,000 3,200 85,000 75,000 3,000 65,000 Correlation: ‐0.933389 2,800 55,000 Honey Bee Colonies 2,600 45,000 35,000 2,400 25,000 15,000 2,200 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 www.tylervigen.com/view_correlation?id=1582 20 pairs; 2/Sqrt(20) = 0.45; Statistically-significant Arrests Colonies (K) Marijuna Arrests of Juveniles 0F 2014 NNN4 Statistically-Significant Correlations 8 Correlation = 0.664 Drownings & Cage films Pool Drownings vs. Films with Nicholas Cage 130 4.5 Correlation: 0.666004 4 110 3.5 Cage films: Blue line Drowningns: Red line 3 2.5 100 2 1.5 90 80 1998 Cage films Drownings 120 1 2000 2002 2004 2006 2008 0.5 2010 www.tylervigen.com/view_correlation?id=359 11pairs; 2/Sqrt(11) = 0.60; Statistically-significant! 0F 2014 NNN4 Statistically-Significant Correlations 9 Something Seems Wrong! 1. There is nothing linear about these associations. 2. These correlations seem unbelievably high. ----------------------#1: The correlation between two time-series eliminates the common factor: time. The question is whether their mutual association is linear. To see this, an XY-plot is generated. 0F 2014 NNN4 Statistically-Significant Correlations Correlation = 0.664 Drownings & Cage films 10 0F 2014 NNN4 Statistically-Significant Correlations 11 #2: Very High Correlations. Three Explanations 1. Association is causal. See Tyler Vigen’s video: www.youtube.com/watch?feature=player_embedded&v=g-g0ovHjQxs 2. Association is spurious – just random chance. Five percent of random associations will be mistakenly classified as statistically significant. 3. Association is cherry-picked -- after the fact. According to Tyler, “This server has generated 24,470 correlations.” Tyler just picked those with high or interesting correlations. 0F 2014 NNN4 Statistically-Significant Correlations 12 Conclusions 1. Use 2/Sqrt(n) as the minimum correlation for statistical significance. This criteria is sufficient, fairly accurate (within 5%) and memorable. 2. The correlation between two time-series eliminates time. Correlation determines the degree of linearity in their cross-sectional association. 3. Do not use a test for statistical significance if the data pairs were selected – after the fact via data mining – solely because of their high correlation. 0F 2014 NNN4 Statistically-Significant Correlations 13 Correlation = 0.993 Divorce & Margarine Usage Divorce Rate vs. Margarine Consumption 5 9 Correlation: 0.992558 8 4.8 7 4.6 6 Margarine Consumption per capita (US) 4.4 5 Divorce Rate 4.2 4 in Maine 4 3 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Correlation: 0.993. N=10, SS_Rho = 1/sqrt(11) = 10 pairs; 2/Sqrt(10) = 0.64. Statistically-significant www.tylervigen.com/view_correlation?id=1703
© Copyright 2026 Paperzz