Applied Data Analysis Spring 2017 Karen’s mom, Northern Ireland late 1970s Karen Albert [email protected] Thursdays, 4-5 PM (Hark 302) Lecture outline 1. Correlations Meet Salem Cat data library(MASS) data(cats) attach(cats) str(cats) ## 'data.frame': 144 obs. of ## $ Sex: Factor w/ 2 levels ## $ Bwt: num 2 2 2 2.1 2.1 ## $ Hwt: num 7 7.4 9.5 7.2 3 variables: "F","M": 1 1 1 1 1 1 1 1 1 1 ... 2.1 2.1 2.1 2.1 2.1 ... 7.3 7.6 8.1 8.2 8.3 8.5 ... Scatter diagram plot(Bwt,Hwt) 20 ● ● 12 8 6 Hwt 16 ● ● ● ● ● ● ● ● ● ● ● ● ● 2.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3.0 Bwt ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3.5 ● ● ● ● The basics of scatter diagrams • Ordered pairs • Cartesian plane • Abscissa/ordinate • Association Ordered pairs • An ordered pair (x, y ) is two numbers presented in parentheses and separated by a comma. Ordered pairs • An ordered pair (x, y ) is two numbers presented in parentheses and separated by a comma. • The first number represents the value along the x-axis, and the second number represents the value along the y -axis. Ordered pairs • An ordered pair (x, y ) is two numbers presented in parentheses and separated by a comma. • The first number represents the value along the x-axis, and the second number represents the value along the y -axis. • The x-coordinate of the ordered pair is known as the abscissa and the y -coordinate is called the ordinate. Cartesian plane Ordered paris can be plotted in a Cartesian plane, named after the 17th-century French mathematician and philosopher Rene Descartes. Cartesian plane Ordered paris can be plotted in a Cartesian plane, named after the 17th-century French mathematician and philosopher Rene Descartes. Famous quote: “I’m pink, therefore I’m spam.” -1 0 var[, 2] 1 2 3 Positive association -2 -1 0 var[, 1] 1 2 0 -1 -2 -3 var[, 2] 1 2 Negative association -2 -1 0 1 var[, 1] 2 3 -3 -2 -1 var[, 2] 0 1 2 No association -3 -2 -1 0 var[, 1] 1 2 Summarizing the relationship We need a number that will help us summarize the relationship. The number is known as the correlation. Correlation coefficient • The value that helps us measure the association between two variables is the correlation coefficient. Correlation coefficient • The value that helps us measure the association between two variables is the correlation coefficient. • The symbol for the correlation is r . Correlation coefficient • The value that helps us measure the association between two variables is the correlation coefficient. • The symbol for the correlation is r . • The correlation coefficient measures how tightly clustered the point cloud is. The more tightly clustered, the higher the correlation. The less tightly clustered, the lower the correlation. Some facts about r • the correlation coefficient ranges between -1 and 1 −1 ≤ r ≤ 1 Some facts about r • the correlation coefficient ranges between -1 and 1 −1 ≤ r ≤ 1 • 1 is a perfect positive correlation, and -1 is a perfect negative correlation. Some facts about r • the correlation coefficient ranges between -1 and 1 −1 ≤ r ≤ 1 • 1 is a perfect positive correlation, and -1 is a perfect negative correlation. • the correlation coefficient is unitless—it is not degrees, percents, or inches. Negative correlation A positive correlation means that as one variable increases, so does the other. A negative correlation means that as one variable decreases, so does the other. Right? Negative correlation A positive correlation means that as one variable increases, so does the other. A negative correlation means that as one variable decreases, so does the other. Right? Wrong. It means that as one variable decreases, the other increases. Example? What correlation cannot tell us 1. It cannot tell us how much one variable changes in response to a change in the other variable. 2. It does not tell us the magnitude of the change. 3. That is, it does not tell us the slope of the point cloud. 4. That is a job for regression. Cave! hic dragones Pastafarianism Church of the Flying Spaghetti Monster Pastafarianism “beliefs” • We believe pirates, the original Pastafarians, were peaceful explorers, and it was due to Christian misinformation that they have an image of outcast criminals today. • We are fond of beer. • Every Friday is a religious holiday. • We do not take ourselves too seriously. • We embrace contradictions (though in that we are hardly unique). How the world was created... We believe the Flying Spaghetti Monster created the world much as it exists today, but for reasons unknown made it appear that the universe is billions of years old (instead of thousands) and that life evolved into its current state (rather than created in its current form). Every time a researcher carries out an experiment that appears to confirm one of these “scientific theories” supporting an old earth and evolution we can be sure that the FSM is there, modifying the data with his Noodly Appendage. We don’t know why He does this but we believe He does, that is our Faith. Other Examples • The number of firefighters at a fire is positively correlated with the amount of damage caused by the fire. Other Examples • The number of firefighters at a fire is positively correlated with the amount of damage caused by the fire. • The accuracy of bombers in WWII is positively correlated with the number of opposition fighters. Other Examples • The number of firefighters at a fire is positively correlated with the amount of damage caused by the fire. • The accuracy of bombers in WWII is positively correlated with the number of opposition fighters. • Children’s reading level is positively correlated with the amount of TV watched. Notation flashback x̄ = n n i=1 i=1 X 1X xi = xi × pi n v v u n u n uX u1 X t 2 s= (xi − x̄) = t (xi − x̄)2 × pi n i=1 i=1 Covariance To calculate r , we need a value known as the covariance: n cov(x, y ) = 1X (xi − x̄)(yi − ȳ ) n i=1 = n X (xi − x̄)(yi − ȳ ) × pi i=1 The covariance measures the relationship between two variables. Problem The covariance has a problem—it depends on the units with which x and y are measured. cov(Bwt,Hwt) ## [1] 0.9501127 Bwt.new <- Bwt*10 Hwt.new <- Hwt*10 cov(Bwt.new,Hwt.new) ## [1] 95.01127 Solution: the correlation coefficient r= cov(Bwt,Hwt) ## [1] 0.9501127 cov(Bwt,Hwt)/(sd(Bwt)*sd(Hwt)) ## [1] 0.8041274 cor(Bwt,Hwt) ## [1] 0.8041274 cov(x, y ) sx sy Another view r = = = cov(x, y ) sx sy 1 Pn i=1 (xi − x̄)(yi − ȳ ) n q P q P n n 1 1 2 2 i=1 (xi − x̄) i=1 (yi − ȳ ) n n Pn x̄)(yi − ȳ ) i=1 (xi − q qP Pn n 2 2 i=1 (xi − x̄) i=1 (yi − ȳ ) Another view Remember that we said that the correlation coefficient is unitless. How do you make a value unitless? Another view Remember that we said that the correlation coefficient is unitless. How do you make a value unitless? We use standardized scores: xi − x̄ s We should then be able to calculate the correlation coefficient using standardized units. z= Another view r can be calculated as the average of the X ’s in standard units times the Y ’s in standard units. n r= 1X x zi × ziy n i=1 Bwt.z <- (Bwt-mean(Bwt))/(sd(Bwt)*((length(Bwt)-1)/length(Bwt))) Hwt.z <- (Hwt-mean(Hwt))/(sd(Hwt)*((length(Hwt)-1)/length(Hwt))) mean(Bwt.z*Hwt.z) ## [1] 0.8097507 More about r • r is not affected by interchanging the variables. • Adding the same number to all the values of one variable does not change r . • Multiplying all the values of one variable by the same positive number does not change r . • r can by fooled by outliers and nonlinear associations. Always look at a scatter diagram. What did we learn? • Pirates • Global warming • Pastafarianism • Covariance • Correlation
© Copyright 2025 Paperzz