4 Measurements of Location and Position The primary purpose of collecting sample data is to draw inferences regarding the population from which it came. Once we have collected our sample data and “looked” at it with an appropriate graphical display, we will want to generate summary statistics. Summary statistics describe (summarize) the data numerically. The first of these summary statistics provides us with information regarding the location of our data. Learning Objectives 4.1 • Compute and interpret the location of the center of a data set along the number line. • Compute and interpret the position of a data value relative to the other data values for a data set. • Identify the appropriate measure of center for a data set by distributional shape. Measurements of Location When we talk about the location of a data set, and measurements of location, we are referring to a measurement of the value of the center of the data. In other words, what value on the number line best identifies the center of the distribution. In comparison, measurements of position refer to the position of individual data values within both the population and sample distributions. There are many ways to measure the center, three of which we will discuss here. The choice of measurement will most often be topic dependent. The diagram to the right shows the approximate centers for two different data sets shown on the same graph. 4.1.1 We use the “center” of the distribution to describe the distribution “location.” The Sample Mean Data types applicable: interval/ratio. The most commonly used measurement of distribution location is the mean. When people speak of an average, they are most often referring to the arithmetic mean. The mean is a numerical value that acts as a balancing point within 57 CHAPTER 4 Measurements of Location the data, with equal weight above and below. Consider the data set that consists only of the values 1 and 3. The mean is a number that equally balances the values in the data set. In this case, the mean is 2. Notice the value 2 is not a member of the data set. There is no requirement for the mean to be an actual member of our data set. Symbolically, if we are using the variable x to represent our data, then we denote the sample mean by placing a bar over the x. The sample mean is then referred to (in English) as “x-bar” and denoted x . Likewise, if we are using a y to represent our data, the sample mean will be denoted by y . Recall a sample statistic is used to summarize the sample data and as an estimate for the population parameter of interest. The population mean (parameter) is denoted by µ (pronounced mu). Quite often, if we have more than one list of data values, say one denoted by x and the other by y, then we will use a subscript with µ to indicate which population mean we are talking about, such as µ x or µ y . n 1 Mathematically, the mean is defined as: x = --n ∑ x . In English, this mathematical equation is telling you to add up all the data i i=1 values and then divide by the sample size. For our simple example with the data set consisting only of 1 and 3, the mean is 2 1 calculated by: x = --2 ∑x i 1+3 = ------------ = 2 . Consider the data set consisting of the numbers 1, 30, and 32. The sample size is 3, so 2 i=1 3 1 n = 3. The sample mean is calculated as: x = --3 ∑x i 1 + 30 + 32 = --------------------------- = 21 . 3 i=1 Notice the value of the mean is no longer physically the middle of the data set. Rather, it maintains a position that equally balances the data values. The directed distance between the sample mean and the value to the left is 1 - 21 = -20. The directed distance between the sample mean and the values to the right are 30 - 21 = 9 and 32 - 21 = 11. The sum of those values is 9 + 11 = 20. The distances balance out since -20 + 20 = 0. This will always occur when the mean is used as the measure of the center. 58 CHAPTER 4.1.2 4 Measurements of Location Median Data types applicable: ordinal, interval/ratio. Another common measurement of location is the median. The median is defined as the value that occupies the middle position when the data are sorted according to size. Unlike the mean, if the sample size is odd, the sample median will be a value in your data set. If the sample size is even, the sample median may or may not be a value in your data set. We will discuss this later with an example. The sample median (statistic) is denoted by M. The population median (parameter) is denoted by the Greek letter θ (pronounced theta). Let n equal the number of data points (sample size). Then the position of the median can be identified by: n+1 Position = ------------ . If n is odd, then the position will be an integer such as 6. If n is even, then the position of the median will not 2 necessarily be an integer. Rather, it will be midpoint between two values such as 6.5. Another definition commonly used to describe the median is a value such that no more than 50% of the data values fall below and no more than 50% fall above. If the sample size is even, then we will get exactly 50% above and 50% below. If the sample size is odd, then we will get less than 50% above and below, but in no case will there ever be more than 50% above or below. Consider the data set consisting of the values 11, 5, 1, 12, 9. In this example the sample size, n, is 5. The position of the median 5+1 is then calculated as ------------ = 3 . 2 DON’T FORGET TO SORT YOUR D ATA BEFORE YOU G O TO THE CALCULATED POSITION A ND IDENTIFY THE MEDIAN Position 1 2 3 4 5 Data Values 1 5 9 11 12 The sorted data is: 1, 5, 9, 11, 12. The 3rd position is occupied by 9, so the sample median, M, is 9. If we failed to sort the data, then we would have identified the number 1 as the median, a serious mistake. If the median was not based on the sorted data, then the value of the median would be determined purely by the order the data was recorded. The same data set, simply written down in a different order, would then have different medians. This, in turn, would make the median a completely useless statistic. For this reason, sorting the data prior to identifying the median is crucial. In this example the sample size is 5, an odd number. The median has been identified as 9. There are two data values below the median and two data values above the median, which means we have 40% of the data below the median and 40% above. This satisfies our definition of the median in that we have no more than 50% below (2/5 = 0.40) and no more than 50% above (2/5 = 0.40). Consider the data set consisting of the values 11, 5, 12, 9. In this example the sample size is 4, so the position of the median is 4+1 5 calculated as: ------------ = --- = 2.5 2 2 59 CHAPTER 4 Measurements of Location DON’T FORGET TO SORT YOUR DATA BEFORE YOU GO TO THE POSITION AND IDENTIFY THE MEDIAN Position 1 2 3 4 Data Values 5 9 11 12 Sample Median = 10 The sorted data is: 5, 9, 11, 12. The 2.5th position is not occupied by an actual data value. The fact that the position is identified as position 2.5 simply tells us to compute the average of the data values in the 2nd and 3rd positions and identify this as our 9 + 11 20 median. Hence, M = --------------- = ------ = 10 . Notice the value 10 is not a member of our data set. There are two data values below 2 2 the median and two data values above the median, which means in this case we have exactly 50% of the data values below the median and 50% above. This satisfies our definition of the median. Now consider the data set consisting of the numbers 1, 2, 7, 7, 9, 15. Identify the sample median. Notice the sample size is 6, 6+1 7 hence the position of the sample median is given as ------------ = --- = 3.5 . The 3rd and 4th sorted data values are both 7’s. If we 2 2 compute their average, then we will get 7, hence the sample median is 7. In the previous example, the sample size was even and the median was not a member of the data set. In this example, the sample size is even, and the median is a member of the data set. 4.1.3 Mean vs. Median There are times when the median is a much more desirable measurement of location than the mean. As an example, when working with cancer research, the median survival time is typically reported rather than the mean. Suppose 12 people were given a new cancer treatment. Further suppose patients that have this particular type of cancer have an expected survival time of 3 years if a patient uses the conventional treatment. Consider the survival times of the 12 persons in our study as recorded (in years) below: 1.0 13.0 1.5 12.0 2.0 2.75 2.5 2.25 0.5 1.0 10.0 1.5 The mean survival time is 4.1667 years. Based only on the value of the mean, one may believe the new treatment is better than the conventional. If data analysis were really this straight forward, then you would not be taking this course. In a later chapter we will look closer at comparing the sample mean of 4.1667 with the believed population mean of 3 to determine if the sample mean really suggests the survival time is greater than 3 years, or if the difference is simply a matter of random chance. 60 CHAPTER The sample median in this example is 2.1250 years. That value is telling us that half of the patients survived less than 2.1250 years, and half survived more than 2.1250 years. In fact, looking closer at the sample data, only 3 of the 12, or 25%, survived longer than 3 years. Reporting the mean, in this example, would be very misleading whereas the median paints a much more accurate picture of survivability. If a patient were diagnosed with cancer, would it be better to inform the patient the average survival time is 4 years, the median survival time is 2.1250 years, 25% survive longer than 3 years, or possibly all? This is another example of how a picture would be of tremendous use. How easy is it to get a feeling for what the data is telling you simply by looking at the numbers? The summary statistics help, but the combination of the summary statistics and a picture really start to tell the story. 4 Measurements of Location yc ne uq er F Survival Time (years) One advantage to reporting the sample median over the sample mean is that the median is not influenced by outliers. This can best be described with an illustration. Consider the data set consisting of the values 1, 2, and 3. The mean and the median are both 2. Now consider the data set consisting of the values 1, 2, and 50. The mean is 17.6667, but the median is still 2. The data value 50, an outlier or extreme value in this example, pulls the value of the mean toward it, yet the median remains unchanged. 4.1.4 M = M edian y = Sam p le M ean y y Mode The mode is the only measurement of “center” that can be applied to nominal, ordinal, interval or ratio data, although the meaning of the “center” for nominal data is not as clear as it is for ordinal, interval and ratio data. The mode, when discussing nominal or ordinal data, is the value in the data set that occurs most frequently. When the word mode is used in reference to interval and ratio data, it is referring to the class that has the most observations. Consider the following data obtained from a recent statistics class consisting of student eye color along with the associated frequency table. The mode for this nominal data is brown eyes. Student Eye Color Brn Brn Brn Blue Grn Blue Blue Hzl Brn Blue Grn Blue Brn Grn Brn Grn Blue Brn Brn Blue Hzl Brn Hzl Brn Brn Hzl Blue Brn Brn Blue Grn Hzl Freq Rel. Freq. 11 6 4 3 11/24 = 0.4583 6/24 = 0.2500 4/24 = 0.1667 3/24 = 0.1250 61 CHAPTER 4 Measurements of Position Now consider the calculus test scores from the previous chapter, along with the associated frequency table. The mode (modal class) is the class with the highest observed frequency, which in this case is 70 to 80. Class Frequency Rel. Frequency Cum. Frequency Cum. Rel. Freq 0 ≤ x < 10 0 0/20 = 0% 0 0/20 = 0% 10 ≤ x < 20 1 1/20 = 5% 1 1/20 = 5% 20 ≤ x < 30 0 0/20 = 0% 1 1/20 = 5% 30 ≤ x < 40 0 0/20 = 0% 1 1/20 = 5% 40 ≤ x < 50 2 2/20 = 10% 3 3/20 = 15% 50 ≤ x < 60 0 0/20 = 0% 3 3/20 = 15% 60 ≤ x < 70 4 4/20 = 20% 7 7/20 = 35% 70 ≤ x < 80 9 9/20 = 45% 16 16/20 = 80% 80 ≤ x < 90 2 2/20 = 10% 18 18/20 = 90% 90 ≤ x < 100 2 2/20 = 10% 20 20/20 = 100% Caution must be exercised when using the mode as a measure of the center of the distribution for interval and ratio data. The reason for this caution is due to the fact the mode may be dependent on the class widths chosen. As you change class widths, the modal class may change. This makes the mode an unreliable measure of the center, whereas the mean and median remain the same regardless of the number of classes, or the width of the classes, chosen. 4.2 Measurements of Position Measurements of position are used to describe the position of a specific data value in relation to the rest of the data. The two measures of position we will be looking at are quartiles and percentiles. 4.2.1 Quartiles Data types applicable: interval/ratio. Quartiles are typically calculated for interval or ratio data, although in some cases they may have meaning for ordinal data. It does not make any sense to calculate quartiles for ordinal data consisting of only three categories. Quartiles divide the data into four equal groups with each group containing at most 25% of the data. The notation for the population parameter for a quartile is θ (theta) with an appropriate subscript, whereas Q, with the appropriate subscript, denotes the sample quartiles. The population first quartile is denoted by θ 1 , and the third quartile by θ 3 . The second quartile, better known as the median, is typically represented by θ without a subscript. The definitions for the sample statistics and the population parameters are 62 CHAPTER 4 Measurements of Position essentially the same with the only exception being the population parameters summarize the data in the population, and the sample statistics summarize the sample data obtained. Quartiles divide the data into four equal groups with each group containing at most 25% of the data. The first sample quartile, Q1, is that value such that at most 25% of the data fall below and at most 75% fall above. The second sample quartile, Q2, is that value such that at most 50% of the data fall below it and at most 50% of the data fall above. The second quartile is seldom referred to as the second quartile. Rather, we typically refer to it as the median (it is, after all, by definition the value in the middle). The letter M is used to represent the sample statistic for the median. The third quartile, Q 3, is that value such that at most 75% of the data fall below it and at most 25% of the data fall above. Population 25% 25% θ1 4.2.2 Sample 25% θ 25% θ3 25% 25% Q1 25% M 25% Q3 The Five Number Summary The five number summary is often used collectively as a numerical summary of the sample data. As such, the five number summary only has meaning for data that is at least interval scale (i.e. interval or ratio scale data). The summary consists of: • the minimum value from the sample data (min) • the first quartile (Q 1) • the median (M) • the third quartile (Q3) • the maximum value (max) 63 CHAPTER 4 Measurements of Position Combined, these values are used to create a box-and-whisker plot, another pictorial representation of the sample data. Recall the data from the calculus exam. Calculus Exam Scores - AM 80 42 73 86 93 78 42 68 71 73 75 69 19 75 68 67 75 78 73 98 We can find the values of the five number summary as displayed below, on both the TI-83 and MINITAB. Once obtained, we can use these values to prepare another useful graphical representation of our data called a box-and-whisker plot. TI-83 Output MINITAB Output Descriptive Statistics 4.2.3 Variable Calc-AM N 20 Mean 70.15 Variable Calc-AM Minimum 19.00 Median 73.00 TrMean 71.44 Maximum 98.00 Q1 68.00 StDev 17.90 SE Mean 4.00 Q3 78.00 Box-And-Whisker Plot Data types applicable: interval/ratio. The box-and-whisker plot, or more commonly referred to as simply a box plot, is really nothing more than a picture of the 5number summary. The box-and-whisker plot has three main components. The first component is the box. The ends of the box represent Q1 and Q3. An additional vertical line is drawn within the box to represent the location of the median. The “whiskers” are then drawn from the ends of the box, reaching out to the minimum and maximum values. Min Q1 M Q3 Max The left end of the “whisker” represents the minimum value, the right end of the “whisker” represents the maximum value. The first vertical bar represents the location of the 1st quartile, the 2nd vertical bar represents the location of the median, and the last bar represents the location of the 3rd quartile. 64 CHAPTER 4 Measurements of Position With the TI-83, if we press TRACE, the cursor will jump to the graph and we can use our right and left arrows to read off the minimum and maximum values along with the quartiles. TI-83 Using Trace MINITAB Outliers are defined as any value that is greater than 1.5(Q3 - Q1) above Q3 or below Q1. An * is displayed to denote outlier values. MINITAB displays these outliers automatically. The TI-83 has two options for a box plot. The first will display outliers, and the second will connect all outliers to the whiskers on the box plot. The box plots chosen for the TI-83 here do not display the outliers. Display outlier Do not display outlier Keep in mind that each section contains at most 25% of the data values. The picture will give you an idea how the data is spread out. Since Q1 is 68 and Q3 is 78, we know approximately 50% of our sample data falls between these values. This is telling us the majority of the student scores were close together, with a few people doing very well, and a few people doing very poorly. One powerful use of box-and-whisker plots is in comparing two or more data sets. The calculus exam scores came from an early morning class. Consider the following scores from the same exam, but an early afternoon class. Calculus Exam Scores - PM 86 73 94 91 82 82 77 74 79 86 81 76 54 70 75 80 75 72 79 99 65 CHAPTER 4 Measurements of Position It is difficult to simply “look at the numbers” and see possible differences in the distribution of the exam scores from each class. If we look at the box plot for the afternoon class, the shape seems a little different. Grasping how different is difficult without a scale. PM Calculus Scores Without Scale If we look at the side-by-side box plots, which provides a relative scale, a difference between the two classes seems to emerge. The top box plot is the morning class. TI-83 Side-by-Side Box Plots MINITAB Output Of The Same Data By placing the box plots on the same scale, we have a visual comparison of the center of the distributions and respective variability. 4.2.4 Box-And-Whisker Plots vs. Histograms Both the box-and-whisker plot and histograms are very useful pictures of your data. They both present the overall shape of your data and will inform you of potential outliers. The box plot provides additional information in that it indicates the location of the quartiles and median relative to each other. Regardless, both graphical techniques present a similar overall pictorial representation of your data. The following pictures represent a box plot and the corresponding histogram for the various data shapes. 66 CHAPTER Approximately Bell Shaped 4.2.5 Skewed Left 4 Measurements of Position Approximately Uniform Bimodal Percentiles Data types applicable: interval/ratio. A percentile is the position where at most a specified percentage of the data falls below that point. The 25th percentile is the same as the first quartile (P25 = Q1). The 98th percentile (P98) is the point such that “at most” 98% of the data falls below and “at most” 2% falls above. Suppose we are interested in the 65th percentile for a particular set of data that we have collected. By definition, this is the value such that at most 65% of the data values is below and at most 35% is above. The at most statement here is very important in that it is rare we will have a data set such that the 65th percentile (or any percentile) will give you exactly 65% of the data values below and 35% above. Example: Consider a data set consisting of the numbers 8, 3, 11, 0, 1, 7, 13, 12, 9, 2, 15. The first thing we want to do is put the values in order, which gives us: 0, 1, 2, 3, 7, 8, 9, 11, 12, 13, 15. To find the location of P65 we will multiply 11 (there are 11 observations here) by 0.65, which is 7.15. This is suggesting we consider the 7.15th position for the 65th percentile. The 7.15th position does not exist, so we will consider averaging the 7th and 8th values. But first we need to see if we really have at most 65% below and 35% above. 7/11 = 0.636 and 4/11 =.364. So we have 63.6% below (which is fine by the at most definition), but we have 36.4% above, which is more than 35% allowed by definition. This is telling us we need to go to the 8th position. Using the 8th position, we still have seven values below (63.6%), but we now have only 3 values above, or 27.3%. This satisfies the at most definition. So the 65th percentile is the data value in the 8th position, or 11. This is what is meant by seldom being able to actually find a value that gives you exactly the desired proportion above and below. This leads to an interesting point. If our data set is small, what purpose do percentiles serve? Consider the data set 1, 3, 17. The 85th percentile is 17, which is also the 86th percentile, the 87th percentile, the 88th percentile,..., and so forth. Reporting percentiles in small data sets can be misleading, and of little value. Example: Consider the data set consisting of the numbers, 5, 9, 11, 21, 90, 3, 13, 43, 21, 63, 34, 26, 53, 18, 37, 31, 40, 19, 1, 21. Suppose we were interested in P20, the 20th percentile. We will locate the position by multiplying the sample size, 20, by 0.20 which gives us 4. If we consider the 4th ordered data value, then we will have 3 below, or 3/20 = 15%. Above we have 16/ 67 CHAPTER 4 Correlation (Optional) 20, or 80%. The definition of the 20th percentile states we should have at most 20% below and 80% above. The 4th position certainly meets the definition, but can we do better? If we consider averaging the 4th and 5th positions, then we would have 4 data values below, or exactly 20%, and 16 data values above, or 80%. By averaging the 4th and 5th ordered data values, we can get exactly 20% below and 80% above. The 4th and 5th ordered data values are 9 and 11, the average of which is 10, so we will identify P20 as 10. The process to find a desired percentile, as illustrated above, can be summarized as follows: Procedure For Finding Percentiles 1. Sort the data from smallest value to largest value. 2. Find the location by multiplying the sample size n, by the proportion represented on the percentile. 3. If the location calculated is a non-integer value, round up to the next integer value. Identify the data value in that position as the desired percentile. 4. If the location calculated is an integer value, average the data value in that position and the next position. The average of the two values will be identified as the desired percentile. 4.3 Correlation (Optional) (This topic is covered in more detail in a later chapter.) Data types applicable: interval/ratio. Sir Francis Galton (1822 - 1911) made a great deal of progress in the area of methods for studying the relationship between two variables. Statisticians in Victorian England were fascinated by the idea of quantifying hereditary influences and gathered huge amounts of data in pursuit of developing methods for studying the relationship between two variables. One of Galton’s disciples, Karl Pearson (1857 - 1936) completed a great deal of work in the area of resemblance between family members. As part of his studies, Pearson measured the heights of 1078 fathers along with their sons at maturity. Pearson’s studies lead to what is known today as the Pearson’s productmoment correlation. Karl Pearson Consider the relationship, if it exists, between the number of cigarettes smoked and CHD mortality based on the data as presented below. Country 68 Cigarettes Adult/Year CHD Mortality Country per 100,000 Cigarettes Adult/Year CHD Mortality per 100,000 Australia 3220 238 Italy 1510 114 Austria 1770 182 Mexico 1680 32 Belgium 1700 118 Netherlands 1810 125 Canada 3350 211 New Zealand 3220 212 CHAPTER 4 Correlation (Optional) Denmark 1500 120 Norway 1090 Finland 2160 208 Spain 1200 44 France 1410 60 Sweden 1270 127 Germany 1890 150 Switzerland 2780 160 Greece 1800 41 UK 2790 194 Iceland 2290 208 USA 3900 265 Ireland 2770 187 90 In this case, we have not specified which variable is thought of as the independent and the dependent. However, it is logical that someone would want to look at this data and perhaps attempt to make a prediction of the CHD mortality based on the number of cigarettes smoked. For this reason, it would make sense to think of the number of cigarettes smoked as being the independent variable and the CHD mortality as the dependent variable. In terms of calculating a correlation, it will not make any difference which variable is the independent and which is the dependent; however, it will make a difference later when we attempt to use this data for prediction. Hopefully, the scatter plot will look like a cloud of points with a visible trend. We need to numerically summarize this cloud. Such a numerical summary would then tell us a great deal about the strength and direction of the association between the two variables. TI-83 and MINITAB Scatter Plot Of Cigarette Data If there is a strong association between two variables, then knowing the value of one variable helps a great deal in predicting the value of the other variable. If the association is weak, then information about one variable does not help in predicting the other. Pearson’s correlation coefficient is just such a measurement. Specifically, Pearson’s correlation is a measurement of linear association. The population parameter is denoted by the Greek letter ρ (pronounced rho). The sample statistic is denoted by the letter r. Both ρ and r can assume values on the interval [-1, 1]. It is mathematically impossible for the value of the correlation coefficient to take on a value less than -1 or greater than 1. A correlation coefficient of 0 means there is no linear association. A correlation of 1 means there exists a perfect positive (as x gets larger, y gets larger) linear association. In other words, your scatter plot will be a bunch of dots that line up perfectly with a positive slope. If r = 1 means a perfect alignment, 69 CHAPTER 4 Correlation (Optional) then what would r = 1.5 mean … a “more perfect alignment?” It should seem clear that r > 1 simply has no meaning. Likewise, r = -1 means a perfect negative (as x gets larger, y gets smaller) linear association so r < -1 has no meaning. Although correlation is a measurement of linear association, it is often difficult to interpret the actual meaning of a specified correlation coefficient value. As an example, r = 0.80 is not “twice as linear” as r = 0.40. This will be explored more later. Consider the following scatter plots and the associated value of the sample correlation coefficient. The ovals have been added for emphasis. Essentially a blob with no sign of a linear trend. Possibly a very weak positive Positive linear trend in the linear trend, but difficult to see. data is more readily visible. r=-0.068 Very strong, clearly visible positive linear trend. r=0.32 Perfect positive linear association. r=0.98 4.3.1 r=1.0 r=0.50 Perfect negative linear association. r=-1.0 Calculating the Sample Correlation As with other statistical procedures we have explored, Pearson’s correlation has a set of assumptions that must be reasonably satisfied before the technique can be properly used. The assumptions for Pearson’s correlation are: • • 70 Both variables are randomly selected from the populations they represent. A scatter plot of the data appears reasonably linear (a more specific dicussion of the assumptions will be presented in CHAPTER 4 Correlation (Optional) a future chapter). If the scatter plot has curvature to it, then Pearson’s correlation may not be appropriate. Remember, Pearson’s correlation is a measurement of linear association. Consider the Cigarette versus CHD Mortality data. In looking at the scatter plot, does there appear to be a linear trend? The answer is yes. There is a clear linear trend and it seems to be relatively strong. Cigarette versus CHD Mortality Calculating the value for the sample correlation coefficient by hand is a very unpleasant thing to do. Both MINITAB and the TI-83 will calculate this value for you. When using the TI-83, you must first make sure your calculator is in Diagnostic mode. Once your calculator is put in diagnostic mode it will remain that way until you change it, so this really only needs to be done once. Select the calculator’s catalog by pressing 2nd CATALOG then type the letter D, which will cause the catalog to jump to the commands that start with the letter “D”. Use the down arrow key until the arrow points to DiagnosticsOn. Press ENTER and the command will paste to the home screen. Press ENTER again and your calculator will be placed in the “diagnostics on” mode. Activating the TI-83 Diagnostic Mode Once the diagnostic mode is turned on, we can use the TI-83 to find the value of the sample correlating coefficient. The value of r is actually provided as part of the output for linear regression. The command sequence is shown here. 71 CHAPTER 4 Spearman’s Rank Correlation (Optional) Calculating The Sample Correlation (Cigarette Data) Coefficient On The TI-83 Pearson’s Sample Correlation The TI-83 provides several options for calculating the regression equation information (which will be discussed in detail in a future chapter) that contains Pearson’s correlation. Two are shown here. The top series of screen shots is Option 4, the bottom series of screen shots is Option 8. The information is the same; however, the format it is presented is slightly different. You may develop a preference Pearson’s Sample Correlation The value of the correlation coefficient is 0.8175. This is suggesting a rather strong linear association. 4.4 Spearman’s Rank Correlation (Optional) (This topic is covered in more detail in a later chapter.) Data types applicable: ordinal/interval/ratio. Unlike Pearson’s correlation coefficient, Spearman’s Rank Correlation can also be applied to ordinal data. If a scatter plot has curvature, then Pearson’s correlation is not the best route to take. Rather, Spearman’s rank correlation may be a more appropriate technique to use. In terms of notation, we will denote Spearman’s sample correlation by rs with the population parameter being denoted by ρ s . Spearman’s rank correlation determines the degree to which a monotonic relationship exists between the two variables. 72 CHAPTER 4 Spearman’s Rank Correlation (Optional) Spearman’s rank correlation determines the degree to which a monotonic relationship exists between the two variables. As a matter of practice, the determining factor as to which form of the correlation coefficient that should be used is commonly decided upon by strictly whether or not the a scatter plot has curvature. If there is curvature in your data then use Spearman’s correlation based on ranks. Keep in mind the curvature must be monotonic. Consider the following graph. Clearly there is curvature; however, the curvature is not monotonic hence Spearman’s correlation is not appropriate. 4.4.1 Calculating Spearman’s Correlation Consider the following bivariate data. The scatter plot clearly shows a monotonic increasing pattern. x y 1 2 3 4 5 6 1 0.8 2.2 2 4.5 11 Spearman’s correlation is nothing more than Pearson’s correlation calculated using the ranks of the data rather than the original data. The following table shows the original data along with the corresponding ranks. Notice how the scatter plot of the ranks now has a linear trend. By using the ranks, rather than the original data, the curvature has been removed. If you were initially presented with this scatter plot and asked which measure of association is appropriate, you would most certainly answer Pearson’s correlation. We can now find the value of a Spearman’s correlation coefficient by finding Pearson’s correlation coefficient using the ranks. x Rank x y Rank y 1 2 3 4 5 6 1 2 3 4 5 6 1 0.8 2.2 2 4.5 11 2 1 4 3 5 6 73 CHAPTER 4 Spearman’s Rank Correlation (Optional) Pearson’s Correlation calculated with the original data. Pearson’s Correlation calculated with the ranks of the original data. (Spearman’s Correlation) The value of Pearson’s correlation on the original data is 0.8423 whereas the value of Spearman’s correlation is 0.8857. Pearson’s correlation under represented the strength of the monotonic association between the x and y variables. This is because Pearson’s correlation is looking for a linear relationship. 4.4.2 Using MINITAB and the TI-83 The TI-83 does not have a built-in routine for Spearman’s correlation; however, you can use the program SRCORR available from your instructor or from the book web site if you have a GraphLink cable. The original data for the previous example was stored in L3 and L4. Using The SRCORR Program MINITAB also does not have a Spearman’s correlation routine; however, MINITAB will rank the data for you as shown here. The original data was placed in C1 and C2. The ranks of C1 are being shown placed in C3. Repeat the process placing the ranks of C2 into C4 then calculate Pearson’s correlation on the data in C3 and C4. 74 CHAPTER 4.5 4 From a Consultant’s Perspective From a Consultant’s Perspective The cereal dataset has several variables. In the previous chapter we decided which of these variables we should use to produce graphical displays. Measurements of the center of the data (the mean or median) are probably the most common of all descriptive statistics. From a practical point of view, it is very common to report both the mean and the median; however, inferences (which we will discuss in a future chapter) are typically completed only for the “proper” measurement of center. At this point in the course, we will make our decision regarding which measurement is “proper” based on the overall shape of the data. The box plot is often the display of choice for analytical decision making such as “which is most appropriate, the mean or the median?” At this juncture in the analysis, a data analyst would: • Decide which variables are appropriate for a box plot. • Produce appropriate box plots and use them to decide which measurement of the center is more appropriate. • Calculate the measurements of the center. • Calculate appropriate correlation coefficients. 75 CHAPTER 4 From a Consultant’s Perspective Notation This chapter has introduced some very important notation which will be used throughout the remainder of the course. We have already gained an understanding of a parameter versus a statistic. The notation we use will allow us to quickly and easily determine if we are talking about a population parameter or a sample statistic. Likewise, the notation will allow us to more readily see the relationship between parameters in statistics in later chapters. For this reason, it is imperative we are able to immediately identify parameters versus statistics, their English meaning and what they measure. The following table lists the parameters and statistics we have thus far explored. We can not over emphasize the importance of being able to recognize each symbol immediately by sight, knowing what it is, what it measures, and if it represents a parameter or a statistic. English Mean x What does it measure? Center of the data. M Center of the data. θ1 Q1 Location where no more than 25% of all values are smaller and no more than 75% are larger. θ3 Q3 Location where no more than 75% of all values are smaller and no more than 25% are larger. Median µ θ First Quartile Third Quartile k th Percentile Pearson’s Correlation Spearman’s Correlation 76 Parameter Statistic Location where no more than k% of all values are smaller and no more than (100-k)% are larger. * Pk ρ r Measurement of linear association. ρS rS Measurement of monotonic association. * Note: We typically use percentiles to describe the sample data, so a symbol for the parameter value is being omitted. CHAPTER 4.6 4 Review Exercises Review Exercises 4.1 Compare and contrast the sample mean and median. a) What do they measure? b) When would you use the mean rather than the median? c) What advantages does the median have over the mean, if any? 4.2 Consider the following data. 8 3 13 11 6 16 9 4 14 7 2 12 10 5 15 18 23 21 1 19 24 17 22 20 25 a) Calculate the mean and median. Use the appropriate symbol for each. b) Use the definition of the mean and the definition of the median to explain why these two values are the same for this data set. (The use of a graph may help.) 4.3 A professor announced to her class that the average score for the 37 exams received was 86. She later discovered three additional exams with scores of 76, 81 and 97. These exams were not included in the class average she reported to the class. Is it possible to calculate the new class average? If so, what is it? If not, justify why. 4.4 You are the member of a research team that is exploring the salary earned by recent college graduates. You start by sampling 45 recent (within the last two years) college graduates. One of the members of the team wants to report the mean income, and one wants to report the median income. You will cast the tie-breaking vote. Which measure of center would you choose, the mean, the median, or both? Defend your position. 4.5 You have been hired to represent a rookie baseball player in contract negotiations. You have data for salaries of other rookies that have been recently awarded contracts. The distribution of salaries is shown here. After studying the data you must go into the negotiations armed with your analysis and ask for either the mean or the median salary. Which one do you choose for your client and why? Defend your position. 4.6 A sample of 27 college students were asked how many times a year they go to the movies. Their answers are in the following table: 5 3 a) 12 16 25 30 36 12 55 50 0 0 6 1 8 5 5 6 0 3 1 6 5 10 2 25 24 Identify the variable of interest along with the measurement scale (nominal, ordinal, interval or ratio). 77 CHAPTER 4 Review Exercises b) Produce an appropriate graphical representation of this data and describe the shape. c) Choose and report an appropriate measure of center. Be sure to label the measure of center using the appropriate symbol. d) What can you say about the number of times college students go to the movies each year based on this data? 4.7 Consider the following sample data. 15 6 17 84 13 14 23 27 66 21 57 89 19 53 85 1 32 62 66 38 a) Report the values for the five-number summary. Label each value using the appropriate symbol. b) Construct a box plot. Be sure to label the appropriate parts of the plot. Describe the shape of the distribution. c) Find the 15th percentile. Be sure to label the value using the appropriate symbol. d) Find the 90th percentile. Be sure to label the value using the appropriate symbol. e) 4.8 Find the 85th percentile. Be sure to label the value using the appropriate symbol. A friend was diagnosed as having Type I diabetes (insulin dependent) in 1994. He tests his blood sugar levels 6 to 10 times each day and calculates his average blood sugar level. Even though the average can be calculated, it is a poor indicator of blood sugar control. Hemoglobin A 1C is used as a more accurate indicator of blood sugar control for persons with diabetes, providing information of overall control for the past 2 months. Below are 20 different readings from his hemoglobin A 1C tests. a) 7.1 8.2 9.1 6.4 7.3 7.1 8.0 7.6 7.2 6.8 8.3 9.6 6.6 7.7 9.1 7.5 5.7 4.0 6.1 7.9 Identify the variable of interest along with the measurement scale (nominal, ordinal. interval or ratio). b) Find the 25th and 75th percentiles. Be sure to label each value using the appropriate symbol c) Find the 95th percentile. Be sure to label the value using the appropriate symbol d) Produce a box plot and comment on the shape of the distribution. 4.9 A test to measure nurturing tendencies was given to a group of teenage boys and girls who commonly volunteered at a local retirement home. The test is scored from 10 to 50, with a high score indicating a higher tendency to nurture. 38 25 42 39 41 26 37 45 39 28 28 29 30 24 16 37 47 36 35 46 42 37 40 16 31 30 39 43 a) Identify the variable of interest along with the measurement scale (nominal, ordinal. interval or ratio). b) Identify all the elements for a five number summary and then complete a box-and-whisker plot properly labeling each element. c) 78 Identify the 80th and 90th percentiles. Be sure to label each value using the appropriate symbol. CHAPTER 4 Review Exercises d) Compute both the mean and median. If you had to report a measure of central tendency, would you choose the mean or the median? Justify why. 4.10 Using your calculator or MINITAB, generate 200 random numbers from a normal distribution that has a mean of 0 and a standard deviation of 1. Find the 5th percentile. The needed commands to generate the random numbers are listed below. You will have to locate the position of the 5th percentile by hand, as neither MINITAB nor the TI-83 have a command for that. Once you locate the position, you can sort the data in ascending order and easily identify the actual value. TI-83 Commands: The following command will put 200 random numbers in L1. It will take about 30 seconds, so be patient. RandNorm(0,1,200)→L1 Once the numbers are placed in L1, you need to sort them. Press STAT then select SortA(and enter L1. This will also take up to 30 seconds or so. MINITAB Commands: Calc > Random Data > Integer Generate 200 rows of data. Store in column(s): C1. Mean: 0.0 Standard deviation: 1.0 Press OK. To sort the data select Manip > Sort. Sort column(s): C1 Store sorted column(s): C1 Sort by column: C1 (do not check Descending). Press OK. 4.11 A biologist, researching the population of ground squirrels in Kern County, California has divided a known breeding ground into 1478 grids. He then randomly selected 30 grids and counted the number of ground squirrels living in each of those grids. The number of squirrels in each grid has been recorded below. Can the biologist use this information to estimate the total number of ground squirrels in this breeding ground? If so, what is the estimate? If he cannot use this information to estimate the total number of squirrels, then explain why. 63 75 82 53 69 91 83 55 28 88 75 91 72 74 81 94 67 59 66 57 53 81 87 70 66 52 49 81 72 64 4.12 Explain how quartiles and percentiles are similar. 79 CHAPTER 4 Review Exercises 4.13 Explain when the mean and the median would be similar. What can this tell you about the shape of a distribution? 4.14 The following data shows the length of time, in years, between successive births for two populations of killer whales (Orcinus orca). David Meyers, Model Comparison for Two Populations of Killer Whales (Orcinus Orca), Masters Thesis, Humboldt State University, 1990. Southern Population Northern Population 10.2 5.2 9.1 2.3 8.4 8.4 11.6 12.0 0.1 2.6 9.6 7.5 6.2 10.3 9.6 5.3 1.7 5.8 6.7 6.7 7.4 4.1 8.3 3.6 7.3 8.0 1.6 5.9 9.1 3.4 5.9 1.8 5.2 5.8 7.8 8.0 5.0 6.1 16.1 12.7 4.8 1.7 2.4 2.2 4.2 7.0 4.6 7.0 8.4 6.1 3.4 5.1 5.6 5.5 5.3 5.5 8.6 5.7 3.2 4.7 6.2 6.6 0.6 0.8 8.3 7.0 6.9 5.3 0.3 a) Construct a histogram and box plot for each data set and describe the shape of each distribution. b) Choose an appropriate measure of center that could be used to compare the populations. Justify your choice. c) What statement can be made about the two populations? 4.15 A fast food chain was interested in testing three different advertising campaigns. Eighteen sites were selected and randomly placed into one of three groups by area. Each area was then assigned one of the new advertising campaigns. The sales figures (in thousands of dollars) obtained after a 30 day period are displayed below. Campaign #1 Campaign #2 Campaign #3 42 41 49 40 45 44 46 45 51 41 40 52 44 42 46 42 46 47 a) Identify the variable of interest along with the measurement scale (nominal, ordinal, interval or ratio). b) Calculate a 5 number summary for each of the data sets. Be sure to label each value using the appropriate symbol. c) Display the data in side by side box plots. d) Use the display and numerical summaries to compare the data gathered. What can you say about the three advertising campaigns? 80 CHAPTER 4 Review Exercises 4.16 Customer satisfaction scores were gathered from 4 regions of a national bank. These scores have numerous uses, such as a decision making tool for bonuses and an evaluation tool for improved service. a) b) c) d) e) f) Region 1 Region 2 Region 3 Region 4 94.5 86.4 93.6 92.0 88.8 90.9 87.0 88.4 94.8 81.0 89.2 88.0 83.2 88.0 94.0 89.0 92.6 76.6 90.0 89.4 79.0 84.6 90.3 92.2 94.3 75.8 87.6 87.7 92.0 85.0 93.1 96.0 84.0 75.5 82.0 91.6 88.9 82.5 96.3 94.5 Identify the variable of interest along with the measurement scale (nominal, ordinal, interval or ratio). Calculate the 5 number summary for each region. Be sure to label each value using the appropriate symbol. Use the 5 number summary to construct, by hand, side by side box plots for each region. Calculate the 90th percentile for each region. Label the value using the appropriate symbol. Use these figures to make a statement about which region or regions have the highest level of customer satisfaction. For Regions 1 and 2, construct a box plot and histogram side by side. Note how the shape of the distribution can be determined by either the box plot or the histogram. 4.17 In an effort to raise the standardized test scores for children in California, a campaign for education has been waged by the Governor and his office. At a news conference, Governor Grey Davis was reported saying that his education goal was to have each child in the state score above the 50th percentile on the state exams. Is this an achievable goal? Justify why or why not. 4.18 You have been instructed to both label and scale graphs correctly. Explain why scaling is so important in a box plot. Use pictures if it will aide in your explanation. 4.19 A student calculated the 5 number summary for a homework data set, and the results were as follows: minimum = 2, first quartile = 10, median = 10, third quartile = 27 and maximum = 27. Is this possible? Use graphs and explanations to support your answer. 81 CHAPTER 4 Review Exercises 4.20 Various math programs have been used in California at the ninth grade level. The following data are the standardized test scores for two of these programs from 27 counties. Program 1 58.2 55.8 68.5 45.8 54.7 66 55 58.3 Program 2 58 55 60 46 64 48 67 61 a) b) c) d) e) 63.2 47.2 54.1 63.7 41.5 69.9 65.8 45.3 48.4 46.2 71 60.5 56.6 60 59.5 57.6 65.3 47.7 55.3 61 56 42 65 45 55 50 56 41 50 60 59 64 44 45 48 52 54 44 Identify the variable of interest along with the measurement scale (nominal, ordinal, interval or ratio). Construct side by side box plots for this data. Calculate the 5 number summary for each data set. Be sure to label each value using the appropriate symbol. Use these figures to compare and contrast the two math programs for the 27 counties Construct a box plot and histogram side by side for each program. Note how the shape of the distribution can be determined by either the box plot or the histogram. 4.21 You have learned to create histograms and box plots. Both are useful graphical techniques. a) Explain the similarities and differences between the two graphs. b) When would you choose a box plot over a histogram? Explain. 4.22 Hospitals use daily census figures (the number of occupied beds) to plan staffing. Below are the daily census figures for August. 155 167 160 146 125 132 148 139 152 155 163 143 125 137 138 156 151 154 159 137 130 145 143 165 157 161 143 127 131 147 151 a) Identify the variable of interest along with the measurement scale (nominal, ordinal, interval or ratio). b) Calculate the 5th percentile, the 50th percentile and the 95th percentile for this data set. Be sure to label each value using the appropriate symbol. c) Interpret the meaning of these values statistically and then use this interpretation to make a statement about the census of the hospital. d) Construct a box plot and histogram side by side. Note how the shape of the distribution can be determined by either the box plot or the histogram. 4.23 You are given the following figures for a data set; minimum = 3, first quartile = 17, median = 22, third quartile = 38 and maximum = 40. Describe the shape of the data set and any other features you can gather from this information. Use a picture if it will aid in your explanation. 82 CHAPTER 4 Review Exercises 4.24 Refer to Dataset 2, Smoking and Cancer, in the Appendix. This data set consists of 6 variables that are values from 43 states and the District of Columbia recorded in 1960. The variables include the number of cigarettes smoked per capita, and the death rates per thousand for various types of cancer. a) b) c) d) e) Calculate the mean and median for each of the quantitative variables. Graph each of the variables and decide which measure of center is appropriate for each variable. Justify your answers in part b. In your own words, explain the median value in the variable “number of cigarettes smoked per capita.” Explain the mean value for the same variable. 4.25 Magazines were evaluated for the level of difficulty in reading and the type of advertisements in each kind of magazine. Two of the variables are SEN, the number of sentences in the advertisements and 3SYL which is the number of three syllable words in the adds. You will find the data in Data Set 5 in the Appendix. a) b) c) d) e) f) Calculate the mean and median for each of these variables. Create a histogram of the variables and decide which measure of center is appropriate for each variable. Justify your answers in part b. In your own words, explain the median value for the variable “SEN.” Explain the mean value for the SEN variable. Calculate the five number summary and make a box plot for the SEN data. Compare the box plot to the histogram. Is one graph more desirable for this data than the other? Explain. 4.26 In an effort to measure the growing presence of women in the labor force, data was collected from 19 U.S. Cities in 1968 and 1972. The proportion of women is reported in Data Set 6 as variables named 1968 and 1972. a) Calculate 5 number summaries for each of these variables. b) Calculate the 44th percentile for the 1968 data. Explain the meaning of this value in the data set. c) Calculate the 79th percentile for the 1972 data. Explain the meaning of this value in the data set. 4.27 The Properties of 60 standard metropolitan statistical areas were gathered in a census of the US. This information includes the average January temperature as well as the average July temperature. You will find this in Data Set 7, SMSA. a) Calculate the mean and median for temperatures. b) Graph the variables and decide which measure of center is appropriate for each data set. c) Can you make any statements about the average temperatures in January and July? Does the same statement hold for the median temperatures? Explain. 4.28 From the same data set used above, Data Set 7, consider the variable Rain: a) b) c) d) Calculate 5 number summaries for Rain. Create side by side box plots and compare and contrast the distributions. Calculate the 27th percentile for the January data. Explain the meaning of this value in the data set. Calculate the 82nd percentile for the July data. Explain the meaning of this value in the data set. 83 CHAPTER 4 Review Exercises 4.29 The mortality rate was measured in 60 metropolitan cities across the United States. The mortality rates can be found in Data Set 7, SMSA. a) b) c) d) e) f) Calculate the 5 number summary for this data and create a box plot. What is the shape of the distribution? What measure of center would you use to describe this distribution? Calculate the 20th percentile. Calculate the 80th percentile. What is the relationship between these two values? 4.30 Droughts have often caused failure of crops world wide. In an effort to control crop losses due to drought, scientists have seeded clouds with silver nitrate to increase precipitation. Refer to Data Set 8, Cloud Seeding. The information given is the amount of rain in acre feet (how much water it takes to cover a acre of ground, one foot deep) of rain from both seeded clouds and unseeded clouds. a) b) c) d) Calculate the mean and median for each of these variables. Graph the variables and decide which measure of center is appropriate for each variable. Justify your answer in part b. Would it be more appropriate to compare the means or medians for these variables? Explain your answer. 4.31 Pearson’s Product Moment Correlation Coefficient will measure the association between any two variables. True or False? Explain your answer. 4.32 Discuss the difference between Pearson’s and Spearman’s correlation coefficient. Be certain to include a discussion of what they both measure, when you would choose one technique over the other along with any differences in the mechanics of the actual calculation. 4.33 Consider following data. Calculate Pearson's Correlation coefficient and justify how appropriate you think Pearson’s Correlation is for this data. If you feel Pearson’s Correlation is not appropriate, then find a measurement that is appropriate and report its value. 84 x 3 1 5 9 7 11 y 8 0 24 85 35 175 CHAPTER 4 Review Exercises 4.34 It is thought that a student’s performance on one exam will give an indication of how he or she will do on subsequent exams. Data for the first two exams taken in a statistics class are shown below. E x am 1 E x am 2 49 82 60 92 34 72 55 88 23 51 61 89 51 95 23 43 47 65 32 63 44 60 43 82 21 62 57 91 12 * E x am 1 E x am 2 55 100 33 55 50 66 28 * 40 75 61 95 44 72 54 85 35 74 52 97 33 * 53 78 26 69 50 94 52 97 a) Construct a scatter plot of the data. b) Calculate a correlation coefficient for this data and defend your choice (Pearson’s or Spearman’s). 4.35 Recent studies have suggested smoking during pregnancy may affect the birth weight of newborn infants. Sixteen women smokers all of whom were pregnant were asked to estimate the average number of cigarettes they smoke each day. This data was recorded along with the birth weight of their child. a) Construct a scatter plot of the data and justify why you think Pearson’s Correlation is/is not an appropriate descriptive statistic for this data. b) Calculate Pearson’s correlation coefficient and interpret its meaning as it pertains to this scenario. Cigarettes 22 16 4 19 42 8 12 30 Weight 6.4 7.2 8.1 6.9 6.1 8.4 7.6 6.5 Cigarettes 14 16 5 20 32 2 15 48 Weight 8.4 8.1 8.5 6.6 6 7.9 7.1 5.5 4.36 Consider the following data. Justify which measurement of association is most appropriate (Pearson’s or Spearman’s). x 1 3 5 7 9 10 y 0 8 35 24 85 175 4.37 A study was done investigating the infestation of the T. orientalis lobster by two types of barnacles, O. tridens and O. lowei (Jeffries, Vorhis, and Yang, 1984). The data is provided below. a) Plot the data. b) Calculate the appropriate correlation coefficient. O. tridens O. low ei 645 6 320 23 401 40 364 9 327 24 73 5 20 86 221 0 3 109 5 350 85 CHAPTER 4 Review Exercises 4.38 During the winter, it is expected that energy usage will increase due to the decreasing temperatures. Data for eleven random dates in the winter with the average BTU per hour reading and the low temperature (in degrees Fahrenheit) on that date are recorded below. (Data courtesy of MINITAB, Inc.) BTU Te m p 2.1 45 2.55 39 2.77 42 2.4 37 3.31 33 2.97 34 3.32 30 3.29 30 3.62 20 3.67 15 4.07 19 Is there a relationship between the low temperature recorded and the amount of energy used for heating? If so, quantify this relationship. 86
© Copyright 2026 Paperzz