Summary Descriptive Measures Percentage Projects Completed Early 35 30 25 20 15 10 5 0 Percent Location is an indicator of where the data is located. Projects Completed Early 40 30 %20 10 Plant B 0 10 15 20 25 30 35 40 45 50 Percent Scale is a measure of how “spread out” data is. Plant A Criteria for Measures of Location and Scale Must be well defined for: Raw Data Grouped Data Theoretical Curves For Business Purposes: Must be arithmetic Measures of Location Mode Simply the most frequent value in a data set. Problems: Raw Data: Many data sets have no repeat values, therefore mode does not exist. Mode is taken as midpoint of the bin with the greatest frequency. But consider the data discussed in the last lecture. Histogram of Labor Costs 30 Frequency 25 20 15 10 5 0 20 30 40 50 60 70 80 Labor Cost Histogram of Labor Costs 35 30 Frequency Grouped Data: 25 20 15 10 5 0 25 35 45 55 Labor Costs 65 75 Theoretical Data: Mode may not exist; consider the theoretical distribution of random numbers which should look like: Uniform Density Function 1.2 1 f(x) 0.8 0.6 0.4 0.2 0 x= random number Measures of Location Median The median is that data value which has approximately the same percentage of observations below it as above it (for large data sets this proportion will approach 50%). The word “median” comes from the Latin word “medius”, meaning “middle”. Raw Data: Finding the median from raw data is a two step process. First you must put the data in order, then you need to find the middle value. Example: Data = 3, -1, 6, 10, 11 Ordered Data = -1, 3, 6, 10, 11 Median = 6 If sample size is odd then median will be the value occupying position (n+1)/2 in the ordered data. Example: Data = 3, -1, 6, 10, 11, 7 Ordered Data= -1, 3, 6, 7, 10, 11 Median = any value between 6 and 7. Usually average two points to get 6.5 . If sample size is even then median is the arithmetic average of the values occupying positions (n/2) and (n/2) +1 in the ordered data. Notice: Median is not computed, it is found. For example replace the value of 11 in the above example by 12,000. The median remains 6.5 Cannot be manipulated algebraically. Finding the Median of Raw Data Using EXCEL Open the file “thickdat.xls” in the MBA Mod 1 folder. Find an empty cell and type in =median( Then highlight the range of the data. You should see something that looks like the following: Finally, type in the right parenthesis. The result is 355 which is the average of the 30th and 31st values, both of which happen to be 355. Finding the Median from Grouped Data Suppose you did not have the raw data for steel thickness, but only had the data grouped as shown below: Interval 341.5 344.5 347.5 350.5 353.5 356.5 359.5 362.5 344.5 347.5 350.5 353.5 356.5 359.5 362.5 365.5 m(i) Midpoint f(i) Freq F 343 346 349 352 355 358 361 364 1 3 8 8 20 13 5 2 1 4 12 20 40 53 58 60 Using the column labeled “F”, it is clear that the 30th and 31st observations lie in the interval [353.5 to 356.5]. Altogether there are 20 observations in the interval [353.5 to 356.5]. Since there are 20 observations below 353.5, we need 10 more to get to the 30th value. ASSUMPTION: The data points in the interval are equi-spaced throughout the interval To get the 30th value, we need to go 10/20ths (or .5) into the interval. Since the bin is 3 units wide, we need to go a distance of (10/20)*3 = 1.5 into the interval. Therefore we estimate the 30th value as 353.5 + 1.5 = 355 To get the 31st value, we need to go 11/20ths (or .55) into the interval. Since the bin is 3 units wide, we need to go a distance of (11/20)*3 = 1.65 into the interval. Therefore we estimate the 31st value as 353.5 + 1.65 = 355.15. The median is estimated as median = (355 + 355.15)/2 = 355.075. Finding the Median From Theoretical Probability Distributions If f(x) is the probability density function of x, the median is that value med satisfying the integral equation: med f ( x)dx .5 Problems with the Median Suppose you had two groups of people. In Group 1 you had 50 people with a median hourly wage of $15.00 per hour. In Group 2 you had 100 people with a median hourly wage of $17.00 per hour. Given this information can you determine the median hourly wage of all 150 people? Consider the following data: Time 1 median Time 2 change 5 10 15 20 25 4 12 18 19 23 -1 2 3 -1 -2 15 18 -1 Change in median is 18 15 =3 Median Change is -1 Measures of Location Averages Averages can be tricky. Consider: Rate of Return Year 1 0.07 Year 2 Year 3 Year 4 Year 5 0.1 0.12 0.3 0.15 What is the average rate of return over the five year period? Arithmetic average = .148 Correct average = .145321 (The correct average is that value which when compounded for 5 years gives the same result as the observed compounding rates, in other words the solution to the equation: ( 1 x )5 ( 1.07 ) ( 1.10 ) ( 1.12 ) ( 1.30 ) ( 1.15 ) 1.9707688 ) Consider: Dallas and Fort Worth are approximately 30 miles apart. On a round trip from Dallas to Fort Worth and back, you average 30 mph on the first leg from Dallas to Fort Worth. How fast to you have to travel on the return leg from Fort Worth to Dallas so that you average 60 mph for the round trip? Usual answer: 90 mph Correct answer: it is impossible Both of the above are common errors. Measures of Location The Arithmetic Average The arithmetic average of a set of values is the sum of the values divided by the number of values. If x1, x2, . . . . xn represent the n numerical values from a random sample, then the formula for the sample mean is: n x xi / n i 1 To find the average( when I use this term subsequently, I will mean the arithmetic average), using EXCEL, one uses the function “average”. It is used just like the “median” function. Specifically, one types “=average( range of data)”. For the data on steel thickness, you would have something that looks like the below: By closing the parentheses, you get the average for the data as 354.55. Computation of the Arithmetic Mean From Grouped Data If we do not have the raw data but only the frequency distribution of the data, the formula for the sample mean becomes: x f i mi / n i EXCEL does not compute this formula directly. To compute this in EXCEL for the steel thickness data, one can use the following procedure: Interval 341.5 344.5 347.5 350.5 353.5 356.5 359.5 362.5 344.5 347.5 350.5 353.5 356.5 359.5 362.5 365.5 m(i) Midpoint f(i) Freq f(i)*m(i) 343 346 349 352 355 358 361 364 1 3 8 8 20 13 5 2 343 1038 2792 2816 7100 4654 1805 728 60 21276 Average 354.6 If one defines the proportion of observations in a bin as pi f i / n then the formula for the mean from grouped data (and also the formula for a discrete probability distribution) is: x pi mi i Using the above, it is then possible to generalize the definition of the mean for data from a continuous distribution with probability density function f(x) as: xf ( x)dx Computation with the Average Consider the problem of having two groups of people, 50 people in Group 1 with an average hourly wage of $15.00 and 100 people in Group 2 with an average hourly wage of $17.00, can I find the mean of the pooled group of 150 people. The average of the pooled group is just the total hourly wages of all 150 people divided by the 150 people. Using the formula for the arithmetic average, one can show that: nx xi i Therefore the sum of the hourly wages in the first group is 50 x 15 = 750. The sum of the hour wages in the second group is 100 x 17 = 1700. Finally the mean of the pooled group is: pooled average = (750 + 1700)/ (50 + 100) = $16.33 This can be written in formula terms as: x pooled ( n1 x 1 n2 x 2 ) /( n1 n2 ) This is a special case of the formula for multiple groups: x pooled ni x i / ni i i Consider the following example which we discussed previously in connection with the median: Group 1 Average Group Change 2 5 10 15 20 25 4 12 18 19 23 -1 2 3 -1 -2 15 15.2 0.2 Notice that the change in the means is the same as the mean of the changes. Summary Criterion Median Mean Ease of Understanding High Reasonable Computation Moderate Easy Effect of Outliers None High Use in Further Computation None Easy Accuracy for Inference to Population for a fixed sample of size n 25% worse than mean Baseline Simpson’s Paradox Consider the following data found in the file “meandemo.xls”: Male Males Average Female Females Averag e Prof 35 60,000 5 65,000 Assoc Prof 25 50,000 20 55,000 Asst Prof 15 40,000 15 45,000 Average 52,667 52,500 Or the following data also found in the file “meandemo.xls”: Time 1 Time 1 Median Time 2 Time 2 Median Median Change Group 1 30 35 48 35 31 32 75 32 -3 Group 2 14 85 98 85 60 83 85 83 -2 Group 3 60 63 65 63 61 62 98 62 -1 62 2 All Groups 60 Measures of Scale The simplest way to measure scale is to find the average distance of each datapoint from the measure of location (in our case the arithmetic mean). Symbolically this can be written: ( x i x)0 i The fact that some deviations are positive and some negative can be corrected in one of two ways: 1) Use the absolute value to compute the mean absolute deviation (MAD), which in formula terms is: MAD x i x / n i or 2) Use the square of the deviations which in formula terms gives: S 2 ( x i x )2 /( n 1 ) i and, s S2 In EXCEL, the function “stdev” uses the above formula for computing the sample standard deviation: For the steel thickness data, you would type “=stdev(range)” as shown below: This yields the value of s=4.492549. EXCEL does not automatically compute the standard deviation if the data is grouped. The computing formula to use in this case is given by: S 2 ( f i m i2 nx 2 ) /( n 1 ) i and then taking the square root. The necessary terms can be computed in EXCEL as shown in the following table for the steel data: Interval 341.5 344.5 347.5 350.5 353.5 356.5 359.5 362.5 344.5 347.5 350.5 353.5 356.5 359.5 362.5 365.5 m(i) Midpoin t f(i) Freq 343 346 349 352 355 358 361 364 1 3 8 8 20 13 5 2 343 1,038 2,792 2,816 7,100 4,654 1,805 728 117,649 359,148 974,408 991,232 2,520,500 1,666,132 651,605 264,992 Sum 60 21,276 7,545,666 which yields an estimate of s = 4.5031. f(i)*m(i) f(i)*m(i)*m(i) If only the proportion of observations in each bin is available, then the following approximate formula may be used: S 2 pi mi2 x 2 i which in this case yields the value of s = 4.465423. The standard deviation for data following a theoretical distribution function f(x) can also be defined as: 2 x 2 f ( x )dx 2 and, 2 Further Uses of the Mean and Standard Deviation The Mound Rule: For data which is “mound” shaped, approximately Percent of Data Region 68% mean +/- one standard deviation 95% mean +/- two standard deviations 99.7% mean +/- three standard deviations For the steel thickness data (which is mound shaped) the exact results are: Region mean +/- 1 sd mean +/- 2 sd mean +/- 3 sd Values 350.1 345.6 341.1 % to to to 359.0 363.5 368.0 73.0% 96.7% 100.0% Chebyshev’s Inequality For any distribution, at least 100(1- 1/k2)% of the data must lie in the region, the mean +/- k standard deviations. Specifically, for k=2, at least 75% of the data must lie in the range mean +/- 2 standard deviations. For k=3, at least 88.9% of the data must lie in the range mean +/- 3 standard deviations. Measures of Relative Position Class Mean Standard Deviation Monday 85 6 Wednesday 90 8 A Student from the Monday night class takes the Wednesday exam and scores 92 To what score in the Monday night class, does this score correspond? Define: t ( x x)/ s and x x ts For the example, t = (92-90)/8 = .25 xMonday = 85 + .25 x 6 = 86.5
© Copyright 2026 Paperzz