Lecture 3 Office 2007 format

Summary Descriptive Measures
Percentage
Projects Completed Early
35
30
25
20
15
10
5
0
Percent
Location is an indicator of where the data is located.
Projects Completed Early
40
30
%20
10
Plant B
0
10 15 20
25 30 35
40 45 50
Percent
Scale is a measure of how “spread out” data is.
Plant A
Criteria for Measures of Location and Scale
Must be well defined for:
Raw Data
Grouped Data
Theoretical Curves
For Business Purposes:
Must be arithmetic
Measures of Location
Mode
Simply the most frequent value in a data set.
Problems:
Raw Data:
Many data sets have no repeat values, therefore mode does not
exist.
Mode is taken as midpoint of the bin with the greatest
frequency.
But consider the data discussed in the last lecture.
Histogram of Labor Costs
30
Frequency
25
20
15
10
5
0
20
30
40
50
60
70
80
Labor Cost
Histogram of Labor Costs
35
30
Frequency
Grouped Data:
25
20
15
10
5
0
25
35
45
55
Labor Costs
65
75
Theoretical Data: Mode may not exist; consider the theoretical distribution of
random numbers which should look like:
Uniform Density Function
1.2
1
f(x)
0.8
0.6
0.4
0.2
0
x= random number
Measures of Location
Median
The median is that data value which has approximately the same percentage
of observations below it as above it (for large data sets this proportion will approach
50%).
The word “median” comes from the Latin word “medius”, meaning
“middle”.
Raw Data:
Finding the median from raw data is a two step process. First you
must put the data in order, then you need to find the middle value.
Example:
Data = 3, -1, 6, 10, 11
Ordered Data = -1, 3, 6, 10, 11
Median = 6
If sample size is odd then median will be the value occupying
position (n+1)/2 in the ordered data.
Example:
Data = 3, -1, 6, 10, 11, 7
Ordered Data= -1, 3, 6, 7, 10, 11
Median = any value between 6 and 7. Usually average two
points to get 6.5 .
If sample size is even then median is the arithmetic average of
the values occupying positions (n/2) and (n/2) +1 in the ordered
data.
Notice: Median is not computed, it is found. For example replace the value of 11 in
the above example by 12,000. The median remains 6.5
Cannot be manipulated algebraically.
Finding the Median of Raw Data Using EXCEL
Open the file “thickdat.xls” in the MBA Mod 1 folder.
Find an empty cell and type in =median(
Then highlight the range of the data. You should see something that looks like the
following:
Finally, type in the right parenthesis.
The result is 355 which is the average of the 30th and 31st values, both of which
happen to be 355.
Finding the Median from Grouped Data
Suppose you did not have the raw data for steel thickness, but only had the data
grouped as shown below:
Interval
341.5
344.5
347.5
350.5
353.5
356.5
359.5
362.5
344.5
347.5
350.5
353.5
356.5
359.5
362.5
365.5
m(i)
Midpoint
f(i)
Freq
F
343
346
349
352
355
358
361
364
1
3
8
8
20
13
5
2
1
4
12
20
40
53
58
60
Using the column labeled “F”, it is clear that the 30th and 31st observations lie in the
interval [353.5 to 356.5].
Altogether there are 20 observations in the interval [353.5 to 356.5].
Since there are 20 observations below 353.5, we need 10 more to get to the 30th
value.
ASSUMPTION:
The data points in the interval are equi-spaced throughout the
interval
To get the 30th value, we need to go 10/20ths (or .5) into the interval. Since the bin is
3 units wide, we need to go a distance of (10/20)*3 = 1.5 into the interval. Therefore
we estimate the 30th value as 353.5 + 1.5 = 355
To get the 31st value, we need to go 11/20ths (or .55) into the interval. Since the bin
is 3 units wide, we need to go a distance of (11/20)*3 = 1.65 into the interval.
Therefore we estimate the 31st value as 353.5 + 1.65 = 355.15.
The median is estimated as median = (355 + 355.15)/2 = 355.075.
Finding the Median From Theoretical Probability
Distributions
If f(x) is the probability density function of x, the median is that value med
satisfying the integral equation:

med
 f ( x)dx .5

Problems with the Median
Suppose you had two groups of people. In Group 1 you had 50 people with a
median hourly wage of $15.00 per hour. In Group 2 you had 100 people with a
median hourly wage of $17.00 per hour. Given this information can you determine
the median hourly wage of all 150 people?
Consider the following data:
Time 1
median
Time 2 change
5
10
15
20
25
4
12
18
19
23
-1
2
3
-1
-2
15
18
-1
Change in median is 18 15 =3
Median Change
is -1
Measures of Location
Averages
Averages can be tricky.
Consider:
Rate of Return
Year 1
0.07
Year 2
Year 3
Year 4
Year 5
0.1
0.12
0.3
0.15
What is the average rate of return over the five year period?
Arithmetic average = .148
Correct average = .145321
(The correct average is that value which when compounded for 5 years gives
the same result as the observed compounding rates, in other words the solution to
the equation: ( 1  x )5  ( 1.07 )  ( 1.10 )  ( 1.12 )  ( 1.30 )  ( 1.15 )  1.9707688 )
Consider:
Dallas and Fort Worth are approximately 30 miles apart. On a round trip
from Dallas to Fort Worth and back, you average 30 mph on the first leg from
Dallas to Fort Worth. How fast to you have to travel on the return leg from Fort
Worth to Dallas so that you average 60 mph for the round trip?
Usual answer:
90 mph
Correct answer:
it is impossible
Both of the above are common errors.
Measures of Location
The Arithmetic Average
The arithmetic average of a set of values is the sum of the values divided by
the number of values.
If x1, x2, . . . . xn represent the n numerical values from a random sample,
then the formula for the sample mean is:
n
x   xi / n
i 1
To find the average( when I use this term subsequently, I will mean the
arithmetic average), using EXCEL, one uses the function “average”. It is used just
like the “median” function.
Specifically, one types “=average( range of data)”. For the data on steel
thickness, you would have something that looks like the below:
By closing the parentheses, you get the average for the data as 354.55.
Computation of the Arithmetic Mean
From Grouped Data
If we do not have the raw data but only the frequency distribution of the
data, the formula for the sample mean becomes:
x   f i mi / n
i
EXCEL does not compute this formula directly. To compute this in EXCEL
for the steel thickness data, one can use the following procedure:
Interval
341.5
344.5
347.5
350.5
353.5
356.5
359.5
362.5
344.5
347.5
350.5
353.5
356.5
359.5
362.5
365.5
m(i)
Midpoint
f(i)
Freq
f(i)*m(i)
343
346
349
352
355
358
361
364
1
3
8
8
20
13
5
2
343
1038
2792
2816
7100
4654
1805
728
60
21276
Average
354.6
If one defines the proportion of observations in a bin as
pi  f i / n
then the formula for the mean from grouped data (and also the formula for a
discrete probability distribution) is:
x   pi mi
i
Using the above, it is then possible to generalize the definition of the mean for
data from a continuous distribution with probability density function f(x) as:


 xf ( x)dx

Computation with the Average
Consider the problem of having two groups of people, 50 people in Group 1
with an average hourly wage of $15.00 and 100 people in Group 2 with an average
hourly wage of $17.00, can I find the mean of the pooled group of 150 people.
The average of the pooled group is just the total hourly wages of all 150
people divided by the 150 people. Using the formula for the arithmetic average, one
can show that:
nx   xi
i
Therefore the sum of the hourly wages in the first group is 50 x 15 = 750.
The sum of the hour wages in the second group is 100 x 17 = 1700. Finally the mean
of the pooled group is:
pooled average = (750 + 1700)/ (50 + 100) = $16.33
This can be written in formula terms as:
x pooled  ( n1 x 1  n2 x 2 ) /( n1  n2 )
This is a special case of the formula for multiple groups:
x pooled   ni x i /  ni
i
i
Consider the following example which we discussed previously in connection
with the median:
Group
1
Average
Group Change
2
5
10
15
20
25
4
12
18
19
23
-1
2
3
-1
-2
15
15.2
0.2
Notice that the change in the means is the same as the mean of the changes.
Summary
Criterion
Median
Mean
Ease of Understanding
High
Reasonable
Computation
Moderate
Easy
Effect of Outliers
None
High
Use in Further Computation
None
Easy
Accuracy for Inference to
Population for a fixed sample
of size n
25% worse than mean
Baseline
Simpson’s Paradox
Consider the following data found in the file “meandemo.xls”:
Male
Males Average
Female
Females Averag
e
Prof
35
60,000
5
65,000
Assoc Prof
25
50,000
20
55,000
Asst Prof
15
40,000
15
45,000
Average
52,667
52,500
Or the following data also found in the file “meandemo.xls”:
Time 1
Time 1 Median
Time 2
Time 2 Median
Median
Change
Group 1
30
35
48
35
31
32
75
32
-3
Group 2
14
85
98
85
60
83
85
83
-2
Group 3
60
63
65
63
61
62
98
62
-1
62
2
All
Groups
60
Measures of Scale
The simplest way to measure scale is to find the average distance of each
datapoint from the measure of location (in our case the arithmetic mean).
Symbolically this can be written:
( x
i
 x)0
i
The fact that some deviations are positive and some negative can be
corrected in one of two ways:
1) Use the absolute value to compute the mean absolute deviation (MAD),
which in formula terms is:
MAD   x i  x / n
i
or 2) Use the square of the deviations which in formula terms gives:
S 2   ( x i  x )2 /( n  1 )
i
and,
s  S2
In EXCEL, the function “stdev” uses the above formula for computing the
sample standard deviation:
For the steel thickness data, you would type “=stdev(range)” as shown below:
This yields the value of s=4.492549.
EXCEL does not automatically compute the standard deviation if the data is
grouped. The computing formula to use in this case is given by:
S 2  (  f i m i2  nx 2 ) /( n  1 )
i
and then taking the square root.
The necessary terms can be computed in EXCEL as shown in the following
table for the steel data:
Interval
341.5
344.5
347.5
350.5
353.5
356.5
359.5
362.5
344.5
347.5
350.5
353.5
356.5
359.5
362.5
365.5
m(i)
Midpoin
t
f(i)
Freq
343
346
349
352
355
358
361
364
1
3
8
8
20
13
5
2
343
1,038
2,792
2,816
7,100
4,654
1,805
728
117,649
359,148
974,408
991,232
2,520,500
1,666,132
651,605
264,992
Sum
60
21,276
7,545,666
which yields an estimate of s = 4.5031.
f(i)*m(i) f(i)*m(i)*m(i)
If only the proportion of observations in each bin is available, then the
following approximate formula may be used:
S 2   pi mi2  x 2
i
which in this case yields the value of s = 4.465423.
The standard deviation for data following a theoretical distribution function
f(x) can also be defined as:
 
2

x
2
f ( x )dx   2

and,
  2
Further Uses of the Mean and Standard Deviation
The Mound Rule:
For data which is “mound” shaped, approximately
Percent of Data
Region
68%
mean +/- one standard deviation
95%
mean +/- two standard deviations
99.7%
mean +/- three standard deviations
For the steel thickness data (which is mound shaped) the exact results are:
Region
mean +/- 1 sd
mean +/- 2 sd
mean +/- 3 sd
Values
350.1
345.6
341.1
%
to
to
to
359.0
363.5
368.0
73.0%
96.7%
100.0%
Chebyshev’s Inequality
For any distribution, at least 100(1- 1/k2)% of the data must lie in the region,
the mean +/- k standard deviations.
Specifically, for k=2, at least 75% of the data must lie in the range mean +/- 2
standard deviations.
For k=3, at least 88.9% of the data must lie in the range mean +/- 3 standard
deviations.
Measures of Relative Position
Class
Mean
Standard Deviation
Monday
85
6
Wednesday
90
8
A Student from the Monday night class takes the Wednesday exam and
scores 92 To what score in the Monday night class, does this score correspond?
Define:
t  ( x x)/ s
and
x  x  ts
For the example,
t = (92-90)/8 = .25
xMonday = 85 + .25 x 6 = 86.5