Midterm Review
GSI: Kellie Ottoboni
July 15, 2015
1
Histograms and Normal Approximation (Ch. 3-5)
Reading and drawing histograms
• The area of a bar represents percentage falling in that interval.
• Height of a bar represents density: percentage of observations per unit on the x-axis. Find the bar
heights by dividing the percentage for that interval by the the length of the interval (area/width =
height).
• The y-axis uses density scale: the units are percent per x-axis unit.
Eyeballing histograms
• The median is the point where 50% of observations lie above and 50% of observations lie below. To
find the median, add up the areas of bars until you get to 50%.
• You can estimate the average once you’ve found the median. If the histogram is symmetric, then the
average is the same as the median. If the data has a long right tail, the average is higher than the
median. If it has a long left tail, the average is lower than the median.
• The best way to guess the SD is to use the 68-95 rule: about 68% of the data (area under the curve)
lies within 1 SD of the average and about 95% of the data lies within 2 SDs of the average. This rule
works best with data that looks roughly like the normal distribution. If the tails don’t die off like
the normal curve, then the SD will probably be bigger than you’d guess for a normal curve with the
same average and range.
Changing your data
• Suppose you have a set of numbers with average x̄ and standard deviation SDx . If you multiply
every number in your data set by a constant c and add a constant d to every number, then the new
average is cx̄ + d and the new SD is |c|SDx .
P
• The general formula for an average is n1 ni=1 xi – the average depends on the sum of your data. If
you add, remove, or modify data points, you can compute the new average by adding, removing, or
modifying the sum of those values.
q P
• The “nice” formula for computing the SD is n1 ni=1 (x2i ) − (x̄)2 . It depends on the sum of squares
of your data. To compute a new SD, you must find the sum of squared values for the data points
that are added, removed, or modified.
Problem 1 (Purves’ midterm practice, #1)
A list of numbers has an average of 51. A new list is formed by subtracting 50 from every number on the
original list. The r.m.s. of the numbers on the new list is 3.0. a) Find the average of the new list. b) Find
the SD of the original list.
1
Using the normal approximation
• This is only okay to do if the histogram of your data looks roughly normal.
• To use the normal approximation for a histogram, you need to compute the average and the SD of
the data. Then use these values to standardize the data point x you’re interested in: z = x−x̄
SDx .
• To compute the x percentile of a distribution, find the z in the normal table that has about x% to
the left of it. Remember that the normal table in the book gives the area between −z and z, so you’ll
have to do an extra step to find the area to look up.
• Two key facts for finding areas under the normal curve: the area on either side of the center 0 is 50%
and the area above z is equal to the area below −z.
Problem 2 (Ibser’s midterm practice, #9)
The distribution of heights of a group of women closely follows the normal curve. A woman at the 84th
percentile of heights is 67 inches tall, and a woman at the 7th percentile of heights is 61 inches tall. a) The
average height is
inches. b) Someone at the 20th percentile is
inches tall.
2
Correlation (Ch 8-9)
Eyeballing correlation
• Correlation measures clustering around a line. Just because points are tightly clustered in a cloud
does not mean they have a strong correlation.
• The flatter the line, the closer the correlation is to 0.
Changing your data
• Correlation doesn’t change when you multiply your data by a constant or add a constant shift to your
data. To see why, remember that the correlation is the average of products of your data in standard
units. When you standardize the data, any change of scale or shift disappears. The exception is if
you multiply either x or y by a negative constant; then your correlation will change signs.
• When you don’t know individual data points (xi , yi ), you can’t standardize the x and y. You should
use the formula
1 Pn
i=1 xi yi − x̄ȳ
n
r=
SDx SDy
This shows that correlation depends on five things: the averages of x and y, the SDs of x and y, and
the sum of products of (x, y) pairs.
• If you change your data and want to compute the new correlation coefficient, then you need to:
1. Compute the new averages and SDs using the method from earlier.
P
2. Find the sums of products of pairs, ni=1 xi yi , corresponding to the old data (e.g. two data
sets). You do this by writing out the correlation formula with the old averages, old SDs, and
the old correlation (this must be given to you), and solving for the sum of products.
3. Find the new sum of products of pairs by adding together the sums of products you just found.
4. Plug in all five new values into the correlation formula. Be sure to use the correct n.
2
Problem 3 (Ibser’s midterm practice, #8)
At a university, a group of 200 freshmen has an average VSAT score of 550, and the average MSAT score
is 540. The correlation between VSAT and MSAT score is 0.6. The SDs for both exams are 100. A group
of 100 sophomores has an average VSAT score of 550, and an average MSAT score of 600. The SDs for
both exams are also 100. The correlation between VSAT and MSAT scores for the sophomores is 0.5.
(This is too long for a midterm, but good practice.) a) Find the average VSAT and MSAT scores for all
300 students. b) Find the SDs for VSAT and MSAT scores for all 300 students. c) Find the correlation
between VSAT and MSAT scores for all 300 students.
3
Regression (Ch 10-12)
Prediction using regression
• The regression line tries to approximate the average y value for each fixed x.
• If an individual is s standard deviations away from the average in x, then they will be r × s standard
deviations away from the average in y.
• Two ways to get the regression prediction:
1. Use the “regression towards the mean” idea: standardize the observed x value to get z, then
ypred = ȳ + r × z × SDy
SD
2. Find the regression line: The slope is m = r SDxy and the intercept b can be solved for using the
fact that the line passes through the point of averages (x̄, ȳ). Then ypred = mx + b.
RMS Error
• Root mean square (RMS) is just a series of operations we can apply to any set of data.
q P
– If you take the RMS of a list, n1 ni=1 x2i , it gives you a sense of the “size” of the numbers.
– If youqsubtract the average from a list and take the RMS, you get the standard deviation of the
1 Pn
2
list:
i=1 (xi − x̄) . This tells you the average size of deviations from the average.
n
– If you subtract the predicted y values from
the actual observed y values and take the RMS, you
q P
n
1
2
get the RMS error of your data set:
i=1 (yi − yi,pred ) . This tells you the average size of
n
your prediction errors.
• If you always predict your y value to be the average, ȳ, then the RMS error is the same as the SDy .
• If
√ you use the regression line to make predictions, there is a nice formula for the RMS error:
1 − r2 SDy .
• About 2/3 of the data points fall between the lines y = mx + b ± (RMS error) if the data is footballshaped.
• The regression line has smaller RMS error than any other line passing through the points.
Normal approximations in a vertical strip
• Normal approximations in a vertical strip only make sense for football-shaped data (linear, homoskedastic, normally distributed in any vertical strip).
• In a vertical strip at x, the average of the normal curve is the regression prediction of y and the SD
is the RMS error.
3
• Once you’ve found the average and SD of the vertical strip, the problem becomes a typical problem
of finding area under the normal curve. You no longer need to worry about x values, only the y.
• These types of problems might ask for the percent greater than a certain value or percentile rank of
a value.
Problem 4
For the men of age 18 to 24 years in the HANES sample, the relationship between height and systolic
blood pressure can be summarized as follows:
Height
Blood pressure
Average
70”
124 mm
SD
3”
15 mm
The correlation coefficient r is −0.2 and we can assume that the scatterplot is football-shaped.
Part a. Find the regression equation for predicting blood pressure from height.
Part b. Predict the average blood pressure of men who were 6ft (72”) tall.
Part c. Find the RMS error for predicting blood pressure from height.
Part d. What percentage of men who are 6ft tall have systolic blood pressure below 100 mm?
4
Probability (Ch. 13-15)
Basics of probability
• The word “and” in a problem signals that you should use the multiplication rule:
P(A and B) = P(A)P(B | A) = P(B)P(A | B)
• P(B | A) = P(B) if and only if the events A and B are independent.
• The word “or” in a problem signals that you should use the addition rule:
P(A or B) = P(A) + P(B) − P(A and B)
P(A and B) = 0 if and only if the events A and B are mutually exclusive.
• If an event has can be broken down into lots of cases (e.g. at least one head in 10 flips of a coin), it’s
convenient to use the complement rule:
P(A) = 1 − P(Ac )
We write Ac for “not A”.
Problem 5
Part a. For two events A and B, suppose that P(A) = 0.3, P(B) = 0.5, and P(A and B) = 0.6. Calculate
P(A or B).
Part b. If P(A) = 0.6, P(B) = 0.3, and P(A and B C ) = 0.4. Calculate P(A and B).
Problem 6
Four hotels in a big city have the same name, the Grand Hotel. Four people make an appointment to meet
at the Grand Hotel. If each one of the 4 people chooses the hotel at random, calculate the following the
following probabilities:
1. All 4 choose the same hotel
2. All 4 choose different hotels
4
Conditional probability and Bayes’ rule
• To find the probability of an event B given that the event A has occurred, use
P(B | A) =
P(A and B)
P(A)
Notice that this is just a rearrangement of the multiplication rule for dependent events:
P(A and B) = P(B | A)P(A)
• A typical situation where you’d use Bayes’ rule is when you want to find P(A | B) but you are only
given P(B | A). Then
P(B | A)P(A)
P(A | B) =
P(B)
Use your favorite method to find the denominator P(B). Usually you’ll have to break it down into
separate cases. For example, we can split B into two mutually exclusive cases:
P(B) = P({B and A} or {B and Ac })
= P(B and A) + P(B and Ac )
= P(B | A)P(A) + P(B | Ac )P(Ac )
A and Ac are mutually exclusive
by the multiplication rule
Problem 7
If the events A and B are mutually exclusive, and the chance that at least one happens is positive
(P(A or B) > 0), express the probabilities P(A | A or B) and P(B | A or B) in terms of P(A) and P(B).
Problem 8
Suppose that the probability that both of a pair of twins are boys is 0.3, the probability that they are both
girls is 0.26, and the probability that the first child is a boy is 0.52. Find the probability that:
1. The second twin is a boy, given that the first is a boy
2. The second twin is a girl, given that the first is a girl
3. The second twin is a boy
4. The first is a boy and the second is a girl
Hint: Let bi and gi denote the events that the ith child is a boy or a girl, respectively, i = 1, 2.
Problem 9
A population is made up of 52% females and 48% males. An individual, drawn at random, is color blind.
The rate of color-blindness in females is 25% and the rate of color-blindness in males is 5%. What is the
probability that the person is male?
Problem 10
The probability that a missile fired against a target is not intercepted by an antimissile is 23 . If the missile is not intercepted, then the probability of a successful hit is 43 . If the missile is intercepted, then it
will not hit the target. Find the probability that the missile was intercepted given that it misses the target.
5
Binomial coefficients
• The binomial coefficient counts how many ways k things can be
from a group of n things,
chosen
n!
where the order of the k things doesn’t matter. It is written nk = k!(n−k)!
.
• The binomial probability formula is used when there are only two possibilities: success and failure.
If the probability of success is p, then the probability of k successes in n independent trials is
n k
n−k .
k p (1 − p)
• The binomial probability formula is only used when there are two possible outcomes, success or
failure, and trials are independent. On the other hand, the binomial coefficient is a general method
for counting things.
Problem 11
30% of people have a pet dog. You pick 5 people at random from a crowd. What is the probability that
Part a. At least one of them has a dog?
Part b. Exactly 3 people have a dog?
Problem 12
The faculty in an academic department consists of 4 assistant professors, 6 associate professors, and 5 full
professors. Also, it has 30 graduate students. A committee of 5 people is going to be formed.
Part a. What is the number of all possible committees consisting of faculty alone?
Part b. How many committees can be formed if 2 graduate students are included and all academic ranks
are represented?
Part c. If the committee is formed at random, what is the probability that it will be made up of all
graduate students?
Problem 13 A course in English composition is taken by 10 freshmen, 15 sophomores, 30 juniors, and
5 seniors. If 10 students are chosen at random, calculate the probability that this group will consist of 2
freshmen, 3 sophomores, 4 juniors, and 1 senior.
6
© Copyright 2026 Paperzz