unit 4 study guide

Unit 4, Cluster 1, Learning Cycle A: Descriptive Statistics
I can represent data with number line plots, including dot plots, histograms, and box plots.
SUMMARIZE, REPRESENT, AND INTERPRET DATA ON A SINGLE COUNT OR
MEASUREMENT VARIABLE
Problem: Suppose Bob scores the following points in a game:
8, 15, 10, 10, 10, 15, 7, 8, 10, 9, 12, 11, 11, 13, 7, 8, 9, 9, 8, 10, 11, 14, 11, 10, 9, 12, 14, 14, 12, 13, 5, 13, 9,
11, 12, 13, 10, 8, 7, 8
A _____ _______ includes
all values from the range of
the data and plots a point for
each occurrence of an
observed value on a number
line.
A ______________
subdivides the data into
class intervals and uses a
rectangle to show the
frequency of observations in
those intervals
Skill-based Task
The following data set shows the number of
songs downloaded in one week by each student
in Mrs. Jones’ class: 10, 20, 12, 14, 12, 27, 88, 2,
7, 30, 16, 16, 32, 15, 25, 15, 4, 0, 15, 6.
Choose and create a plot to represent the data.
A _____ ______ (____-___-______ ______) is
a diagram that shows the five-number summary
of a distribution. (Five-number summary
includes the minimum, lower quartile (___
percentile), median (___ percentile), upper
quartile (___ percentile), and the maximum. In
a modified boxplot, the presence of outliers can
also be illustrated.
Problem Task
On the midterm math exam, students had the following
scores: 95, 45, 37, 82, 90, 100, 91, 78, 67, 84, 85, 85,
82, 91, 92, 93, 92, 76, 84, 100, 59, 92, 77, 68, 88.
What are the strengths and weaknesses of presenting
this data in a certain type of plot for:
Unit 4, Cluster 1, Learning Cycle B: Descriptive Statistics
I can use the shape of the data to compare two or more data sets with summary statistics by comparing
either center (mean and median) and/or spread (range,interquartile range, and standard deviation).
CENTER
Measures of _______ refer to the summary measures used to describe the most “typical” value in a set of data.
The two most common measures of center are median and the mean.
To find the ____________, we arrange the observations in
The ______ of a sample or a population is
order from ________ to ________ value. If there is an odd
computed by adding all of the observed values
number of observations, the median is the ________ value.
and dividing by the number of observations.
If there is an even number of observations, the median is the Bob’s mean value = __
_________ of the _____ middle values.
Bob’s median value = ____
Mean vs. Median
As measures of central tendency, the mean and the median each have advantages and disadvantages.
• The median is resistant to _________ _______ so it is a better indicator of the typical observed value if a set
of data is _________.
• If the sample size is ________ and _____________, the mean is often used as the measure of center.
Example: Examine a sample of 10 households to estimate the typical family income. Nine of the households
have incomes between $25,000 and $75,000; but the tenth household has an annual income $750,000. The
tenth household is an ________ _______ or _____. If we choose a measure to estimate the income of a typical
household, the ______ will greatly over-estimate the income of a typical family (because it is sensitive to
extreme values); while the _________ will not.
SPREAD
The ________ of a distribution refers to the ___________ of the data. If the data cluster around a single central
value, the spread is ________. The further the observations fall from the center, the _________ the spread or
variability of the set.
The ________ is the
difference between the
largest and smallest values
in a set of data.
Bob’s range = ____
The __________ ______ (__ __ __) is a measure of variability, based on dividing
a data set into quartiles, four equal parts, The values that divide each part are
called the first ___, second ____(_______), and third____ quartiles.
• Q1 is the “middle value” in the ______half of the rank-ordered data.
• Q2 is the _______ value in the set.
• Q3 is the “middle value” in the _____ half of the rank-ordered data.
The IQR = _____ - ______.
It is the distance between the 75th and 25th percentile, the ______ of the middle
50% of the data.
Bob’s IQR = ____ Bob's Q1=____ Bob's Q2=___ Bob's Q3=___
STANDARD DEVIATION
The __________ ___________ is a measure of how spread out the data is. It indicates the average distance
from the mean for the set of observations.
n: _______ of values in the set
i: represents a ______ value
𝑥̅ : is the ________
xi: is the _____ item
Bob’s standard deviation = ___
If individual data points vary a great deal from the group mean the standard deviation is _____ and vice versa.
Example:
Graph 1: same _______, different ________ __________, one has a greater _________ than the other
Graph 2: same ________/_________ _____________, but different _________.
SHAPE
Described by symmetry, number of peaks, direction of skew, or uniformity.
A _________ distribution can be divided at the
Distributions can have few or many ________.
center so that each half is a mirror image of the
Distributions with one clear peak are called __________
other.
(sometimes called _____-________) and distributions
with two clear peaks are called ___________.
Distributions with a tail on the right toward the
higher values are said to be _______ ________;
and distributions with a tail on the left toward the
lower values are said to be ________ _______.
When observations in a set of data are equally spread
across the range of the distribution, it is called a
_________ distribution. A uniform distribution has no
clear peaks.
Describe the Shape
____________________
___________________
____________________
_____________________
__________________
____________________
UNUSUAL FEATURES
The two most common unusual features are gaps and outliers.
________ refer to areas of a distribution where
Extreme values are called __________.
there are no observations or the distribution may
As a rule, an extreme value is considered to be an
contain clusters separated by gaps.
outlier if it is at least _____ interquartile ranges
below the lower quartile (Q1), or at least ____
interquartile ranges above the upper quartile (Q3).
Q1 – 1.5 • IQR
Q3 + 1.5 • IQR
Bob’s outliers: lie outside the interval (_____, ______).
Summarizing our example of Bob’s points during a game:
Shape is fairly _________. The mean is _____ and the median is ___. The minimum is ___, the maximum is
___, 1st quartile is at _____ and the 3rd quartile is at ___. The range is ___ points, the IQR is ____ points, and
the standard deviation is ____ points. Any outliers must lie outside the interval (_____, ______). This
information helps us determine that most of the values are very close to the center of the graph—this makes
sense because of the small range of values and how many values are close to 10.
Skill-based Task
Problem Task
The boxplots show the distribution of scores on a district Plot data based on populations of European countries.
writing test in two fifth grade classes at a school.
Plot data based on populations of Asian countries.
Compare the range and medians of the scores from the
Compare and discuss differences in center and spread.
two classes.
Unit 4, Cluster 1, Learning Cycle C: Descriptive Statistics
I can interpret and discuss differences in shape, center, & spread in context of the data and account for possible outliers.
COMPARING DISTRIBUTIONS
When ______ ________ are used to compare With ____ ______, data from two distributions are
distributions, they are positioned one above the displayed on the same chart, using the same measurement
other, using the same scale of measurement.
scale.
Pet ownership in homes on two city blocks.
Pet ownership is slightly _______ in block X.
In block X, most have _____ pets. In block Y,
most have _____ pets. Block X’s pet
ownership is _______ _____ and block Y’s is
more _________. In block Y the pet
ownerships are from __ to __ pets versus __ to
__ pets in block X. There is more __________
in the block Y distribution. There are ___
outliers or gaps in either set of data.
Results from a medical study.
The treatment group received an experimental drug to relieve
cold symptoms and the control group received a placebo.
The box plots show the number of days each group reported
symptoms. Neither group had any gaps or outliers. Both
distributions are a bit skewed to the _______, although the
skew is more prominent in the _________ group. In the
treatment group, cold symptoms lasted __ to ___ days versus
__ to ___ days for the control group. While the median
recovery time is about __ days for the treatment group versus
__ days for the control group. It is difficult to say that the
drug would have positive effect on patient recovery since the
__________ of both groups lie within the __ __ __ of the
other.
Suppose Bob’s friend Alan had these scores:
1, 3, 0, 2, 4, 5, 7, 7, 8, 10, 4, 4, 3, 2, 5, 6, 6, 6, 8, 8, 10, 11, 11, 10, 12, 12, 5, 6, 8, 9, 10, 15, 10, 12, 11, 11, 6, 7, 7, 8
Alan’s summary statistics: minimum is __, Q1 = __, Q3(median)= ____, Q3 ___, and maximum is ___. The
𝑥̅ = __, the range of the values is ___ points, the IQR is __ points, and the standard deviation is about ____
points. Any outliers must be outside of the interval (____, _____). Comparing these two sets shows that Alan
has a ______ spread of data with an IQR of ___points versus Bob’s IQR of ____ points. In addition, Alan’s
overall scores are generally ________ than Bob’s, shown by Alan’s median ____ points which is less that
Bob’s median score of ___ points. The shape of Alan’s graph is not as ____________ as Bob’s either,
suggesting that Alan’s scores might be slightly skewed _______ since there is a bit of a tail in that direction
whereas Bob’s scores appear to have a __________ distribution.
Skill-based Task
The boxplots show the distribution of scores on a
district writing test in two fifth grade classes at a
school. Which class performed better and why?
Problem Task
Find two similar data sets A and B (use textbook or internet
resources). What changes would need to be made to data set
A to make it look like the graph of set B?
Unit 4, Cluster 2, Learning Cycle A: Descriptive Statistics
I can distinguish between categorical and numerical data.
CATEGORICAL vs. NUMERICAL VARIABLES
____________variables take on values that are
___________or _____________ variables represent a
names or labels. The color of a ball (e.g., red,
measurable quantity. The population of a city is the
green, blue), gender (male or female), year in
number of people in the city – a measurable attribute of the
school (freshmen, sophomore, junior, senior) are city. Other examples: scores on a set of tests, height and
examples of data that cannot be averaged or
weight, temperature at the top of each hour.
represented by a scatter plot as they have ___
numerical meaning.
Examples:
For which set of data can you find an average?
1) The temperature at the park was over a 12-hour period was: 60, 64, 66, 71, 75, 77, 78, 80, 78, 77, 73, 65.
Can you find the average temperature over the 12-hour period?
_____________________________________________________
2) Once a week, Maria has to fill her car up with gas. She records the day of the week each time she fills her
car for three months: Monday, Friday, Tuesday, Thursday, Monday, Wednesday, Monday, Tuesday, Monday,
Wednesday, Tuesday, Thursday. Can you find the average day of the week that Maria had to fill her car with
gas?
_________________________________________________________________________
Example: Grade in school is a categorical value even though they are numbers, because the number describes
a position in school. Age is a variable that can be either because ages have a specific order that are generally
important quantitatively, but could be used as a sorting value categorically.
You need to be careful when assigning numbers as categorical or quantitative data.
Here are the ages of a group of students 12, 15, 11, 14, 12, 11, 15, 13, 12, 12, 11, 14, 13.
a) Find the average age of the students. ________________
b) Suppose 11year olds are in 6th grade, 12 year olds are in 7th grade, 13 year olds are in 8th grade, 14 year
olds are in 9th grade, and 15 year olds are in 10th grade. Find the average grade for the set of data.
____________________
Unit 4, Cluster 2, Learning Cycle B: Descriptive Statistics
I can summarize categorical data using two-way frequency tables.
I can interpret and discuss relative frequencies in context (including joint, marginal, and conditional
relative frequencies).
TWO-WAY FREQUENCY TABLES
Examining relationships between categorical variables. The entries in the cells of a two-way table can be
_______ _________ or ________ __________.
The two-way frequency table below shows the The two-way frequency table below shows the favorite
favorite leisure activities for 50 adults – 20
leisure activities for 50 adults – 20 men and 30 women using
men and 30 women using ________ ________. _______ _________, written as a decimal or percentage. It is
the ratio of the ________ number of a particular event to the
________ number of events; often taken as an estimate of
probability.
The table above shows relative frequencies for the _______ ________. The tables below show relative
frequencies for _________ and the relative frequencies for ______________.
Each type of relative frequency table makes a different contribution to understanding the relationship between
gender and preferences for leisure activities.
List one observation:___________________________________________________________
MARGINAL, JOINT, and CONDITIONAL FREQUENCIES
Frequencies are the number of ______________ in a given statistical category.
Entries in the “Total” row and “Total” column
Entries in the body of the table
are called _________ ___________ or
are called ___________ ____________.
the _________ __________.
The relative frequencies in the body of the table
below are called _________ ___________ or the
__________ _____________.
Example: A public opinion survey explored the
relationship between age and support for increasing the
minimum wage.
In the 41 to 60 age group, what percentage supports
increasing the minimum wage?__________________
Unit 4, Cluster 2, Learning Cycle D: Descriptive Statistics
I can recognize possible associations and trends in the data.
ASSOCIATIONS and TRENDS
An ___________ is a connection between data A ________ is a change (either positive, negative or
values.
constant) in data values over time.
Example: the number of males versus females in the US Military. Although there are still more males than
females in the Armed Forces, the ______ is that the gap is closing. There is ___ association between the
number of females and the number of males in the US Military. We cannot draw any conclusions about a
relationship between the two.
Taking a further look at Bob’s and Alan’s points scored:
Using the example of Bob’s and Alan’s
points, the frequency and relative
frequency tables above show that Bob
received a 15 or high score __% of the
time, whereas Alan received that score
only ____% of the time. If the same
trends occur, you would expect Bob to
get a score of 15 _________ as often as
Alan.
Skill-based Task
What is the joint frequency of students who have
chores and a curfew? __________
Which marginal frequency is the largest? _______
Problem Task
Collect data that compares populations of countries with
square miles. What trends emerge when we compare living in
geographically large countries with those that are highly
populated?
Unit 4, Cluster 2, Learning Cycle E: Descriptive Statistics
I can represent two quantitative data sets on a scatter plot and describe how the variables are related
using the correlation coefficient.
SCATTER PLOT
A ________ _______ is a graphic tool used to display the relationship between two quantitative (numerical)
variables. It consists of a _________ axis, a _______ axis, and observations plotted with _______. Each
______ on the scatter plot represents a paired observation from the data set. The ___________ of the dot on
the scatter plot represents its x- and y-coordinates. The __-___________ can be described as the input,
explanatory, or independent variable; whereas the __-____________ can be referred to as the output,
response, or dependent variable.
Example:
The table and scatter plot shows the
height and the weight of five starters on a
high school basketball team.
Each player in the table is represented by
a ____ on the scatter plot. The first dot
represents the shortest, lightest player.
From the scale on the x-axis, the shortest
player is ___ inches tall; and from the
scale on the y-axis, you see that he/she
weighs ____ pounds.
Scatter plots are helpful in understanding patterns in _________ _______. For example, the above scatter plot
shows that the relationship between height and weight is _____, ________, and has a ___________ slope.
Unit 4, Cluster 2, Learning Cycle F: Descriptive Statistics
I can fit a linear function to a scatter plot or data set that shows a linear association and fit an
exponential function to a set of data and use these functions to solve problems.
A _____ __ _____ ____ (or
“trend” line) is a straight line
that best represents the data on
a scatter plot. This line may
pass through some of the
points, none of the points, or
all of the points.
If you are looking for values
that fall within the range of
values plotted on the scatter
plot, you are ___________.
If you are looking for values that fall beyond
the range of those values plotted on the
scatter plot, you
are ________________.
Beware the dangers of extrapolation!
The _______ away from the plotted values
you go, the ______ reliable your prediction
can become.
Is there a linear association between the fat grams and the total calories in fast food? ________ Can we predict
the number of calories from the number of grams of fat in a sandwich?_______
If a sandwich has 26 grams of fat, how many calories will it have based on the data?________
Create a scatter plot and find the line of best fit or the regression line. ___________________
1. Using your regression equation, find the total calories based upon 26 grams of fat? _________
2. Using your regression equation, find the total calories based upon 40 grams of fat? _________
A rapidly growing strain of bacteria has been discovered. Its growth rate is shown in the chart below. Prepare a
scatter plot of the data with hours as the independent variable and the number of bacteria as the dependent
variable.
1. Determine which regression model will best approximate your data. We will limit our choices to linear
and exponential regression models. Let’s examine both.
It makes sense that the ___________ model would best represent the data since exponential models are often
used with population growth (even when the population is bacteria.)
2. Write the regression equation for your model, rounding the values to three decimal places
_______________________
3. What is the correlation coefficient for this data and what does it tell you? ____________
___________________________________________________________________
4. Using your regression equation, determine how many bacteria, to the nearest
integer, will be present in 12 hours.
____________________________________________________
5. Using your regression equation, determine how many bacteria, to the
nearest integer, will be present in 3.5 hours.
____________________________________________________
Unit 4, Cluster 2, Learning Cycle G: Descriptive Statistics
I can determine when a linear model is appropriate for a data set by plotting and analyzing residuals.
RESIDUALS
__________ (or error) represents unexplained (or residual) variation after fitting a regression model.
A linear regression model is not always appropriate for the data. You can assess the appropriateness of the
model by examining the residual plot. The _______ between the observed value of the dependent variable (y)
and the predicted value (ŷ) is called the residual (e). The predicted value is the y-value generated by the linear
regression model (or line of best fit).
Each data point has ____ residual. Notice from the equation below that an observed value (point) falls below
the regression line, its residual will be ________ whereas points that fall above the regression line have a
_________ residual value.
residual = observed value – predicted value
e =y –ŷ
RESIDUAL PLOTS
A _________ ________ is a graph that shows the residual values on the vertical axis and the independent (x)
variable on the horizontal axis.
If the points in a residual plot are show no ________ ________, a linear regression model is appropriate for the
data; otherwise, a non-linear model is more appropriate. Below, the residual plots show three typical patterns.
(good fit for a linear model)
(a better fit for a non-linear model)
Below, the table presents data from measuring the left foot and the left forearm of students in a math class. The
scatter plot in the center suggests a ________ association. A linear function is also graphed and the equation
given. This equation was used to calculate the expected values in the table. The difference between the actual
measurement of the left forearm and the expected value give us the residual for each x coordinate. The
residuals are plotted in the graph on the right. The residual plot shows a _______ pattern. Therefore, a
________ model is appropriate. The correlation coefficient will measure the strength of this linear relationship.
Skill-based Task
The following data shows the age and average
daily energy requirements for male children and
teens (1, 1110), (2, 1300), (5, 1800), (11, 2500),
(14, 2800), (17, 3000). Create a graph and find a
linear function to fit the data. Using your function,
what is the daily energy requirement for a male 15
years old? Would your model apply to an adult
male? Explain.
Problem Task
Collect data on forearm length and height in a class. Plot the
data and estimate a linear function for the data. Compare and
discuss different student representations of the data and
equations they discover. Could the equation(s) be used to
estimate the height for any person with a known forearm
length? Why or why not?
Unit 4, Cluster 3, Learning Cycle A: Descriptive Statistics
I can interpret the slope (rate of change) and intercept (constant term) of given linear models in context.
INTERPRET LINEAR MODELS
The scatter plot below displays data that relates the number or oil changes per year and the cost of engine
repairs. If the line of best fit is given by the equation, y = -70x + 630, what does the slope of -70 mean within
context of the problem? _________________ What does the y- intercept (the constant term) of 630 mean
within the context of the problem?____________________
____________ is the average rate of change.
What is the change in the cost of repairs for each oil change?______
This indicates that on average for each additional oil change per year, the cost of engine repairs will tend to
________ by $___.
The __________ of (0, ____) means that if there were zero oil changes, engine repairs would be predicted to
cost about $____.
From the graph, the x-intercept is about __, which means that a car owner would expect to spend nothing on
engine repairs if they changed the oil __ times a year.
Is this a sensible number of oil changes per year?______
Skill-based Task
Collect power bills and graph the cost of electricity
compared to the number of kilowatt hours used.
Find a function that models the data and tell what
the intercept and slope mean in the context of the
problem.
Problem Task
Create a poster of bivariate data with a linear relationship.
Describe for the class the meaning of the data, including the
meaning of the slope and intercept in the context of the data.
Unit 4, Cluster 3, Learning Cycle B: Descriptive Statistics
I can compute, using technology, and interpret the correlation coefficient of a linear fit.
CORRELATION COEFFICIENT
The correlation coefficient, __, measures the strength of the _______ association between two variables.
How to Interpret a Correlation Coefficient
The sign and the absolute value of r, the correlation coefficient, describe the __________ and the _________
of the relationship between two variables.
•
•
•
•
•
•
•
The value of a correlation coefficient ranges between ___ and __.
The _______ the absolute value of a correlation coefficient, the stronger the _______ relationship.
The strongest linear relationship is indicated by a correlation coefficient of ___ or __.
The weakest linear relationship is indicated by a correlation coefficient of __.
A positive correlation means that as one variable gets larger, the other variable tends to get _______ as well.
A negative correlation means that as one variable gets larger, the other variable tends to get _______.
The correlation coefficient and the slope of the line will have the same ______.
The correlation coefficient, r, only measures _______relationships. Therefore, a correlation of 0 does not
mean there is no relationship between two variables; rather, it means there is ___ linear relationship. (It is
possible for two variables to have a zero correlation coefficient and a strong curvilinear relationship at the
same time.)
SCATTER PLOTS & CORRELATION COEFFEICIENTS
Several points are evident from the scatter plots.
• When the slope (average rate
of change)
of the line in the plot is
_________, the
correlation is _________; and
vice versa.
• The strongest correlations (r
= 1.0 and r
= -1.0 ) occur when data points
fall
__________ on a straight line.
• The correlation becomes
weaker as the
data points become more __________ away from the line.
• If the data points have no particular pattern, the correlation is equal to _______.
• The correlation is affected by an ________, a data point that diverges greatly from the overall pattern of
data and has a large residual.
Compare the first scatter plot with the last scatter plot. The single outlier in the last plot greatly __________
the correlation (from 1.00 to 0.71).
Skill-based Task
Problem Task
The correlation coefficient of a given data set is 0.97.
Hypothesize the correlation between two sets of
List three specific things this tells you about the data.
seemingly related data. Gather data to support or refute
your hypothesis.
1.
2.
3.
Unit 4, Cluster 3, Learning Cycle C: Descriptive Statistics
I can distinguish between correlation and causation.
CORRELATION & CAUSATION Warning: Correlation does not imply causation!!
A value that shows the strength of the linear relationship
_________________ occurs only when the relationship between
between two variables is a ____________ ______________.
the two variables can be proven through a scientific experiment
However, a strong correlation between two sets of data _____
following strict guidelines. Only in this way can we rule out
____ mean that there is a cause-and-effect relationship between
other factors that may affect the relationship that we see in the
the two.
observed values.
EXAMPLES: When a scatter plot shows a correlation between two variables, even if it's a strong one, there is not necessarily a
__________ and ________ relationship . Both variables could be related to some third variable that actually causes the
apparent correlation. An apparent correlation simply could be the result of chance.
1. During the month of June the number of new babies born at the Utah Valley Hospital were recorded for a week. Over the same
time period, the number of cakes sold at Carlo’s Bakery in Hoboken, New Jersey was also recorded. From looking at the graph, it
can be seen that a high positive correlation exists between these two sets of data. This must mean that as the number of babies born at
Utah Valley Hospital increases, the number of cakes that are sold at Carlo’s Bakery will also increase. Is this true? ______________
2. Here's an example where an apparent correlation is actually the result of a hidden third variable.
"Mr. Jones gave a math test to all the students in his school. He made the startling discovery that the
taller students did better than the short ones. His conclusion was that 'as your height increases, so does
your math ability'.” Of course, the hidden variable here is the age of the students, or the grade level they're in. Mr. Jones gave
the same test to every student in grades one through twelve, so of course the students in the higher grades who had learned more
math did better on the test. It wasn't the increase in the students' height
that caused the math results to go up, but their increase in _______ ___________.
3. Scientists who use scatter plots to look for correlations between variables watch out for this hidden variable problem.
An American medical researcher wants to see if there is a link between a person's socio-economic status
(how much money they have) and certain types of cancer. His research seems to indicate that there is a
link ... rich people seem to suffer from more cancers than poor people do. His conclusion is that being
rich will make you more likely to get cancer. His rationale for concluding this is that the stress that
comes with high-paying jobs could be a factor in causing some cancers. Sound logical? The connection the
researcher described may actually exist, but there are certainly other possibilities. In the United States there is no universal free
medical care, so rich people are far more likely to see a doctor than poor people. The higher rates of cancer in rich people may not be
due to their wealth at all, but due instead to the fact that they visit their doctor more often (they can afford it), so more cancers are
being _____________ for them.
Skill-based Task
Give an example of a data set that has strong
correlation but no causation and describe why this is
so.
Give an example of a data set that has both strong
correlation and causation and write a description of
why this is so.
Problem Task
Find media artifacts that make claims of causation and
evaluate them.