Unit 4, Cluster 1, Learning Cycle A: Descriptive Statistics I can represent data with number line plots, including dot plots, histograms, and box plots. SUMMARIZE, REPRESENT, AND INTERPRET DATA ON A SINGLE COUNT OR MEASUREMENT VARIABLE Problem: Suppose Bob scores the following points in a game: 8, 15, 10, 10, 10, 15, 7, 8, 10, 9, 12, 11, 11, 13, 7, 8, 9, 9, 8, 10, 11, 14, 11, 10, 9, 12, 14, 14, 12, 13, 5, 13, 9, 11, 12, 13, 10, 8, 7, 8 A _____ _______ includes all values from the range of the data and plots a point for each occurrence of an observed value on a number line. A ______________ subdivides the data into class intervals and uses a rectangle to show the frequency of observations in those intervals Skill-based Task The following data set shows the number of songs downloaded in one week by each student in Mrs. Jones’ class: 10, 20, 12, 14, 12, 27, 88, 2, 7, 30, 16, 16, 32, 15, 25, 15, 4, 0, 15, 6. Choose and create a plot to represent the data. A _____ ______ (____-___-______ ______) is a diagram that shows the five-number summary of a distribution. (Five-number summary includes the minimum, lower quartile (___ percentile), median (___ percentile), upper quartile (___ percentile), and the maximum. In a modified boxplot, the presence of outliers can also be illustrated. Problem Task On the midterm math exam, students had the following scores: 95, 45, 37, 82, 90, 100, 91, 78, 67, 84, 85, 85, 82, 91, 92, 93, 92, 76, 84, 100, 59, 92, 77, 68, 88. What are the strengths and weaknesses of presenting this data in a certain type of plot for: Unit 4, Cluster 1, Learning Cycle B: Descriptive Statistics I can use the shape of the data to compare two or more data sets with summary statistics by comparing either center (mean and median) and/or spread (range,interquartile range, and standard deviation). CENTER Measures of _______ refer to the summary measures used to describe the most “typical” value in a set of data. The two most common measures of center are median and the mean. To find the ____________, we arrange the observations in The ______ of a sample or a population is order from ________ to ________ value. If there is an odd computed by adding all of the observed values number of observations, the median is the ________ value. and dividing by the number of observations. If there is an even number of observations, the median is the Bob’s mean value = __ _________ of the _____ middle values. Bob’s median value = ____ Mean vs. Median As measures of central tendency, the mean and the median each have advantages and disadvantages. • The median is resistant to _________ _______ so it is a better indicator of the typical observed value if a set of data is _________. • If the sample size is ________ and _____________, the mean is often used as the measure of center. Example: Examine a sample of 10 households to estimate the typical family income. Nine of the households have incomes between $25,000 and $75,000; but the tenth household has an annual income $750,000. The tenth household is an ________ _______ or _____. If we choose a measure to estimate the income of a typical household, the ______ will greatly over-estimate the income of a typical family (because it is sensitive to extreme values); while the _________ will not. SPREAD The ________ of a distribution refers to the ___________ of the data. If the data cluster around a single central value, the spread is ________. The further the observations fall from the center, the _________ the spread or variability of the set. The ________ is the difference between the largest and smallest values in a set of data. Bob’s range = ____ The __________ ______ (__ __ __) is a measure of variability, based on dividing a data set into quartiles, four equal parts, The values that divide each part are called the first ___, second ____(_______), and third____ quartiles. • Q1 is the “middle value” in the ______half of the rank-ordered data. • Q2 is the _______ value in the set. • Q3 is the “middle value” in the _____ half of the rank-ordered data. The IQR = _____ - ______. It is the distance between the 75th and 25th percentile, the ______ of the middle 50% of the data. Bob’s IQR = ____ Bob's Q1=____ Bob's Q2=___ Bob's Q3=___ STANDARD DEVIATION The __________ ___________ is a measure of how spread out the data is. It indicates the average distance from the mean for the set of observations. n: _______ of values in the set i: represents a ______ value 𝑥̅ : is the ________ xi: is the _____ item Bob’s standard deviation = ___ If individual data points vary a great deal from the group mean the standard deviation is _____ and vice versa. Example: Graph 1: same _______, different ________ __________, one has a greater _________ than the other Graph 2: same ________/_________ _____________, but different _________. SHAPE Described by symmetry, number of peaks, direction of skew, or uniformity. A _________ distribution can be divided at the Distributions can have few or many ________. center so that each half is a mirror image of the Distributions with one clear peak are called __________ other. (sometimes called _____-________) and distributions with two clear peaks are called ___________. Distributions with a tail on the right toward the higher values are said to be _______ ________; and distributions with a tail on the left toward the lower values are said to be ________ _______. When observations in a set of data are equally spread across the range of the distribution, it is called a _________ distribution. A uniform distribution has no clear peaks. Describe the Shape ____________________ ___________________ ____________________ _____________________ __________________ ____________________ UNUSUAL FEATURES The two most common unusual features are gaps and outliers. ________ refer to areas of a distribution where Extreme values are called __________. there are no observations or the distribution may As a rule, an extreme value is considered to be an contain clusters separated by gaps. outlier if it is at least _____ interquartile ranges below the lower quartile (Q1), or at least ____ interquartile ranges above the upper quartile (Q3). Q1 – 1.5 • IQR Q3 + 1.5 • IQR Bob’s outliers: lie outside the interval (_____, ______). Summarizing our example of Bob’s points during a game: Shape is fairly _________. The mean is _____ and the median is ___. The minimum is ___, the maximum is ___, 1st quartile is at _____ and the 3rd quartile is at ___. The range is ___ points, the IQR is ____ points, and the standard deviation is ____ points. Any outliers must lie outside the interval (_____, ______). This information helps us determine that most of the values are very close to the center of the graph—this makes sense because of the small range of values and how many values are close to 10. Skill-based Task Problem Task The boxplots show the distribution of scores on a district Plot data based on populations of European countries. writing test in two fifth grade classes at a school. Plot data based on populations of Asian countries. Compare the range and medians of the scores from the Compare and discuss differences in center and spread. two classes. Unit 4, Cluster 1, Learning Cycle C: Descriptive Statistics I can interpret and discuss differences in shape, center, & spread in context of the data and account for possible outliers. COMPARING DISTRIBUTIONS When ______ ________ are used to compare With ____ ______, data from two distributions are distributions, they are positioned one above the displayed on the same chart, using the same measurement other, using the same scale of measurement. scale. Pet ownership in homes on two city blocks. Pet ownership is slightly _______ in block X. In block X, most have _____ pets. In block Y, most have _____ pets. Block X’s pet ownership is _______ _____ and block Y’s is more _________. In block Y the pet ownerships are from __ to __ pets versus __ to __ pets in block X. There is more __________ in the block Y distribution. There are ___ outliers or gaps in either set of data. Results from a medical study. The treatment group received an experimental drug to relieve cold symptoms and the control group received a placebo. The box plots show the number of days each group reported symptoms. Neither group had any gaps or outliers. Both distributions are a bit skewed to the _______, although the skew is more prominent in the _________ group. In the treatment group, cold symptoms lasted __ to ___ days versus __ to ___ days for the control group. While the median recovery time is about __ days for the treatment group versus __ days for the control group. It is difficult to say that the drug would have positive effect on patient recovery since the __________ of both groups lie within the __ __ __ of the other. Suppose Bob’s friend Alan had these scores: 1, 3, 0, 2, 4, 5, 7, 7, 8, 10, 4, 4, 3, 2, 5, 6, 6, 6, 8, 8, 10, 11, 11, 10, 12, 12, 5, 6, 8, 9, 10, 15, 10, 12, 11, 11, 6, 7, 7, 8 Alan’s summary statistics: minimum is __, Q1 = __, Q3(median)= ____, Q3 ___, and maximum is ___. The 𝑥̅ = __, the range of the values is ___ points, the IQR is __ points, and the standard deviation is about ____ points. Any outliers must be outside of the interval (____, _____). Comparing these two sets shows that Alan has a ______ spread of data with an IQR of ___points versus Bob’s IQR of ____ points. In addition, Alan’s overall scores are generally ________ than Bob’s, shown by Alan’s median ____ points which is less that Bob’s median score of ___ points. The shape of Alan’s graph is not as ____________ as Bob’s either, suggesting that Alan’s scores might be slightly skewed _______ since there is a bit of a tail in that direction whereas Bob’s scores appear to have a __________ distribution. Skill-based Task The boxplots show the distribution of scores on a district writing test in two fifth grade classes at a school. Which class performed better and why? Problem Task Find two similar data sets A and B (use textbook or internet resources). What changes would need to be made to data set A to make it look like the graph of set B? Unit 4, Cluster 2, Learning Cycle A: Descriptive Statistics I can distinguish between categorical and numerical data. CATEGORICAL vs. NUMERICAL VARIABLES ____________variables take on values that are ___________or _____________ variables represent a names or labels. The color of a ball (e.g., red, measurable quantity. The population of a city is the green, blue), gender (male or female), year in number of people in the city – a measurable attribute of the school (freshmen, sophomore, junior, senior) are city. Other examples: scores on a set of tests, height and examples of data that cannot be averaged or weight, temperature at the top of each hour. represented by a scatter plot as they have ___ numerical meaning. Examples: For which set of data can you find an average? 1) The temperature at the park was over a 12-hour period was: 60, 64, 66, 71, 75, 77, 78, 80, 78, 77, 73, 65. Can you find the average temperature over the 12-hour period? _____________________________________________________ 2) Once a week, Maria has to fill her car up with gas. She records the day of the week each time she fills her car for three months: Monday, Friday, Tuesday, Thursday, Monday, Wednesday, Monday, Tuesday, Monday, Wednesday, Tuesday, Thursday. Can you find the average day of the week that Maria had to fill her car with gas? _________________________________________________________________________ Example: Grade in school is a categorical value even though they are numbers, because the number describes a position in school. Age is a variable that can be either because ages have a specific order that are generally important quantitatively, but could be used as a sorting value categorically. You need to be careful when assigning numbers as categorical or quantitative data. Here are the ages of a group of students 12, 15, 11, 14, 12, 11, 15, 13, 12, 12, 11, 14, 13. a) Find the average age of the students. ________________ b) Suppose 11year olds are in 6th grade, 12 year olds are in 7th grade, 13 year olds are in 8th grade, 14 year olds are in 9th grade, and 15 year olds are in 10th grade. Find the average grade for the set of data. ____________________ Unit 4, Cluster 2, Learning Cycle B: Descriptive Statistics I can summarize categorical data using two-way frequency tables. I can interpret and discuss relative frequencies in context (including joint, marginal, and conditional relative frequencies). TWO-WAY FREQUENCY TABLES Examining relationships between categorical variables. The entries in the cells of a two-way table can be _______ _________ or ________ __________. The two-way frequency table below shows the The two-way frequency table below shows the favorite favorite leisure activities for 50 adults – 20 leisure activities for 50 adults – 20 men and 30 women using men and 30 women using ________ ________. _______ _________, written as a decimal or percentage. It is the ratio of the ________ number of a particular event to the ________ number of events; often taken as an estimate of probability. The table above shows relative frequencies for the _______ ________. The tables below show relative frequencies for _________ and the relative frequencies for ______________. Each type of relative frequency table makes a different contribution to understanding the relationship between gender and preferences for leisure activities. List one observation:___________________________________________________________ MARGINAL, JOINT, and CONDITIONAL FREQUENCIES Frequencies are the number of ______________ in a given statistical category. Entries in the “Total” row and “Total” column Entries in the body of the table are called _________ ___________ or are called ___________ ____________. the _________ __________. The relative frequencies in the body of the table below are called _________ ___________ or the __________ _____________. Example: A public opinion survey explored the relationship between age and support for increasing the minimum wage. In the 41 to 60 age group, what percentage supports increasing the minimum wage?__________________ Unit 4, Cluster 2, Learning Cycle D: Descriptive Statistics I can recognize possible associations and trends in the data. ASSOCIATIONS and TRENDS An ___________ is a connection between data A ________ is a change (either positive, negative or values. constant) in data values over time. Example: the number of males versus females in the US Military. Although there are still more males than females in the Armed Forces, the ______ is that the gap is closing. There is ___ association between the number of females and the number of males in the US Military. We cannot draw any conclusions about a relationship between the two. Taking a further look at Bob’s and Alan’s points scored: Using the example of Bob’s and Alan’s points, the frequency and relative frequency tables above show that Bob received a 15 or high score __% of the time, whereas Alan received that score only ____% of the time. If the same trends occur, you would expect Bob to get a score of 15 _________ as often as Alan. Skill-based Task What is the joint frequency of students who have chores and a curfew? __________ Which marginal frequency is the largest? _______ Problem Task Collect data that compares populations of countries with square miles. What trends emerge when we compare living in geographically large countries with those that are highly populated? Unit 4, Cluster 2, Learning Cycle E: Descriptive Statistics I can represent two quantitative data sets on a scatter plot and describe how the variables are related using the correlation coefficient. SCATTER PLOT A ________ _______ is a graphic tool used to display the relationship between two quantitative (numerical) variables. It consists of a _________ axis, a _______ axis, and observations plotted with _______. Each ______ on the scatter plot represents a paired observation from the data set. The ___________ of the dot on the scatter plot represents its x- and y-coordinates. The __-___________ can be described as the input, explanatory, or independent variable; whereas the __-____________ can be referred to as the output, response, or dependent variable. Example: The table and scatter plot shows the height and the weight of five starters on a high school basketball team. Each player in the table is represented by a ____ on the scatter plot. The first dot represents the shortest, lightest player. From the scale on the x-axis, the shortest player is ___ inches tall; and from the scale on the y-axis, you see that he/she weighs ____ pounds. Scatter plots are helpful in understanding patterns in _________ _______. For example, the above scatter plot shows that the relationship between height and weight is _____, ________, and has a ___________ slope. Unit 4, Cluster 2, Learning Cycle F: Descriptive Statistics I can fit a linear function to a scatter plot or data set that shows a linear association and fit an exponential function to a set of data and use these functions to solve problems. A _____ __ _____ ____ (or “trend” line) is a straight line that best represents the data on a scatter plot. This line may pass through some of the points, none of the points, or all of the points. If you are looking for values that fall within the range of values plotted on the scatter plot, you are ___________. If you are looking for values that fall beyond the range of those values plotted on the scatter plot, you are ________________. Beware the dangers of extrapolation! The _______ away from the plotted values you go, the ______ reliable your prediction can become. Is there a linear association between the fat grams and the total calories in fast food? ________ Can we predict the number of calories from the number of grams of fat in a sandwich?_______ If a sandwich has 26 grams of fat, how many calories will it have based on the data?________ Create a scatter plot and find the line of best fit or the regression line. ___________________ 1. Using your regression equation, find the total calories based upon 26 grams of fat? _________ 2. Using your regression equation, find the total calories based upon 40 grams of fat? _________ A rapidly growing strain of bacteria has been discovered. Its growth rate is shown in the chart below. Prepare a scatter plot of the data with hours as the independent variable and the number of bacteria as the dependent variable. 1. Determine which regression model will best approximate your data. We will limit our choices to linear and exponential regression models. Let’s examine both. It makes sense that the ___________ model would best represent the data since exponential models are often used with population growth (even when the population is bacteria.) 2. Write the regression equation for your model, rounding the values to three decimal places _______________________ 3. What is the correlation coefficient for this data and what does it tell you? ____________ ___________________________________________________________________ 4. Using your regression equation, determine how many bacteria, to the nearest integer, will be present in 12 hours. ____________________________________________________ 5. Using your regression equation, determine how many bacteria, to the nearest integer, will be present in 3.5 hours. ____________________________________________________ Unit 4, Cluster 2, Learning Cycle G: Descriptive Statistics I can determine when a linear model is appropriate for a data set by plotting and analyzing residuals. RESIDUALS __________ (or error) represents unexplained (or residual) variation after fitting a regression model. A linear regression model is not always appropriate for the data. You can assess the appropriateness of the model by examining the residual plot. The _______ between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the residual (e). The predicted value is the y-value generated by the linear regression model (or line of best fit). Each data point has ____ residual. Notice from the equation below that an observed value (point) falls below the regression line, its residual will be ________ whereas points that fall above the regression line have a _________ residual value. residual = observed value – predicted value e =y –ŷ RESIDUAL PLOTS A _________ ________ is a graph that shows the residual values on the vertical axis and the independent (x) variable on the horizontal axis. If the points in a residual plot are show no ________ ________, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate. Below, the residual plots show three typical patterns. (good fit for a linear model) (a better fit for a non-linear model) Below, the table presents data from measuring the left foot and the left forearm of students in a math class. The scatter plot in the center suggests a ________ association. A linear function is also graphed and the equation given. This equation was used to calculate the expected values in the table. The difference between the actual measurement of the left forearm and the expected value give us the residual for each x coordinate. The residuals are plotted in the graph on the right. The residual plot shows a _______ pattern. Therefore, a ________ model is appropriate. The correlation coefficient will measure the strength of this linear relationship. Skill-based Task The following data shows the age and average daily energy requirements for male children and teens (1, 1110), (2, 1300), (5, 1800), (11, 2500), (14, 2800), (17, 3000). Create a graph and find a linear function to fit the data. Using your function, what is the daily energy requirement for a male 15 years old? Would your model apply to an adult male? Explain. Problem Task Collect data on forearm length and height in a class. Plot the data and estimate a linear function for the data. Compare and discuss different student representations of the data and equations they discover. Could the equation(s) be used to estimate the height for any person with a known forearm length? Why or why not? Unit 4, Cluster 3, Learning Cycle A: Descriptive Statistics I can interpret the slope (rate of change) and intercept (constant term) of given linear models in context. INTERPRET LINEAR MODELS The scatter plot below displays data that relates the number or oil changes per year and the cost of engine repairs. If the line of best fit is given by the equation, y = -70x + 630, what does the slope of -70 mean within context of the problem? _________________ What does the y- intercept (the constant term) of 630 mean within the context of the problem?____________________ ____________ is the average rate of change. What is the change in the cost of repairs for each oil change?______ This indicates that on average for each additional oil change per year, the cost of engine repairs will tend to ________ by $___. The __________ of (0, ____) means that if there were zero oil changes, engine repairs would be predicted to cost about $____. From the graph, the x-intercept is about __, which means that a car owner would expect to spend nothing on engine repairs if they changed the oil __ times a year. Is this a sensible number of oil changes per year?______ Skill-based Task Collect power bills and graph the cost of electricity compared to the number of kilowatt hours used. Find a function that models the data and tell what the intercept and slope mean in the context of the problem. Problem Task Create a poster of bivariate data with a linear relationship. Describe for the class the meaning of the data, including the meaning of the slope and intercept in the context of the data. Unit 4, Cluster 3, Learning Cycle B: Descriptive Statistics I can compute, using technology, and interpret the correlation coefficient of a linear fit. CORRELATION COEFFICIENT The correlation coefficient, __, measures the strength of the _______ association between two variables. How to Interpret a Correlation Coefficient The sign and the absolute value of r, the correlation coefficient, describe the __________ and the _________ of the relationship between two variables. • • • • • • • The value of a correlation coefficient ranges between ___ and __. The _______ the absolute value of a correlation coefficient, the stronger the _______ relationship. The strongest linear relationship is indicated by a correlation coefficient of ___ or __. The weakest linear relationship is indicated by a correlation coefficient of __. A positive correlation means that as one variable gets larger, the other variable tends to get _______ as well. A negative correlation means that as one variable gets larger, the other variable tends to get _______. The correlation coefficient and the slope of the line will have the same ______. The correlation coefficient, r, only measures _______relationships. Therefore, a correlation of 0 does not mean there is no relationship between two variables; rather, it means there is ___ linear relationship. (It is possible for two variables to have a zero correlation coefficient and a strong curvilinear relationship at the same time.) SCATTER PLOTS & CORRELATION COEFFEICIENTS Several points are evident from the scatter plots. • When the slope (average rate of change) of the line in the plot is _________, the correlation is _________; and vice versa. • The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall __________ on a straight line. • The correlation becomes weaker as the data points become more __________ away from the line. • If the data points have no particular pattern, the correlation is equal to _______. • The correlation is affected by an ________, a data point that diverges greatly from the overall pattern of data and has a large residual. Compare the first scatter plot with the last scatter plot. The single outlier in the last plot greatly __________ the correlation (from 1.00 to 0.71). Skill-based Task Problem Task The correlation coefficient of a given data set is 0.97. Hypothesize the correlation between two sets of List three specific things this tells you about the data. seemingly related data. Gather data to support or refute your hypothesis. 1. 2. 3. Unit 4, Cluster 3, Learning Cycle C: Descriptive Statistics I can distinguish between correlation and causation. CORRELATION & CAUSATION Warning: Correlation does not imply causation!! A value that shows the strength of the linear relationship _________________ occurs only when the relationship between between two variables is a ____________ ______________. the two variables can be proven through a scientific experiment However, a strong correlation between two sets of data _____ following strict guidelines. Only in this way can we rule out ____ mean that there is a cause-and-effect relationship between other factors that may affect the relationship that we see in the the two. observed values. EXAMPLES: When a scatter plot shows a correlation between two variables, even if it's a strong one, there is not necessarily a __________ and ________ relationship . Both variables could be related to some third variable that actually causes the apparent correlation. An apparent correlation simply could be the result of chance. 1. During the month of June the number of new babies born at the Utah Valley Hospital were recorded for a week. Over the same time period, the number of cakes sold at Carlo’s Bakery in Hoboken, New Jersey was also recorded. From looking at the graph, it can be seen that a high positive correlation exists between these two sets of data. This must mean that as the number of babies born at Utah Valley Hospital increases, the number of cakes that are sold at Carlo’s Bakery will also increase. Is this true? ______________ 2. Here's an example where an apparent correlation is actually the result of a hidden third variable. "Mr. Jones gave a math test to all the students in his school. He made the startling discovery that the taller students did better than the short ones. His conclusion was that 'as your height increases, so does your math ability'.” Of course, the hidden variable here is the age of the students, or the grade level they're in. Mr. Jones gave the same test to every student in grades one through twelve, so of course the students in the higher grades who had learned more math did better on the test. It wasn't the increase in the students' height that caused the math results to go up, but their increase in _______ ___________. 3. Scientists who use scatter plots to look for correlations between variables watch out for this hidden variable problem. An American medical researcher wants to see if there is a link between a person's socio-economic status (how much money they have) and certain types of cancer. His research seems to indicate that there is a link ... rich people seem to suffer from more cancers than poor people do. His conclusion is that being rich will make you more likely to get cancer. His rationale for concluding this is that the stress that comes with high-paying jobs could be a factor in causing some cancers. Sound logical? The connection the researcher described may actually exist, but there are certainly other possibilities. In the United States there is no universal free medical care, so rich people are far more likely to see a doctor than poor people. The higher rates of cancer in rich people may not be due to their wealth at all, but due instead to the fact that they visit their doctor more often (they can afford it), so more cancers are being _____________ for them. Skill-based Task Give an example of a data set that has strong correlation but no causation and describe why this is so. Give an example of a data set that has both strong correlation and causation and write a description of why this is so. Problem Task Find media artifacts that make claims of causation and evaluate them.
© Copyright 2026 Paperzz