Chapter 4 - Relationships Between Categorical Variables

AP Statistics – Test 2 Review
***DISCLAIMER – This does not include all concepts you are responsible for knowing. It is just to help organize your
thoughts.***
Essay Questions: One of these will be on your test.






Explain the difference between the correlation coefficient and the coefficient of determination
Discuss at least three facts of correlation
Discuss at least two of the facts about LSRL
Explain how to construct a residual plot from a given set of data and how to use it to assess the fit of a linear model
Why can a conditional distribution be more useful to analyze than marginal?
Why is it necessary to consider all possible lurking variables when analyzing a set of data?
Chapter 3: Examining relationships
Barron’s Topic 4
Big Ideas



Scatterplots
Correlation
Least-Squares Regression Line
Important vocabulary
Univariate data – observation on a single characteristic of an individual
Bivariate data – observations on two characteristics from the same individual
Response Variable (dependant) – outcome of a study
Explanatory Variable (independent) – helps explain or influences change in the response variable
Scatterplot – used to display bivariate numerical data
Correlation coefficient 𝒓 - describes the strength of a straight-line relationship between two variables
Regression – a modeling process that attempts to find and express the relationship between two variables using lines, curves,
and other more complicated functions
Least Squares Regression Line – the “line of best fit” which makes the vertical distances of the points from the line as small
as possible
Interpolation (GOOD) – estimating predicted values between known values
Extrapolation (BAD) – prediction beyond (outside) the range of known values
Residual – difference between the actual observed value and the predicted or expected value in the response (𝑦 − 𝑦̂ )
o vertical deviations from the regression line
o prediction error
Residual Plot – a scatterplot of the (𝑥, 𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙) ordered pairs
Coefficient of Determination 𝒓𝟐 (COD) – measures the success of the regression model in representing the data
Outlier – an observation that is far outside the overall pattern – an individual point with a large residual
Influential Point – a point which has a strong influence on the position of the regression line because of its extreme position
on the scale. Warning: Influential points pull the line close to them, so they often have small residuals.
Confounding Variable – other variable likely to affect response variable AND systematically related to explanatory variable –
the effects of two explanatory variables that can’t be separated
Lurking Variable – sometimes referred to as “common response” – a variable that drives two other variables, creating the
spurious appearance of a relationship
I. Scatterplots
A scatterplot shows the relationship between two quantitative variables measured on the same individual.
Scatterplots may be the most common and most effective display for two quantitative sets of data.
How to Construct:
1. 𝑥-axis: explanatory variable
𝑦-axis: response variable
2. Label axes!
3. Scale evenly… uniform intervals… well-defined
What to look for:




Form (or model): linear or non-linear
Direction: positive, negative, none
Strength: weak, moderate, strong
Unusual Features: Outliers, Clusters
II. Correlation
Formula:
r
 xi  x   yi  y 
1
1
 

 

 zx z y
n  1  sx   s y 
n 1
Properties of Correlation ~~ “Helps” to guide the interpretation of 𝒓
1. Correlation is a measure of the direction and strength of the linear relationship between two quantitative
variables. Correlation requires that both variables be quantitative.
2. The value of 𝑟 is always a number between −1 and +1. The sign of 𝑟 gives the direction of the correlation.
3. A correlation coefficient of 𝑟 = 1 occurs only when all the points in a scatterplot of the data lie exactly on a
straight line that slopes upward. Similarly, 𝑟 = −1 only when all the points lie exactly on a downward-sloping line.
4. A low value of 𝑟 does not mean there is no relationship between the two variables. Values of 𝑟 near zero
indicate a very weak linear relationship. There could still be a strong association of some form other than linear.
Correlation does not describe curved relationships between variables, no matter how strong they are. You can
calculate a correlation coefficient for any pair of variables but applying this formula would be misleading if the
relationship is not linear.
5. Correlation is a “pure number” – it has no units. The value of 𝑟 does not depend on the unit of measurement for
either variable. It makes no difference if we change the units of measurement of 𝑥, 𝑦, or both.
6. The value of 𝑟 does not depend on which of the two variables is considered 𝑥 and which is considered 𝑦 in
calculating the correlation. Correlation treats 𝑥 and 𝑦 symmetrically. The correlation of 𝑥 with 𝑦 is the same as the
correlation of 𝑦 with 𝑥. Correlation does not require an explanatory variable and a response variable and thus
makes no distinction between them.
7. Correlation is not affected by changes in the center or scale of either variable. Correlation depends only on the
𝑧-scores, and they are unaffected by changes in center or scale.
8. Like the mean and standard deviation, the correlation is not resistant: 𝑟 is strongly affected by a few outlying
observations. An outlier can make an otherwise small correlation look big or hide a large correlation. It can even
give an otherwise positive association a negative correlation coefficient (and vice versa)! When you see a strong
outlier, it may be a good idea to report the correlations with and without the point.
9. Correlation is not a complete summary of two-variable data, even when the relationship between the variables
is linear. You should give the means and standard deviations of both 𝑥 and 𝑦 along with the correlation.
10. This value of 𝑟 is also called the Pearson Product-Moment Correlation Coefficient
What Can Go Wrong?




Don’t say “correlation” when you mean “relationship” or “association.”
Don’t correlate categorical variables.
Watch out for lurking variables.
Don’t confuse correlation with causation.
Correlation does not imply causation.
**Avoid using the term causes unless the data is the result of an experiment where there is overwhelming
evidence that changes in the explanatory variable actually do cause changes in the response variable!!
III. Regression
A. Find the equation of the regression line
First, recall the algebra version of a linear equation: 𝑦 = 𝑚𝑥 + 𝑏
Now, the statistics version: 𝑦̂ = 𝑎 + 𝑏𝑥
The slope: 𝑏 = 𝑟
𝑠𝑦
𝑠𝑥
The 𝒚-intercept: 𝑎 = 𝑦̅ − 𝑏𝑥̅
The slope is the predicted average change in the response variable for each additional unit change in
the explanatory variable.
Along the regression line, a change of one standard deviation in 𝑥 corresponds to a change of 𝑟 standard
deviations in 𝑦.
***Do not confuse the slope 𝑏 of the LSRL with the correlation 𝑟. The correlation and the slope are not the
same thing but they are closely connected in that 𝑟 is contained in the formula for 𝑏.
The residuals are the vertical distance from 𝑦 to the line and are calculated by 𝑦 − 𝑦̂. A point that
lies above the regression line would have a positive residual (indicating underestimate by the line), and a
point that lies below the regression line would have a negative residual (indicating overestimate).

Anytime you calculate the equation for a LSRL on your calculator, the residuals are stored in a list called
RESID. YOU can use this list to construct a residual plot.
B. Examine effectiveness of “fit”
1. Is a line an appropriate way to summarize the relationship?
2. Are there any unusual aspects of the data?
A residual plot is used to answer these questions.
Desirable: Points are scattered with no obvious pattern (this means that there should not be any signs of curvature).
**If there is curvature, then the relationship is not linear.
**If there are isolated points, then the prediction for those points was not very accurate (greater error).
C. How accurate will the prediction be?
The coefficient of determination 𝑟 2 tells us ____% of the response variable is explained by the LSRL
relating 𝑥 and 𝑦. The other _____% is individual variation among subjects that is not related/explainable by
the linear relationship (due to some other factor).
Variability in 𝒚 = influence of 𝒙 + error
Another way to look at 𝑟 2 is that it gives the percent of error we got rid of in predicting 𝑦 by adding another
variable… by going from just one variable for predicting the value of 𝑦 (𝑦̅ ) to two variables (𝑥 and 𝑦).
Top Ten LSRL Facts and Features
10. The point (𝑥̅ , 𝑦̅) is always on the LSRL, regardless of whether or not that point exists as a data point in the scatterplot.
**called the “mean-mean” point or “point of averages”
9. ∑ residuals = 0
(the average error is zero)
8. In a LSRS residual plot, there must always be the same total of absolute lengths below the center line as above the center
line. This is not true for other types of curve-fitting (median-median, exponential, logistic, logarithmic, etc.).
𝑠
7. LSRS is not resistant to outliers. The reason is that none of 𝑟, 𝑠𝑥 , or 𝑠𝑦 are resistant, and the LSRL’s slope is 𝑏 = 𝑟 𝑦 . What a
𝑠𝑥
combination…
6. Don’t extrapolate with LSRL! You can use LSRL for prediction only on the domain of known values. But note, as long as 𝑟 2
is reasonably strong and the linear fit is appropriate (as judged from the residual plot), it is OK to use LSRL for prediction even
if no cause-and-effect relationship exists between explanatory (𝑥𝑖 ) and response (𝑦𝑖 ).
5. 𝑦̂ = 𝑏0 + 𝑏1 𝑥 (AP notation)
𝑦̂ = 𝑎 + 𝑏𝑥 (textbook and STAT CALC 8 notation)
4. Random-looking residual plots are desirable. Common LSRL plot problems:
Trumpet: The standard deviation of the residuals changes with 𝑥. In other words, the amount of vertical “scattering”
changes noticeably for different values of 𝑥.
Bowl-shaped: Try exponential fit instead. The classic example is population growth or other similar growth where
the rate of growth feeds upon itself. (or maybe a power or polynomial fit)
Dome-shaped: Try log fit instead.
S-shaped or wavy: Try logistic fit (if appropriate) or sinusoidal fit (if many waves).
Clusters: Separate the clusters and try fitting each cluster separately.
3. Residual plots can never be linear. (If it were, we would just tilt the LSRL to eliminate the tilt in the residual plot!)
2. LSRL means “Least Squares Regression Line”, since the LSRL minimizes ∑ residuals 2 , the sum of the squared residuals.
And now (drum roll, please)….The #1 LSRL fact or feature:
1. The LSRL is unique.
Chapter 4 - §𝟐 Relationships Between Categorical Variables
Baron’s Topic 5
Big Idea


Two-Way Frequency Tables
Marginal & Conditional Distributions
Important Vocabulary
Two-Way Table – used to display the frequency of bivariate categorical data
Cells – gives observed count for each combination of variables
Column Variable – explanatory variable in columns
Row Variable – response variable across rows
Joint Distribution – proportion for each combination of values
Marginal Distribution – distribution of the row and column variables separately; there are two marginal distributions
Conditional Distribution – distribution of one variable for just those cases that satisfy a condition on another variable;
there are as many conditional distributions as there are rows and columns
Simpson ‘s Paradox
 An association/comparison that holds for all of several groups can reverse direction when the data are combined to
form a single group
 Actually an apparent conflict based on incomplete information
 Results from greatly differing sizes of subgroups
 Need to dig deeper to resolve “paradox”!