2007 - OCCC.edu

Topic 26
Scatter Plot - Graphical display the data (Page 570)
Horizontal axis – Explanatory variable
Vertical Axis – Response Variable
Association- Two variable displays association if knowing the
value of one variable is useful in predicting the value of the
other variable (Page 571)
Three aspects of the association between quantitative
variables: (Page 571)
Direction (Positive or negative)
Strength (Strong, Moderate or Week)
Form (Linear or Curved)
A categorical variable can be incorporated into a
scatter plot by constructing a labeled scatter plot,
which assigns different labels to the dots based on the
category of the observational unit.
For example, you might indicate observations coming from
males with the label M and from females with the label F.
Remember an observed association does not imply a
cause-and-effect relationship exist between two
variables.
Topic 27
Correlation measures the degree of linear association between
two quantitative variables.
But even when two variables display a nonlinear relationship, the
correlation between them still might be quite high when there is a
strong increasing or decreasing trend.
With these data, the relationship is clearly curved and not linear,
and yet the correlation is still fairly high. Do not assume from a
high correlation coefficient that the relationship between the
variables must be only linear.
Always look at a scatter plot, in conjunction with the correlation
coefficient, to assess the form ( linear or not) of the association.
No matter how close a correlation coefficient (r) is to 1,
and no matter how strong the association between two
variables, a cause- and- effect conclusion cannot
necessarily be drawn from observational data.
There are far more plausible explanations for why
countries with lots of televisions per thousand people
tend to have long life expectancies.
For example, the technological sophistication of the
country is related to both number of televisions and life
expectancy.
The correlation coefficient (r) is a number that measures the
direction and strength of linear association between two
quantitative variables.
A correlation coefficient is a number! In fact, it is a number
between -1 and 1, inclusive.
Always examine a scatter plot in addition to calculating a
correlation coefficient. A clear nonlinear relationship can have a
small ( close to zero) correlation,
and a correlation can be close to -1 or 1, even if the relationship
follows a curve or other nonlinear pattern.
The slope, or steepness, of the points in a scatter plot is
unrelated to the value of the correlation coefficient.
If the points fall on a perfectly straight line with a positive slope,
then the correlation coefficient equals 1.0 whether that slope is
very steep or not steep at all.
What matters for the magnitude of the correlation is how
closely the points concentrate around a line, not the steepness
of a line.
Before calculating correlation you need to
enable the option
Press 2nd, 0 and scroll down to find
DiagnosticOn, then press ENTER twice.
Enter data for L1 and L2 in the calculator
Go to STAT, EDIT
Run Least Square Regression
Go to STAT, CALC, 8: LinReg(a+bx)
Enter L1 , L2 where you entered the
data
Press Enter to calculate Correlation
Coefficient (r) and/or Correlation of
Determinant (r2)
Topic 28
The equation of a generic line can be written as
y ˆ = a + b x, in algebra class ( y = mx + b )
where y denotes the response variable. Terms “
Least squares line” and “ regression line” are used
interchangeable.
x denotes the explanatory variable ( also called the
predictor variable).
For Example, x represents foot length and y
represents height, and it is good form to use
variable names in the equation.
a = represent y-intercept
b = Slope of the line
The caret on the y ( read as “ y- hat”) indicates
that its values are predicted, not actual, heights.
One way to measure the “ fit” of a line is to
calculate the residuals for all of the
observational units.
A residual is the difference between the
observed y value and the y value predicted by
your line for the corresponding x value.
In other words, the residual is the vertical
distance from an observation to the regression
line.
One of the primary uses of regression is prediction.
You can use the regression line to predict the value of the yvariable for a given value of the x- variable simply by plugging that
value of x into the equation of the regression line.
This process is equivalent to finding the y- value of the point on
the regression line corresponding to the x- value of interest.
A more common criterion for determining the “ best” line is
to look at the sum of squared residuals ( SSE).
The line that achieves the exact minimum value of the sum of
the squared residuals is called the least squares line, or the
regression line.
Remember to provide measurement units when reporting
predictions. In other words, be clear that the predicted
height is in inches, not centimeters or any other units.
Interpolation means trying to predict the
response variable for values of the explanatory
variable within those contained in the data.
Extrapolation means trying to predict the
response variable for values of the explanatory
variable beyond those contained in the data.
When you have no information about the
behavior of the data outside the values
contained in your dataset
( e. g., you have no reason to believe the
relationship between height and foot length
remains roughly linear beyond these values),
extrapolation is not advisable.
An observation is considered influential if
removing it from the dataset substantially
changes the least squares regression equation.
Typically, observations that have extreme
explanatory ( x) variable values ( far below or
far above the sample mean x-bar ) have more
potential to be influential.
The coefficient of determination is equal to the square
of the correlation coefficient, so it is denoted by r2.
(Where r represents Correlation Coefficient)
r2does not represent the proportion of points that fall on
the line, or the proportion of the y- variable that is
explained by the x- variable.
Rather, r 2 is the proportion of the variability in the yvariable that is explained by the least squares line with
the x- variable.
Of course, when writing your interpretation in a given
context, use the variable names rather than generic x
and y labels.
You have not yet considered how to calculate
the slope and intercept coefficients of the least
squares line. Let the equation of a generic least
squares line be yˆ = a + b x.
The most convenient expressions for calculating
the intercept and slope coefficients for the least
squares line involve the means and standard
deviations of the two variables, along with the
correlation coefficient between them.
It turns out the slope can be calculated from
b = r * sy / sx .
The intercept coefficient can then be calculated
from a = y-bar - b * x-bar .
When a straight line is not the best
mathematical model for a relationship, you can
often transform one or both variables to make
the association more linear.
A transformation is a mathematical function
applied to a variable, re- expressing that
variable on a different scale.
Common transformations include logarithm,
square root, and other powers.
Often trial and error is needed to select the
transformation that establishes a linear
relationship.
Enter data for L1 and L2 in the
calculator
Go to STAT, EDIT
Run Least Square Regression
Go to STAT, CALC, 8: LinReg(a+bx)
Enter L1 , L2 where you entered
the data
Press Enter to calculate
Regression line
Notice that the word “ coefficient” appears often here.
Be especially careful not to confuse a slope coefficient
with a correlation coefficient.
A common theme in statistical modeling is to think of
each data point as being composed of two parts:
the part that is explained by the model ( often called the fit)
and
the “ leftover” part ( often called the residual)
that is the result either of chance variation or of variables
you have not yet considered or measured.
In the context of least squares regression, the fitted
value for an observation is simply the y- value that the
regression line would predict for the x- value of that
observation ( i. e., the fitted value is yˆ).
The residual is the difference between the actual yvalue and the fitted value y ˆ ( residual actual fitted ), so
the residual measures the vertical distance from the
observed y- value to the regression line.
Be sure to subtract in the correct order ( observed
minus predicted) when calculating a residual.
( Remember that points above the line have
positive residuals.)
Never take a prediction very seriously if it results
from extrapolating well beyond the actual data.
Remember not to generalize from the sample data
to a larger population unless the sample was
drawn randomly or you have some other reason to
believe the sample is representative of the
population.
Exercise 28-26: Cricket Thermometers – Page 642