Algebra 2
Class Notes
Correlation and Line of Best Fit
Given N data points, { (X1,Y1), (X2,Y2), (X3,Y3), … , (XN,YN) } , if we plot the
points on the Cartesian Plane (rectangular coordinate plane), we are used to being
given data for which the points “line up nicely.”
But sometimes (actually, almost all the time in the real world), the points do not
form a perfectly straight line. In fact, they’re usually all over the place. We call
this type of plot a “scatter plot”.
If the scatter plot forms a pattern for which a clear “linear trend” can be seen, it is
reasonable to approximate the data with a single straight line and use that line to
predict values outside the given set of data points.
We call the degree of
relationship depicted by the
two variables “correlation”.
Correlation gives us a measure
of predictability.
If the scatter plot depicts a
growth trend, we call it a
“positive correlation”.
If the scatter plot depicts a downward trend, we call it a “negative correlation.”
When we draw a single line approximating the data, we call that line “the line of
best fit”. The positive or negative correlation corresponds to the slope of the line
of best fit.
If the points cluster tightly along the line of best fit, we call it a “strong
correlation.” If the points do cluster along some trend line, but not in a very
definitive manner, we call it “weak correlation.” The strength or weakness of
correlation is called the “magnitude” of correlation.
If no such trend is apparent, we say that there is “no correlation.”
So how do we construct the equation of this “best line of fit”? Is there a simple
formula we can use?
Actually, yes there is! The equation for the “line of best fit” is:
Y =
+
Yeah, yeah, I know. You’re thinking of throwing something at me. You call that
simple? I’ve seen jet engine designs that are simpler than that!
Actually, it is pretty simple. The fancy Greek symbol ( ) stands for “sum”. For
small sets of data, you can do this manually (if you have a lot of time and
patience). But for large sets of data, you need a computer or calculator to get the
answer (that is, if you want the answer before your 60th birthday).
But aside from doing LOTS of arithmetic, that’s all it is … arithmetic.
Now, for small sets of data, you can sometimes visually estimate where the line is.
For large sets of data, estimation is clearly a real guessing game. The following
two sample scatter plots depict lines of best fit.
So how do we estimate a line of best fit without using that hideously convoluted
formula?
Well, look at the simple scatter plot above (no, not that one … the one on the left!)
Notice how the line of best fit is fairly close to being a straight line connecting the
two end points. Given two points, can we easily find the equation of the straight
line that connects them?
Uh, well, let’s see. Suppose we take the two “end” points from the graph above.
Their coordinates are approximately (12,190) and (25,610).
The two points give us a slope of
=
≈ 32.3
Using this slope and one point, we can write the Point-Slope form of the line as:
Y – 190 = 32.3(X – 12) or
Y = 32.3 X – 197
Is this a reasonable approximation? Well, consider the point at which X = 22.
Substituting X = 22 into our equation gives us: Y = 32.3(22) – 197 ≈ $523
We can see from the graph that this is fairly close to the plotted value.
Usually, the line of best fit is extended to predict values beyond the given data.
For example, suppose you were
collecting data over a period of the
last ten years on the population of
bears in the area. Let’s say your
graph looks like this
A line drawn connected the two
end points seems to reasonably
estimate the data. Is it the exact
best line of fit? Of course not. But
it’s close and probably close
enough for our needs.
Using this line, we can predict the
increased bear population for the
year 2015.
In general, when “eye-balling” the best-fit line, it’s reasonable to have an equal
number of data points above and below the line. A better approach is to have the
sum of the distances above the line equal the sum of the distances below the line.
Remember, this technique is only an approximation. The “ugly formula” we saw
before gives the BEST line of fit. It is called the “line of regression” or “line of
least squares.”
You’ll learn more about this when you take Statistics!
© Copyright 2026 Paperzz