Stat11-30Jan06

Two Quantitative Variables
Scatterplots
examples
how to draw them
Association
what to look for in a scatterplot
Correlation
strength of a linear relationship
how to calculate
good news and bad news
Paired vs. Unpaired Variables
Paired variables come from the
same data table.
Each record has one value of X
and one value of Y, and they go
together a pair.
case
#
Shoe
size
IQ
1
11
115
2
7
120
3
7.5
100
4
8
102
5
45
160
Paired vs. Unpaired Variables
Germany
Unpaired variables come
from different tables
France
…or from different lines
of one table.
IN CHAPTER TWO WE’RE
DEALING WITH PAIRED
VARIABLES.
ca
se
#
Shoe
size
1
11
2 7
3 7.5
4 8
5 12
6 10
case
#
Shoe
size
1
6.5
2
8
3
8
4
11
5
9
Paired vs. Unpaired Variables
Unpaired variables come
from different tables
…or from different lines
of one table.
IN CHAPTER TWO WE’RE
DEALING WITH PAIRED
VARIABLES.
case
#
Country
Shoe
size
1
France
11
2
Germany 7
3
Germany 7.5
4
France
8
5
France
12
6
France
10
Scatterplot
CARS
BOATS
30
20
25
30
65
BOATS
40
80
50
20
20
30
CARS
40
Scatterplot
CARS
BOATS
30
20
25
30
65
BOATS
40
80
50
20
20
30
CARS
40
cigarettes.xls
Kinds of Association…
Positive vs. Negative
Strong vs. Weak
Linear vs. Non-linear
Made-up Examples
STATE
AVE
SCORE
PERCENT TAKING SAT
Made-up Examples
IQ
SHOE SIZE
Made-up Examples
JUDGE’S IMPRESSION
250
350
BAKING TEMP
450
Made-up Examples
LIFE EXPECTANCY
GDP PER CAPITA
What to look for in a scatterplot…
Do the cases break up into separate clusters?
Are there outliers?
Is there an ASSOCIATION between the
variables? OR are they INDEPENDENT?
ALWAYS DRAW THE PICTURE !!!!
Scatterplots: Which variable goes where?
RESPONSE VARIABLE goes on Y axis
(“Y”)
(“dependent variable”)
EXPLANATORY VARIABLE goes on X axis
(“X”)
(“independent variable”)
If neither is really a response variable, it
doesn’t matter which variable goes where.
Scatterplots: Drawing Considerations
Don’t show the axes without a good reason
Don’t show gridlines without a good reason
Scales should cover the ranges of the variables-—outliers?
—no need to include 0
—what if same units?
CORRELATION
CORRELATION
(or, the CORRELATION COEFFICIENT)
measures the strength of a linear relationship.
If the relationship is non-linear, it measures the
strength of the linear part of the
relationship. But then it doesn’t tell the
whole story.
Correlation can be positive or negative.
Computing correlation…
1. Replace each variable with its
standardized version.
xi '  ( xi  x ) / sx
yi '  ( yi  y ) / s y
2. Multiply each pair ( xi’ times yi’ )
3. Take an “average” of the products
x 'y '

r
i
n 1
i
Computing correlation
sum of all
the products
x 'y '

r
i
i
n 1
r, or R, or
greek 
(rho)
n-1,
not n
Good things about correlation
It’s symmetric ( correlation of x and y means same as correlation
of y and x )
It doesn’t depend on scale or units
— adding or multiplying either variable by
a constant doesn’t change r
— of course not; r depend only on the
standardized versions
r is always in the range from -1 to +1
+1 means perfect positive correlation; dots on line
-1 means perfect negative correlation; dots on line
0 means no relationship, OR no linear relationship
Bad things about correlation
Sensitive to outliers
Misses non-linear relationships
Doesn’t imply causality