Statistics I
Chapter 3: Bivariate data analysis
Chapter 3: Bivariate data analysis
Contents
3.1 Two-way tables
I
I
I
I
I
Bivariate data
Definition of a two-way table
Joint absolute/relative frequency distribution
Marginal frequency distribution
Conditional frequency distribution
3.2 Correlation
I
I
I
Graphical display: scatterplot
Types of relationships between numerical variables
Measures of linear association
Chapter 2: Bivariate data analysis
Recommended reading
I
Peña, D., Romo, J., ’Introducción a la Estadı́stica para las Ciencias
Sociales’
I
I
Chapters 7, 8, 9
Newbold, P. ’Estadı́stica para los Negocios y la Economı́a’ (2009)
I
Chapter 12
Bivariate data
I
Bivariate data is obtained when we observe two characteristics
(numerical or categorical) of one individual/object.
I
Notation: A sample of n bivariate observations will be written as n
pairs
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ).
Summarizing bivariate data: two-way table
I
In order to summarize the data, we create two-dimensional table ...
I
I
with classes/class intervals of one variable in the rows ...
and classes/class intervals of the other in the columns
I
The corresponding frequencies (absolute or relative) appear in the
cells on the cross-section of the given row and the given column
I
When one variable is qualitative, the two-way table is called a
contingency table
Two-way table: joint absolute frequency distribution
I
Two-way table with k rows and m columns
Y
y1 · · ·
yj
···
x1 n11 · · · n1j · · ·
..
..
..
.
.
.
X
I
ym
n1m
..
.
Total
n1•
..
.
xi
..
.
ni1
..
.
···
nij
..
.
···
nim
..
.
ni•
..
.
xk
Total
nk1
n•1
···
···
nkj
n•j
···
···
nkm
n•m
nk•
n••
Notation:
nij absolute frequency of cell (i, j)
n••
Row i total:
Column j total:
sample size
ni• = ni1 + ni2 + · · · + nim
n•j = n1j + n2j + · · · + nkj
Joint absolute frequency distribution
Example Sixty four men were selected and the following two variables
were considered
Y = Color of man’s mother’s eyes {Bright, Dark}
X = Color of man’s eyes {Bright, Dark}
(x1 , y1 ) = (B, B), (x2 , y2 ) = (B, D), . . . , (x64 , y64 ) = (D, D)
X
Bright
Dark
Total
Y
Bright Dark
23
12
17
12
40
24
Total
35
29
64
I
(x2 , y2 ) = (B, D) means that the 2nd man had Bright eyes and his
mother had Dark eyes
I
23 of men had B eyes and their mothers also had B eyes
I
29 of men had Dark eyes
Two-way table: relative joint frequency distribution
I
fij = nij /n•• relative frequency of cell (i, j)
X
I
x1
..
.
y1
f11
..
.
···
···
xi
..
.
fi1
..
.
···
xk
Total
fk1
f•1
···
···
Y
yj
f1j
..
.
···
···
ym
f1m
..
.
Total
f1•
..
.
fij
..
.
···
fim
..
.
fi•
..
.
fkj
f•j
···
···
fkm
f•m
fk•
1
Notation:
fij relative frequency of cell (i, j)
n••
Row i total:
Column j total:
sample size
fi• = fi1 + fi2 + · · · + fim
f•j = f1j + f2j + · · · + fkj
Joint frequency distribution: absolute and relative
Example
Y = Weekly number of visits to the cinema
X = Weekly number of visits to the theater
(absolute frequencies in black, relative in blue)
Y
X
0
1
2
3
12
4
3
1
0
0.279
0.093
0.070
0.023
5
3
3
0
1
0.116
0.070
0.070
0.000
4
2
2
0
2
0.093
0.047
0.047
0.000
2
1
0
0
3
0.047
0.023
0.000
0.000
1
0
0
0
4
0.023
0.000
0.000
0.000
Marginal frequency distribution: absolute and relative
I
The marginal frequencies take us back to the univariate case (we
ignore the other variable)
I
We obtain them by looking at the margins of the joint frequency
table (absolute or relative)
I
Specifically, the relative marginal frequencies of X are defined as
fi• = fi1 + · · · + fij + · · · + fim
and those of Y are
f•j = f1j + · · · + fij + · · · + fkj
I
Absolute marginal frequencies are obtained analogously
Marginal frequency distribution
Example
Y = Number of workers
X = Number of sales
Y
X
1-100
101-200
201-300
Total
1-24
0.293
0.098
0.073
0.463
25-49
0.122
0.073
0.073
0.268
50-74
0.098
0.049
0.049
0.195
75-99
0.049
0.024
0.000
0.073
Marginal (relative) frequencies of Y
Workers
f•j
1-24
0.463
25-49
0.268
50-74
0.195
75-99
0.073
Marginal (relative) frequencies of X
Sales
fi•
1-100
0.561
101-200
0.244
201-300
0.195
Total
0.561
0.244
0.195
1
Conditional frequency distribution
I
The conditional distribution of Y , given that X = xi (notation
f (Y |X = xi )) is obtained by calculating
fij
fi•
for j-th class/class interval of Y (we narrow our data set to the
observations satisfying the condition)
I
The conditional distribution of X , given that Y = yj is obtained by
calculating for i-th class/class interval of X
fij
f•j
I
Exactly the same conditional distribution is obtained if we replace
the relative frequencies by the absolute frequencies in the above
definitions
I
Conditional frequencies always add up to 1!
Conditional frequency distribution
Example
Y = Number of workers
X = Number of sales
Find the conditional frequency distribution for sales given that the
number of workers is between 50 and 74.
Y
X
1-100
101-200
201-300
Total
1-24
0.293
0.098
0.073
0.463
25-49
0.122
0.073
0.073
0.268
50-74
0.098
0.049
0.049
0.195
75-99
0.049
0.024
0.000
0.073
Total
0.561
0.244
0.195
1
The conditional frequency distribution f (X |50 ≤ Y ≤ 74)
Sales
f (x|50 ≤ y ≤ 74)
0.500(=
1-100
0.098
0.195 )
101-200
0.250(= 0.049
0.195 )
201-300
0.049
0.250(= 0.195
)
Conditional frequency distribution
Example
Y = Number of workers
X = Number of sales
Find the conditional frequency distribution for workers when the number
of sales is between 101 and 200.
Y
X
1-100
101-200
201-300
Total
1-24
0.293
0.098
0.073
0.463
25-49
0.122
0.073
0.073
0.268
50-74
0.098
0.049
0.049
0.195
75-99
0.049
0.024
0.000
0.073
Total
0.561
0.244
0.195
1
The conditional frequency distribution f (Y |101 ≤ X ≤ 200)
Workers
f (y |101 ≤ x ≤ 200)
1-24
0.402
25-49
0.299
50-74
0.201
75-99
0.098
Graphics for bivariate numerical data: scatterplot
165000
●
●
●
●
●
●
160000
●
●
●
●
●
●
155000
price
162657
165554
154506
162103
158271
166925
161917
161149
152263
151878
165678
166696
165387
161806
163824
Price of a house (euro)
size
107
114
91
100
96
107
104
100
80
81
105
111
108
97
106
●
●
●
80
85
90
95
100
105
Size of a house (m^2)
110
115
Types of relationships
There are different ways in which two variables may be related:
(i) absence of relationship (ii) negative linear relationship
(iii) positive linear relationship and (iv) nonlinear relationship
NEGATIVE LINEAR
3
3
NO RELATIONSHIP
●
●
●
●
●
−2
−1
2
●
● ●
1
●
●
●
●
−3
●
−2
●
−1
0
1
2
−2
−1
0
2
−2
0
●●
● ●●
● ●●● ●
●● ●
●● ● ● ●●●
● ●
● ● ● ●● ●● ● ● ●● ● ●
●
●
●● ●
●
● ●
●
●
●● ●
−4
1
0
−1
−2
●●
●
●●
● ●
●● ●
●
● ● ● ●●
●
●
●●
●●
●
●●
●●
●
●
●
●
●● ●
●
●
●●
●●
●
●
● ●
●
−8
−3
●
1
NONLINEAR
●
●
●
●
−6
2
POSITIVE LINEAR
−4
●●
●
●●
●
● ● ● ●●
●
●
●● ●
●●●
●
●●
●
● ●
●
● ●
● ●
●●
0
0
●
●●●
●
●
−2
1
●
−1
2
●
●
● ●●
●●
●●
●
● ●● ●
●
●●
●
●
●
●
●●
● ● ●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
−3
●
−2
−1
0
1
2
−3
−2
−1
0
1
2
Measures of linear association: covariance
I
We look for a descriptive measure that would indicate whether there
is a linear relationship between x and y
I
Sample covariance is defined as
Pn
i=1 xi yi
sxy =
I
1
n−1
z
n
X
}|
− nx̄ ȳ
{!
(xi − x̄)(yi − ȳ )
i=1
Covariance and linear transformation: if we transform Y variable
according to Z = a + bY , then the covariance between X and the
new variable Z is
sxz = bsxy
Properties of the covariance
I
If the covariance is ’much larger than 0’, it’s because there exists a
positive linear relationship between the variables
I
If the covariance is ’much smaller than 0’, it’s because there exists a
negative linear relationship
I
If the covariance is ’small’, it’s because:
(i) the linear relationship does not exists; or
(ii) the relationship is nonlinear.
Drawbacks:
I
I
I
What does it mean large/small?
What are the units of covariance?
Measures of linear association: correlation
I
Correlation overcomes the issues with covariance
I
Sample correlation is defined as
r(x,y ) =
I
sxy
sx sy
Properties:
I
The correlation is bounded: −1 ≤ r(x,y ) ≤ 1 , so the terms large and
I
small now make sense
It is unitless
Measures of linear association: correlation
Example Three variables are measured over 91 countries: female life
expectancy, male life expectancy and gross national product.
I
The covariances between the three pairs of two variables are 105.15,
50066.04 and 57917.93, respectively.
I
On the other hand, the correlations between the three pairs of two
variables are 0.98, 0.64 y 0.65, respectively.
I
Therefore, even if the covariances between male and female live
expectancy and gross national product are larger than the covariance
between male and female live expectancies, the correlation is larger
for these last two variables.
© Copyright 2026 Paperzz