Multivariate statistical data File

Reliability and Risk Analysis
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Elementary Statistical Terms
Population consists of all elements – individuals, items, or objects – whose characteristics
are being studied. The population that is being studied is also called target population.
A unit is a single entity (usually a person or an object) whose characteristics are of
interest.
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Elementary Statistical Terms
A sample from a statistical population is a proportion (a subset) of the population
selected for study.
A survey that includes every member of the population is called census. The technique of
collecting information from a proportion of the population is called sample survey.
A sample that represents the characteristics of the population as closely as possible is
called a representative sample.
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Elementary Statistical Terms
A variable is a characteristic under study that assumes different values for different
elements.
The value of variable for an element is called an observation or measurement. A data
set is a collection of observations on one or more variables. The number of observations
we call a sample size and denote usually n.
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Main Types of Data (variables)
Basic types of data (variables):
nominal or categorical
ordinal
cardinal or numerical
discrete
continuous
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Frequency and relative frequency
• Frequency nj is number of occurrences of variant xj . We can write
where k is the number of variants.
• Relative frequency is given
pj =
it fulfills
Pk
j=1
nj
,
n
pj = 1.
• Cumulative frequency Nj
Nj = n1 + · · · + nj
• Relative cumulative frequency Fj
Fj =
Nj
= p1 + · · · + pj
n
Jiří Neubauer
Multidimensional Data
Pk
j=1
nj = n,
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Frequency and relative frequency – example
We have data set containing the heights of 50 randomly chosen 15 months old boys (in
cm):
83 85 81 82 84 82 79 84 80 81 82 82 80 82 80 82 83 84 82 79
83 82 83 82 82 82 81 80 82 82 83 80 82 85 81 83 81 81 83 82
81 85 83 79 81 81 81 84 81 82
Height Freq. Rel. freq. Cumulative
Rel. cum.
xi
ni
pi
frequency Ni frequency Fi
79
3
0.06
3
0.06
80
5
0.10
8
0.16
81
11
0.22
19
0.38
82
16
0.32
35
0.70
83
8
0.16
43
0.86
84
4
0.08
47
0.94
85
3
0.06
50
1.00
Σ
50
1.00
—
—
Tabulka: Frequency table – height of 15 months old boys
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Frequency and relative frequency – example
Obrázek: Frequency distribution
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Emprirical distribution function
We define empirical distribution function as
follows
N(xi ≤ x)
Fn (x) =
,
n
where the expression in the numerator indicates the number of elements which value is
equal or less than x.
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Frequency and relative frequency – example
We have data set containing the quantity of the dust particles (in µg/m3 ):
1.23
1.51
1.41
1.14
1.47
1.10
1.53
1.22
1.34
1.24
1.54
1.31
1.27
1.16
1.45
1.34
1.23
1.37
1.51
1.29
1.06
1.31
1.14
1.58
1.17
1.09
1.27
1.22
1.33
1.63
1.41
1.17
1.43
1.31
1.39
1.48
1.27
1.40
1.04
1.02
1.52
1.34
1.41
1.58
1.38
Create a frequency table and plot the data.
Jiří Neubauer
Multidimensional Data
1.37
1.27
1.51
1.12
1.39
1.37
1.09
1.51
1.19
1.43
1.63
1.01
1.47
1.17
1.28
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Frequency and relative frequency – example
Class
(1.00; 1.10i
(1.10; 1.20i
(1.20; 1.30i
(1.30; 1.40i
(1.40; 1.50i
(1.50; 1.60i
(1.60; 1.70i
Σ
Middle
xj
1.05
1.15
1.25
1.35
1.45
1.55
1.65
—
Freq.
nj
7
8
11
14
9
9
2
60
Rel. freq.
pj
0.177
0.133
0.183
0.233
0.150
0.150
0.033
1
Cum.
freq. Nj
7
15
26
40
49
58
60
—
Rel. cum.
Freq. Fj
0.117
0.250
0.433
0.667
0.817
0.967
1.000
—
Tabulka: Frequency table – quantity of dust particles in µg/m3
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Frequency and relative frequency – example
Obrázek: Frequency distribution – histograms
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
One-dimensional discrete data
One-dimensional continuous data
Empirical distribution function
Obrázek: Empirical distribution function
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional discrete data


y1
.. 
. , X has r variants and Y has s variants.
yn
x1
 ..
Let us have two-dimensional data set  .
xn
Joint absolute frequency of (xj , yk ) is njk = N(X = xj ∧ Y = yk ).
Joint relative frequency of (xj , yk ) is
pjk =
njk
.
n
Marginal absolute frequency of the variant xj is
nj. = N(X = xj ) = nj1 + · · · + njs .
Marginal relative frequency of the variant xj is
pj. =
nj.
= pj1 + · · · + pjs .
n
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional discrete data
Marginal absolute frequency of the variant yj is
n.k = N(X = yk ) = n1k + · · · + nrk .
Marginal relative frequency of the variant yk is
p.k =
n.k
= p1k + · · · + prk .
n
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional discrete data – example
The age of 42 dwarf apple-trees in years (X ) and the annual harvest (Y ) were recorded,
see the table below. .
xj
3
4
5
6
7
8
9
4
9
9
10
9
8
5
7
5
8
8
7
7
4
5
7
9
10
8
7
6
Jiří Neubauer
yi
5
6
10
10
9
8
7
5
8
7
10
10
6
6
7
7
9
9
10
8
Multidimensional Data
8
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional discrete data – example
age/harvest
3
4
5
6
7
8
9
n.k
4
1
0
0
0
0
0
1
2
5
3
1
0
0
0
0
1
5
6
0
1
0
0
0
1
2
4
7
1
2
2
0
1
2
1
9
8
0
2
1
1
1
2
1
8
9
0
1
2
1
3
0
0
7
10
0
0
1
4
1
1
0
7
Tabulka: Frequency table
Jiří Neubauer
Multidimensional Data
nj.
5
7
6
6
6
6
6
42
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional discrete data – example
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional continuous data


x1 y1
.
.. 
Let us have two-dimensional data set  ..
. , we split values of X into r intervals
xn yn
(uj , uj+1 i, j = 1, . . . , r and values of Y into s intervals (vk , vk+1 i, k = 1, . . . , s. Each
frequency is then related to the frequency of values at given intervals.
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional continuous data – example
We have 34 measurement of pH and bicarbonate HCO−
3 in water. Construct distribution
table.
pH
7.6
7.1
8.2
7.5
7.4
7.8
7.3
8.0
7.1
HCO−
3
157
174
175
188
171
143
217
190
142
pH
7.5
8.1
7.0
7.3
7.8
7.3
8.0
8.5
7.1
HCO−
3
190
215
199
262
105
121
81
82
210
Jiří Neubauer
pH
8.2
7.9
7.6
8.8
7.2
7.9
8.1
7.7
8.4
HCO−
3
202
155
157
147
133
53
56
113
35
Multidimensional Data
pH
7.4
7.3
8.5
7.8
6.7
7.1
7.3
HCO−
3
125
76
48
147
117
182
87
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional continuous data – example
pH/HCO−
3
6.6–7.0
7.0–7.4
7.4–7.8
7.8–8.2
8.2–8.6
8.6–9.0
n.k
30–70
0
0
0
2
2
0
4
70–110
0
2
1
1
1
0
5
110–150
1
3
4
0
0
1
9
150–190
0
2
5
3
0
0
10
190–230
1
2
0
2
0
0
5
Tabulka: Distribution table
Jiří Neubauer
Multidimensional Data
230–270
0
1
0
0
0
0
1
nj.
2
10
10
8
3
1
34
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Two-dimensional discrete data
Two-dimensional continuous data
Two-dimensional continuous data – example
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Descriptive measures
measures of location (center) – mean, quantiles, mode, . . .
measures of dispersion (variation) – variance, standard deviation, sample variance,
sample standard deviation, . . .
measures of concentration – skewness and kurtosis
measures of dependency – coefficients of correlations
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Measures of location
mean:
arithmetic mean x =
1
n
harmonic mean x H =
geometric mean x G =
n
P
xi
i=1
n
n
P
1
x
i=1 i
n
√
x1 · x2 · · · xn
quantile: The quantile xp is the value of the variable which fulfills that 100p % of
values of ordered sample (or population) are smaller or equal to xp and 100(1 − p) %
of values of ordered sample (or population) are larger or equal to xp .
mode: x̂ is the value with highest frequency
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Measures of dispersion
range of variation: R = xmax − xmin
interquartile range: RQ = x0,75 − x0,25
n
P
variance: sn2 = n1 (xi − x)2
i=1
√ 2
sn
n
P
1
standard deviation sn =
sample variance s 2 =
n−1
(xi − x)2
i=1
√
sample standard deviation s = s 2
n
P
average deviation dx = n1
|xi − x|
i=1
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Measures of concentration
n
P
(xi − x)3
1 i=1
skewness: a3 =
n
sn3
n
P
(xi − x)4
1 i=1
kurtosis: a4 =
−3
n
sn4
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Measures of dependency


x1 y1
.
.. 
Let us have two-dimensional data set  ..
. , where x and y denotes means of X
xn yn
and Y , sx , sy are standard deviations of X , Y . Pearson correlation coefficient is defined
by formula
n
1 X xi − x yi − y
.
rxy =
n i=1 sx
sy
We can rewrite it in the form
rxy =
where
sxy =
sxy
,
sx sy
n
1X
(xi − x)(yi − y )
n i=1
is covariance of X a Y .
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Measures of dependency
We calculate ranks of values xi , yi and denote them pi , qi Spearman’s correlation
coefficient (rank correlation coefficient) is then defined by the formula
P
6 ni=1 (pi − qi )2
ρ=1−
.
n(n − 1)
Jiří Neubauer
Multidimensional Data
Statistical data
One-dimensional data
Two-dimensional data
Descriptive measures
Measures of dependency
We say that (xi , yi ) and (xj , yj ) concordant if both xi > xj and yi > yj or if both xi < xj
and yi < yj . We say that they are discordant, if both xi < xj and yi > yj or if both xi > xj
and yi < yj . If xi = xj or yi = yj , the pair is neither concordant nor discordant. Let us
denote nc number of concordant pairs and nd number od discordant pairs. Kendall
correlation coefficient is defined by formula
τ =
nc − nd
.
1
n(n − 1)
2
Jiří Neubauer
Multidimensional Data