Exploratory data analysis Exploratory data analysis Random

Random variables
IMPORTANT DIFFERENCE BETWEEN:
Random variable ←→ realisation (or observation)
Exploratory data analysis
Dr. David Lucy
• Random variable is always written in UPPER CASE
and is associated with a probability distribution.
• e.g. X = Ozone level
[email protected]
Lancaster University
Ozone level is the random variable of interest.
Exploratory data analysis – p.1/31
Exploratory data analysis – p.3/31
Exploratory data analysis
• Data is information we collected but also bears
uncertainty, due to random variability in characteristics
of interest from one individual to another.
• In mathematical terms, the characteristics being
measured are represented by random variables. e.g X
= age of individuals chosen at random.
Random variables
• Observations of random variables are written in lower
case and is just a number.
• e.g. x = observed instance of random variable
For a data set of size n:
X 1 , · · · , Xn
x1 , · · · , xn
• Individual in this context refers to a unit of study actual people, towns, cars, test tubes...
Exploratory data analysis – p.2/31
random variables
particular realizations
Exploratory data analysis – p.4/31
Data analysis
The first stage in any analysis is to get to know the problem
and the data. This usually:
1. Work out what has been observed, and how it has
been observed.
2. Involves a variety of graphical procedures to try to
visualise the data.
3. The calculation of a few simple summary numbers, or
summary statistics that capture key features of the
data.
Role of data analysis
• Finding errors and anomalies: missing data, outliers,
changes of scale...
• Suggesting route of subsequent analysis: plots of data
give information on location, scale and shape of the
distribution and relationships between variables.
• Augmenting understanding of applied problem:
exploratory graphical tools sharpen the scientific
questions being addressed.
Hopefully these will reveal key features of the unknown
underlying distribution.
Exploratory data analysis – p.5/31
Exploratory data analysis – p.7/31
Data analysis
Having said that:
Example data sets
Three data sets used in the early part of this course:
• the random variability in the data is reflection of the
underlying distribution.
1. Ozone measurements.
• Because of finite sample size, this would serve as an
approximation to the true underlying distribution and its
features.
3. Hospital operation sucess rates.
• This implies that we also need to care how good the
approximation is.
Exploratory data analysis – p.6/31
2. Diseased tree runs.
How do the observations relate to the scientific questions
under investigation?
Exploratory data analysis – p.8/31
Ozone measurements
Examines ground level ozone
Ozone measurements
Scientific questions of interest are:
• Produced by chemical reactions between nitrogen
dioxide (NO2 ), hydrocarbons and sunlight.
1. How, if at all, does the distribution of ozone
measurements vary between the urban and rural sites?
• When present at high levels, ozone can irritate the
eyes and air passages causing breathing difficulties
and may increase susceptibility to infection.
2. How, if at all, is the distribution of ozone
measurements affected by season?
• Ozone is toxic to some crops, vegetation and trees and
is a highly reactive chemical, capable of attacking
surfaces, fabrics and rubber materials.
3. How, if at all, does the presence of other pollutants
affect the levels of measured ozone?
Observations made on an hourly basis from April to July,
and from November to February
Exploratory data analysis – p.9/31
Exploratory data analysis – p.11/31
Ozone measurements
Ozone measurements
Summer measurements
Focus on a pair of sites in Yorkshire:
1. Rural site - Ladybower reservoir.
2. Urban site - Leeds City centre.
Observations in this case designed to form a contrast
between urban/rural environments, and temporal contrasts.
Exploratory data analysis – p.10/31
Date
Leeds O3
Leeds NO2
LadyB O3
LadyB NO2
01/04/1994
32
48
42
7.5
02/04/1994
29
49
43
6.1
03/04/1994
32
34
38
9.5
04/04/1994
32
35
49
3.9
05/04/1994
..
.
33
..
.
50
..
.
39
..
.
11.7
..
.
Exploratory data analysis – p.12/31
Disease in trees
Arbologist looking at tree disease in a plantation:
Disease in trees
Run lengths for trees from a diseased plantation
• Looks at a transcept across the plantation.
• Locates a diseased tree.
• Counts the number of diseased trees in the unbroken
chain of diseased trees.
Records the number of times a run of length n is observed.
Run length
Number of runs
in first 50 observations
Number of runs
in 109 observations
Exploratory data analysis – p.13/31
north
H
D
H
H
H
D
H
D
D
D
D
D
H
H
H
H
H
D
D
H
D
H
D
H
H
H
D
H
D
D
D
D
H
D
D
D
H
D
D
D
H
D
H
H
H
H
H
H
H
D
D
D
H
H
H
D
H
H
D
H
H
H
H
H
D
H
H
H
H
D
D
D
H
D
H
H
D
D
D
D
H
H
H
H
H
D
H
D
D
H
D
H
D
D
H
H
H
H
D
1
16
2
2
3
0
4
1
5
0
71
28
5
2
2
1
Exploratory data analysis – p.15/31
Disease in trees
D
0
31
Disease in trees
Given in notes as a table, but the actual data would be a
sequence of counts:
0, 0, 1, 3, 2, 0, 0, 0, 0, 1, 1, 0, 1, . . .
east
Exploratory data analysis – p.14/31
Exploratory data analysis – p.16/31
Comparison of hospitals
Comparison of successful outcomes in hospital operations:
More esoteric points
• Many continuous measurements are measured as
many valued ordinal variables.
• Number of sucesses counted for each of two hospitals.
• Ten operations observed in each hospital.
Hospital
Hospital 1
Hospital 2
• For heights we can only measure to the nearest
division on our measuring instrument.
• That is someone who is 5‘8.2343" is said to be 5’8",
but from our measurements in the 5’8" to 5’9"
category.
• Height is notionally continuous.
Sucessful
9
5
Colour can be said to be categorical, but the underlying
variable is light wavelength, and if you ask a physicist is
considered continuous.
Exploratory data analysis – p.17/31
Exploratory data analysis – p.19/31
Types of variable
Basically three types:
1. ordinal - can only take on discrete values - order of
values is important - anxiety/neurosis scores
(not-anxious/slightly-anxious/very-anxious).
2. categorical - can only take on discrete values ordering of values has no meaning - examples being
sex (female/male/other) - ethnicity
(European/African/North American) - species
(cat/dog/hamster) - often known as “factors”.
3. continuous - can take on any value from the number
spectrum - height in inches, weight in pounds.
Exploratory data analysis – p.18/31
Continuous to ordinal
Transfer of continuous to discrete variables taken far in
social sciences:
• Likert scales - familiar as a five category scale.
• Example: very good, good, indifferent, poor, bad.
• Example: agree strongly, agree, neither agree not
disagree, disagree, disagree strongly.
Underlying varible type continuous, however:
• Often treated as coarse grained continuous.
• Scale not preserved.
Exploratory data analysis – p.20/31
Populations and samples
• In statistical science samples are collections of objects
which are observed.
• Samples are subsets of populations - we examine a
sample from a population to make inferences about
that population, and because we cannot examine all
members of a population.
At odds with some physical sciences use of the term to
mean thing which is observed.
Populations and samples
• Think about the following:
1. All humans who ever lived and are living at the
moment and who will live in the future.
2. All humans who ever lived and are living at the
moment.
3. All human who are alive at the moment.
Which is the population. and which the sample?
Exploratory data analysis – p.21/31
Exploratory data analysis – p.23/31
Populations and samples
Populations and samples are hierarchically arranged:
Populations and samples
Which in the case of human properties may be biased:
• A sample is always a subset of a population.
• A population always a superset of a sample.
Samples can in themselves be populations, and
populations constitute samples of other populations.
past
Exploratory data analysis – p.22/31
present
future
Exploratory data analysis – p.24/31
Populations and samples
• Notions of population and sample not natural types.
• Dependent on purpose and question under
examination.
• Samples can be biased - means some systematic
difference to population - accounted for by measures of
uncertainty.
Exercise 3.1.1
These might be:
1. Ozone: concentrations of ozone at the two locations
given the time of year and the level of NO2 .
2. Diseased Trees: all trees in the forest and, possibly
other locations where the climate, soil and trees have
similar properties.
3. Hospital operation successes: other operations at the
two hospitals.
Exploratory data analysis – p.25/31
Exploratory data analysis – p.27/31
Exercise 3.1.1
For each of the problems that we are concentrating on in
the course, state the populations that we are trying to learn
about:
1. Ozone:
Exercise 3.1.2
In the case where observations have been made on
diseased trees what:
1. type of variable are the observations?
2. possible values for that variable?
2. Diseased Trees:
3. Hospital operation successes:
Exploratory data analysis – p.26/31
Exploratory data analysis – p.28/31
Exercise 3.1.2
In the case where observations have been made on
diseased trees what:
Exercise 3.1.2
For the comparison between hospitals:
1. Type of variable are the observations? Number trees in
the unbroken run of trees. Discrete, but scale
preserving.
1. Type of variable are the observations? Number of
sucessful operations from 10. Discrete, but scale
preserving.
2. Possible values for that variable? X = 0, 1, 2, . . ., 10
2. Possible values for that variable? X = 0, 1, 2, . . .
Exploratory data analysis – p.29/31
Exercise 3.1.3
For the comparison between hospitals:
1. type of variable are the observations?
2. possible values for that variable?
Exploratory data analysis – p.30/31
Exploratory data analysis – p.31/31