Random variables IMPORTANT DIFFERENCE BETWEEN: Random variable ←→ realisation (or observation) Exploratory data analysis Dr. David Lucy • Random variable is always written in UPPER CASE and is associated with a probability distribution. • e.g. X = Ozone level [email protected] Lancaster University Ozone level is the random variable of interest. Exploratory data analysis – p.1/31 Exploratory data analysis – p.3/31 Exploratory data analysis • Data is information we collected but also bears uncertainty, due to random variability in characteristics of interest from one individual to another. • In mathematical terms, the characteristics being measured are represented by random variables. e.g X = age of individuals chosen at random. Random variables • Observations of random variables are written in lower case and is just a number. • e.g. x = observed instance of random variable For a data set of size n: X 1 , · · · , Xn x1 , · · · , xn • Individual in this context refers to a unit of study actual people, towns, cars, test tubes... Exploratory data analysis – p.2/31 random variables particular realizations Exploratory data analysis – p.4/31 Data analysis The first stage in any analysis is to get to know the problem and the data. This usually: 1. Work out what has been observed, and how it has been observed. 2. Involves a variety of graphical procedures to try to visualise the data. 3. The calculation of a few simple summary numbers, or summary statistics that capture key features of the data. Role of data analysis • Finding errors and anomalies: missing data, outliers, changes of scale... • Suggesting route of subsequent analysis: plots of data give information on location, scale and shape of the distribution and relationships between variables. • Augmenting understanding of applied problem: exploratory graphical tools sharpen the scientific questions being addressed. Hopefully these will reveal key features of the unknown underlying distribution. Exploratory data analysis – p.5/31 Exploratory data analysis – p.7/31 Data analysis Having said that: Example data sets Three data sets used in the early part of this course: • the random variability in the data is reflection of the underlying distribution. 1. Ozone measurements. • Because of finite sample size, this would serve as an approximation to the true underlying distribution and its features. 3. Hospital operation sucess rates. • This implies that we also need to care how good the approximation is. Exploratory data analysis – p.6/31 2. Diseased tree runs. How do the observations relate to the scientific questions under investigation? Exploratory data analysis – p.8/31 Ozone measurements Examines ground level ozone Ozone measurements Scientific questions of interest are: • Produced by chemical reactions between nitrogen dioxide (NO2 ), hydrocarbons and sunlight. 1. How, if at all, does the distribution of ozone measurements vary between the urban and rural sites? • When present at high levels, ozone can irritate the eyes and air passages causing breathing difficulties and may increase susceptibility to infection. 2. How, if at all, is the distribution of ozone measurements affected by season? • Ozone is toxic to some crops, vegetation and trees and is a highly reactive chemical, capable of attacking surfaces, fabrics and rubber materials. 3. How, if at all, does the presence of other pollutants affect the levels of measured ozone? Observations made on an hourly basis from April to July, and from November to February Exploratory data analysis – p.9/31 Exploratory data analysis – p.11/31 Ozone measurements Ozone measurements Summer measurements Focus on a pair of sites in Yorkshire: 1. Rural site - Ladybower reservoir. 2. Urban site - Leeds City centre. Observations in this case designed to form a contrast between urban/rural environments, and temporal contrasts. Exploratory data analysis – p.10/31 Date Leeds O3 Leeds NO2 LadyB O3 LadyB NO2 01/04/1994 32 48 42 7.5 02/04/1994 29 49 43 6.1 03/04/1994 32 34 38 9.5 04/04/1994 32 35 49 3.9 05/04/1994 .. . 33 .. . 50 .. . 39 .. . 11.7 .. . Exploratory data analysis – p.12/31 Disease in trees Arbologist looking at tree disease in a plantation: Disease in trees Run lengths for trees from a diseased plantation • Looks at a transcept across the plantation. • Locates a diseased tree. • Counts the number of diseased trees in the unbroken chain of diseased trees. Records the number of times a run of length n is observed. Run length Number of runs in first 50 observations Number of runs in 109 observations Exploratory data analysis – p.13/31 north H D H H H D H D D D D D H H H H H D D H D H D H H H D H D D D D H D D D H D D D H D H H H H H H H D D D H H H D H H D H H H H H D H H H H D D D H D H H D D D D H H H H H D H D D H D H D D H H H H D 1 16 2 2 3 0 4 1 5 0 71 28 5 2 2 1 Exploratory data analysis – p.15/31 Disease in trees D 0 31 Disease in trees Given in notes as a table, but the actual data would be a sequence of counts: 0, 0, 1, 3, 2, 0, 0, 0, 0, 1, 1, 0, 1, . . . east Exploratory data analysis – p.14/31 Exploratory data analysis – p.16/31 Comparison of hospitals Comparison of successful outcomes in hospital operations: More esoteric points • Many continuous measurements are measured as many valued ordinal variables. • Number of sucesses counted for each of two hospitals. • Ten operations observed in each hospital. Hospital Hospital 1 Hospital 2 • For heights we can only measure to the nearest division on our measuring instrument. • That is someone who is 5‘8.2343" is said to be 5’8", but from our measurements in the 5’8" to 5’9" category. • Height is notionally continuous. Sucessful 9 5 Colour can be said to be categorical, but the underlying variable is light wavelength, and if you ask a physicist is considered continuous. Exploratory data analysis – p.17/31 Exploratory data analysis – p.19/31 Types of variable Basically three types: 1. ordinal - can only take on discrete values - order of values is important - anxiety/neurosis scores (not-anxious/slightly-anxious/very-anxious). 2. categorical - can only take on discrete values ordering of values has no meaning - examples being sex (female/male/other) - ethnicity (European/African/North American) - species (cat/dog/hamster) - often known as “factors”. 3. continuous - can take on any value from the number spectrum - height in inches, weight in pounds. Exploratory data analysis – p.18/31 Continuous to ordinal Transfer of continuous to discrete variables taken far in social sciences: • Likert scales - familiar as a five category scale. • Example: very good, good, indifferent, poor, bad. • Example: agree strongly, agree, neither agree not disagree, disagree, disagree strongly. Underlying varible type continuous, however: • Often treated as coarse grained continuous. • Scale not preserved. Exploratory data analysis – p.20/31 Populations and samples • In statistical science samples are collections of objects which are observed. • Samples are subsets of populations - we examine a sample from a population to make inferences about that population, and because we cannot examine all members of a population. At odds with some physical sciences use of the term to mean thing which is observed. Populations and samples • Think about the following: 1. All humans who ever lived and are living at the moment and who will live in the future. 2. All humans who ever lived and are living at the moment. 3. All human who are alive at the moment. Which is the population. and which the sample? Exploratory data analysis – p.21/31 Exploratory data analysis – p.23/31 Populations and samples Populations and samples are hierarchically arranged: Populations and samples Which in the case of human properties may be biased: • A sample is always a subset of a population. • A population always a superset of a sample. Samples can in themselves be populations, and populations constitute samples of other populations. past Exploratory data analysis – p.22/31 present future Exploratory data analysis – p.24/31 Populations and samples • Notions of population and sample not natural types. • Dependent on purpose and question under examination. • Samples can be biased - means some systematic difference to population - accounted for by measures of uncertainty. Exercise 3.1.1 These might be: 1. Ozone: concentrations of ozone at the two locations given the time of year and the level of NO2 . 2. Diseased Trees: all trees in the forest and, possibly other locations where the climate, soil and trees have similar properties. 3. Hospital operation successes: other operations at the two hospitals. Exploratory data analysis – p.25/31 Exploratory data analysis – p.27/31 Exercise 3.1.1 For each of the problems that we are concentrating on in the course, state the populations that we are trying to learn about: 1. Ozone: Exercise 3.1.2 In the case where observations have been made on diseased trees what: 1. type of variable are the observations? 2. possible values for that variable? 2. Diseased Trees: 3. Hospital operation successes: Exploratory data analysis – p.26/31 Exploratory data analysis – p.28/31 Exercise 3.1.2 In the case where observations have been made on diseased trees what: Exercise 3.1.2 For the comparison between hospitals: 1. Type of variable are the observations? Number trees in the unbroken run of trees. Discrete, but scale preserving. 1. Type of variable are the observations? Number of sucessful operations from 10. Discrete, but scale preserving. 2. Possible values for that variable? X = 0, 1, 2, . . ., 10 2. Possible values for that variable? X = 0, 1, 2, . . . Exploratory data analysis – p.29/31 Exercise 3.1.3 For the comparison between hospitals: 1. type of variable are the observations? 2. possible values for that variable? Exploratory data analysis – p.30/31 Exploratory data analysis – p.31/31
© Copyright 2026 Paperzz