File

Outlier Detection and the Estimation of
Missing Values
Martin Charlton and Paul Harris
National Centre for Geocomputation
National University of Ireland Maynooth
Maynooth, Co Kildare, IRELAND
ESPON 2013 Programme Workshop
Managing Time Series and Estimating Missing Values
6 May 2010
Luxembourg
www.StratAG.ie
Outline
• Time Series
• ESPON DB data issues
• Detecting exceptional values
• Estimation of missing values
• Case study
www.StratAG.ie
1: Time Series
www.StratAG.ie
What is a time series?
• A variable which is measured sequentially in
time at fixed sampling intervals is known as
a time series
• The behaviour of such series can be
modelled
• The main features of time series are trend
and (sometimes) seasonal variation
• Observations which are close together in
time tend to be correlated
www.StratAG.ie
Air Passengers 1949-1960
400
200
300
There is also a seasonal pattern
of travel within each year. More
people travel in the summer
than the winter.
100
Passengers (1000's)
500
600
A time plot of the number of air
passengers per month between
January 1949 and December
1960 in the USA reveals a rising
trend
1950
1952
1954
1956
1958
1960
Time
www.StratAG.ie
5000
2000
aggregate(AP)
1950
1952
1954
1956
1958
1960
100
400
Time
1
2
3
4
5
6
7
8
9
10
11
12
Aggregating the series annually reveals the rising trend, and the boxplot
shows that more people travel in the summer months.
www.StratAG.ie
Forecasting: 1
Holt-Winters filtering
300
400
Here we use the Holt
Winters procedure to
model the series
behaviour…
200
The fit is quite
promising
100
Observed / Fitted
500
600
There are many
modelling and
forecasting techniques.
1950
1952
1954
1956
1958
1960
Time
www.StratAG.ie
Forecasting: 2
600
700
800
And if the growth of the
US air traffic during the
first 4 years of the 1960s
follows the pattern of
the previous 12…
100
200
300
400
500
the forecast is for some
800 million passengers
by 1965
1950
1955
1960
Time
www.StratAG.ie
1965
Models
• There are a wide variety of different models,
including
Basic stochastic models (like Holt Winters)
Stationary models (AR, MA, ARMA)
Non-stationary models (ARIMA, ARCH)
Spectral analysis (based on the Fourier
transform)
– Multivariate models (two or more series are
involved)
–
–
–
–
www.StratAG.ie
2: ESPON DB Data Issues
www.StratAG.ie
Some typical data… household income
The NUTS2 regions in Austria are the Länder – here we have short time
series concerning disposable income of private households from 1995 to
2007. Each series has only 13 elements
We might normalise these by the population to reach a comparable ‘per
capita’ figure
www.StratAG.ie
Short series…
• We should be aware that there is an interaction
between the amount of data available and what can
be done with it
• Paas, Kusk, Schlitte and Võrk’s 2007 analysis of
income convergence in selected countries of the EU
using NUTS3 data had this to say:
www.StratAG.ie
George Box, 1976, Science and Statistics
• Models include not just the analytical tools that
others might use, but those which we use to
examine the data for outliers and estimating values
• ‘Wrong’ for Box includes models that fail to
encapsulate the process under investigation
www.StratAG.ie
ESPON Tigers
• Long time series tend to be for large areal
units, such as countries, or major
administrative regions – the MAUP may well
also be a tiger
• Smaller regions…
– shorter series
– incomplete series
– a long time period between elements (decennial
censuses) in the case of very small units
www.StratAG.ie
3: Detecting Exceptional Values
www.StratAG.ie
Exceptional values
•
Two types:
1. Logical errors (e.g. negative unemployment rate)
2. Statistical outlier (e.g. unusually high unemployment
rate)
•
Identification methods
1. Logical errors: mechanical (& statistical) techniques
2. Statistical outliers: statistical techniques
www.StratAG.ie
Types of outliers
www.StratAG.ie
Our approach
•
There is no single ‘best’ detection technique, so…
1. Apply a selection of outlier detection methods, which are
simple and robust
2. Flag an observation if it is a likely outlier according to
each technique
3. Build up the weight of evidence for the likelihood of an
value being statistically exceptional
4. Suggest what type of outlier it is likely to be
–
aspatial, spatial, temporal, relationship, a mixture
5. Consult an expert of the data to decide on the
appropriate cause of action
www.StratAG.ie
Issues
• Temporal outliers
• The time series are often too short to apply a
‘standard’ technique reliably
• So... Parallel time series are treated as additional
variables (there will be a high positive correlation
between series from different years)
• Then... Apply an aspatial/spatial/relationship detection
technique
• That is... We add the spatial component which is then
treated either implicitly or explicitly
• Modifiable Areal Unit Problem MAUP
• Identify exceptional values at the finest spatial
resolution
www.StratAG.ie
Weight of evidence
• If we apply a range of techniques, then we
can build up the weight of evidence for the
likelihood of an observation being
exceptional
• Observations which are exceptional on most
or all of the tests are those which we would
select for further investigation
• Here’s an example showing three
observations…
www.StratAG.ie
Obsn. 1
Obsn. 2
Yes
Yes
Identification technique
Identification type
1. Boxplot
Aspatial & univariate
2. Bagplot
Aspatial & bivariate
Relationship
3. Residuals from locally weighted mean &
Hawkins test statistic
Spatial & univariate
4. Residuals from multiple linear regression*
(requires modelling decisions)
Aspatial & multivariate
Linear relationships
5. Residuals from locally weighted regression*
(requires modelling decisions)
Aspatial & multivariate
Nonlinear relationships
6. Residuals from geographically weighted
regression* (requires modelling decisions)
Spatial & multivariate
Nonlinear relationships
Yes
7. Basic & robust principal component analysis*
(model-decision free)
Aspatial & multivariate
Linear relationships
Yes
8. Locally weighted principal component analysis*
(model-decision free)
Aspatial & multivariate
Nonlinear relationships
Yes
9. Geographically weighted principal component
analysis* (model-decision free)
Spatial & multivariate
Nonlinear relationships
Yes
* Can have a spatial, univariate form if the coordinate data are used as variables
www.StratAG.ie
Obsn. 3
Yes
Yes
Yes
Yes
Yes
Yes
Yes
4: Estimating Missing Data
www.StratAG.ie
Data estimation techniques
• There is an enormous range of possibilities
– Choice depends on
• Data type, size, dimensionality, and properties
• Objective – prediction or prediction uncertainty accuracy
• Model complexity
– We can estimate missing values using...
•
•
•
•
•
•
Averaging
Regression (with or without autocorrelation, global and local)
Inverse distance weighting
Regression Kriging
Co-Kriging
Bayesian Markov Chain Monte Carlo methods
www.StratAG.ie
5: Case study
Identifying NUTS regions with
exceptional time-series values
www.StratAG.ie
Unemployment at NUTS 23 2000-2007
• A dataset for NUTS23 regions was obtained
from UMS-RIATE
• For each year there are counts of
– Economically active population
– Unemployed, economically active population
• Shapefile created from NUTS2/NUTS3
shapefiles in Mapkit
• Analysis undertaken in R
www.StratAG.ie
Eight ‘unemployment rate’ variables for
2000 to 2007
Rate = [Unemployed/Economically active]
790 x 8 observations at NUTS 2/3 level
Some island data removed
www.StratAG.ie
Data post-processing
• Logical input errors
– Original data checked
– There appear to be none, appear to be a few exceptional
values
• Assessing outlier detection methods
– 320 values randomly picked (~5% of the data)
• These are in 271 regions
– Values doubled and then randomly redistributed among
the 320 positions in the data
– These observations are assumed to be outlying in some
way (but we cannot guarantee this)
www.StratAG.ie
Effect of
outliers?
Merely looking at some maps
doesn’t help in easily
identifying the regions with
exceptional values
www.StratAG.ie
Interseries correlations
Those plots
about the main
diagonal are
highly
correlated.
The effect of the
randomly
introduced
values is clearer
on the more
distant plots
(these are also
‘distant’ in time)
www.StratAG.ie
Detection Techniques for comparison
• Simple time-series approach (TS) – outlined
in FIR: we have used a simplified version
• Principal Components Analysis (PCA)
• GWPrincipal Components Analysis (GWPCA)
– The PCA based methods allow us to consider
more than simply pairs of time series
simultaneously
www.StratAG.ie
We’ll compare the various methods
www.StratAG.ie
Time Series method (TS)
• For each of the 790 regions, index TS is calculated at each of
8 time observations (using the 8-observation data set):
• TS = [observation – mean]2/[variance]
• Assuming Gaussian errors, a time observation is taken as
outlying if TS > 3.84 (95% level)
• In this study, we simply find outliers according to boxplot
statistics
• An indicator variable is then set at any region for which at
least one time observation is outlying
www.StratAG.ie
Principal Components Analysis (PCA)
• Principal Components Analysis is a
technique which transforms m correlated
variables into m new variables which are
have a correlation of zero
• All of the variance in the original m
variables is retained during the
transformation
• Values of the new variables are known as
scores – we can use these for identifying
exceptional values
www.StratAG.ie
Geographically Weighted PCA
• PCA is a global transformation but it
ignores the spatial arrangement of the NUTS
regions
• With GWPCA we obtain local
transformations by applying geographical
weighting – this gives us a set of
components for each NUTS region
• We can use the scores from these local
transformations to identify exceptional
values
www.StratAG.ie
PCA for the unemployment series
The series are highly correlated, so the first component accounts for the
majority of the variance
www.StratAG.ie
Using PCA and GWPCA
• Examine the residual component data (those with
small variances)
• Use boxplot statistics to define outlying values
• In this case, a significant result indicates one or
more outlying time observations in a NUTS region
• GWPCA will also indicate a spatial ‘outlyingness’ in
the data
www.StratAG.ie
The various techniques are compared on the
next slides
www.StratAG.ie
(a) TS method compared with PCA
The TS method appears to be less discriminating than the global PCA method
www.StratAG.ie
(b) TS compared with GWPCA
The GWPCA method would appear to be very discriminating in identifying
potentially exceptional regions
www.StratAG.ie
(c) PCA compared with GWPCA
The global PCA is slightly less discriminating than the GW PCA
www.StratAG.ie
Results for the 271 randomised sites
• Sites not identified as outlying – 21.4%
• Outlying by at least one method – 78.6%
• Outlying by one method only – 55.3%
• Outlying by two methods – 18.8%
• Outlying by all three methods – 4.8%
www.StratAG.ie
• Identification by method:
– TS (75.6%)
– PCA (22.5%)
– GWPCA (8.8%)
• False positives at 519 un-affected sites:
– TS (29.5%)
– PCA (2.3%)
– GWPCA (1.3%)
• These results endorse the “weight of evidence” approach
to the identification of exceptional values…
www.StratAG.ie
Acknowledgements
• We are disappointed that Eyjafjallajökull
decided to send some ash to Ireland
• We are deeply grateful to Claude for
presenting this work – some of it is not easy
• We also acknowledge statistical advice from
Professor Chris Brunsdon, Professor of
Geographic Information at the University of
Leicester
www.StratAG.ie
Thank You!
www.StratAG.ie