SAODR_final_MeasSciTechOct2004(c).pdf

INSTITUTE OF PHYSICS PUBLISHING
MEASUREMENT SCIENCE AND TECHNOLOGY
Meas. Sci. Technol. 15 (2004) 2047–2052
PII: S0957-0233(04)78367-8
SAODR: sequence analysis for outlier
data rejection
Franco Pavese and Daniela Ichim
Istituto di Metrologia ‘G Colonnetti’—CNR, Strada delle Cacce 73, 10135, Torino, Italy
E-mail: [email protected]
Received 17 March 2004, in final form 18 June 2004
Published 26 August 2004
Online at stacks.iop.org/MST/15/2047
doi:10.1088/0957-0233/15/10/014
Abstract
In automatic data acquisition, a sample is generally made up of several
instrumental readings. A series of readings is generally reduced to a single
value by simple methods, such as averaging. However, outlying values can
affect the series. The paper introduces an algorithm, named ‘sequenceanalysis outlier data rejection’ (SAODR), which takes into account one of
the most common problems affecting the measurand during the acquisition,
i.e. a nonlinear drift with embedded sequences of outliers due to pulse-noise
peaks. The algorithm uses a time-ordering procedure and the ‘distances’
between successive readings. The frequent case of constant sampling rate is
discussed. The reported tests show the results obtained with Fortran 77 and
R
implementations of the algorithm. A rejection efficiency higher
MATLAB
than 99% was obtained.
Keywords: data analysis: algorithms and implementation, data management,
data acquisition: hardware and software, mathematical procedures and
computer techniques, time series analysis, time variability
(Some figures in this article are in colour only in the electronic version)
1. Introduction
The experiment for acquiring a sample yi , i = 1, . . . , I ,
consists of measurements of the output of an instrument.
Though there are acquisition schemes based on a single
reading, often—and more safely—each measurement consists
of several instrumental readings yij in sequence, made at
successive times tij , (yij , tij )i=1,...,I,j =1,...,J . Often, for each
i ∈ {1, . . . , I }, the J readings are on-line processed by the
instrument firmware (usually simply by averaging them), to
obtain yi . Most often, the value of the response variable y
depends on one or more influence variables xp , p = 1, . . . , P ,
whose values do not remain stable during the acquisition time
(typically a ‘drift’ is present). Therefore, instead of a stable
influence-variable p, xpi1 = · · · = xpiJ for each i (stationary
case), there is a (small) change of xp in time, whence of
yi , (yij , xpij , tij )i=1,...,I,j =1,...,J -quasi-stationary case, showing
the so-called baseline drift.
The multiple-reading strategy is preferable, as it produces
samples less influenced by the presence of outliers if a
statistical analysis is performed on-line on the readings.
0957-0233/04/102047+06$30.00
General methods of managing outliers or unusual values in
a dataset can be found in the literature in books addressing
diagnostics, robust statistics or filtering (e.g., [1]).
There are many of these methods, since their choice
critically depends on the application and on the definition of
the outlier type to be identified. This paper, as better specified
later, deals with the detection of spikes in the data (i.e., highfrequency features of the data sequence), including the small
ones [2], also discriminating them from sudden changes of
the baseline trend (drift, i.e., low-frequency components of
the data sequence). Data are assumed to form an ordered
sequence, like a time series (short sequences as considered
in the paper, or unlimited-length sampled signals). However,
instead of time any other ordering independent variable can
be considered instead (e.g., spatial). Finally, an aim of the
algorithm is to be fast enough to be used also for on-line
rejection, in order to allow a real-time data integration avoiding
missing data in the subsequent analysis. For this type of
outlier and aim, detection methods based on cluster analysis
are not suitable. Also statistical methods have been judged
inadequate since they are unable to discriminate between
© 2004 IOP Publishing Ltd Printed in the UK
2047
10
F Pavese and D Ichim
4
0
2
y[i,j]
6
8
drift
noise
outlier
0
10
20
30
40
t[i,j]
Figure 1. A typical sample of acquired data, made of a sequence of
instrumental readings. They show a trend, due to a low-frequency
signal drift, and a system noise. For external reasons, noise peaks
can randomly occur (square in the figure), affecting one or more
consecutive readings.
high- and low-frequency components of the data. Transformdomain methods and frequency filtering methods have been
ruled out as they are less suitable in the case of short sequences
of data (small data sets). Predictive models are limited to the
cases when the baseline drift trend can be predicted.
The method preferred pertains to the class of gradient
methods. This paper discusses a specific novel approach,
embedding also the decisions about detection. The choice
of the threshold implemented in the algorithm is more
conventional, being a variant of the median, but it is left
to the user the possibility of making a different choice, and
even adding a double threshold [2] or a training stage, by
adapting the corresponding logical block in the procedure.
After efficient removal of the outliers, a correct drift (baseline
variation) removal can be performed before averaging the
cleaned readings.
A two-step procedure has been introduced in [3]: for
each sample yi , (i) a pre-processing step rejects the outlying
readings yij then, (ii) the baseline drift is suppressed by means
of a regression applied to the ‘cleaned’ readings, to obtain
the valid output sample value yi . The first step consists of
a pre-processing algorithm, named ‘sequence-analysis outlier
data rejection’ (SAODR), aimed at being easily fitted into
the instrument firmware. It is based on the distances between
consecutive readings for the analysis of the outlying data. This
paper describes the algorithm and then shows the results of its
testing with simulated sequences of uniformly spaced data on
R
two implementations, in Fortran 77 and MATLAB
.
2. Assumptions on data characteristics
The characteristics of a data sequence yij , such as the very
general one shown in figure 1, can be summarized as follows:
2048
(i) acquired data consist of an ordered sequence of readings
from an instrument (e.g., a digital voltmeter, bridge, . . . ),
which can be short and we will call the ‘signal’;
(ii) the signal is assumed to be quasi-stationary during the
readings’ acquisition;
(iii) there are three reasons for the signal value changes, of
various and generally independent origin:
1. signal drift due to the signal being quasi-stationary:
it is assumed to have a frequency spectrum limited to
sufficiently low frequencies, e.g., to allow modelling
with a low-order polynomial;
2. signal noise typically due to electrical instrumental
noise: random variations of the readings with a broad
frequency band (typically white noise), assumed to
be stationary within each sequence of readings;
3. pulse noise due to spot events in the environment, of
magnetic, electric (switching), mechanical (shocks)
nature, etc. It occurs as random spikes assumed
affecting not more than a maximum number K of
consecutive readings (generally a few), but with no
limits as to the number of occurrences, the position
within the sequence, the magnitude and the direction.
In estimating the ‘sample value’ yi and its associated
uncertainty from the readings yij , two problems can produce
misleading results when attempting outlier identification:
• the number, position and size of the outliers are unknown
• the low-frequency drift of the signal can be of the same
order of magnitude as the outlying readings.
Methods for outlier rejection based only on the values of the
readings cannot discriminate between the components (1) and
(3) of the signal variation components. Moreover, any simple
statistical estimate of the data—e.g., the standard deviation—
is affected by both the presence of outliers and signal drift
and, therefore, it cannot be simply used as a robust threshold
to discriminate for outliers.
3. The online algorithm SAODR
The SAODR algorithm neither computes the baseline nor
uses data values, but analyses the sequences of distances
between consecutive instrumental readings. For readings
non-uniformly spaced in time, the divided differences or the
Euclidean distances should be adopted to take into account
different scales in the two variables. In the case of equallyspaced data (constant sampling rate), which is the most
common case in automatic data acquisition, the vectorial
distance between two consecutive readings simply reduces
to the projection on the y-axis. SAODR has presently been
developed for this case.
The SAODR inputs are: the minimum number J of
data (ti,j , yi,j ), j = 1, . . . , J to be used for estimating
each yi , the number M J of input initial instrumental
readings (ti,j , yi,j ), j = 1, . . . , M and the hypotheses about
the outliers’ structure.
The latter can be summarized by assuming the maximum
number K of consecutive outlying readings allowed (more
than one in an outlier sequence is allowed to occur) and a
threshold distance value d0 for outlier distance discrimination.
These assumptions can be adapted according to the level of
dj+1
1.5
1.0
dj+2
0.5
dj
0.0
• Step 1. Acquire a number M J of instrumental readings
in sequence (ti,j , yi,j ), j = 1, . . . , M.
• Step 2. Compute the projection on the y-axis of the
(vectorial) distances dj = dist(yi,j , yi,j +1 ) = |yi,j +1 −
yi,j |, j = 1, . . . , M − 1 between consecutive readings
yi,j +1 and yi,j . With each dj , j = 1, . . . , M − 1 associate
the sign sj :
1,
yi,j +1 yi,j
sj =
−1,
yi,j +1 < yi,j .
j+1
j+2
y[i,j]
knowledge of the data characteristic: e.g., in the case of a
sampled signal, K depends on the ratio between the noise-spike
time constant and the sampling rate: if this ratio is too high,
increasing unnecessarily the average value of K, provisions
can be taken to lower it in the algorithm rules. An initial
training can also be added for this purpose.
The basic algorithm steps for constant sampling rate are
the following, for each i = 1, . . . , I :
2.0
SAODR: sequence analysis for outlier data rejection
j+3
j
5
10
15
20
t[i,j]
In the following dj is called ‘distance’ and the sign sj is
called ‘direction’.
• Step 3. Define as candidate outlier distances those
distances dj for which dj > d0 (an a priori defined
threshold) and compute the number C of their occurrences.
• Step 4. If M − C − 1 < J , acquire A supplementary
readings and GOTO STEP 2.
• Step 5. Starting from each candidate outlier distance, say
the j th, analyse the subsequence of length L = K + 1 of
consecutive distances dj , . . . , dj +K .
(a) Declare as not outlier distance a candidate outlier
distance for which the relative subsequence of
distances does NOT satisfy one or both of the
following conditions:
(1) more than one candidate outlier distance exists,
except for the first and the last distances, d1 and
dM−1
(2) at least one change in direction occurs
(b) For all the remaining candidate outlier distances
collected in step 3, take the decisions using the
method of the ‘truth table’ to identify the outlying
readings to reject.
The ‘truth table’ presented in table 1 is for K = 2, but for
any other value of K a similar table can be constructed.
The feedback step 4 can be dropped if one does not require
to have all yi with the same statistical weight, i.e., computed
from the same number of readings.
To estimate the sample yi value and its uncertainty,
regression can then be applied to the cleaned sequence of at
least J data using a suitable functional model of the baseline
drift, see figure 1.
4. The simplest SAODR implementation:
two-outlying-reading sequences
Choice of the threshold. As the location statistics, an
augmented median has been used. The median performs
the most robust discrimination of the ‘regular’ distances
(ordered by size) from the potential outlying distances, since
obviously C < (M − 1)/2. However, simply rejecting all
the distances above the median value would be too crude and
Figure 2. A typical reading sequence which contains an outlier to
be identified by the ‘truth table’.
acquisition ‘expensive’, because half of the distances would
be considered as candidate outlier distances. Consequently, a
higher threshold d0 is defined by augmenting the median by
a factor depending on the standard deviation of the distances
below the median (outlier-free by definition), multiplied by an
adjustable constant, µ.
Choice of the maximum length of the outlying-reading
sequence. The simplest implementation is the one that
assumes the noise spikes affecting only, at worst, two
consecutive readings (K = 2). From the point of view of
the acquisition cost, the size of the increase A = M − J of
the number of readings, with respect to the target number J , is
another necessary choice: one has to find a trade-off between
the increase in the number of readings (A) and the need to
resort to the feedback step 4, both having a time cost. A
good compromise seems to invoke the feedback only when
there is more than one outlying reading, thus setting in this
implementation A = 2.
4.1. The ‘truth table’
The analysis of the distance subsequences is the core of the
algorithm, see [3]. Its construction is based on the analysis
of all possible situations involving outliers in the reading
sequence. Let us consider, for example, a situation like the one
in figure 2 and analyse its main features. In such a situation,
a candidate outlier distance is dj hence one of the analysed
subsequence of distances must be (dj , dj +1 , dj +2 ), in the case
K = 2. One can see that the candidate outlier distances in
this subsequence are dj and dj +2 and that there is only one
change of direction1 , namely between dj and dj +1 . In this
particular outlier configuration, the identified outliers must
be the readings on the positions j + 1 and j + 2 (in table 1,
this situation is identified by the row number 10 in the ‘truth
table’). If considered to be a possible outlier configuration
in the studied data acquisition, such a configuration must
1
No change of direction is meant to represent not an outlier but a mere signal
‘drift’ (component 1 in section 2).
2049
F Pavese and D Ichim
Table 1. The ‘truth table’ for K = 2; each analysed subsequence of distances starts with a candidate outlier distance.
No
dj > d0 ?
dj +1 > d0 ?
dj +2 > d0 ?
sj ∗ sj +1 < 0?
sj +1 ∗ sj +2 < 0?
Declared outlier
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
F
F
F
F
F
F
F
F
T
T
T
T
F
F
F
F
T
T
T
T
F
F
F
F
T
T
F
F
T
T
F
F
T
T
F
F
T
T
F
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
T
F
j +1
j + 1, j
j + 1, j
None
j +1
j +1
None
None
None
j + 1, j
j + 1, j
None
None∗
None∗
None∗
None∗
∗
+2
+2
If C = 1, the first or the last reading is declared outlier.
be identified by the ‘truth table’. The other rows of the
‘truth table’ are to be constructed in a similar manner, that
is, considering all the possible situations identifying outliers,
16 in total for K = 2, plus a few special cases for the first
and last readings. These few special cases are due to the
fact that the situation C = 1 could occur only when the first
and the last readings are involved. The columns of the ‘truth
table’ are identified by the distances and directions entering
in the analysis of a subsequence of K + 1 distances. That is,
dj , dj +1 , . . . dj +K , (sj , sj +1 ), . . . , (sj +K−1 , sj +K ). A complete
analysis of all the possible typical situations identifies those
sequences that contain outlier readings. Since they can be
listed in full, the action that must be taken for each combination
can be decided according to the ‘truth table’. Of course, for
greater K the number of subsequences that contain outliers
rapidly increases, and so does the analysing time. One of
the reasons why the parameter K increases can merely be
associated with a very high ratio between the pulse-noise time
constant and the sampling rate: obviously, if the former is
much higher than the latter, K can easily become too large and
should be reduced by a suitable procedure.
The resulting cleaned data sequence (or signal) would
result as in figure 1 with the high reading (square dot)
suppressed (corresponding to figure 2 with the two large
distances suppressed)—and an additional clean reading added
at the end of the sequence as necessary to keep J constant.
5. Code implementation for the SAODR algorithm
and test results
The algorithm was initially implemented in FORTRAN 77
R
. The
as a subroutine [3], and later also in MATLAB
FORTRAN 77 subroutine takes a few microseconds and
R
one takes about 3.3 ms to run on a slow
the MATLAB
PC (clock rate <500 MHz). A block logic scheme of the
implementation is reported in the appendix and figures 4
and 5 for the case where step 4 is omitted. Speed being a
must, as the algorithm should be essentially transparent to
data acquisition and not influencing sampling rate, the code
allows stopping the procedure as soon as the presence of
further outliers can be excluded: e.g., if none or only one
2050
+2
+2
candidate outlier distance is found, the procedure obviously
goto END—see figure 5 (A = 2). The study of the efficiency
in outlier detection was performed by using simulated data
sequences having the following parameters for the readings:
J = 18, M = J + 2, Omax = 4 (where Omax is the maximum
number allowed for the outlying readings in a sub-sequence:
Omax (M/2 − 1)/2, with Omax K − 1). The basic test
sequence of readings includes a random noise component and a
non-monotonic baseline drift affecting the readings by a factor
of about 2, but no outliers. Then, this basic test sequence is
automatically altered by including outlier elements, random
in number (up to a maximum value Omax ), in position in the
sequence, in relative size, R = ymax /y, and in ‘direction’.
By randomly varying all these simulation parameters and
the discrimination threshold, it is possible to check the ability
of the algorithm to identify outlying values closer and closer to
the random noise level, in order to also check the sharpness of
the discrimination threshold. Tests of groups of 10 000 trials
gave essentially the same results, therefore no extension of
the tests above 60 000 random sequences was considered. The
number of algorithm failures, i.e., of mismatches between the
simulated outliers and the ones recognized by the algorithm,
was tested as a function of the maximum outlier size relative
to the signal value, R.
In the former tests of the FORTRAN 77 routine, when
Omax matches the hypothesis K = 2, i.e., when Omax = 1,
the efficiency of the algorithm was found to be 100%. When
Omax > 1, there is a non-zero statistical probability for the
outliers within the sequence to form subsequences affecting
more than two consecutive readings, violating the assumptions
of the present algorithm implementation. This was the main
reason for a resulting ≈0.5% inefficiency at Omax = 4 for
outliers of size much larger than the signal (R = 15). A
residual inefficiency 0.1% could be due to the occurrence
of special sequences, such as multiple outliers starting from
the first reading: they could be taken into account by a more
complicated ‘truth table’, but the cost/benefit ratio may be too
high in most experimental cases.
R
With the MATLAB
code, testing gave essentially the
same results. Again, for the non-monotonic tested sequence,
outliers were randomly generated and the efficiency of the
SAODR: sequence analysis for outlier data rejection
2
4
5
0.5
i=OutPath(j)
yes
i=nr
OutlierReadings=
OutlierReadings U (i)
k=k+1
index=10 or
index=11 or
index=15
yes
OutlierReadings=
OutlierReadings U (i+1)
k=k+1
yes
OutlierReadings=
OutlierReadings U
(i+1, i+2)
k=k+2
no
0.4
yes
index=8
paths(i+1)>threshold
index=5 or
index=6 or
index=13 or
index=14
no
0.3
Failure Percentage
no
index=0
yes
change
of sign
index=
index+2
no
no
0.2
no
j=j+1
i+2<=
nr-1
yes
yes
paths(i+2)>threshold
index=
index+4
3
no
0
10
20
30
40
50
R
Figure 3. The failure percentage of SAODR for Omax = 2
maximum outliers generated in the sequence, baseline drift present.
yes
change
of sign
no
yes
i=ini
index=
index+1
OutlierReadings=
OutlierReadings U (ini)
k=k+1, ini=ini+1
REGRESSION AND
UNCERTAINTY EVALUATION
(offline or online)
END
no
Definition of candidate outlier paths
INPUT
Readings, AugFact,
5
1
i=1
no=0
OutPath=empty
Nread=length(Readings)
nr=Nread-1
Criteria=0
Compute paths and signs
paths(i)=Readings(i+1)-Readings(i)
signs(i)=sign(paths(i))
paths(i)=abs(paths(i))
i=1,...,nr
Figure 5. SAODR logic chart, without procedure step 4, part 2.
generated by means of the uniform continuous distribution
(UC ) on [0, 1] and of the dimensioning parameter R, as for the
Fortran 77 tests. That is:
no
i<=nr
yes
yes
Paths(i)=Threshold
b ∼ UC (0, 1)
no
FirstTime=0
n0=n0+1
OutPaths=OutPaths U (i)
Compute the median
AM of paths
i=i+1
rms=0
i=1
c=0
yes
no
no=0
i<=nr
no
GoodReadings=Readings
no
Paths(i)<=AM
CASE ANALYSIS
rms=rms+(paths(i))^2
c=c+1
nbout=length(OutPath)
OutlierReadings=empty
j=1 , ini=1, k=1
i=i+1
no
j<=nbout
Rms=sqrt(rms)/c
Threshold=AM+AugFact*rms
1
4
2
3
Figure 4. SAODR logic chart, without procedure step 4, part 1.
algorithm was approximated by its opposite, that is the
failure percentage. The smaller the failure percentage, the
greater the efficiency of the algorithm. The randomness in
the outlier generation was again given by value, position,
dimension and direction. The outlier position in the data
sequence was randomly generated using a uniform discrete
distribution, while the direction was generated using the
Bernoulli distribution. The magnitude of the outlier was
R ∈ (1.2, 50)
V =R∗b
(1)
where V is the final dimension of the outlier inserted in the
test sequence.
In this testing version, the varied parameters were R, the
relative size of the outlier, and the maximum number of the
outliers generated in the sequence. R was allowed to vary
between 1.2 and 50 using a discrete step, while the maximum
number of outliers inserted in the initial test sequence of
readings was allowed to be 4, that is Omax 4. Again,
for each combination of the parameters, 60 000 versions of the
initial test sequence (inserting outliers) were generated. In the
case Omax = 1, we observed that the inserted outliers were
always detected, independently of their dimension. This is a
very positive result since it seems reasonable to believe that
most of the practical acquisition sequences would not contain
more than one outlier, at least for moderate lengths of the
sequences of data. For the other values of the parameter Omax ,
we observed that the inefficiency decreases as R increases up
to a value of R ≈ 2.5. For greater values of R, the inefficiency
no longer depends on the magnitude factor and it is about
0.22%, 0.33% and 0.75% for Omax = 2, 3, 4 respectively, as
can be seen in figure 3, for example.
6. Conclusions
An online pre-processing algorithm, named SAODR, has
been developed for outlier rejection during automatic data
acquisition, based on the distances between consecutive
readings and on the analysis of subsequences of these
distances, instead of resorting to the signal baseline. The
2051
F Pavese and D Ichim
computation of the baseline drift is then postponed to an offline step of the cleaned sequence.
Implementations with routines written in FORTRAN 77
R
have shown that the algorithm works properly
and MATLAB
and it is as robust as expected against sequences of outliers,
independent of baseline drift and the size of the outliers.
This was tested via data simulation. Even with only a
few assumptions on the signal characteristics, one gets an
efficiency better than 99% (i.e., less than one outlier over 100
is missed).
The use of online outlier rejection would reduce
considerably or avoid at all the number of measurements
processed off-line that have to be later rejected because they
are outlying, at a stage when generally new measurements
can no longer be added. This is particularly important
when the optimization of the experimental design is applied,
reducing or eliminating the redundance of the total number of
measurements, so that missing data can become a problem in
the data processing procedure.
Extension of the algorithm to non-uniformly spaced
data is possible: suitable definitions of ‘distance’ and of the
discrimination threshold are to be adopted to take into account
different scales in the two variables.
Extension is also easy to outlier occurrences affecting
longer sequences of consecutive readings, at the cost of
a heavier ‘truth table’, hence of an increasing computing
time. However, the code is running so fast that it is hardly
affecting the acquisition time except for fast sampling rates
R
(>10 kHz with FORTRAN and >30 Hz with MATLAB
: an
implementation in C is likely to substantially increase these
limits).
In conclusion, since the SAODR algorithm is simple and
fast, and it is independent of the presence of a baseline drift of
any shape—only provided that the frequency spectrum of the
drift is limited to frequencies much lower than the sampling
rate. It could be implemented directly into instrumentation
firmware (using machine codes) to estimate sample values
and their associated uncertainty estimate not affected by
the presence of outlying signal values, in a much more
efficient way than the usual simple averaging—performed
either by analogue or numerical means—presently used in
most instruments.
2052
R
code has also been
A runtime version of the MATLAB
R
embedded in a LabView (version 6.1) acquisition procedure,
where it works fine.
Appendix. Block logic scheme of SAODR actual
implementation.
Figures 4 and 5 show SAODR implementation flowcharts. The
reported implementation shows the case where the feedback
loop (see paper) is omitted. The codes and full documentation
are available on request from [email protected] and also
for downloading on www.amctm.org, the website of the
SofToolos MetroNet EU Network on Advanced Mathematical
and Computational Tools in Metrology.
The use of the algorithm and the codes is open-source
(‘OSI Certify Open Source Software’), with GNU generalpublic and library licences for non-commercial use.
Basic definitions:
Readings = the J instrument readings
AugFact = augmentation factor µ
paths = distance between two consecutive readings
signs = sign of paths
AM = median
rms = standard deviation of the paths AM
OutPaths = candidate outlying paths
GoodReadings = readings not outlying according to SAODR:
vector of ‘cleaned’ readings
OutlierReadings = readings corresponding to outliers,
eliminated from the ‘cleaned’ readings
index
= see table of truth
= union of
References
[1] Simonoff J S 1991 Directions in Robust Statistics and
Diagnostics—Part II (Berlin: Springer)
[2] Das M and Hunt T 1998 IEEE Midwest Symp. on Systems and
Circuits (Southbend, USA) p 501
[3] Pavese F, Ichim D and Ciarlini P 2001 Advanced Mathematical
Tools in Metrology V, Series on Advances in Mathematics for
Applied Sciences vol 57 (Singapore: World Scientific) p 283