The Study of the Sleep and Vigilance Electroencephalogram Using

The Study of the Sleep and Vigilance
Electroencephalogram Using Neural Network
Methods
Mayela E. Zamora
St Cross College
Supervisor: Prof. L. Tarassenko
Sponsor: Universidad Central de Venezuela
;
A thesis is submitted to the
Department of Engineering Science,
University of Oxford,
in fulfilment of the requirements for the degree of
Doctor of Philosophy.
Hilary Term, 2001
Declaration
I declare that this thesis is entirely my own work, and except where otherwise stated, describes my own
research.
M. E. Zamora,
St Cross College
Mayela E Zamora
St Cross College
Doctor of Philosophy
Hilary Term, 2001
The Study of the Sleep and Vigilance Electroencephalogram
Using Neural Network Methods
Abstract
This thesis describes the use of neural network methods for the analysis of the electroencephalogram
(EEG), primarily in subjects with a severe sleep disorder known as Obstructive Sleep Apnoea (OSA). This
is a condition in which breathing stops briefly and repeatedly during sleep, causing frequent awakening
as the subject gasps for breath. Day-time sleepiness is the main symptom of OSA, but the actual methods
to assess the level of drowsiness are time-consuming (e.g. scoring the EEG) or not reliable (e.g. subjective
measuring of the person’s sense of sleepiness, performance in vigilance tasks, etc). The work presented
in this thesis is two-fold. In the first part, a method for the automatic detection of micro-arousals from
features extracted from single-channel EEG, is developed and tested. AR modelling is the method of
extracting the features from the EEG. A compromise was found between the stationarity requirements of
AR modelling and the variance of the AR estimates by using a 3-second analysis window with a 2-second
overlap. The EEG features are then used as the inputs to a multi-layer perceptron (MLP) neural network
trained to track the sleep-wake continuum. It was found that a micro-arousal may cause an increase in the
slow rhythms (δ band) of the EEG at the same time as it causes an increase in the amplitude of the higher
frequencies (α and/or β bands). The automated system shows high sensitivity Se (median 0.97) and
positive predictive accuracy P P A (median 0.94) when validated against a human expert’s scores. This
is the first time that AR modelling has been used in micro-arousal detection. Visualisation analysis of the
EEG features revealed that Alertness and Drowsiness in vigilance tests are not the same as Wakefulness
and REM/Light Sleep in a sleep-promoting environment. The second part of the thesis describes the
application of another MLP neural network, trained to track the alertness-drowsiness continuum from
single-channel EEG, on OSA patients performing a visual attentional task. It was found that OSA subjects
may present “drowsy” EEG while performing well during the visual vigilance test. Also, the MLP analysis
of the wake EEG with these subjects showed that the transition to drowsiness may occur progressively
as well as in sudden dips. Correlation of the MLP output with a measure of task performance and
visualisation of EEG patterns in feature space show that the alertness EEG patterns of OSA subjects may
be closely related to the drowsiness EEG patterns of normal sleep-deprived subjects.
Acknowledgments
I am most grateful to Prof Lionel Tarassenko for supervising this work. Many thanks to my collaborators
at the Osler Chest Unit, Churchill Hospital, Dr John Stradling, Dr Melissa Hack, Dr Robert Davies and Dr
Lesley Bennett for providing the test data and the valuable clinical support. I would also like to thank Dr
Chris Alford for his helpful comments on the clinical aspects of this work.
To the Consejo de Desarrollo Cientifico y Humanistico de la Universidad Central de Venezuela, I extend
my sincere gratitude for the finacial support, and to the staff of its Departamento de Recursos Humanos
for the quality of service that they gave me during my stay in the UK.
Also, I am very appreciative to all my fellow labmates, especially Dr Mihaela Duta, Dr Ruth Ripley, David
Clifton, Gari Clifford, Dr Simukai Utete, Dileepan Joseph, Iain Strachan, Dr Steve Collins, Dr Taigang He
and Dr Neil Townsend for their friendship and help. Special thanks to Jan Minchington for the efficient
office support and natural kindness. To all my friends in Oxford, and in Caracas, a million thanks.
Finally, and most importantly, endless gratitude to my parents, to my sisters, and to Neal for their continous support, cheering and love.
working hard on vigilance...
Contents
1 Introduction
1
1.1 Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Sleep and day-time sleepiness
1
3
2.1 Sleep, wakefulness, sleepiness and alertness . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1.2 The process of falling asleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1.3 Going on to a deeper sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2 Breathing and sleep
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2.1 Normal sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2.2 Obstructive Sleep Apnoea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.3 Daytime sleepiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3.1 Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3.2 Sleepiness/fatigue related accidents . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3.3 Correlation between OSA and accidents . . . . . . . . . . . . . . . . . . . . . . . .
8
2.4 Measuring the sleep-wake continuum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.4.1 Measuring sleepiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.4.2 Measuring sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
3 Previous work on EEG monitoring for micro-arousals and day-time vigilance
12
3.1 The EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.1.1 Origin of the brain electrical activity . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.1.2 Description of the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.1.3 Recording the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1.4 Extracerebral potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.2 Analysis of the EEG during sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2.1 Changes in the EEG from alert wakefulness to deep sleep . . . . . . . . . . . . . . .
20
i
ii
3.2.2 Visual scoring method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2.3 Computerised analysis of the sleep EEG . . . . . . . . . . . . . . . . . . . . . . . .
24
3.3 Analysis of the EEG for the detection of micro-arousals . . . . . . . . . . . . . . . . . . . .
29
3.3.1 Cortical arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.3.2 ASDA rules for cortical arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.3 Computerised micro-arousal scoring . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.4 Using physiological signals other than the EEG . . . . . . . . . . . . . . . . . . . .
33
3.3.5 Using the EEG in arousal detection . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.4 Analysis of the EEG for vigilance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.4.1 Changes in the EEG from alertness to drowsiness . . . . . . . . . . . . . . . . . . .
35
3.4.2 EEG analysis in vigilance studies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.4.3 Vigilance monitoring algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
4 Parametric modelling and linear prediction
43
4.1 Spectrum estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.1.1 Deterministic continuous in time signals . . . . . . . . . . . . . . . . . . . . . . . .
43
4.1.2 Stochastic signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.2 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.3 AR parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.3.1 Asymptotic stationarity of an AR process . . . . . . . . . . . . . . . . . . . . . . . .
54
4.3.2 Yule-Walker equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.3.3 Using an AR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.4 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.4.1 Wiener Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.4.2 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.5 Maximum entropy method (MEM) for power spectrum density estimation . . . . . . . . .
66
4.6 Algorithms for AR modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.6.1 Levinson-Durbin recursion to solve the Yule-Walker equation . . . . . . . . . . . . .
67
4.6.2 Other algorithms for AR parameter estimation . . . . . . . . . . . . . . . . . . . . .
72
4.6.3 Sensitivity to additive noise of the AR model PSD estimator . . . . . . . . . . . . .
78
4.7 Modelling the EEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
iii
5 Neural network methods
81
5.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.1.1 The error function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.1.2 The decision-making stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.1.3 Multi-layer perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
5.2 Optimisation algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.2.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.2.2 Conjugate gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.3 Model order selection and generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.3.1 Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.2 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.3 Performance of the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Radial basis function neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.1 Training an RBF network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2 Comparison between an RBF and an MLP . . . . . . . . . . . . . . . . . . . . . . . 108
5.5 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.1 Sammon map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5.2 NeuroScale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 Sleep Studies
115
6.1 Using neural networks with normal sleep data: benchmark experiments . . . . . . . . . . 115
6.1.1 Previous work on normal sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.2 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.4 Assembling a balanced database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1.5 Data visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.1.6 Training a Multi-Layer Perceptron neural network . . . . . . . . . . . . . . . . . . . 123
6.1.7 Sleep analysis using the trained neural networks . . . . . . . . . . . . . . . . . . . 126
6.2 Using the neural networks with OSA sleep data . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2.1 Data description, pre-processing and feature extraction . . . . . . . . . . . . . . . . 130
6.2.2 MLP analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.3 Detection of μ-arousals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2.4 The choice of threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
iv
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7 Visualisation of the alertness-drowsiness continuum
146
7.1 The vigilance database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.1.2 Visualising the vigilance database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.2 Visualising vigilance and sleep data together . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8 Training a neural network to track the alertness-drowsiness continuum
164
8.1 Neural Network training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.1.1 The training database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.1.2 The neural network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.1.3 Choosing training, validation and test sets . . . . . . . . . . . . . . . . . . . . . . . 165
8.1.4 Optimal (n − 1)-subject MLP per partition . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Testing on the nth subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.2.1 Qualitative correlation with expert labels . . . . . . . . . . . . . . . . . . . . . . . . 169
8.2.2 Quantitative correlation with expert labels . . . . . . . . . . . . . . . . . . . . . . . 170
8.3 Training an MLP with n subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.4 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9 Testing using the vigilance trained network
181
9.1 Vigilance test database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Running the 7-subject vigilance MLP with test data . . . . . . . . . . . . . . . . . . . . . . 182
9.2.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.2.2 MLP analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.3 Visualisation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.3.1 Projection on the 7-subject vigilance on the N EURO S CALE map . . . . . . . . . . . . 210
9.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.5 Summary and conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
v
10 Conclusions and future work
220
10.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.2 Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.3 Main research results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
A Discrete-time stochastic processes
227
A.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
B Conjugate gradient optimisation algorithms
232
B.1 The conjugate gradient directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
B.1.1 The conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
B.2 Scaled conjugate gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
B.2.1 The scaled conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . . . 237
C Vigilance Database
238
D LED Database
240
D.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
D.2 Demographic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
List of Figures
2.1 The human brain showing its main structures . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2 A conventional all night sleep classification plot from one normal subject . . . . . . . . . .
11
3.1 A simplified neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.2 The 10-20 International System of Electrode Placement . . . . . . . . . . . . . . . . . . . .
17
3.3 Conventional electrode positions for monitoring sleep . . . . . . . . . . . . . . . . . . . . .
19
3.4 Sleep EEG stages (taken from [69]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.5 Apnoeic event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.1 Stochastic process model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2 Autoregressive filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
4.3 Moving Average filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
4.4 Moving Average Autoregressive filter (b0 = 1, q = p − 1) . . . . . . . . . . . . . . . . . . .
53
4.5 Time series of the synthetised AR process . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.6 Autocorrelation function of the synthetised AR process . . . . . . . . . . . . . . . . . . . .
59
4.7 Second order AR process generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.8 Second order AR process analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.9 AR coefficients estimates’ mean and variance . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.10 Filter problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.11 Prediction filter of order p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.12 Prediction-error filter of order p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.13 Prediction-filter filter of order p rearranged to look as an AR analyser . . . . . . . . . . . .
65
4.14 Lattice filter of first order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.15 Lattice filter of first order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
5.1 The classification process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.2 An artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.3 Hyperbolic tangent and Sigmoid functions. . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
vi
vii
5.4 A I −J −K neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.5 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 A radial basis function network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1 The neural network’s wakefulness P(W ), REM/light sleep P(R) deep sleep P(S) outputs;
and measure of sleep depth P(W )-P(S) (from Pardey et al. [123]) . . . . . . . . . . . . . . 117
6.2 Mean error and covariance matrix trace for reflection coefficients computed with the Burg
algorithm (wakefulness and Sleep stage 4) vs data length N . . . . . . . . . . . . . . . . . 121
6.3 Sammon map for the balanced sleep dataset; classes W, R and S . . . . . . . . . . . . . . . 123
6.4 N EURO S CALE map for the balanced sleep dataset; classes W, R and S . . . . . . . . . . . . 124
6.5 Average performance of the MLPs vs number of hidden units . . . . . . . . . . . . . . . . . 125
6.6 Performance of the 10-6-3 MLP vs regularisation parameters . . . . . . . . . . . . . . . . . 127
6.7 MLP outputs, P(W ), P(R) and P(S) for subject 9’s all-night record, showing a 12-minute
segment detailed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.8 Sleep database subject 9 P(W )-P(S), raw (a) and 31-pt median filtered (b) compared to
human expert scored hypnogram (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.9 OSA sleep MLP outputs for subjects 3 and 8 . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.10 [P(W )-P(S)] output for OSA sleep subjects 3 (top) and 8 (middle); and for normal sleep
subject 9, (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.11 μ-arousal detection procedure. Upper trace: [P(W)-P(S)] and a 0.5 threshold; middle
trace: thresholding result; lower trace: μ-arousal automatic score with ASDA timing criteria 133
6.12 μ-arousal validation Upper trace: automated score for 0.7 threshold; middle trace: automated score for 0.8 threshold; lower trace: visually scored signal . . . . . . . . . . . . . . 134
6.13 Se, P P A and Corr vs threshold for OSA subjects . . . . . . . . . . . . . . . . . . . . . . . 137
6.14 [P(W )-P(S)] output for OSA sleep subjects 2 (top) and amplitude histogram showing the
two main clusters, surrounded by a circle of one standard deviation, and the EDM threshold (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.15 Se, P P A and Corr for the best threshold (blue), the EDM threshold (red), and a 0.5 fixed
threshold (green) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.16 OSA subject 5 EEG and [P(W )-P(S)] output during a typical μ-arousal for this subject (24s) 141
6.17 Spectrogram of the EEG segment shown in Fig. 6.16 calculated with 1s resolution using
10th-order AR modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.18 OSA subject 5 EEG and [P(W )-P(S)] output during a μ-arousal missed by the automated
scoring system (24s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.19 OSA subject 8 [P(W )-P(S)] output and human expert scores (2 minutes) . . . . . . . . . . 144
6.20 Sleep database subject 9 raw P(W )-P(S) using a 1-s analysis window (a) and using a 3-s
analysis window (b), compared to the human expert scored hypnogram (c) . . . . . . . . 145
7.1 Vigilance Sammon map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
viii
7.2 Vigilance N EURO S CALE map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3 Vigilance Sammon map showing subject’s distribution (Alertness in red and Drowsiness in
blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.4 Vigilance N EURO S CALE map projections for each subject (Alertness in magenta and Drowsiness in blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.5 Vigilance N EURO S CALE map trained with all subjects, including the α+ subject . . . . . . . 158
7.6 Vigilance N EURO S CALE trained with all subjects, including α+ subject (Alertness in magenta and Drowsiness in blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.7 Subject 8 reflection coefficient histogram (green) in relation to the rest of the subjects in
the training set (magenta) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.8 Vigilance and sleep N EURO S CALE map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.9 Vigilance and sleep N EURO S CALE projections for all the patterns in each class (colour code:
W, cyan; R, red; S, green; A, magenta; and D, blue) . . . . . . . . . . . . . . . . . . . . . . 161
7.10 Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta;
and D, blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.11 Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta;
and D, blue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.1 Average misclassification error for the validation set vs. number of hidden units J for the
(n − 1)-subject MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.2 Average misclassification error on the validation set with respect to regularisation parameters (νz ,νy ) for the (n − 1)-subject MLP with J = 3 (linear interpolation used between 12
values) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Time course of the MLP output for vigilance subject 1
. . . . . . . . . . . . . . . . . . . . 173
8.4 Time course of the MLP output for vigilance subject 2
. . . . . . . . . . . . . . . . . . . . 174
8.5 Time course of the MLP output for vigilance subject 3
. . . . . . . . . . . . . . . . . . . . 175
8.6 Time course of the MLP output for vigilance subject 4
. . . . . . . . . . . . . . . . . . . . 176
8.7 Time course of the MLP output for vigilance subject 5
. . . . . . . . . . . . . . . . . . . . 177
8.8 Time course of the MLP output for vigilance subject 6
. . . . . . . . . . . . . . . . . . . . 178
8.9 Time course of the MLP output for vigilance subject 7
. . . . . . . . . . . . . . . . . . . . 179
8.10 Average misclassification error for the validation set vs. number of hidden units J for the
7-subject MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.1 LED subject 1 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 186
9.2 LED subject 1 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 187
9.3 LED subject 2 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 188
9.4 LED subject 2 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 189
9.5 LED subject 3 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 190
9.6 LED subject 3 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 191
ix
9.7 LED subject 3 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 192
9.8 LED subject 4 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 193
9.9 LED subject 4 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 194
9.10 LED subject 5 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 195
9.11 LED subject 5 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 196
9.12 LED subject 6 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 197
9.13 LED subject 6 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 198
9.14 LED subject 6 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 199
9.15 LED subject 7 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 200
9.16 LED subject 7 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 201
9.17 LED subject 8 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 202
9.18 LED subject 8 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 203
9.19 LED subject 9 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 204
9.20 LED subject 9 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . . 205
9.21 LED subject 10 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . 206
9.22 LED subject 10 MLP output and missed hits time courses . . . . . . . . . . . . . . . . . . . 207
9.23 LED subjects MLP output vs missed hits scatter plots . . . . . . . . . . . . . . . . . . . . . 208
9.24 LED subjects MLP output vs missed hits scatter plots . . . . . . . . . . . . . . . . . . . . . 209
9.25 LED subjects no-missed hits MLP output histogram . . . . . . . . . . . . . . . . . . . . . . 210
9.26 Patterns from LED subject 1 and 2 projected onto the 7-subject vigilance N EURO S CALE map 211
9.27 Patterns from LED subject 3 and 5 projected onto the 7-subject vigilance N EURO S CALE map 212
9.28 Patterns from LED subject 7, 9 and 10 projected onto the 7-subject vigilance N EURO S CALE
map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
9.29 Patterns from LED subject 4 and 6 projected onto the 7-subject vigilance N EURO S CALE map 218
9.30 Patterns from LED subject 8 projected onto the 7-subject vigilance N EURO S CALE map . . . 219
A.1 Stochastic process ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
List of Tables
3.1 The Rechtschaffen and Kales standard for sleep scoring. . . . . . . . . . . . . . . . . . . .
22
3.2 The vigilance sub-categories and their definition . . . . . . . . . . . . . . . . . . . . . . . .
37
4.1 AR coefficients estimates’ mean and variance . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.2 Feedback coefficients in terms of the reflection coefficients . . . . . . . . . . . . . . . . . .
72
4.3 Reflection coefficients in terms of the feedback coefficients . . . . . . . . . . . . . . . . . .
72
6.1 Mean error and trace of covariance matrix for synthesised EEG reflection coefficients
(wakefulness) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2 Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (stage
4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Misclassification error (expressed as a percentage) for the best three MLPs . . . . . . . . . 126
6.4 Se, P P A and Corr per subject for various threshold values . . . . . . . . . . . . . . . . . . 136
6.5 Optimal threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.6 Equi-distance to means (EDM) threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.7 Fixed (0.5) threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.1 Alford et al. vigilance sub-categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2 Number of patterns per subject per class in vigilance training database . . . . . . . . . . . 149
7.3 Number of patterns per subject per class in K-means training set . . . . . . . . . . . . . . 150
8.1 Partitions and distribution of patterns in training (Tr) and Validation (Va) sets . . . . . . . 166
8.2 Optimum MLP parameters per partitions and percentile classification error for training (Tr)
and validation (Va) sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3 Percentage correlation between 1-s segments of the 15-pt median filtered MLP output and
15s-based expert labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
C.1 Bristol subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
D.1 Time of falling asleep (in mm:ss) measured by the clinician from the start of the MWT test.
The letter used in this thesis to refer to a given test is shown in brackets . . . . . . . . . . 241
x
xi
D.2 Subject demographic details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
D.3 Overnight sleep study results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Chapter 1
Introduction
Obstructive Sleep Apnoea (OSA) is a condition in which breathing stops briefly and repeatedly during
sleep, causing frequent awakening as the subject gasps for breath. Day-time sleepiness is the main symptom of OSA. Diagnosis of the disorder includes an over-night sleep study to count the number of arousals,
and a day-time sleepiness assessment. Changes from wakefulness to deep sleep and from alertness to
drowsiness are reflected in many physiological signals and behavioural measures. Among the physiological variables, the electroencephalogram (EEG) is one of the most relevant, but traditional methods, based
on visual assessment of this signal (for example, counting the number of micro-arousals during sleep)
are time-consuming or not reliable. Most of the changes in the EEG associated with the transition from
alertness to drowsiness and to sleep are in the frequency domain, and many attempts to computerise the
EEG analysis are based on frequency-domain methods.
1.1 Overview of thesis
The focus of this thesis will be on sleep disturbance (micro-arousals in OSA patients) and its effect on
day-time performance, as assessed with vigilance monitoring. Definitions of terms used in this thesis
and a description of the OSA disorder and its implications in society can be found in chapter 2. Clinical
background and a literature review on computerised methods are presented in chapter 3. Section 3.2.3
of that chapter show that little research has been done on the computerised analysis of disturbed sleep.
Furthermore, there is no prior work on the computerised analysis of both sleep disturbance and vigilance
1.1 Overview of thesis
2
from the EEG. This thesis will describe the research undertaken in order to develop such a framework
using AR modelling for frequency-domain analysis and neural network methods for clustering and for
classification.
AR modelling theory and algorithms are the subject of chapter 4. Chapter 5 is a review of neural network methods. Experiments carried on AR modelling to find a compromise between the stationarity
requirements and the variance of the AR estimates are described in chapter 6, which also presents the
use of neural networks with sleep data to track the sleep-wake continuum from single-channel EEG. An
automated system is developed to detect micro-arousals in OSA sleep EEG, based on the neural network
outputs. Results, compared with an expert’s scores, show high sensitivity with a low number of false positives, and a good similarity in starting time and duration. A case study shows that the EEG may present
a mixed-frequency pattern during a micro-arousal, instead of a shift in frequency as usually described in
the literature.
In chapter 7 we explain the reasons why a different network is needed to map the alertness-drowsiness
continuum. Visualisation analysis of the EEG features revealed differences between Alertness and Drowsiness in vigilance tests, with respect to Wakefulness and REM/Light Sleep in a sleep-promoting environment. Chapter 8 deals with the training of neural networks with vigilance data to track the alertnessdrowsiness continuum using single-channel EEG only. Finally, chapter 9 presents the results of the trained
network with data from OSA patients performing a visual attentional task. This study shows that OSA
subjects may present “drowsy” EEG while performing well. Also, the MLP analysis of the wake EEG with
these subjects shows that the transition to Drowsiness may occur progressively as well as in sudden dips.
Correlation of the MLP output with a measure of task performance and visualisation of EEG patterns in
feature space show that the alertness EEG patterns of OSA subjects may be more closely related to the
drowsiness EEG patterns of normal sleep-deprived subjects than to the alertness patterns of these.
Chapter 2
Sleep and day-time sleepiness
2.1 Sleep, wakefulness, sleepiness and alertness
2.1.1 Definitions
Although the above words are part of almost everyone’s daily conversations we will define them in the
sense that they are to be used in this thesis. Sleep is a natural and periodic state of rest during which
consciousness of the world is suspended while its counterpart, wakefulness, is a periodic state during
which one is conscious and aware of the world [124]. Between sleep and wakefulness is the transitional
state of sleepiness [130], which has been defined as a physiological drive towards sleep [4], usually
resulting from sleep deprivation, or as a subjective feeling or state of sleep need. Wrongly used as a
synonym for wakefulness, alertness is the process of paying close and continuous attention, a state of
readiness to respond [124], an optimal activated state of the brain [115]. Vigilance, another word of
similar meaning, was first introduced in the literature by Head in 1923, who differentiated the stages of
awareness [64].
2.1.2 The process of falling asleep
Two theories try to explain how we fall asleep. Oswald in 1962 suggested that the fall of cerebral vigilance
does not occur as a steady decline but occurs briefly over and over again, with frequent surges of cerebral
vigilance to, once more, a high level. This depicts sleep onset as a punctuated rather than gradual
process. However more and more evidence has been found recently using recordings of the brain’s
2.2 Breathing and sleep
4
electrical activity and respiration signals that show a gradual oscillatory descent into sleep [6][130].
Sleepiness has an ultradian (> once/day) modulation with three times when this condition is most common: just after awakening in the morning, in mid afternoon (the so called “post-lunch dip” which is
nevertheless not related to the ingestion of food) and just prior to sleep. The post-lunch dip correlates
with the occurrence of siestas and an increase in the incidence of automobile accidents [166][41][130].
The drive to sleep can be overridden by motivation, especially in life-threatening situations, but it cannot
be suppressed indefinitely [42].
2.1.3 Going on to a deeper sleep
Once the sleep state is reached, physical signs of this condition are lack of movement, reduced postural
muscle tone, closed eyes, lack of response to limited stimuli, and more regular and relaxed breathing,
usually accompanied with an increase in upper airway noise. During sleep, the eyes can move repeatedly and rapidly. This condition is called rapid eye movement (REM) sleep and is usually associated
with the act of dreaming. A normal subject follows cycles or periodic patterns of REM and non-REM
(NREM) sleep during the night, going from the wakefulness stage to the deep sleep stage and then to
REM sleep, for a time longer than 20 minutes but usually no more than 1 hour, descending again into a
deep sleep stage, and repeating the REM-NREM sleep 90-minute cycle for about 4 or 5 times (see Fig 2.2
in section 2.4.2)[155].
2.2 Breathing and sleep
2.2.1 Normal sleep
When a normal subject is awake, ventilation is controlled by two pathways, one driven by the brainstem respiratory control centre and the other by the cortex (see Fig. 2.1). The one which is controlled
by the brain-stem is a vagal reflex and is more related to oxygen and carbon dioxide concentration
control. During sleep, this respiratory centre remains active but the cortex drive disappears causing
regular breathing as well as a fall in ventilation and a rise in the CO2 concentration. The reduction
2.2 Breathing and sleep
5
in muscular tone causes a similar effect. The intercostal muscles stop their breathing motion and the
tubular pharynx muscle, which relies on tonic and phasic respiration to stay open, is narrowed when it
and related muscles lose tone. This pharyngeal narrowing increases the upper airway resistance. Even so,
the loss in tone of the intercostal muscles increases the chest wall compliance, allowing the diaphragm to
elevate the rib-cage more easily. The overall effect is that the breathing looks more relaxed and the ratio
of abdominal contribution to rib-cage contribution decreases, at least in NREM sleep [155].
Figure 2.1: The human brain showing its main structures
The further reduction in tone experienced by the intercostal muscles during tonic REM sleep brings
another fall in ventilation followed by a recovery in phasic REM sleep, when the randomly excited cortex
is able to drive the breathing again, making it less regular. The abdominal contribution increases to a
higher level than when the subject is awake [155].
2.2.2 Obstructive Sleep Apnoea
An obstructive apnoea is a condition which occurs when the air flow in the ventilation system stops for
more than 10s, due to an obstruction in the upper airways. A hypopnoea occurs when the normal flow is
reduced by 50% or more for more than 10s [91]. The number of apnoea and hypopnoea events per hour,
called the respiratory disturbance index (RDI) or apnoea/hypopnoea index (AHI), is used to determine
whether breathing patterns are normal or abnormal. Usually, an AHI of 5 or more is considered abnormal
[119].
2.2 Breathing and sleep
6
Some subjects develop a sleep disorder called Obstructive Sleep Apnoea (OSA) in which apnoea or hypopnoea events occur when the upper airway, usually crowded by obesity, enlarged glands, or other kinds of
obstruction, collapses under the negative pressure created by inspiration as the muscles lose their tone.
Then, the subject increases his respiratory efforts gradually until the intrathoracic pressure drops to a
subatmospheric value. Only when the carbon dioxide level rises and the oxygen level falls enough to
awake the cortex respiratory mechanism, does the returning tone unblock the upper airways and restore
ventilation. Recently, some studies [155] have pointed out the possibility that the increase in respiratory
effort is responsible for the cortex arousal. Whatever is the cause, this arousal is short in time, sometimes
referred in the literature as a “micro-arousal”1 , and the patient rarely is conscious of it [45].
If the apnoea/hypopnoea event is followed by an overshoot of hyper-ventilation, then the threshold of
the carbon dioxide level to provoke spontaneous ventilation can fall, and the next apnoea will have a
period when no respiratory effort is being made [155].
Micro-arousals
An arousal is a mechanism of the organism to increase the level of alertness in order to respond more
effectively to danger, whether it be external or internal and whether actual or perceived. In terms of
sleep, arousal not only refers to waking up but also to a series of physiological changes in autonomic
balance (i.e. heart rate, blood pressure, skin potential) and brain cortex activity [45].
Arousals caused by an apnoea/hypopnoea event are a short duration response caused by an internal
stimulus. Their length can be from just 3 or 5 seconds to 20 seconds [11], and can be barely noticeable
or can end in a choking sensation or panic [45]. A number of 15 or more micro-arousals per hour are
enough to diagnose OSA with confidence, but the number of arousals can be greater than 400 during the
night [155], and some studies found up to 100 per hour [45]. This fragmentation decreases the quality of
the sleep by diminishing the effective sleep time. Progressive sleepiness during daytime is a consequence,
starting with some loss of vigilance when the subject is performing a boring task, but soon leading him or
1 The
term micro-arousal was first introduced by Halasz in 1979 [60]
2.3 Daytime sleepiness
7
her to fall asleep while doing other activities such as reading, watching TV, sitting as a passenger in a car
or train or taking a bath. In the worst case, the subject may fall asleep while driving a machine at work or
a car, causing shunting accidents, and more serious crashes [155]. The deterioration in daytime function
correlates with the frequency of the micro-arousals rather than the extent of the reduced arterial oxygen
saturation [44].
Arousals can have causes other than obstructive sleep apnoea (OSA), for instance, ageing, leg movements,
pain, some forms of insomnia, but the most common cause is OSA [30]. OSA has a prevalence of 1-4% in
the overall population, 85% of the sufferers being males, and is highest in the 40-59 year age group, the
percentage of those affected rising to 4-8% [107] [44]. The problem usually arises in middle age, when
the muscles becomes less rigid and a decrease in activity increases the weight [155].
2.3 Daytime sleepiness
2.3.1 Causes
Sleep deprivation is one of the most common causes of sleepiness in our society. Studies on sleep deprivation have found that a reduction in nocturnal sleep of as little as 1.3 to 1.5 hours per night results
in a reduction of daytime alertness by as much as 32% as measured by the multiple sleep latency test
(see section 2.4.1 for a description of this test)[20]. Physiological and psychological functions deteriorate
progressively over accumulating hours of sleep loss as well as over periods of fragmented sleep [35][97].
A second cause of sleepiness is OSA, the most common sleep disorder to cause day-time sleepiness,
even though the subjects affected by this disorder often report sleeping quite well [70] [107]. The
sleepiness of OSA sufferers is reflected in neuro-physiological impairment in originality, logical order
in visual scanning, recent memory, word fluency, flattening of affect in speech, and spatial orientation.
They become easily distracted by irrelevant stimuli, and have difficulties in ordering temporally changing
principles (card sorting or digit symbol substitution) [70]. It has been recommended that diagnosis of
OSA should not only depend on the AHI but also on functional sleepiness [107].
2.3 Daytime sleepiness
8
2.3.2 Sleepiness/fatigue related accidents
Fatigue and sleepiness are often used as synonyms. The term fatigue is also used to indicate the effects
of working too long, or taking too little rest, and being unable to sustain a certain level of performance
on a task [41]. Fatigue as well as sleepiness is related to motivation; the capability of performing a given
task; and past, cumulative day-by-day arrangements and durations of sleep and work periods [121].
Loss of performance usually means decreased ability to maintain visual vigilance and to have quick reactions, as well as to respond to unique, emergency-type situations. The loss of performance brought
by fatigue and sleepiness can be fatal when driving, piloting, monitoring air traffic control or radar or
when operating dangerous machinery. It appears that the incidence of sleepiness-related fatal crashes
may be as high as 40% of all the accidents on long stretches of motorway [41]. 20-25% of drivers having
motorway accidents appear to do so as a result of falling asleep at the wheel [107]. Sleepiness influences
people’s perception of risk [41]. Drivers do not always recognise the signs of fatigue/drowsiness or may
choose to ignore them [121]. Evidence has been found that lorry drivers on 11-hour hauls show in their
physiological signals increased signs of marked drowsiness during the last three hours of their drive [83].
Long-distance driving, youth and sleep restriction are frequently associated with sleep-related accidents
[128].
Sleepiness is the major complaint of shift-workers. Displaced hours of work are in conflict with the
basic biological principles regulating the timing of rest and activity (i.e. the circadian and homeostatic
regulatory systems) [4]. It may be the cause of more than 2% of all the serious accidents in industry
[41].
2.3.3 Correlation between OSA and accidents
As OSA is one of the most common causes of day-time sleepiness [161], the link between this sleep
disorder and motorway accidents is obvious. OSA patients show a high dispersion in reaction times [79],
and evidence has been found that OSA impairs driving [59]. Recent polls have revealed that 24% of
2.4 Measuring the sleep-wake continuum
9
OSA patients reported falling asleep at least once per week while driving [107], so it is not a surprise to
find that OSA sufferers have a 5 to 7 fold greater risk of road accidents than normal subjects. Long-haul
lorry drivers belong to the highest-risk group [107]. Lorry drivers with OSA have twice as many crashes
per mile driven as the normal group [121]. However, more recent studies have noted that increased
automobile accidents in OSA sufferers may be restricted to cases with severe apnoea (AHI> 40) [56].
2.4 Measuring the sleep-wake continuum
2.4.1 Measuring sleepiness
Many attempts to measure sleepiness/alertness have been made, and several scales are currently in
use. Subjective measures like the Stanford Sleepiness Scale (SSS), with 7 statements of feelings of
sleepiness from “wide awake” to “cannot stay awake” [130]; the Visual Analogue Scale (VAS), that
uses 10cm-lines anchored between the extremes of the states or moods under study [5][130]; and the
Activation-Deactivation Adjective Check List (ADACL), which consists of a series of adjectives describing
feelings at the moment and a four point scale – definitely feel, feel slightly, cannot decide and definitely
do not feel – have been used in a wide range of vigilance studies [130], in parallel with more objective
measures that provide means of verifying the subjective feelings of loss of alertness [5].
Several tests have been developed to provide an objective, repeatable quantification of sleepiness like
the multiple sleep latency test (MSLT) [27][140], that places the subject in a sleep promoting situation
and measures the latency to onset of sleep. In the MSLT, subjects in a sleep-promoting environment are
instructed to try to fall asleep while other similar tests differ in the instructions, like the maintenance of
wakefulness test (MWT) [111][43], which instructs the subject to resist sleep.
Loss of alertness or sleepiness has been related to diminished response capability, in which a decrease in
performance will indicate the presence of this condition. Therefore, quantifiable behavioural responses
or performance measures have been used also as objective ways to measure vigilance. The most popular
ones are reaction time, tracking error and stimulus detection error. The use of vigilance tasks to measure
2.4 Measuring the sleep-wake continuum
10
sleepiness has the problem that the tasks are intrusive with respect to the natural process of sleepiness
[130]. Task complexity and knowledge of results (feedback) can mitigate the effects of sleep loss [42].
Other non-task related factors that affect the process are motivation, distraction and comprehension of
instructions [130][35][42].
Physiological measures
As a result of the degree of isomorphism between physiological and behavioural systems, diminished
response capabilities associated with sleepiness will be reflected in distinctive variations in physiological
measures. Below is a list of some of the changes in physiological variables associated with sleepiness
[130][118]:
• slower, more periodic breathing,
• decrease in cardiovascular activity (heart rate, blood pressure),
• decreased eye blinks and increased slow eye movements,
• decreased but variable skin conductance responses,
• decreased body temperature,
• electroencephalogram (EEG) changes in amplitude, frequency and patterning.
2.4.2 Measuring sleep
Loomis and collaborators first showed in 1937 that sleep is not a uniform or steady state and they therefore classified sleep in stages [96]. Following this sleep classification was further refined until in 1968, a
committee chaired by Rechtschaffen and Kales (R & K) compiled a set of rules that soon became the standard in sleep staging [136]. From wakefulness or REM sleep to deep sleep, R & K analysis distinguishes
four intermediate stages for NREM sleep (see Fig. 2.2 for a typical all-night sleep classification plot which
is known as a hypnogram). Visual assessment of the subject is not enough for the characterisation of these
stages. Physiologically, the EEG, the electromyogram (EMG) and the electrooculogram (EOG) provide a
2.4 Measuring the sleep-wake continuum
11
higher level of quantification in the description of the different sleep stages. Measures of sleepiness based
on the EEG and details of the sleep stages are given in chapter 3.
Awake
Sleep stages
REM
1
2
3
4
1
2
3
4
5
6
7
Hours of sleep
Figure 2.2: A conventional all night sleep classification plot from one normal subject
8
Chapter 3
Previous work on EEG monitoring for
micro-arousals and day-time vigilance
As we have seen in chapter 2, many physiological processes change at the time of sleep onset. Monitoring
these changes provides means of detecting arousals during sleep for OSA diagnosis (see section 2.2.2)
and day-time sleepiness. However, the organ that shows the clearest changes during sleep and from
alertness to sleepiness is the brain. Not only is the brain the organ that contains the mechanisms for
sleeping and being awake, its electrical activity is relatively easy to monitor and reflects the changes in
the sleep/wake continuum [69].
3.1 The EEG
The electroencephalogram or EEG is a graphical record of the electrical activity of the brain which was
first measured non-invasively in humans and described in 1929 by Hans Berger1 . It can be measured with
electrodes located near, on or within the cortex. Depending on the location of the recording electrodes
the EEG can be called scalp EEG, cortical EEG or depth EEG. The first one is recorded with electrodes
placed on the scalp while the last two refers to electrodes in contact with the brain cortex [145]. From
now on we will use EEG to mean scalp EEG.
1 The first recording of the electrical activity of the brain was made by Caton in 1875 using rabbits, monkeys and other small
animals[28]
3.1 The EEG
13
3.1.1 Origin of the brain electrical activity
The human nervous system is responsible for taking the information from internal and external or environmental changes, analysing it and acting upon it in order to preserve the integrity, well-being, and
status quo of the organism. Its most prominent and important organ is the brain (see Fig. 2.1). The human brain contains approximately 109 nerve cells or neurons interconnected in a very intricate network
within which the information is transmitted by electro-chemical impulses [51]. Most neurons consist of
a cell body, or soma, with several receiving processes, or dendrites, which prolongs to a nerve fibre, or
axon that branches at the other end (see Fig. 3.1).
dendrites
soma
axon
Figure 3.1: A simplified neuron
As in any other cell in the human body, there is an electrical potential difference between the inner and
the outer side of the neuron. This potential, called the resting potential, is due to differences in extracellular and intracellular ion concentration, maintained by the cell membrane structure and ion pumping
mechanisms. Neurons can respond to stimuli, strong enough to initiate a series of charge changes that
leads to membrane depolarisation and reverse polarisation that reaches a peak and repolarises back to
the resting potential. This sudden activity resembles a spike in shape and is called an action potential.
Typically it has a peak to peak amplitude of 90 mV and a duration of 1ms.
Neurons also interact with each other by chemical secretions in the dendrite-axon gaps (synapses) between them. The action potential in the pre-synaptic neuron (transmitting neuron) travels from the soma
along the axon. When it reaches the end it releases a chemical neurotransmitter at the axon terminals,
which are very close to the dendrites of other neurons. Then the post-synaptic neuron (receiving neuron)
receptors for this chemical release ions inside the cell that change the membrane polarisation, originating
a post-synaptic potential. Post-synaptic potentials are much lower in amplitude than the action potentials,
3.1 The EEG
14
but they last much longer (15 - 200 ms or more) and the extracellular current flow associated with them
is much more widely distributed than that corresponding to action potentials. It has been estimated that
one neuron can influence up to 5000 of its neighbours. For these reasons it is believed that the EEG reflects the summation of post-synaptic potentials of the pyramidal cells rather than the spatial summation
of individual action potentials [135] [145] [125]. Pyramidal cells are neurons located very close and
perpendicularly to the cortex surface, so the ion current flow generates electrical potential changes that
are maximum in the plane parallel to the cortex [135].
If post-synaptic potentials coming from the dendrites of one neuron, summed in time and space, exceed
a certain threshold, the soma generates a new nerve impulse, an action potential, that is then transmitted to the neurons at the end of its axon, passing in this way the stimulus response from one neuron to
another [51] [135]. Post-sypnatic potentials can be of varied peak amplitude but in general, a single one
is not enough to trigger the action potential [135]. Because of their chemical origin, the potentials generated in the brain are very limited in amplitude and the ionic currents travel slower (1ms per synapse)
than currents in metals. The axon membrane is not a perfect insulator, some extracellular current flows
and diffuses the information around the neuron, speeding up the signal transmission [135] [110]. The
cerebro-spinal fluid and the dura membrane act as strong attenuators for the EEG, with the scalp itself
having less effect. EEG waves seen at the scalp, therefore, represent a kind of a “spatial average” of
electrical activity from a limited area of the cortex [125].
3.1.2 Description of the EEG
The EEG is a very complex quasi-rhythmical spatio-temporal signal within a time-frequency band of 0.1
- 100 Hz and an amplitude of the order of hundreds of microvolts at the scalp [135]. The effective
frequency range is 0.5 - 50 Hz and is divided for clinical reasons in four main bands in which the power
of the signal is concentrated, namely [87][69][24]:
1. Delta (δ) activity: [0.5 - 3.5] Hz2
2δ
rhythm is limited to the [0.5 - 2) Hz range in sleep studies
3.1 The EEG
15
2. Theta (θ) activity: [4 - 8) Hz
3. Alpha (α) activity: [8 -13] Hz
4. Beta (β) activity: [15 - 25] Hz
5. Gamma (γ) activity: [30-50] Hz
with “]” meaning “inclusive” and “)” meaning “exclusive”.
EEG records are sometimes described as just “slow” or “fast” if the dominant frequency is below or above
the α band. The amplitude of the waves tends to drop as the frequency increases. Although there are
indications of several sources of rhythmical activity in the brain, their role in the generation of the EEG
rhythms is not yet fully understood [145] [125]. Clear oscillatory behaviours in the nervous system occur
in various situations, like in rhythmic motor functions (chewing, swimming) as well as in pathological
conditions (clonic muscular jerking, rhythmic eye blinks), but most of them serve unknown functions.
Some may be related to biological clocks or establishing windows of time during which information flows
[125]. The bands described above correspond to the main frequencies of these physiological pacemakers.
These frequencies do not tend to overlap with the frequency content of the neighbouring bands, hence
the gaps between some of the bands.
It has been suggested that the distributed, but related, cortical γ activity in the forebrain provides the
physiological basis for focused attention that links input to output, i.e. relating voluntary effort and/or
sensory input to a calling up and operation of a sequence of movements or thoughts. This form of
attention occurs normally during wakefulness, but can also be present during disordered sleep, in patients
who talk or walk during sleep [24]. Activity over 50 Hz is not considered of clinical value in scalp EEG
because it is mostly masked by background noise. Apart from the background rhythmical activity, there
are other components in the EEG of transient nature, usually described in terms of their duration and
waveform. For instance, a monophasic wave of less than 80ms duration is called a spike, while one of
80-200ms is called a sharp wave. Other transient forms are the spindles and K-complexes (see Fig. 3.4 later
in this chapter). All EEG components fluctuate spontaneously in response to stimuli or as a consequence
3.1 The EEG
16
of changes in the subject’s state of mind (i.e. sleep/wake control and psychoaffective status) and brain
metabolic status. They can also be changed by the use of drugs or by traumas or pathological conditions
[87] [85].
EEG patterns are different from one individual to another. Factors like gender, early stimuli, minor or
major brain damage, etc. can affect the development of the EEG. Once a subject reaches adulthood, their
EEG characteristics “stabilize” over time. This means that the EEG patterns for different conditions such
as eyes open, eyes closed, auditory stimulation and task performance remain remarkably similar for the
same individual as their age increases [85].
3.1.3 Recording the EEG
The EEG is recorded by amplifying the potential differences between two electrodes located on the scalp.
An electrode is a liquid-metal junction used to make the connection between the conducting fluid of the
tissue in which the electrical activity is generated and the input circuit of the amplifier [135]. The most
commonly used system for the placement of the electrodes is the so-called “10-20 International System
of Electrode Placement”, [76], represented in Figure 3.2. An orderly array of EEG channels constitutes
a montage. When all the channels are referenced to the same electrode (usually mastoid processes A1
for the right side of the scalp, and A2 for the left side of the scalp, or a common site located at the nose
or at the chin) the montage is called “referential”. If all the channels represent the difference potential
between two consecutive electrodes on the scalp, the montage is said to be “bipolar” [145].
The EEG signal is traditionally recorded on paper, or, more commonly now, electronically. It is subsequently analysed in order to extract useful information about the physiology or pathology of the brain.
This analysis is usually done by an expert by visual inspection of the signal.
3.1.4 Extracerebral potentials
In addition to the EEG, the scalp electrodes can also pick up other signals whose sources are not in the
brain, but are near or strong enough to interfere with its electrical activity. These signals can totally
3.1 The EEG
17
Nasion
FP1
FP2
F8
F7
F3
Left
T3
C3
FZ
F4
CZ
C4
PZ
P4
T5
T6
O1
P3,4
F3,4
frontal
O 1,2 occipital
parietal
C 3,4 central
F7,8
anterior
T3,4
mid-temporal
FZ
frontal mid-line
T5,6
posterior temporal
CZ
central vertex
PZ
parietal mid-line
Right
A2
P3
pre-frontal
1,2
T4
A1
Fp
A 1,2 mastoid
O2
Inion
Figure 3.2: The 10-20 International System of Electrode Placement
obscure the EEG, making the recording uninterpretable. They can subtly mimic normal EEG activity or
distort normal activity, leading to misinterpretation [23]. Although called artefacts (or artifacts) they do
not always come from man-made devices. The main sources of artefacts are:
1. The recording instrument
2. The interface between the recording instruments and the scalp
3. Extraneous environmental sources
4. Other bio-electrical signals that do not originate from the brain and are not of interest in this
context, and can therefore be considered to be unwanted influences.
Muscle and heart activity as well as eye and tongue movements are among the bio-electrical signals
which, in this context, are considered artefacts because they obscure the EEG. They are classified as:
1. Electrocardiographic (ECG) signals and signals due to breathing
2. Electrooculographic (EOG) signals (signals due to eye movement)
3. Glossokinetic signals (signals from the movement of the tongue)
4. Electromyographic (EMG) signals (signals induced by muscle activity)
3.1 The EEG
18
5. Electrodermal signals due to altered tissue impedance (see above).
Usually the above influences appear within the EEG frequency range and consequently cannot be eliminated by filtering. If the interference renders the EEG useless, then the affected sections of EEG are
ignored, unless the presence of the interfering signal gives important information about the brain status
as is sometimes the case in visual scoring to determine alertness or in visual sleep staging.
Artefacts during sleep
Blink artefacts can occur only during wakefulness and in combination with slow eye movements during
drowsiness. Rapid eye movements (REM) are seen during waking but are characteristic of the “dreaming”
sleep stage that was named after them. For EEG recorded with the reference electrode positioned on
the opposite side of the body, vertical eye movements affect mostly the frontopolar sites (F p1 and F p2
electrodes), with an exponential decrease of the effect towards the occipital sites, while, for horizontal
eye movements, the maximum effect is found at the frontotemporal sites. EMG artefacts are uniformly
distributed within REM sleep, but are concentrated at the beginning and the end of non-REM sleep
periods. As expected, the deeper the sleep stage the lower the EMG activity, although REM sleep is
marked by skeletal muscle atonia. ECG artefacts may or may not be present during sleep as they do not
depend on the non-REM sleep stage. Phasic electrodermal artefacts can occur upon sudden arousal from
light sleep stages. Chest movements due to respiration may induce head movements that compress some
of the electrodes against the pillow, resulting in slow potential shifts in them [9].
The best way of dealing with artefacts is avoiding or minimising their occurrence during the recording
[23] [9]. When this is not possible (e.g. after the recording) other alternatives like digital filtering may be
applied. However, digital low-pass filtering for reducing muscle and mains artefacts, or high-pass filtering
for reducing sweating and respiration artefacts may severely distort both the EEG and the artefact signal.
EMG artefacts may resemble cerebral activity after filtering (mostly β activity, but also epileptic spikes and
rhythmic α activity). The last alternative is to reject EEG segments contaminated with artefacts [9]. This
is performed in most sleep laboratories by visual inspection, but some automatic detection can also be
3.2 Analysis of the EEG during sleep
19
performed, like out-of-range checks, and lately using some more sophisticated methods of identification
based on artefact-free models.
3.2 Analysis of the EEG during sleep
The R & K [136] technique for sleep scoring has become the gold standard throughout the world since
its publication in 1968. The scoring is based on the recording of several physiological signals, called the
polysomnograph (PSG). Typically a PSG record consists in 5 to 11 signals, including 2 EEG channels, one
mentalis-submentalis (chin) EMG channel, 1 or 2 EOG channels and one ECG channel (see Fig. 3.3). In
a clinical study to detect sleep related breathing disorders, special transducers are used to include nasaloral airflow, respiratory effort recorded both at the level of the chest and the abdomen, and oximetry
(oxygen saturation levels). When the number of channels is restricted to one, one of the EEG channels
C4 − A1 , C3 − A2 or Cz − Oz is recommended for a single-channel EEG recording. Paper or magnetic
tape used to be the outputs of a PSG device, but are nowadays replaced by digital storage and display of
the digital PSGs [84].
C4
R-EOG
reference
for EOGs
L-EOG
EMG
reference
for EEG
EMG EEG EOG EOG
channels
Figure 3.3: Conventional electrode positions for monitoring sleep
A description of the R & K rules for sleep scoring is given briefly in Table 3.4 and Fig. 3.4, and in more
detail in the following section.
3.2 Analysis of the EEG during sleep
20
3.2.1 Changes in the EEG from alert wakefulness to deep sleep
At Wakefulness, EEG waves in an adult show a low amplitude, high frequency and apparently random
characteristic, generally contaminated with muscular activity from the temporal or other skeletal muscles.
When the subject closes his eyes and relaxes or when he becomes drowsy, there is usually a reduction
in any muscle and eye movement potentials, plus an increase in the EEG α activity. The slow (< 1 Hz)
rolling of the eyes upwards and the shutting and opening of the eyelids a few times are also signs of
drowsiness [69].
As the subject becomes more drowsy, the α rhythm may be interrupted by periods of relatively low voltage
during which slow lateral eye movements often occur. The slightest stimulus during these periods of
low voltage EEG activity will cause immediate reappearance of the α rhythm. Note that this indicates
not drowsiness but an increase in alertness, for which reason this is called a paradoxical α response.
Alternating periods of low voltage activity and of higher voltage α activity occur for a few minutes,
with the duration of the former progressively increasing until the latter no longer appears, along with a
progressive increase in θ activity indicating that the subject is lightly asleep. Stimuli insufficient to cause
arousal may be strong enough to produce an electronegative sharp wave at the top of the head or vertex
(V-wave). This is defined by R & K as Sleep Stage 1 [87].
Stage 2 is characterised by the appearance of sleep spindles, short (0.5-3s) bursts of 12-14Hz activity consisting of approximately 6-25 complete waves, as well as K-complexes. A K-complex is a large amplitude
biphasic wave of approximately one second duration, maximal at the vertex. K-complexes can have two
different origins. One is as a response to an external stimulus (e.g. a noise) and the other is as an early
manifestation of the slow waves typical of deeper sleep stages [155].
The other two stages, Sleep Stage 3 and Sleep Stage 4, are well distinguished from the rest by the appearance of high amplitude (≥ 75μVpp ) δ waves. This feature gives to these stages the name of slow wave
sleep (SWS) or δ sleep. The difference between the two stages lies in the percentage of this slow pattern
in the analysed segment, being from 20% to 50% for the third stage and greater than 50% for the fourth
3.2 Analysis of the EEG during sleep
21
Figure 3.4: Sleep EEG stages (taken from [69])
[155]. Figure 3.4 shows typical EEG segments for each sleep stage.
So-called REM sleep can be divided into three phases. The first phase is characterised by the decrease
or even total disappearance of the EMG activity, which has already experienced a decline in going from
wakefulness to deep NREM sleep. After a few minutes the slow waves, spindles and K-complexes in the
EEG are replaced by rapid, low-amplitude waves, as in wakefulness or in the first sleep stage, with the
exception that the α mode does not dominate the EEG. This is the second phase, which only lasts a few
3.2 Analysis of the EEG during sleep
22
minutes, giving way to the third phase that comes with a burst of rapid eye movements, spikes in EMG
and sometimes visible twitching of the limbs. When REM sleep has a high density of these bursts of eye
movements it is known as phasic REM, while a low density type of REM sleep has received the name of
tonic REM. Tonic REM typically occurs at the beginning of the night whilst phasic REM is usually found
late at night. REM and non-REM periods alternate on a 90-minute cycle through the night, although the
duration of REM increases across the night.
Sleep stage
Wakefulness
Sleep stage 1
Sleep stage 2
Sleep stage 3
Sleep stage 4
REM sleep
Characteristics
Low amplitude, high frequency EEG activity(β and α activity);
EEG sometimes with EMG artefact
Increased θ activity; slow eye movements (SEM); vertex
sharp waves; transition stage that lasts only few minutes
EEG presents spindles (bursts of α activity) and K-complexes
EEG with high amplitude, low frequency activity;
δ activity appears
EEG is dominated by δ activity
EEG presents high frequency, low amplitude waves; EMG
generally inhibited; bursts of rapid eye movements (REM) also appear
together with spikes in the EMG
Table 3.1: The Rechtschaffen and Kales standard for sleep scoring.
The hypnogram shown in Fig. 2.2, section 2.4.2, illustrates the transitions between sleep stages, the main
features of the NREM-REM cycle, and the proportion of each stage found in a young adult. Sleep stage 1,
being a transitional stage between wakefulness or drowsiness and true sleep (stage 2 or deeper), usually
occupies only 5% of the night. The bulk of human sleep, around 45% of it, is made up of stage 2. Stage
3, another transitional phase, constitutes only about 7% of the sleep, while stage 4 makes up about 13%.
The rest of the total sleep time (20%–30%) is taken up by REM sleep [69].
3.2.2 Visual scoring method
The transition from fully alert wakefulness to deep sleep is a gradual process and it would be very difficult
to determine what the level of sleep is at any moment without dividing the PSG record into epochs of
duration which may be anything from 10s to 2min. The standardised use of 15mm/s and 10mm/s as the
PSG paper speed made the use of 20s or 30s epochs quite convenient, as each epoch is then one page
long. The scorer uses the R & K set of rules to determine the sleep stage per epoch, regardless of the
3.2 Analysis of the EEG during sleep
23
level of sleep in the previous record or in subsequent records. Eight hours of sleep produce about 400m
of paper. If the record is segmented in 20-30s epochs, this gives approximately 1000-1400 epochs to be
scored visually, which takes an experienced technician over 2 hours, or more if the record has transient
pathological events [155].
Limitations of the R & K scoring rules
Inspite of being widely used, the R & K rules have never been appropriately validated, and they were
never designed for scoring pathological sleep [66]. They suffer from major limitations like:
• the 6-value discrete scale to represent a process that is essentially continuous,
• the 20-30s time scale offering a very poor resolution so that transient events shorter than it are
missed,
• the bias introduced due to the failure to address non-sleep related individual variability of characteristics such as α rhythm,
• the failure to address important sleep/wake related physiological processes such as respiratory and
cardiovascular processes and corresponding disorders.
Obviously, the rules have to be adapted and extended. In 30 years the methods of analysis have changed
due to the advent of the personal computer era. The task group on Signal Analysis of the European
Community Concerted Action “Methodology for the analysis of the sleep-wakefulness continuum” [12],
generated guidelines for a computer-based sleep analyser that would overcome the limitations of the
manual standard scoring and the R & K standard set of rules. They proposed a 1s time resolution, and
the tracking of the NREM sleep/wake process along a continuous scale with the 0% level corresponding to
wakefulness and the 100% level corresponding to the deepest SWS, as well as an on/off output indicating
REM sleep. They felt that quantification of REM/NREM should be based only on EEG, EOG and chinEMG in order to avoid bias between the inter-individual and intra-individual non-sleep-related differences
3.2 Analysis of the EEG during sleep
24
such as α rhythm, vertex and sawtooth3 waves and slow eye movements. They also considered additional
outputs to complement the REM/NREM sleep/wake process, such as a micro-arousal on/off output.
3.2.3 Computerised analysis of the sleep EEG
The need for automatic classification has been widely recognised [12]. Attempts to classify sleep EEG
automatically were made soon after the release of the R & K rules for manual scoring [150]. Different
approaches have been developed, most of them trying to emulate the R & K standard, with or without
overcoming its limitations. Many of them include the analysis of several PSG signals, like the EEG, EOG
and EMG [150] [151] [55] [133] [86] [152] [57] [132] [143] [68], and cardiorespiratory signals [54]
[92].
A computerised classification system is usually fed with artefact-free PSG signal segments of fixed or
adaptive length (typical range 1s-30s), or has an artefact marking/rejection procedure prior to the analysis block. Sometimes the EEG is the only input signal used [94] [72] [75] [78] [147] [160] [123] [16].
A number of features are then extracted using one or more of several methods (time domain, frequency
domain, non-linear dynamics, etc). A classification block combines the features and estimates the sleep
level using a set of rules (decision tree, linear discriminant, fuzzy logic), or an equivalent procedure
(neural networks).
Most of the approaches to classification in the past have used either period analysis (time domain) [22]
[94] [149] [133] [67] [13] [92] [48] [68] [167] or spectral analysis (frequency domain) [164] [57]
[132] [143] [78] [99] [16] [158]. Alternatively other techniques have been introduced, such as wavelet
analysis, autoregressive (AR) modelling [55] [75] [72] [86] [74] [123], principal component analysis
(PCA) [78] and more recently, nonlinear dynamic analysis [2] [52] [137]. The parameters, or features,
obtained have been combined in many different ways to yield classification. One of the most used is
the knowledge-based approach [150] [94] [55] [72] [132] [57] [92] [68]. A Markov-chain maximum
3 Normally related to visual activity while scanning a picture, these random electropositive waves of 20μV of amplitude or less,
sawtooth waves (or λ-like waves are sometimes seen while the subject is in REM/light sleep
3.2 Analysis of the EEG during sleep
25
likelihood model was developed in 1987 by Kemp and collaborators [86], and cluster analysis has also
been investigated [75], with neural network techniques joining the list in the last decade [13] [143]
[147] [160] [123] [16] [158].
Most of these systems show a reasonable discrimination for sleep stage 2 and slow-wave sleep, but all of
them are poor at discriminating REM from wake and stage 1. Holzmann et al. found a high percentage
of disagreement in light sleep scoring for experts revisiting their scoring (intra-rater)[68]. Many authors
use EOG and/or EMG to help the identification of REM [126]. The percentage of agreement with visual
scorers varies between 67% and 90% with a typical value of 83% in artefact-free segments. Some studies
aggregate sleep stages 3 and 4 together and that elevates the percentage of agreement to over 90%
[72] [143]. Another problem, probably inherited from the R & K set of rules that most of the systems
try to emulate, is that classification systems work almost perfectly in healthy subjects but do not work
sufficiently well in sleep-disturbed patients [152] [155] [126]. If an automated system is able to give the
same level of intra-rater and inter-rater agreements as the clinical experts manage (usually about 86% for
the inter-rater agreement, and 91% for intra-rater agreement) then it can be said to be of use for clinical
purposes. Although several commercially available systems can perform sleep staging, visual scoring is
the only reliable method available at the moment when scoring disrupted sleep EEG [112]. It is clear
that computerised analysis cannot fully replace expert opinion, therefore results of an automatic scoring
system require inspection by a trained polysomnographer [126].
Time-domain analysis
Visual analysis of the EEG is based on the identification of patterns and the assessment of mean amplitude
and dominant frequency. Therefore, time-domain measures of the EEG have been among numerous
features used in computerised analysis. Zero-crossing and maximum peak-to-peak amplitude are the
most popular of this kind of descriptors [94] [133]. The zero-crossing count is the number of times
that the signal crosses the base-line and is related to the EEG mean frequency. In 1973, Hjorth [67]
presented several EEG descriptors calculated directly from the time series as an alternative to frequency
3.2 Analysis of the EEG during sleep
26
analysis. The descriptors are calculated from the derivatives of the EEG signal and have a correspondence
to spectral descriptors. Hjorth descriptors, which measure the standard deviation of the signal (activity),
and ratios of the standard deviations of the signal and its first two derivatives (mobility and complexity)
have often been used in the analysis of the sleep EEG [92] [47]. The signal is band-pass filtered prior
to the calculation of time-domain descriptors which are then used to detect particular types of sleep
patterns [133]. Bankman et al. [13] added measures of slope to the already mentioned time-domain
features for K-complex detection. More recently, Uchida and co-workers [168] have investigated the use
of histogram methods of waveform recognition in sleep EEG. The method, which measures the period
and amplitude of a wave, has the advantage of detecting the frequency, amplitude and duration of single
and superimposed waves.
Uchida and collaborators used period-amplitude analysis of the sleep EEG and compared their analysis
with spectral methods. They found that for some frequency bands the time-domain method does not
detect the waves while the Fourier transform methods perform very well over the entire EEG frequency
range [167]. In contrast, Holzmann et al. found that a zero-crossing strategy gives better results for
slow-wave (< 2Hz and > 75μVpp ) detection than Fourier transform methods [68].
Frequency-domain analysis (bank of filters, FFT, AR modelling)
Given that the sleep process involves gradual shifts in the EEG dominant frequency (see section 3.2.1),
the power spectrum of the signal conveys useful information. There are several approaches to estimate
the power spectrum of a stationary signal ([63] pp.147-8). Although the EEG is non-stationary, it may
be considered piece-wise stationary (for a detailed description of this issue, see chapter 4). The most
popular methods to estimating the power spectrum are the Fourier transform and AR modelling. A bank
of band-pass filters is another popular approach to obtain the power of the EEG frequency bands (see
section 3.1.2) [150].
The Fourier transform has been in use for nearly six decades in the spectral analysis of sleep EEG [89].
Its use increased dramatically in the 60’s with the development of the Fast Fourier Transform algorithm
3.2 Analysis of the EEG during sleep
(FFT) [34] that speeds the calculation up by a factor of
27
N2
N logN ,
with optimum performance when N is a
power of two 4 . It has a disadvantage when N is small, as the variance of the spectrum estimate is high
for low values of N , but this can be improved by the use of smoothing windows and averaging. Another
disadvantage is that the Fourier transform is only calculated for discrete values of frequency, multiples
of fs /N , where fs is the sampling frequency. Features can be taken directly from the power density
spectrum, as coefficients for frequencies with the highest variance in the sleep continuum [158], or as
peak-frequencies, but the practical norm is to calculate the power (absolute or relative) accumulated in
the EEG bands [57] [132] [143] [99] [16]. PCA has also been applied to the spectrum coefficients to
find out which are the most significant [78].
AR modelling offers a more interesting alternative to FFT methods for power density spectrum estimation.
It yields a lower variance estimate if the model order is kept low, and is continuous in frequency. It
combines the versatility of picking up broad band signals and pure tones with relatively high accuracy,
which makes it suitable for the analysis of the EEG, a signal that may present bursts of waves as well
as background activity. Its relatively high computational complexity is not a problem anymore with the
state of the art in computing technology. Features can be extracted from the power spectrum estimate as
relative or absolute powers in EEG bands [55] [72], or directly from the model parameters [75] [152]
[147] [160] [123] [137]. Smoothing is usually applied to the coefficients to get a better estimate when
the number of samples has to be kept low (stationarity requirement). Chapter 4 will cover this method
in detail.
Non-linear analysis
Although some evidence has been found that mathematically the EEG signal resembles much more a
stochastic process with changing conditions than a non-linear deterministic process with a chaotic attractor [2], chaos theory offers ways to determine signal complexity.
Shaw et al. used an algorithmic complexity measure as an index of cortical function in rats [146]. Rezek
4N
is the number of signal samples
3.2 Analysis of the EEG during sleep
28
and Roberts [137] compared four stochastic complexity measures for the EEG, namely AR model order,
spectral entropy, approximate entropy and fractional spectral radius, obtaining best results with the last
one when attempting to detect disturbed sleep with the central EEG channel.
Fell et al. found that non-linear measures discriminate better between sleep stages 1 and 2, while spectral
measures do so with sleep stage 2 and SWS. None of the investigated measures were able to discriminate
between REM sleep and sleep stage 1. The measures were relative δ power, spectral edge, spectral
entropy and first spectral moment (spectral measures), and correlation dimension D2, largest Lyapunov
exponent L2 and approximate Kolmogorov entropy K2 (non-linear methods).
Classification techniques
The set of features extracted from either frequency-domain or time-domain analysis or a mixture of
both are usually combined in a deterministic way to determine the R & K sleep stage. However, other
approaches involving self-learning classifiers have also been investigated. In 1981, Jansen and co-workers
used AR features and cluster analysis for sleep staging [75]. Later, Kemp et al. developed a model based
maximum likelihood classifier [86]. Kubat and collaborators presented in 1994 an artificial intelligence
approach with automatic induction of decision trees [92]. At the same time, several investigators have
used probability-based approaches such as Bayesian classifiers [72] and neural network classifiers, with
the introduction of a new approach in tracking the sleep continuum in the work of Pardey et al. [123].
Knowledge-base methods We have already pointed out that many of the attempts to perform automatic sleep scoring emulate the visual scoring process of the R & K rules. As a result, numerous
knowledge-based classification systems have been developed. The implementation varies from hybrid
analog-digital logic arrays of “ANDs” and “ORs” [22] [150], or algorithmic “IF-ELSE” rules [132], to
fuzzy-logic systems [94] [55] [68]. Most of the systems extract additional information from the experts,
but heuristic approaches like the one developed by Smith and co-workers, who tried different adjustments
to increase the agreement with visual scoring [151], can be found in the literature.
3.3 Analysis of the EEG for the detection of micro-arousals
29
Neural network methods Neural networks have been used in the detection of characteristic sleep
waves (i.e. spindles, K-complexes, etc) which can then be used to help automatic sleep staging. Shimada
et al. have trained a 3-layer neural network for the detection of these waves using a time-frequency 2D
array, consisting of 11 sets of 12 FFT coefficients in a 3.84s window [158]. Wu and co-workers developed
EEG artefact rejection by training a neural network to recognise the typical artefact patterns [73].
Neural networks have also been used to classify sleep according to the R & K scale. Baumgart-Schmitt
and collaborators [16] [15] used a mixture of experts to classify sleep using 31 power spectral features
and nine 3-layer neural networks, each one trained with data from a different healthy subject. They
obtained good discrimination of REM with respect to Wakefulness and Sleep Stage 1.
Previous work in the group in which the research described in this thesis has been carried out [123] has
used a neural network to track the dynamic development of sleep on a continuous scale from deepsleep (stage 4) to wakefulness on a second-by-second basis. The neural network output has the ability
to pinpoint short-time events and more cyclic events like Cheyne-Stokes respiration using only one EEG
channel and AR features.
3.3 Analysis of the EEG for the detection of micro-arousals
3.3.1 Cortical arousals
There are several types of arousals. Some sleep disturbances do not reach the brain cortex, they are
called “sub-cortical” or “autonomic” arousals and they can be detected by monitoring the heart rate and
the beat-to-beat blood pressure, looking for an increase in the pulse rate and blood pressure along with
an increase in the respiratory effort, following an apnoea/hypopnoea event. There are also those which
arise in the brain cortex, the so-called “cortical” or EEG arousals, which as their name suggests, can be
detected by monitoring the EEG. Cortical arousals can have several causes, not all related to OSA, for
instance external noise, changes in light, snoring, leg movements, bowel disturbance, bladder distension
and gastroesophageal reflux to mention a few of them. Pain and some forms of insomnia can also be
3.3 Analysis of the EEG for the detection of micro-arousals
30
causes of arousals, but the most common cause is OSA. Ageing is another strong factor in the tendency
to arousal [155].
During a PSG all the external variables like light and noise can be reasonably controlled, and monitoring
leg movements (by transducers located on the legs, or video recording) and snoring (by a microphone
taped on the neck) may help to discard those arousals produced by causes other than OSA [112].
Bennet and colleagues [18] found that detection of autonomic activation is as good as detecting cortical
arousal for predicting daytime sleepiness in OSA patients, but it does not convey any extra information.
As a rule patients with sleep disorders or excessive daytime sleepiness have normal electrophysiological
EEG characteristics both in frequency and amplitude [61]. The OSA sleep disorder does not alter the
physiology of sleep but has a pronounced effect on the sequence of states. Recent studies [123] [48]
have claimed that the EEG provides sufficient information to identify most micro-arousals. A cortical
arousal caused by an apnoeic/hypopnoeic event usually looks like an increase in the frequency in the
EEG (see Fig. 3.5). Note that we will discuss this in more detail in section 6.2.5.
Figure 3.5: Apnoeic event
R & K criteria for sleep scoring allow the scoring of arousals longer than 10s as well as the so-called
“movement arousals”, but the set of rules was not designed for the scoring of transient events (shorter
than 10s). If a 30s epoch has more than 15s of slow waves plus a short arousal, the epoch will be scored
as stage 3 or 4, ignoring the presence of the arousal. In this way, a night sleep record may look “normal”,
3.3 Analysis of the EEG for the detection of micro-arousals
31
whereas the fact is that the subject has experienced hundreds of micro-arousals [155].
The American Sleep Disorders Association (ASDA) attempted to overcome the R & K deficiency in the
scoring of micro-arousals when it published a set of rules for EEG arousal scoring. The rules are independent of the R & K criteria, and are summarised in the next section.
3.3.2 ASDA rules for cortical arousals
Technically, ASDA defined an arousal as an abrupt shift in EEG frequency, which may include θ, α and/or
frequencies greater than 16Hz but not spindles, subject to the following summary of rules and conditions
[11]:
1. A minimum of 10 continuous seconds of sleep in any stage must occur prior to an EEG arousal for
it to be scored as an arousal. That is a consequence of the first and second rules of the EEG arousal
scoring set of rules. The first one establishes that the subject must be asleep. The second one is
such as to prevent the scoring of two related arousals as independent arousals.
2. The minimum duration is 3 seconds. There is both a physiological basis and a methodological
reason for this choice: reliable scoring of events shorter than this is difficult to achieve visually.
3. To score an arousal in REM sleep there must be a concurrent increase in submental EMG.
4. Artefacts (including pen blocking or saturations), K-complexes or δ waves are not scored as arousals
unless accompanied by a frequency shift in another EEG channel. If they precede the frequency
shift, they are not included in the three seconds criterion. Indeed, δ wave bursts are not necessarily
related to arousals and as a result, more evidence should be used, for example, respiratory tracing.
5. To score 3 seconds of α sleep as an arousal, it must be preceded by 10 or more seconds of α-free
sleep.
6. Transitions from one stage to another must meet the criteria indicated above to be scored as an
arousal.
3.3 Analysis of the EEG for the detection of micro-arousals
32
The time scale for arousal scoring is much shorter (changes of 3s or more) than the 20-30s for visual
sleep scoring, and therefore arousal visual scoring is more time-consuming than visual sleep staging, and
also more inaccurate. In spite of the efforts of ASDA, the scoring of micro-arousal events is still difficult
as inter-rater variability is very high, especially when the μ-arousal occurs during REM or light sleep
[46]. Townsend and Tarassenko [165] evaluated the agreement between three scorers of EEG microarousals on an 11-patient database and found very little agreement (0-10%) over a mean of 70 arousals
per patient when counting the number of arousals scored. Indeed the figure got worse if the starting time
and duration of the arousals were also considered, as for some recordings none of the experts scored the
same event as an arousal.
3.3.3 Computerised micro-arousal scoring
As can be deduced from the above rules, the detection of arousals is not easy. Arousals may occur during
any sleep stage, and are particularly difficult to detect in REM sleep when the EEG is the only signal
used in the analysis. There is also some controversy concerning the comprehensiveness of the ASDA
rules. Townsend and Tarassenko [165] as well as Drinnan and co-workers [48] have questioned the
absence of a “gold-standard” definition for arousals. While the ASDA definition is in widespread use,
EEG changes which do not meet the criteria have been associated with daytime sleepiness [29]. Other
signals have been suggested as indices of arousals, like blood pressure [129] [39], but these indicators
correlate well with EEG arousal. Hypoxemia (reduced arterial oxygen saturation) has been found to
play a role in the capacity to stay awake, rather than in the propensity to fall asleep, while indices of
sleep disruption correlate with both [17]. Guilleminault et al. [58] evaluated the role of respiratory
disturbance, oxygen saturation, body mass and nocturnal sleep time in daytime sleepiness, but did not
find significant correlation between them. They concluded that the best predictor of the excessive daytime
sleepiness frequently found in OSA patients is the nocturnal PSG and the sleep structure abnormalities
found in the brain activity recording.
Stradling and collaborators [156] found that the relationship between the severity of the OSA measured
3.3 Analysis of the EEG for the detection of micro-arousals
33
by a sleep study and the daytime sleepiness of the subject is poor. They suggested that the importance of
a micro-arousal is related to both its duration and the depth of sleep prior to the arousal. Accordingly, it
would be desirable to extract this information automatically from a computerised arousal scoring system.
3.3.4 Using physiological signals other than the EEG
Recently, Aguirre and co-workers [3] modelled blood oxygen saturation, heart rate and respiration signals from a patient with OSA, using a nonlinear AR moving average model with exogenous inputs in
which the blood oxygen saturation is the output of the model and the other two signals the inputs. They
reconstructed successfully the respiration signal from the other two, suggesting that the dynamics underlying these signals are nonlinear and deterministic. However, it seems that while these signals are
very well correlated, there is not a unique relationship with the changes in the EEG following an apnoeic
event as was found by Townsend and Tarassenko [165] who investigated pulse transit time (a measure
of beat-to-beat blood pressure) and heart rate for micro-arousal detection. They found that increases in
heart rate and decreases in pulse transit time appear to occur relatively regularly during the night many
times, independently of the occurrence of micro-arousals.
Drinnan et al. [47] investigated the relation between movement or respiration signals (wrist movement,
ankle movement, left and right tibial electromyogram and phase change in ribcage-abdominal movement) and cortical arousals. Their conclusions were that arousal was accompanied by movement only
on a minority of occasions; in some subjects, the number of movement events exceeded the number of
arousals, and some arousals were accompanied by more than one movement. This may explain the poor
relationship that they found between movement signals and arousals. Ribcage-abdominal phase was the
only index which showed a significant relation with cortical arousals, but despite the high correlation, in
some obese subjects the sensitivity and the positive predictive accuracy for phase were as poor as for the
other investigated signals, due to the loose coupling between the used sensors and the diaphragmatic motion. Other subjects showed phase changes opposite to those expected. Macey and collaborators [100]
[101] found similar results when using time-domain features and neural network methods to detect
3.3 Analysis of the EEG for the detection of micro-arousals
34
apnoea events from the abdominal breathing signal in infants with central apnoeas.
3.3.5 Using the EEG in arousal detection
Drinnan and collaborators [48] investigated 10 possible indices of arousal using the EEG derivation
Cz /Oz . Two of the indices were related to amplitude: Hjorth’s activity, CFM (cerebral function monitor)
and eight to frequency: α power, zero crossing rate, δ crossing rate (zero crossing rate of EEG’s first
derivative), Hjorth’s mobility, frequency peak, frequency mean, frequency mean of the CFM-filtered EEG
and Hjorth’s complexity. From all these they found that three of them offered good discrimination in
terms of identifying arousals: the zero crossing rate, Hjorth’s mobility and frequency mean.
Huupponen et al. [71] used a single channel of EEG and a neural network to detect arousals. Secondby-second changes in the power of the EEG bands relative to the average power in a 30s segment were
chosen as input features for the neural network. Amongst the detected arousals, there were only 41%
true positives and a very high number of false positives. A recent study, performed by Di Carli and coworkers [40] used two EEG channels and one EMG channel for the automatic detection of arousals in
a set of 11 patients with various pathologies, including OSA. They used wavelets to analyse the EEG
in the time-frequency domain, then measured the relative powers in the EEG bands and calculated the
ratios between these powers, computed for both short-term and long-term averages. These indices were
used along with the other measures from the EMG as the inputs to a linear discriminant function, whose
free parameters were set to maximise the sensitivity and selectivity for detecting arousals previously
scored by two experts. They made a distinction between “definite” arousals and “possible” arousals in
the visual scoring, and also post-analysed their results correlating the starting time and duration of the
micro-arousals detected by the computer with the ones detected by the experts. The automatic detection
yielded an average 57% agreement with the experts, while the agreement between experts reached 69%.
The percentage of man-machine agreement increased to approximately 75% when only definite arousals
were considered. Both Huupponen’s and Di Carli’s detectors took into account the context, mimicking
the visual scoring according to the ASDA rules.
3.4 Analysis of the EEG for vigilance monitoring
35
3.4 Analysis of the EEG for vigilance monitoring
The study of the wake EEG is much more difficult than that of the sleep EEG as the signals are more
prone to artefacts and show subtle changes as the level of alertness of the individual varies. It is also
very difficult to validate the alertness measures derived from the EEG with others, like task performance,
because training and motivation play an important role in the ability of the subject to perform well while
drowsy.
3.4.1 Changes in the EEG from alertness to drowsiness
The fully awake, responsive state is associated in the EEG with the absence of any rhythmic activity.
The EEG has low amplitude and a random pattern. Also, multiple EMG artefacts are present. The
physiological explanation for this is that the responses to alerting stimuli are mediated by the ascending
reticular activating system of the brain stem [65] which also desynchronises the cortical activity [134].
As the individual relaxes, rhythmical activity appears, most commonly as α wave activity, the amplitude
of the EEG increases and the muscle activity diminishes. The α rhythm is almost always found in the
EEG of healthy, awake, unanaesthetised subjects. Its amplitude, however, is usually very low and it is
only picked up by recorders when it becomes strong as the person becomes drowsy or closes their eyes.
The relationship between the occurrence of an α wave and the brain status is intricate; most often, the α
rhythm appears in individuals relaxed and prone to sleepiness, i.e. drowsy. With further advance towards
drowsiness there is an α activity drop, the α sequences becoming less and less continuous, eventually
giving way to θ activity at the onset of sleep. θ activity is most commonly found in the 6-7 Hz band and
is stronger at the onset of drowsiness [85].
Spatially, the most important changes are in the amplitude of the α activity which occur predominantly
at the occipital sites, while an increase in the slow, mainly θ, activity is more diffuse [142]. EEG changes
do not appear until the subjective symptoms of sleepiness become manifest [5].
Slow eye movements (SEM), is probably the most sensitive variable to allow differentiation between
3.4 Analysis of the EEG for vigilance monitoring
36
sleepiness and alertness [88] [118][142][170]. However, in practice it is very difficult to score SEM
since blinks and rapid eye movements interfere [5]. An increase in motor activity (EMG) is shown in
subjects struggling against imminent drops in alertness [32].
The appearance of α rhythm does not necessarily indicate complete eye closure or blurring vision, sometimes it may be associated with the perception of “being sleepy with open eyes” [88]. α rhythm is
particularly problematic for reasons not totally clear yet. Conradt and co-workers found differences in
“fast” α, “low” α activities and reaction times, but the differences are very difficult to detect [33]. The
changes in α activity on a small time scale are somewhat different depending on whether the eyes were
initially open or closed [32].
Individual EEG differences
A particular problem for vigilance studies is the difference between individuals. Almost all the vigilance
studies using EEG report problems with a proportion of the subjects exhibiting abnormal EEG. Some
individuals are unable to maintain α activity for more than 30s with closed eyes while others show much
α activity with eyes open even when at maximum alertness. Moreover these “α-plus” subjects do not
experience the normal increase in α activity when losing alertness, instead their α waves decrease with
sleepiness. Sometimes their α activity amplitude spreads into the θ band. These observations suggest the
need for individual calibration of sleepiness effects on the EEG [5]. This will be discussed in more detail
in section 10.5.
3.4.2 EEG analysis in vigilance studies
The central referential electrode montage C3-A2 is widely used to record the EEG in vigilance studies
[35] [61] [166] [106] as it is recommended by the standard manual for sleep stage scoring [136]. The
manual also recommends an epoch length of 15-30s, and this has also been adopted for alertness scoring
[38] [8] [157]. However, episodes of stage 1 or “micro-sleep” periods as brief as 1 − 10s have been
identified [142] [130] [126].
3.4 Analysis of the EEG for vigilance monitoring
37
Alford et al. developed a sleepiness scale based entirely on PSG measures using 15s-epochs. The scale
has 6 waking categories and one sleep category (see table 3.2) [8]. This scale will be considered in more
detail in chapter 7 of this thesis. Given that vigilance stages may change within seconds, the EEG in
a 30s window is not stationary in terms of vigilance [127], and a correct statement can no longer be
made with respect to any information averaged over 30s [93]. Penzel and Petzold [127] scored the EEG
in variable length segments according to the patterning or rhythmicity and Varri et al. used an adaptive
segmentation algorithm for the EEG prior to visual scoring, resulting in segments of 0.5s to 2s [170].
With their technique, a 90 min vigilance test may consume an entire day of work for a technician (2-3
hours from preparation to the removal of the electrodes, plus 5 hours scoring the PSG) [126]. As wake
EEG is more complex than sleep EEG, the inter-rater agreement for vigilance EEG scoring is usually lower
(≈72% [61]) than in the sleep case (86% according to [126]).
Vigilance sub-category
Description
Active Wakefulness
(Active)
Quiet Wakefulness Plus
(QWP)
Quiet Wakefulness
(QW)
Wakefulness with
Intermittent α
Wakefulness with
Continuous α
Wakefulness with
Intermittent θ
(WIα)
Wakefulness with
Continuous θ
(WCα)
(WIθ)
(WCθ)
active/alert pattern
more than 2 eye movements per epoch
increased/definite body movement
active/alert pattern
more than 2 eye movements per epoch
average/possible/no body movements
alert pattern
less than 2 eye movements per epoch
average/reduced/definitely no body movements
definite burst of α rhythm
for less than half of an epoch
definite burst of α rhythm
for more than half of an epoch
definite burst of θ rhythm
for less than half of an epoch
(plus α rhythm, if present)
definite burst of θ rhythm
for more than half of an epoch (stage 1 of sleep)
Table 3.2: The vigilance sub-categories and their definition
Spectral methods
Changes associated with sleepiness in the EEG are mainly in the patterns and rhythms of the signal.
Therefore it seems that the signal is better analysed in the frequency domain, either by power spectrum
3.4 Analysis of the EEG for vigilance monitoring
38
estimation or by band-pass filtering, using the standard EEG frequency bands to define the filter boundaries. The rhythms most affected by drowsiness are θ, δ and α in that order. However, they do not change
in the same way, nor are all of the changes linear with respect to the decrease in performance. Late in the
60’s Daniel found that θ waves dropped significantly prior to failures in a detection task, and the occurrence of α waves was not necessarily correlated with errors [38]. Later on Lorenzo et al., using central
electrodes, found a linear increase in θ power as a result of sleep deprivation which was also linked to
deterioration in performance [97]. Da Rosa et al. modelled the awake and sleep EEG with sufficient accuracy using the linearisation and simplification of a nonlinear distributed parameter physiological model
[36]. Studies on a minute scale showed that α power declines with drowsiness, while θ power increases
linearly with the loss in performance.
Flight simulations and in-cockpit studies have found correlation between EEG power-spectrum and pilot
performance, except for the α band [153]. Makeig and Jung [104] found that the second eigenvector of
the normalised EEG log spectrum is highly correlated with variations in drowsiness and sleep onset.
3.4.3 Vigilance monitoring algorithms
Attempts to implement an alertness monitor follow two major trends in pattern classification, the rulebased type and the neural network approach. The signals most commonly used in these algorithms are
the EEG, the EOG and the EMG, but one of the prototypes for driver performance monitoring uses a nonphysiological signal, a measurement of the vehicle’s steering (see sub-section Neural Network methods
below). Some of the prototypes have been used only on simulators, while others have also been tested
in real conditions. Ambiguous data and inter-subject variability seem to be a common problem in all of
them.
As in sleep, the existing systems for automatic vigilance scoring are not yet suitable for clinical work,
requiring supervision from a skilled technician. Results in patients with EEG alterations are not reliable
unless their abnormality has been taken into account when developing the system.
3.4 Analysis of the EEG for vigilance monitoring
39
Rule-based algorithms
In 1989 Penzel and Petzold [127] developed a sub-vigil state rule-based classifier based on frequency
domain features extracted from 2s segments of EEG. They achieved 84.4% of agreement with consensus
labelled data and noted that the inter-rater variability defines the limit of what can be achieved for manmachine agreement. The inter-rater variability was 76% and the intra-rater variability was 81.6% on
their data set. The algorithm was used on OSA data and yielded “good results” in detecting arousals.
Varri et al.’s [170] rule-based computerised system for alertness scoring used more inputs: two EEG, two
EOG and one EMG channels. The system applied adaptive signal segmentation based on mean amplitude
and mean frequency measures, and a bank of filters provided the means of calculating the power within
each EEG band. A similar sub-system detected eye movements and EMG power. The effect of inter-subject
variability was reduced by recording 3 minutes of EEG with the eyes opened in an alert condition and 3
minutes with the eyes closed in a quiet condition to provide reference values for the power in each EEG
band. They found that eye movement can play a very important role in alertness monitoring. The system
gave a 61.6% man-machine agreement. Hasan et al. [61] used the system with new data, having to
perform “prior minor adjustments” to compensate for the differences with the training data. They divided
the group into low/high α activity. They also found a value of 61.8% for the man-machine agreement, for
an inter-rater agreement of 71.9%, and noted that visual scorers had difficulties in correctly identifying
all the bursts of brain waves, especially θ.
Neural Network methods
As in many other classification problems, neural network methods have been applied to the problem of
alertness/drowsiness estimation. In 1992 Venturini et al. [171] attempted to perform real-time estimation of alertness on a minute scale using one EEG channel and a neural network. Power from 5 significant
frequencies were used as input features. The neural network had difficulties in achieving good generalisation due to the small size of the available data set, therefore the jack-knife method was used for training
(see chapter 8 for a description of this methodology). Results were “good”, reported as being better than
3.4 Analysis of the EEG for vigilance monitoring
40
a linear discriminator on subjects who missed more than 40% of the target sounds on an auditory vigilance task. They also tried to develop a similar system based on event related potentials (ERP) getting
an accuracy of 96% on data averaged over 28 minutes and of 90% on data averaged over 2 minutes.
However, ERP has two great disadvantages: firstly, it requires the introduction of a distracting sound,
and secondly it cannot be performed on a second-by-second basis because an ERP requires averaging
of a series of repetitive stimuli over at least a 2-min long window to be extracted from the background
EEG. Jung and Makeig [80] refined the system by using 2 EEG channels and a neural network using the
power spectrum as features and PCA to reduce the dimensionality of the feature space. They obtained a
reasonable match with respect to the predictions made by an a priori model and using linear regression.
More recently, Roberts et al. [139] attempted to predict the level of vigilance using multivariate AR
modelling of 2 symmetric channels of EEG (T3 and T4) and the blink rate from 2 channels of EOG
as input features to a committee of neural networks known as Radial Basis Function (RBF) networks
using thin-plate splines as basis functions. The made a comparative study training the neural networks
for regression and for classification, the latter using only extreme-value labels. They trained the neural
networks in a Bayesian framework that allows integration over the unknown parameters (see [102] and
[103] for more detail) and which provides error bars for the results of the neural network analysis. They
obtained “reasonable”correlation with the smoothed human-expert assessment.
Trutschel et al. [166] combined the neural network approach with fuzzy logic when developing a neurofuzzy hybrid system to detect micro-sleep events. The device consisted of 4 neural networks, one for each
of four EEG channels, and a fuzzy-logic combiner. They used the system to monitor alertness in a driving
simulation study, obtaining “high” correlation between the number of micro-sleeps detected per hour and
the accident statistics per hour during the night.
Physiological signals are not the only sources of information which can provide measures of alertness.
Performance measures give an indirect way of monitoring alertness. A vehicle based signal, the steering
measure, has been used to track driver performance and alertness [157]. Power spectrum, mean and
3.4 Analysis of the EEG for vigilance monitoring
41
variance were chosen as input features in a neural network. Θ-plus individuals5 were rejected from the
study. The system only worked with 75% of the drivers. Poor results may, however, have been due to
contradictory data. For instance, the experts who labelled the data using EEG, EOG and EMG channels,
scored one subject as being asleep for nearly two hours of driving. Results indicate that steering measure
and alertness are not 100% correlated.
Shortcomings
As mentioned above, EEG and SEM are the most significant physiological signals in alertness assessment.
However, SEM is very difficult to measure, and the EEG present two disadvantages, firstly the inherent
complexity of the wake EEG, affected by many factors, like task characteristics, motivation and mood,
and secondly the inter-subject EEG variability.
The EEG has a wide-spread distribution among the population, and even within groups with the same
gender and age range. Matsuura et al. [106] found a large inter-individual variability especially with
respect to age. The percentage of α time and α continuity were greater in males than in females after
adolescence, the percentage of θ time was greater in females than in males during childhood, and the
percentage of β time was higher in females than in males at all ages.
As we said in section 3.4.1, in about 10% of the population, visual inspection of the EEG shows α rhythm
during wakefulness, while for the other 90% the EEG only shows α rhythms when the subjects are in
eyes-shut wakefulness or in the first sleep stages. Another 10% of the population shows very low or no α
activity with eyes closed. The first group is known as α-plus (α+) or P-type while the other is the M-type,
P being used for persistent and M for minimal [87]. One of the vigilance studies found one α-plus subject
whose α activity decreased when becoming drowsy instead of the normal increase experienced by the
rest of the subjects [5].
A study of short-term EEG variability using the FFT suggests that interpretation of relative measures of
δ, θ and β in individual spectra may be dependent on absolute α power [120]. Varri et al. [170] divide
5 Their
EEG displays θ waves while they are awake
3.4 Analysis of the EEG for vigilance monitoring
42
the data into low or high α to adapt their algorithm to the “normal” differences in α activity. As already
mentioned, Hasan et al. [61] had to perform “prior minor adjustments” to compensate for the differences
with the training data. They also found that subjects with poorly defined occipital α activity constitute a
special problem in the detection of drowsiness [61].
A third problem in alertness/drowsiness scoring using the EEG comes from the standard procedures
followed to score the sleep EEG. The standard set of rules for sleep scoring [136] recommends a length
of 15-30s for the EEG epochs. However, Kubicki et al. opine that it is often difficult to make a distinction
between an “α-sleep type” and pre-arousals (micro-arousals)on this time scale [93].
Portable devices
A few commercial alertness monitoring devices based on one or several of measures such as eye-tracking,
pupillometry, eyelid closures, head motion detectors, electrophysiological and skin measures and performance deterioration, are currently available [31][116]. A specialized company [31] advertises a microsleep/fatigue detection algorithm that uses advanced neural network and fuzzy logic hybrid systems for
detecting and predicting the occurrence of micro-sleeps, a description that coincides with the system developed by Trutschel et al. [166]. The same company offers integrated systems with alertness monitoring
and alertness stimulation/ micro-sleep suppression technologies, i.e. vibration, aroma, lighting, sound
and interactive performance systems combined with automatic micro-sleep/fatigue detection.
Chapter 4
Parametric modelling and linear
prediction
This chapter reviews the theories of auto-regressive (AR) modelling and linear prediction, after an introductory section on spectrum estimation. A more detailed review of AR modelling can be found in [63].
Noise classification can be found in [62], and filter structures in [122].
4.1 Spectrum estimation
4.1.1 Deterministic continuous in time signals
Let x(t) be a deterministic continuous signal with finite energy. Its Fourier transform Xc (f ) is given by
Eq. 4.1:
Xc (f ) =
∞
x(t) e−j2πf t dt
(4.1)
−∞
where the subindex c is used to distinguish it from its counterpart in the discrete-time domain.
Given the Fourier transform Xc (f ), the signal x(t) can be recovered using the inverse Fourier transform:
∞
x(t) =
−∞
Xc (f ) ej2πf t df
(4.2)
4.1.2 Stochastic signals
Many physical phenomena occur in such a complicated way that even if they are governed by deterministic laws, the almost infinite amount of interactions and the noise present in the sensors makes the use
4.1 Spectrum estimation
44
of a probabilistic model more sensible. Stochastic signals1 carry an infinite amount of energy, and the
Fourier transform integral as defined in Eq 4.1 normally does not exist. They are not periodic, so the
Fourier series expansion does not apply either. Instead of the energy content, we may be interested in
the power (time average of energy) distribution with frequency. If the generating process is stationary1 ,
second order averages like the autocorrelation and the autocovariance offer an alternative to performing
the time-frequency transform. Normally, the autocovariance tends to zero as the lag increases, but if the
process is zero-mean, the autocorrelation equals the autocovariance and therefore shows the same trend.
This is a sufficient condition for the existence of the Fourier transform of the autocorrelation, given by
Eq 4.3:
∞
R(f ) =
r(τ )e−j2πf τ dτ
−∞
(4.3)
∞
R(ω) =
r(τ )e
−jωτ
dτ
−∞
The autocorrelation at lag zero, which is equal to the average power of the signal, is related to the Fourier
transform R(f ) by the Wiener-Khinchin theorem:
r(0) = E[x(t)2 ] =
∞
R(f )df
(4.4)
−∞
Therefore, the function R(f ) represents the distribution of the power in the frequency domain, as a result
of which it has been named power spectral density (PSD) or power spectrum of the signal, often denoted
as S(f ):
S(f )
= R(f )
S(ω)
= R(ω)
(4.5)
The PSD has several properties which are reviewed in [62, pp.254-56].
Estimating the power spectrum
Autocorrelation function estimators The autocorrelation function is an average over the ensemble
x(t, ξ)1 . Usually only a single realisation x(t) (i.e. fixed ξ) of a given process x(t, ξ) is available leaving
1 for
a definition and a review of stochastic processes see Appendix A
4.1 Spectrum estimation
45
us unable to estimate r(τ ) unless ergodicity is assumed. If the process is ergodic, the autocorrelation
function of the process equals the time average over a single realisation given by the left-hand side of
Eq. 4.6:
1
T →∞ 2T
T
r(τ ) = E[x(t)x(t + τ )] = lim
x(t)x(t + τ ) dt
(4.6)
−T
However, in most of the cases, the signal x(t) is only available during a limited interval of time. Then
the autocorrelation function can only be estimated. Denoting x (t) as the signal x(t) truncated by a
rectangular window of length 2T , we can estimate r(τ ) as:
r̂(τ ) =
1
2T − |τ |
T −|τ |/2
x(t +
−T +|τ |/2
|τ |
|τ |
)x(t −
) dt
2
2
(4.7)
Eq. 4.7 is valid for |τ | < 2T , for |τ | ≥ 2T the estimate r̂(τ ) is set to zero. This is an unbiased estimator
(i.e. its mean value is the real value of r(τ )), but its variance increases as |τ | increases, because of the
factor 2T − |τ | in the denominator. Instead, the estimator r̂ (τ ):
r̂ (τ ) =
2T − |τ |
r̂(τ )
2T
(4.8)
has smaller variance, and although it is a biased estimator, it is more commonly used because its Fourier
transform is related to the energy density spectrum of the truncated signal x (t). Indeed, r̂ (τ ) is equal
to:
r̂ (τ ) =
1 x (τ ) ∗ x (−τ )
2T
(4.9)
where the symbol ∗ represents convolution in τ .
The periodogram: Fourier estimate of the PSD Invoking the Fourier transform property of convolution in time, and noting that the transform of x (−τ ) is X (−f ) then the Fourier transform of r̂ (τ )
is:
R̂ (f ) =
1
2T
X (f )X (−f )
=
1
2T
|X (f )|2
(4.10)
4.1 Spectrum estimation
46
Then the PSD estimate using the estimator r̂ (τ ) for the autocorrelation is:
Ŝ (f )
=
1
2T
|X (f )|2
=
1
2T
|
2T
=
T
x(t)e−j2πf t dt|2
−T
(4.11)
r̂ (τ )e−j2πf τ dτ
−2T
The function Ŝ (f ) is called the periodogram. It is an asymptotically unbiased estimator but its variance
increases with T . This surprising result is due to the integral in τ of the estimator r̂ (τ ) with increasing
variance as |τ | approaches 2T . In the limit T → ∞ the periodogram tends to be a white-noise process
with mean S(f ). Smoothing windows have been widely used as palliatives to overcome this behaviour,
either applied to the autocorrelation estimate r̂ (τ ) to deemphasize the unreliable values at the borders,
or convolved with the periodogram to reduce the variance directly.
Discrete-in-time stationary stochastic processes If the signal is sampled in time, the equations above
change accordingly. The autocorrelation function r(m) is now a function of an integer lag m. Its Fourier
transform R(ω) is periodic, as a result of the sampling in time, and the total power can be found simply
by integrating over a period of R(ω):
1
P = r(0) =
2π
π
R(ω)dω
(4.12)
r(m)e−jωm
(4.13)
−π
where R(ω) is:
R(ejω ) =
∞
m=−∞
If only N samples of the time series x(n) have been taken, the discrete version of the autocorrelation
estimator in Eq. 4.8 can be calculated as:
r̂ (m) =
1
N
N −|m|−1
x(n)x(n + |m|)
(4.14)
n=0
This estimator presents the same characteristics as its continuous-time version. The expected value of
r̂ (m) is m/N times r(m), but is asymptotically unbiased, as the bias tends to zero as N increases. Also,
4.1 Spectrum estimation
47
its variance increases as m approaches N . A full expression for this variance is very difficult to find
for non-Gaussian processes [122]. However, Jenkins and Watt [77] conjecture that, in many cases, the
mean-square error of r̂ (m) is less than for the unbiased estimator.
Discrete-in-time periodogram
Based on the Jenkins and Watt conjecture, Eq. 4.14 is used in Eq 4.13:
R (ejω ) =
1
N
N
−1
N −|m|−1
m=−N +1
n=0
x(n)x(n + |m|)e−jωm
(4.15)
After some mathematical manipulation [122, pp.542-3]:
R (ejω )
=
1
N
−1
N
−1 N
x(n)x(k)e−jω(k−n)
k=0 n=0
=
1
N
N
−1
N
−1
k=0
n=0
x(k)e−jωk
x(n)ejωn
=
1
jω
−jω
)
N X(e )X(e
=
1
jω 2
N |X(e )|
(4.16)
where X(ejω ) is the Fourier transform of the finite length time series x(n). Note that R (ω) is the PSD
estimator S (ω) known as the periodogram.
S (ejω ) =
1
|X(ejω )|2
N
(4.17)
It can be proved ([122] pp.542-3) that using either the unbiased estimator of the autocorrelation or the
biased estimator proposed by Jenkins and Watt, the periodogram for a discrete in time stationary process
is a biased estimator of the PSD. As in the continuous in time case, its variance does not tend to zero
as N increases. Again, this result can be improved by smoothing techniques, one of which divides the
time series into smaller, overlapping segments to perform an average over the periodograms, but this
compromises the resolution in frequency.
Parametric modelling methods A model is any attempt to describe the laws which yield a given phenomenon. Once a model is selected and its parameters estimated from the data, it can be used to generate
as many realisations as are needed to calculate the averages over the ensemble, or even better, it can be
4.1 Spectrum estimation
48
used to calculate directly the PSD without having to use the Fourier transform. The choices for a model
are infinite, but using a priori knowledge over the data, the range can be reduced considerably. Assumptions like zero values outside the observation window can be avoided. However, some assumptions
always have to be made for the characterisation of the model.
Yule [174] proposed in 1927 the use of a deterministic linear filter to represent a stochastic process. The
filter is driven by a sequence of statistically independent random variables with a zero-mean, constantvariance Gaussian distribution2 . This purely random series is known as white Gaussian noise because
its autocorrelation function is zero for all lag except for the origin, where it equals the variance of the
Gaussian process. The corresponding power spectral density is therefore a constant for all frequencies,
like the optical spectrum of white light. The filter performs a linear transformation on this uncorrelated
sequence to generate a highly correlated series x̂ that statistically matches the data x from the process
under analysis, as is shown in Fig. 4.1. The modelling procedure consists of the calculation of the filter
parameters.
White Gaussian noise v(n)
discrete-time
linear filter
^
y(n) = x(n)
Figure 4.1: Stochastic process model
The input-output relation of the filter has this general form:
present value
of model output
⎛
⎞
⎛
⎞
linear combination
linear combination
present
value
⎠=
⎠
of past values
of past values
+⎝
+⎝
of model input
of model output
of model input
This can be written as a linear difference equation that relates the input driving sequence v(n) with the
output y(n) as:
y(n) =
q
m=0
2 Being
bm v(n − m) −
p
ak y(n − k)
(4.18)
k=1
the most common distribution found in physical phenomena, and given that the output of a linear filter driven by a
Gaussian random process is another Gaussian process, it is the most convenient distribution at the filter input for a vast range of
applications.
4.1 Spectrum estimation
49
Giving that the proposed filter is time-invariant and linear, linear filters theory applies. Therefore, taking
the z-transform3 of both sides of Eq. 4.18:
Y (z) =
q
bm V (z)z −m −
m=0
p
ak Y (z)z −k
(4.19)
k=1
Rearranging Eq. 4.19 to leave only Y (z) on the left-hand side:
q
b z −m
m=0
p m −k V (z)
Y (z) =
1 + k=1 ak z
(4.20)
The z-transform of the unit-sample response of the filter h(n) can be found by making v(n) = δ(n), which
has a z-transform V (z) equal to 1:
q
b z −m
m=0
p m −k
H(z) = Y (z) |V (z)=1 =
1 + k=1 ak z
(4.21)
By using the substitution z = ejω , then the Fourier transform of the filter unit-response can be found from
Eq. 4.21:
q
H(ejω ) =
bm e−jωm
m=0
p
1+
ak e−jωk
(4.22)
k=1
Let us now consider the input sequence as white Gaussian noise with variance σv2 . Its autocorrelation
function is equal to σv2 δ(n) and its Fourier transform is equal to σv2 for all frequencies. Using the relation
between the input and output autocorrelation functions given in Eq. A.13:
ry (m) =
∞
∞
h(i)h(k)σv2 δ(k − i + m)
(4.23)
i=−∞ k=−∞
and taking the Fourier transform of both sides:
Sy (ejω ) = Ŝx (ejω ) = σv2 |H(ejω )|2
(4.24)
it can be seen that the PSD of the output of the filter can be obtained from the filter parameters {bi , ai }
and the input noise variance.
3 Defined
as Z[g(n)] = G(z) =
∞
n=−∞
g(n)z −n . See [122] for properties.
4.2 Autoregressive Models
50
Whether the linear combination of the past output values or the linear combination of the past input
values or both are used in the input-output relation defines the following types of filter:
1. Autoregressive model (AR): No linear combination of past values of the inputs is used.
2. Moving average (MA) model: No linear combination of past values of the outputs is used.
3. Autoregressive-Moving average (ARMA) model: Include all the terms shown in Eq. 4.18.
The use of one or other kind of model depends on the nature of the process. A description of the models
follows in the next section.
4.2 Autoregressive Models
Let the time series y(n), y(n−1), . . . , y(n−p) represent a realization of an autoregressive process of order
p. Then it satisfies the following difference equation:
y(n) + a1 y(n − 1) + a2 y(n − 2) + . . . + ap y(n − p) = v(n)
(4.25)
where the constants a1 , a2 , . . . , ap are the parameters of the model (AR coefficients), and [v(n)] is a white
Gaussian noise sample.
The term “autoregressive” comes from the similarity between the AR model equation and the regression
model equation.
y=
p
wk uk + v
(4.26)
k=1
The regression equation relates a dependent variable y to a set of independent variables u1 , u2 , . . . , up
plus an error term v. It is said that y is regressed on u1 , u2 , . . . , up . In a similar way the actual sample
of the AR process, y(n) is regressed on previous values of itself (auto) as is shown if we rewrite the AR
equation as:
y(n) =
p
k=1
wk y(n − k) + v(n)
(4.27)
4.2 Autoregressive Models
51
where wk = −ak .
Transforming Eq. 4.25 to the z domain we get:
Y (z)[1 + a1 z −1 + a2 z −2 + . . . + ap z −p ] = V (z)
(4.28)
Therefore the transfer function of an AR filter is:
H(z) =
Y (z)
V (z)
(4.29)
1
=
−1
1 + a1 z + a2 z −2 + . . . + ap z −p
The use of previous samples of the output of the filter is depicted with feedback paths as shown in
Fig. 4.2. The AR filter is an Infinite Impulse Response (IIR) filter, or all-pole filter of order p. It can be
stable or unstable, depending on the location of its poles. If one or more poles lie outside the unit circle
the filter will be unstable. The p poles may be calculated from the characteristic equation of the filter:
1 + a1 z −1 + a2 z −2 + . . . + ap z −p = 0
white noise v(n)
+
Σ
(4.30)
AR process y(n)
z -1
Σ
a1
y(n-1)
z -1
Σ
a p-1
y(n-p+1)
z -1
ap
y(n-p)
Figure 4.2: Autoregressive filter
4.2 Autoregressive Models
52
Moving Average Models
Moving average filters are described by:
y(n) = v(n) + b1 v(n − 1) + b2 v(n − 2) + . . . + bq v(n − q)
(4.31)
where the constants b1 , b2 , . . . , bq are the MA parameters of the model and [v(n)] is a white Gaussian
noise process.
This type of filter is an all-zero filter, inherently stable and with finite impulse response (FIR). For this
kind of discrete filter the order of the filter equals q, as it is the minimum number of delay units used to
implement it (see Fig. 4.3). The term “moving average” refers to the weighted average of the input time
series v(n).
white noise v(n)
z -1
z -1
z -1
b1
b2
Σ
Σ
...
bq
Σ
MA process y(n)
Figure 4.3: Moving Average filter
Moving Average Autoregressive Models
This combines the features of the AR and MA filters. The difference equation which describes them is:
y(n) + a1 y(n − 1) + a2 y(n − 2) + . . . + ap y(n − p) =
v(n) + b1 v(n − 1) + b2 v(n − 2) + . . . + bq v(n − q)
(4.32)
where the constants a1 , a2 , . . . , ap , b1 , b2 , . . . , bq are the ARMA parameters of the model and [v(n)] is a
white Gaussian noise process. For this kind of IIR filter with direct transmission from the input the order
is said to be the pair (p, q). AR and MA models are special cases of an ARMA model.
4.2 Autoregressive Models
white noise v(n)
53
+
Σ
b0
Σ
b1
Σ
b2
Σ
ARMA process y(n)
z -1
Σ
a1
y(n-1)
z -1
Σ
a2
y(n-2)
..
.
..
.
z -1
Σ
a p-1
y(n-p+1)
bp
z -1
ap
y(n-p)
Figure 4.4: Moving Average Autoregressive filter (b0 = 1, q = p − 1)
Wold decomposition
Wold’s decomposition theorem states that any stationary discrete-time stochastic process [u(n)] may be
decomposed into the combination of a general linear process and a predictable process. The last two
processes are uncorrelated. According to this, the process [u(n)] may be expressed as:
u(n) = y(n) + s(n)
(4.33)
The term s(n) is the predictable process, i.e. the sample s(n) can be predicted from its own past values
with zero predictive variance. The term y(n) is the general linear process which may be represented by
the MA model:
y(n) = v(n) +
∞
bk v(n − k)
(4.34)
k=1
where
∞
k=1
| bk |2 < ∞.
The white noise term v(n) which drives the general linear process y(n) is uncorrelated with the predictable process s(n), i.e. E[v(n)s(k)] = 0 for all pair (n, k). The general linear process may be an AR
process as well; all we have to do is to be sure that the impulse response of the AR filter equals the
4.3 AR parameter estimation
54
impulse response of the MA filter. That is:
h(n) =
∞
bk δ(n − k)
(4.35)
k=0
where b0 = 1.
AR models have gained more popularity than the MA and the ARMA models. The reason lies in the
computation of the filter parameters, which leads to a system of equations that is linear for AR filters and
nonlinear for MA and ARMA filters [81][105].
4.3 AR parameter estimation
4.3.1 Asymptotic stationarity of an AR process
The classical solution to the AR difference equation (Eq. 4.25) separates the homogeneous solution from
the particular solution. The particular solution is the AR model difference equation shown in Eq. 4.27.
But the homogeneous solution yh (n) is of the form:
yh (n) = B1 z1n + B2 z2n + . . . + Bp zpn
(4.36)
where z1 , z2 , . . . , zp are roots of the characteristic equation (Eq. 4.30) of the filter. The constants B1 ,
B2 , . . . , Bp may be determined by the set of p initial conditions y(0), y(−1), . . . , y(−p + 1). For arbitrary
values of the constants Bk , it is clear from equation 4.36 that the homogeneous solution will decay to
zero as n approaches infinity if and only if:
| zk |< 1,
for all k
(4.37)
In other words, this means that all the poles of the AR filter lie inside the unit circle in the z-plane. A
system which is able to “forget” its initial values in this way is said to be asymptotic stationary.
The autocorrelation function of such a system satisfies the homogeneous difference equation of the model.
This may be found if we rewrite equation 4.27:
p
k=0
ak y(n − k) = v(n)
(4.38)
4.3 AR parameter estimation
55
where a0 = 1. Multiplying both sides by y(n − m) and taking the expectation we get:
p
ak y(n − k)y(n − m)] = E[v(n)y(n − m)]
E[
(4.39)
k=0
This may be simplified if we note that the expectation E[y(n − k)y(n − m)] equals the autocorrelation
function for a lag of (m − k), and the expectation of v(n)y(n − m) is zero for m > 0, since the sample
y(n − m) is only related to input samples up to time (n − m).
p
ak r(m − k) = 0,
for m > 0
(4.40)
k=0
Expanding the last equation gives the desired result:
r(m) = w1 r(m − 1) + w2 r(m − 2) + . . . + wp r(m − p),
for m > 0
(4.41)
where wk = −ak . We may express the general solution of this equation as:
r(m) =
p
Ck zkm
(4.42)
k=1
where Ck are constants and zk are the roots of the characteristic equation (Eq. 4.30). As a result of this
we can say that the autocorrelation function of an asymptotic stationary AR process approaches zero as
the lag tends to infinity. This autocorrelation function will be damped exponentially if the dominant root
is real, changed in sign alternatively if is negative, or will be a damped sine wave if the dominant roots
are a complex conjugate pair.
4.3.2 Yule-Walker equations
Writing equation 4.41 for m = 1, 2, . . . , p yields a set of p simultaneous equations for the unknowns
a1 , a2 , . . . , ap assuming that the autocorrelation function r(m) is known at least for the lags from 1 to p:
⎡
⎤ ⎡
⎤ ⎡
⎤
r(0)
r(1)
. . . r(p − 1)
r(1)
w1
⎢ r(1)
⎢
⎥ ⎢
⎥
r(0)
. . . r(p − 2) ⎥
⎢
⎥ ⎢ w2 ⎥ ⎢ r(2) ⎥
(4.43)
⎢
⎥ ⎢ .. ⎥ = ⎢ .. ⎥
..
..
..
⎣
⎦ ⎣ . ⎦ ⎣ . ⎦
.
.
...
.
r(p − 1) r(p − 2) . . .
r(0)
wp
r(p)
where wk = −ak . This set of equations is known as the Yule-Walker equations. In matrix form:
Rw = r
(4.44)
4.3 AR parameter estimation
56
where R is the p × p autocorrelation matrix, w = [w1 , w2 , . . . , wp ]T and r = [r(1), r(2), . . . , r(p)]T . Its
solution is:
w = R−1 r
(4.45)
It can be seen from Eq. 4.45 that the set of AR coefficients may be uniquely determined from the first p + 1
samples of the autocorrelation function of the process x(n) under modelling. If we evaluate Eq. 4.39 for
m = 0 and y(n) equal to the data time series x(n), we get:
p
ak x(n − k)x(n)] = E[v(n)x(n)]
E[
(4.46)
k=0
The right-hand side of Eq. 4.46 is:
E[v(n)x(n)]
= E [v(n) ( pm=1 wm x(n − m) + v(n)) ]
=
p
m=1
wk E[v(n)x(n − m)] + E[v(n)v(n)]
(4.47)
= E[v(n)v(n)]
The right-hand side of the equation is the variance of the input noise σv2 . This variance may be determined
from the set of AR coefficients and the first p + 1 samples of the autocorrelation function.
σv2 =
p
ak r(k)
(4.48)
k=0
Eq. 4.44 can be solved by Gaussian elimination. However, the Toeplitz characteristic of matrix R is used
efficiently to find the parameters ak . In section 4.6.1 a recursive algorithm to solve the Yule-Walker
equation will be presented.
4.3.3 Using an AR model
An AR model may be used for synthesis or for analysis. In synthesis, a stationary stochastic process y(n)
characterised its variance σy2 and the parameters of its AR model, i.e. the AR filter coefficients, are given,
and we want to generate a time series of the process. In analysis, we want to model a stochastic process
given a time series x(n), by estimating the set of AR parameters for a model order p and the input noise
4.3 AR parameter estimation
57
variance, assuming that p is the optimum model order 4 . Next, we will present a second-order example
of a synthesis problem and an analysis problem.
Second order AR process synthesis
Assume that we want to synthesise a real valued, second order stationary AR process y(n) with unit
variance. The difference equation of the AR model is:
y(n) + a1 y(n − 1) + a2 y(n − 2) = v(n)
(4.49)
As a condition for asymptotic stationarity, we need to ensure that the roots of the characteristic equation
of the model lie inside the unit circle in the z-plane:
1 + a1 z −1 + a2 z −2 = 0
−a1 ± a21 − 4a2
⇒ z1,2 =
2
(4.50)
(4.51)
where z1 and z2 are the roots of equation 4.50. To satisfy the asymptotic stationarity condition:
| z1 |< 1, and | z2 |< 1
(4.52)
requires the following restrictions for the AR parameters:
−1 ≤ a2 + a1
−1 ≤ a2 − a1
−1 ≤ a2 ≤ 1
(4.53)
which is satisfied by a triangular region in the a2 , a1 plane, with corners at (−2, 1), (0, −1) and (2, 1). Let
us choose arbitrarily the following values for a1 and a2 from this region:
a1 = −0.1
a2 = −0.8
(4.54)
z1 =
0.9458
z2 = −0.8458
(4.55)
We get roots at:
4 We
will not cover in this section the problem of finding the optimum model order
4.3 AR parameter estimation
58
where the positive root z1 dominates the autocorrelation function. In order to calculate the input noise
variance we need to find the first 3 samples of the autocorrelation function r(m):
σv2 = r(0) + a1 r(1) + a2 r(2)
(4.56)
From the Yule-Walker equation:
r(0)
r(1)
r(1)
r(0)
w1
w2
=
r(1)
r(2)
(4.57)
where w1 = −a1 and w2 = −a2 . We know that r(0) = σy2 = 1, and hence we can find the other 2 samples
of r(m), substituting in 4.57:
1
r(1)
⇒
r(1)
1
0.2
0.0
−0.1 1.0
0.1
0.8
r(1)
r(2)
=
⎧
⎨ r(1)
⇒
r(2)
⎩ 2
σv
=
r(1)
r(2)
0.1
0.8
(4.58)
(4.59)
= 0.5
= 0.85
= 0.27
(4.60)
To generate the time series, we substitute in Eq. 4.49 the values of the AR parameters and run the
difference equation with v(n) from N (0, 0.27), and zero initial values for y(n):
y(n) = 0.1y(n − 1) + 0.8y(n − 2) + v(n)
(4.61)
A time series generated in this way is shown in Fig. 4.5. The autocorrelation function plotted in Fig. 4.6
has been calculated applying Eq. 4.41 with the initial set of values r(0), r(1) and r(2) found above:
r(m) = 0.1r(m − 1) + 0.8r(m − 2),
for m > 2
(4.62)
Second order AR process analysis
Assume that we have got 128 samples of a time series from a stationary stochastic process [y(n)]. We will
see if the given process can be modelled as an AR process of model order 2. The Yule-Walker equations
4.3 AR parameter estimation
59
2
1.5
1
y(n)
0.5
0
−0.5
−1
−1.5
−2
0
20
40
60
sample n
80
100
120
Figure 4.5: Time series of the synthetised AR process
1
r(m)
0.8
0.6
0.4
0.2
0
0
10
20
30
lag m
40
50
60
Figure 4.6: Autocorrelation function of the synthetised AR process
white noise v(n)
+
Σ
AR process y(n)
z -1
Σ
y(n-1)
a1
z -1
a2
y(n-2)
Figure 4.7: Second order AR process generator
4.3 AR parameter estimation
60
may be used to estimate the AR parameters:
w = R−1 r
(4.63)
but we need first to estimate the first 3 samples of the autocorrelation function from the available data.
Using the sample autocorrelation estimator (Eq. 4.14) for N = 128:
1
r̂ (m) =
128
127−|m|
y(n)y(n − m)
(4.64)
n=0
we may estimate the first 3 samples of r(m), and express the matrix R:
R̂ =
r̂ (0) r̂ (1)
r̂ (1) r̂ (0)
(4.65)
If the matrix R̂ is nonsingular, we may find a1 and a2 from the Yule-Walker matrix equation. The input
noise variance may be estimated from the r̂ (m) sequence by Eq. 4.48. To test the model we may use the
inverse filter to see if it is capable of “whitening” the given time series. If the model fits the data well, the
output of this “whitening” filter will be white Gaussian noise with zero mean and variance σv2 . The direct
filter will have the transfer function H(z) given by:
H(z) =
1
1 + a1
z −1
+ a2 z −2
(4.66)
then the “whitening” filter transfer function is:
HW (z) = H −1 (z)
= 1 + a1 z −1 + a2 z −2
(4.67)
The “whitening” filter is also called the AR process analyser and its impulse response has finite duration
(FIR). If the process is not truly autoregressive, or if the model order is not p, or if the error in the
estimation of the autocorrelation is high, then the output of the inverse filter will be coloured noise.
As an example, we may use the AR process generator found in the last section to generate a 128-sample
time series to feed into the AR analyser. For a time series generated in this way we estimated the first 3
values of r(k), obtaining 0.7045, 0.1963 and 0.5612. Therefore the matrix R̂ is:
R̂ =
0.7045 0.1963
0.1963 0.7045
4.4 Linear Prediction
61
y(n-1)
y(n-2)
z -1
stochastic process y(n)
z -1
a1
a2
Σ
Σ
noise v(n)
Figure 4.8: Second order AR process analyser
and the vector r̂ = [0.1963, 0.5612]T . Applying Eq. 4.63 we get the estimate of ŵ = [0.0614, 0.7795]T .
A better approximation to the true value w = [0.1, 0.8]T can be obtained by increasing the number of
samples or by running the generator several times, collecting several time series of the same process (ie
an ensemble), analysing and averaging the results. Table 4.1 and Fig. 4.9 show the mean and variance of
the AR coefficients estimated using the procedure described in this section for an ensemble of 500 time
series and a number of samples per time series from 16 to 1024.
aˆ1
aˆ2
N
mean
variance
mean
variance
16
-0.1622
0.0700
-0.5065
0.0408
32
-0.1311
0.0258
-0.6448
0.0180
64
-0.1126
0.0094
-0.7181
0.0084
128
-0.1045
0.0038
-0.7588
0.0034
256
-0.1028
0.0018
-0.7777
0.0016
384
-0.1039
0.0012
-0.7827
0.0011
512
-0.1022
0.0009
-0.7881
0.0008
1024
-0.1002
0.0004
-0.7949
0.0004
Table 4.1: AR coefficients estimates’ mean and variance
4.4 Linear Prediction
4.4.1 Wiener Filters
A typical statistical linear filtering problem consists of an input time series x(n), a linear filter device
characterised by its impulse response b0 , b1 , b2 , . . . , and the output sequence y(n). This output is an estimate
of a desired response d(n) (Fig. 4.10).
Defining the estimation error as
e(n) = d(n) − y(n)
(4.68)
62
−0.5
a2 estimate (ensemble mean)
a1 estimate (ensemble mean)
4.4 Linear Prediction
−0.1
−0.6
−0.12
−0.7
−0.14
−0.8
−0.16
0
200
400
600
800
1000
number of samples N in time series
0
200
400
600
800
1000
number of samples N in time series
0
200
400
600
800
1000
number of samples N in time series
0.07
0.04
a2 estimate variance
a1 estimate variance
0.06
0.05
0.03
0.04
0.02
0.03
0.02
0.01
0.01
0
0
200
400
600
800
1000
number of samples N in time series
0
Figure 4.9: AR coefficients estimates’ mean and variance
desired
d(n) response
input
x(n)
linear, discrete-time filter
b 0 , b 1 , b 2 , ...
output
y(n)
-
+
Σ
Figure 4.10: Filter problem
estimation error
e(n)
4.4 Linear Prediction
63
the filter can be optimised by minimising the cost function J:
J
= E[ e(n)e(n) ]
(4.69)
= E[ | e(n) |2 ]
by making its gradient in the space, constituted by the filter coefficients, equal to zero:
∇J = 0
(4.70)
Solving Eq. 4.70 yields the following result:
E[x(n − k)eo (n)] = 0,
for k = 0, 1, 2, . . .
(4.71)
where eo (n) denotes the estimation error of the filter operating in its optimum condition.
Substituting Eq. 4.68 in the Eq. 4.71 gives the following set of equations, known as the Wiener-Hopf
equations:
Rbo = c
(4.72)
where the p × p correlation matrix R has been defined in Eq. A.15 and
bo = [bo0 bo1 . . . bo,p−1 ]T
(4.73)
c = [c(0) c(−1) . . . c(1 − p)]T
where c(−k) = E[x(n − k)d(n)].
4.4.2 Linear Prediction
One of the most common use for Wiener filters is to predict a future sample of a stationary stochastic process, given a set of past samples of the process. The Wiener-Hopf equations may be used to optimise the
predictor in the mean-square sense. Assume that a time series of the process x(n−1), x(n−2), . . . , x(n−p)
is available. The estimation of the sample at time n, x̂(n) is a linear function of the previous samples:
x̂(n) =
p
k=1
bk x(n − k)
(4.74)
4.4 Linear Prediction
64
The desired response is the true value of the sample x(n):
(4.75)
d(n) = x(n)
Then, the prediction error for this filter e(n) is:
e(n) = x(n) − x̂(n)
(4.76)
The vector bo in the Wiener-Hopf equations becomes bo = [bo1 , bo2 , . . . , bop ]T . Note the difference in one
of the indices of the coefficients bok with respect to Eq. 4.73, because the input sequence starts at sample
n − 1 instead of n. The input sequence provides the data for the estimation of the first p + 1 samples
of the autocorrelation function r(m), which may be used to find the p × p correlation matrix R and the
vector c. The latter is possible because the desired response is a sample of the input time series:
⎡
⎤
E[x(n − 1)x(n)]
⎢ E[x(n − 2)x(n)] ⎥
⎢
⎥
c =⎢
⎥
..
⎣
⎦
.
E[x(n − p)x(n)]
⎡
⎢
⎢
=⎢
⎣
r(1)
r(2)
..
.
(4.77)
⎤
⎥
⎥
⎥
⎦
r(p)]
If the matrix R is nonsingular, the solution of the Wiener-Hopf set of equations gives the optimum linear
predictor, characterised by the set of parameters boi for i = 1, . . . , p. Fig. 4.11 shows a linear predictor of
order p.
x(n-2)
x(n-p)
z -1
x(n-1)
b1
z -1
b2
...
Σ
bp
Σ
^
x(n)
Figure 4.11: Prediction filter of order p
Note that the number of delay units is (p − 1) while the number of filter parameters remains at p. The
4.4 Linear Prediction
65
apparent incongruity between the number of delay units, the number of parameters and the model order
disappears when the linear predictor is related to the Wiener filter, as is shown in Fig. 4.12.
linear predictor of order p
x(n-1)
z -1
x(n)
b0
x(n-p)
x(n-2)
z -1
z -1
b1
...
b2
bp
Σ
^
x(n) -
Σ
Σ
e(n)
+
Figure 4.12: Prediction-error filter of order p
Relationship between AR models and Linear Prediction
Moreover, the filter of Fig. 4.12 may be rearranged to have the same structure as the AR analyser filter of
Fig. 4.8. The resulting filter is shown in Fig. 4.13.
x(n-1)
x(n)
z -1
x(n-2)
x(n-p)
z -1
z -1
- b1
- b2
Σ
Σ
...
- bp
Σ
e(n)
Figure 4.13: Prediction-filter filter of order p rearranged to look as an AR analyser
The filters in Fig. 4.13 and Fig. 4.8 show the equivalence of the linear prediction-error filter and the AR
analyser. Both filters are fed with a time series from a stochastic process, and are expected to have an
uncorrelated random sequence at the output, e(n) or v(n), respectively. This random output has been
minimised for the linear predictor in the mean-square prediction-error sense, solving the Wiener-Hopf
equations:
Rbo = c
(4.78)
4.5 Maximum entropy method (MEM) for power spectrum density estimation
66
The set of coefficients bi of the linear predictor are related to the parameters ai of the AR model through:
ai = −bi ,
for i = 1, 2, . . . , p
(4.79)
where p = [r(1) r(2) . . . r(p)]T . The set of AR parameters may be calculated using the Yule-Walker
equations:
(4.80)
Rw = r
with w = [−a1 , −a2 , . . . , −ap ]T and r = [r(1), r(2), . . . , r(p)]T . Therefore, the set of AR parameters
found by solving Yule-Walker equations is optimum in the mean-square prediction-error sense.
4.5 Maximum entropy method (MEM) for power spectrum density
estimation
The Yule-Walker equations for AR modelling (or linear prediction) can be used to find the parameters
of the filter that models the stochastic process x(n), and estimate its PSD by using Eq. 4.24. But the
goodness of the estimator Ŝ still depends on the statistical characteristics of the estimator for the autocorrelation function r(m). The periodogram in Eqs. 4.11 and 4.17 assumes that the unknown values of
the autocorrelation (for lags greater in modulus than the data length) are zero. This leads to smearing in
the PSD estimate. Burg [25] applied the principle of maximum entropy to the estimation of the unknown
autocorrelation lags of a Gaussian stochastic process. In this sense, the maximum entropy autocorrelation
estimate will be the one with the most random autocorrelation series, i.e. the maximum entropy estimator will not add any information to the estimate. The solution for a set of 2p + 1 known autocorrelation
lags is:
⎧
⎨ r(m),
r̂MEM (m) =
⎩ p
k=1 bp,k r̂MEM (m
for |m| ≤ p
(4.81)
− k),
for |m| > p
where the coefficients bp,k are none other than the parameters of the p order linear predictor, and therefore equal to minus the ap,k parameters of a p order AR filter for the known autocorrelation lags. The
4.6 Algorithms for AR modelling
67
MEM PSD estimate, obtained by the Fourier transform of r̂MEM , yields:
Pep
ŜMEM (ω) = 2
p
−jωk bp,k e
1 −
(4.82)
k=1
where Pep denotes the prediction error power average E[|ep (n)|2 ] for the p-order linear predictor, which is
2
equivalent to the input noise variance σv,p
of the p-order AR model. In terms of the AR model parameters,
Eq. 4.82 is:
2
σv,p
ŜMEM (ω) = 2
p
−jωk ap,k e
1 +
(4.83)
k=1
4.6 Algorithms for AR modelling
4.6.1 Levinson-Durbin recursion to solve the Yule-Walker equation
The Levinson-Durbin algorithm [95][49] uses the symmetric and Toeplitz properties of the autocorrelation matrix R to provide an efficient solution of Eq. 4.44, requiring only p2 operations for a model
order p, instead of the p3 computations required for Gaussian elimination. Also, the algorithm reveals the
fundamental properties of AR processes. It recursively computes the filter parameter and input variance
2
{am,k , σm
} for model orders m = 1, 2, . . . , p.
The algorithm proceeds as follows:
1. Initialisation:
a1,1 = −r(1)/r(0)
(4.84)
σ12 = (1 − |a1,1 |2 )r(0)
(4.85)
2. Recursion for m = 2, 3, . . . , p:
am,m =
m−1
− r(m) + k=1 am−1,k r(m − k)
am,k = am−1,k + am,m am−1,m−k
2
σm−1
(4.86)
for k = 1, 2, .., m − 1
(4.87)
2
2
σm
= (1 − |am,m |2 )σm−1
(4.88)
4.6 Algorithms for AR modelling
68
The solution {ap,k , σp2 } is the same as would be obtained using Eq. 4.44. The solution sets for lower
model orders provide useful information. If the values r(m) used in the recursion represent a valid autocorrelation sequence, then it can be shown [10] that the last parameter for each model order satisfies5 :
|am,m | ≤ 1
(4.89)
consequently, the input variance follows this property:
2
2
σm
≤ σm−1
(4.90)
Using the analogy with linear predictors, Eq. 4.90 means that the prediction error decreases or at least remains steady as the model order increases. This represents an advantage if the model order is not known
a priori. If the stochastic process x(n) is actually an AR process of order p and known autocorrelation
function, then the Levinson-Durbin recursion will reproduce the set {ap,k , σp2 } for model orders greater
than p. Under real conditions, either the autocorrelation is unknown or the process is not truly AR, then
the input variance as a function of the model order will decrease monotonically. However, it would show
a “knee” or turning point, where further increments in the model order do not improve significantly the
prediction error.
Lattice form of a linear predictor
The parameters am,m play an important role in the theory of linear prediction. To see how they are
related to the linear predictor of order p, let us define two types of prediction errors:
Forward-prediction error: The prediction error shown in Eq. 4.76 for a p-order linear predictor, denoted by ep (n), is:
ep (n) = x(n) −
p
bp,k x(n − k)
k=1
p
= x(n) +
ap,k x(n − k)
(4.91)
k=1
5 In
fact, the condition |am,m | ≤ 1 is necessary and sufficient for the values of r(m) to represent a valid autocorrelation function
4.6 Algorithms for AR modelling
69
where bp,k are the parameters of the p-order linear predictor, and ap,k the p-order AR model parameters.
We will continue using the second form to keep consistency with the notation used in the Levinson-Durbin
algorithm.
Backward-prediction error: If the data time series was reversed and fed to the p-order linear predictor
for the original sequence x(n), then the filter would sequentially predict the “past” samples of the original
time series. Thus, the backward prediction error for the sample x(n − p), denoted by bp (n)6 , would be:
bp (n)
= x(n − p) +
p
ap,k x(n − p + k)
(4.92)
k=1
= x(n − p) + ap,1 x(n − p + 1) + ap,2 x(n − p + 2) + . . . + ap,p x(n)
Using Eq. 4.87, a relationship between the forward and backward prediction errors can be found:
ep (n)
= x(n) +
p−1
ap,k x(n − k) + ap,p x(n − p)
k=1
p−1
= x(n) +
(ap−1,k + ap,p ap−1,p−k )x(n − k) + ap,p x(n − p)
k=1
p−1
p−1
= x(n) +
ap−1,k x(n − k) +
ap,p ap−1,p−k x(n − k) + ap,p x(n − p)
k=1
(4.93)
k=1
Noting that the terms within the brackets in Eq. 4.93 are related to the p − 1-order linear predictor by:
ep−1 (n)
= x(n) +
p−1
ap−1,k x(n − k)
k=1
bp−1 (n − 1) = x(n − 1 − (p − 1)) +
p−1
ap−1,k x(n − 1 − (p − 1) + k)
k=1
(4.94)
= x(n − p) + ap−1,1 x(n − p + 1) + ap−1,2 x(n − p + 2) + . . . + ap−1,p−1 x(n − 1)
p−1
= x(n − p) +
ap−1,p−k x(n − k)
k=1
thus Eq. 4.93 can be written as:
ep (n) = ep−1 (n) + ap,p bp−1 (n − 1)
(4.95)
bp (n) = bp−1 (n − 1) + ap,p ep−1 (n)
(4.96)
Similarly, it can be shown that:
6 Note
that from this point the symbol b will denote an error and not a filter coefficient
4.6 Algorithms for AR modelling
70
Therefore Eqs. 4.95 and 4.96 relate the forward and backward prediction error of a given model order p
to the error for a linear predictor of model order p − 1. They can be used recursively to derive the lattice
form of a p-order linear predictor. Renaming the parameters am,m as:
(4.97)
am,m = κm
and calculating the initial values for the forward and backward prediction errors:
e0 (n) = b0 (n) = x(n)
(4.98)
e1 (n) = e0 (n) + κ1 b0 (n − 1)
b1 (n) = b0 (n − 1) + κ1 e0 (n)
(4.99)
we get:
Fig. 4.14 condenses the relationships shown by Eqs. 4.98 and 4.99. This structure resembles the basic
pattern of a lattice.
e 0 (n)
Σ
e 1 (n)
κ1
x(n)
κ1
z -1
b 0 (n-1)
Σ
b 1 (n)
Figure 4.14: Lattice filter of first order
To continue the lattice, Eqs. 4.95 and 4.96 can be generalised by substituting m for p:
em (n) = em−1 (n) + κm bm−1 (n − 1)
bm (n) = bm−1 (n − 1) + κm em−1 (n)
(4.100)
and evaluating them for m = 2, 3, . . . , p. The complete filter will adopt the structure shown in Fig. 4.15.
Note that the transfer function of the lattice linear predictor filter is:
HLP (z) = 1 +
p
k=1
ap,k z −k
(4.101)
4.6 Algorithms for AR modelling
e 0 (n)
71
e 1 (n)
Σ
e p-1 (n)
Σ
κ1
κ2
κp
κ1
κ2
κp
Σ
e p(n)
Σ
b p (n)
x(n)
z -1
b 0 (n-1)
Σ
z -1
b 1 (n)
b 1 (n-1)
Σ
z -1
b p-1 (n-1)
Figure 4.15: Lattice filter of first order
which is the inverse of the transfer function of the corresponding AR filter. By analogy with transmission
line theory, the parameters κm in the lattice filter are called reflection coefficients 7 , while the parameters
ap,k could be referred to as the feedback coefficients, a term that is self-explanatory by looking at the AR
filter structure in Fig. 4.2.
The lattice structure has some advantages over the transversal filter structure shown in Fig. 4.13, or
the feedback form shown in Fig. 4.2. Not only does it generate both forward and backward prediction
sequences, but it also has modularity. The first “step”, with coefficient κ1 , in the lattice depicted (see
Fig. 4.14) represents the first order linear predictor. Adding a second step κ2 increments the order by
one, yielding a second order linear predictor without having to modify the first step, and so on to reach the
desired model order. Also, each “step” or module is “decoupled” from the others as it can be shown that
forward and backward prediction errors are orthogonal, i.e. uncorrelated with each other for stationary
input data.
Furthermore, the condition |κm | ≤ 1 for m = 1, 2, . . . , p is necessary and sufficient to guarantee that all
the poles of the AR filter are lying within or on the unit circle, which is a condition for stability. If any
of the reflection coefficients equals ±1, then the Levinson-Durbin recursion will terminate with σi2 = 0,
where κi is the first reflection coefficient with unit modulus. The process in this case is purely harmonic,
consisting only of sinusoids.
It is important to note that the set of reflection coefficients κ1 , κ2 , . . . , κp represents the p-order linear
predictor as the set of feedback coefficients ap,1 , ap,2 , . . . , ap,p does. Using the Levinson-Durbin recursion,
7 The
parameters κm are also known as PARCOR coefficients, for partial correlation, in the statistics literature
4.6 Algorithms for AR modelling
72
it is possible to calculate the ap,i ’s from the κi ’s, as is shown in Table 4.2 for the first three sets of feedback
coefficients.
Model order m
1
2
3
am,1
κ1
(κ1 + κ1 κ2 )
(κ1 + κ1 κ2 + κ2 κ3 )
am,2
am,3
κ2
(κ2 + κ1 κ3 + κ1 κ2 κ3 )
κ3
Table 4.2: Feedback coefficients in terms of the reflection coefficients
The inverse Levinson-Durbin recursion in Eq. 4.102 provides the means to calculate the κi ’s as a function
of the ap,i ’s. Again, the results for the first three model orders are shown in Table 4.3.
am−1,i =
am,i − am,m am,m−i
1 − |am,m |2
Model order m
κ1
1
a1,1
2
a2,1
1+a2,2
a3,1 −a3,2 a3,3
1+a3,2 −a3,1 a3,3 −a23,3
3
κ2
(4.102)
κ3
a2,2
a3,2 −a3,1 a3,3
1−a23,3
a3,3
Table 4.3: Reflection coefficients in terms of the feedback coefficients
It is apparent from Tables 4.2 and 4.3 that the reflection coefficients are less correlated with each other
than the feedback coefficients. This could give another reason to choose the set of reflection coefficients
to represent an AR process over the set of feedback coefficients.
4.6.2 Other algorithms for AR parameter estimation
For short data sets, the estimation of the first p lags of the autocorrelation sequence r(m) limits the
accuracy of the AR parameter estimation when using the Levinson Durbin recursion. Other approaches
use standard statistical estimation directly from the data. The commonly used method of Maximum
Likelihood Estimation (MLE) is too difficult to apply [21], as it leads to a set of nonlinear equations [108].
Approximations to the exact MLE have been sought [26][82][172]. McWhorter and Scharf summarise the
work done in approximate MLE. Unfortunately, hardly any improvement is achieved by using approximate
4.6 Algorithms for AR modelling
73
MLE methods despite the high computational cost. Returning to the least square (LS) approach used in
section 4.4.1, we will present three more methods for AR parameter estimation directly from the data.
LS of the forward prediction error
Given a data time series x(n) of length N , the forward prediction error ep (n) shown in Eq. 4.91 can be
written as:
ep (n) =
p
ap,k x(n − k),
where ap,0 = 1
(4.103)
k=0
Computing ep (n) for all the data available, i.e. for n = 0, 1, . . . , N + p − 1, and assuming that the values
of x(n) for n < 0 and for n ≥ N are zero, we get:
ep (0)
ep (1)
ep (2)
..
.
= x(0)
= x(1) + ap,1 x(0)
= x(2) + ap,1 x(1) + ap,2 x(0)
..
.
ep (p)
..
.
= x(p) + ap,1 x(p − 1) + ap,2 x(p − 2) + . . . + ap,p x(0)
..
.
= x(N − 1) + ap,1 x(N − 2) + ap,2 x(N − 3) + . . . + ap,p x(N − p − 1)
=
ap,1 x(N − 1) + ap,2 x(N − 2) + . . . + ap,p x(N − p)
..
.
ep (N − 1)
ep (N − 1)
..
.
ep (N + p − 1) =
(4.104)
ap,p x(N − 1)
In matrix form, this can be written as:
⎡
⎤ ⎡
ep (0)
x(0)
⎥ ⎢
..
⎢
..
.
..
⎢
⎥ ⎢
.
.
⎢
⎥ ⎢
⎢
⎥
⎢
ep (p)
x(p)
···
x(0)
⎢
⎥ ⎢
⎢
⎥ ⎢
..
..
..
=
⎢
⎥ ⎢
.
.
.
⎢
⎥ ⎢
⎢ ep (N − 1) ⎥ ⎢ x(N − 1) · · · x(N − p − 1)
⎢
⎥ ⎢
⎢
⎥ ⎢
..
..
..
⎣
⎦ ⎣
.
.
.
ep (N + p − 1)
x(N − 1)
⎤
⎥⎡
⎥
1
⎥
⎥ ⎢ ap,1
⎥⎢
⎥ ⎢ ap,2
⎥⎢
⎥ ⎢ ..
⎥⎣ .
⎥
⎥ ap,p
⎦
⎤
⎥
⎥
⎥
⎥
⎥
⎦
ep = Xa
The forward prediction error energy is the summation over all the range of |ep (n)|2 :
Ep =
|ep (n)|2
n 2
p
=
ap,k x(n − k)
n
k=0
(4.105)
(4.106)
(4.107)
4.6 Algorithms for AR modelling
74
Minimising Ep with respect to ap,k results in a set of p equations:
p
∂Ep
=0⇒
ap,k
∂ap,i
x(n − k)x(n − i)
for 1 ≤ i ≤ p
= 0,
(4.108)
n
k=0
The minimum error energy, denoted by Ep,min , is obtained by expanding Eq. 4.107 and substituting into
Eq. 4.108. The result can be shown to be [105]:
Ep min =
p
ap,k
k=0
x(n − k)x(n)
(4.109)
n
Using matrix notation for Eqs. 4.108 and 4.109:
XT Xa = [Ep min 0 0 . . . 0]T
(4.110)
The matrix XT X has a Toeplitz structure. In fact, multiplying both sides of Eq. 4.110 by 1/N makes
the equation equivalent to the Yule-Walker equations using the biased autocorrelation estimator given in
Eq. 4.14, hence, the name of Yule-Walker estimator. The Levinson-Durbin recursion can be used to solve
Eq. 4.110.
However, if we avoid the assumption of zeros for the unknown values of x(n) and restrain the calculation
of ep (n) from n = p to n = N − 1, the matrix X in Eq. 4.110 will change to Xcov , defined as one of matrix
X’s partitions:
⎡
x(0)
..
.
⎤
⎡
⎤
Xpre ⎥
⎢
⎥ ⎢
..
⎥
⎢
⎥ ⎢
.
⎥
⎢
⎥ ⎢
⎥
⎢
⎢. . . . . . . . . . . . . . . . . . . . . . . . . . . . .⎥ ⎢
⎢
⎥ ⎢. . . . . . . . . . ⎥
⎥
⎢ x(p)
⎥ ⎢
···
x(0)
⎥
⎢
⎥ ⎢
⎥
⎢
⎥ ⎢
..
..
⎥
X
X=⎢
=
⎥
cov
.
.
⎢
⎥
⎢
⎥
⎥
⎢x(N − 1) · · · x(N − p − 1)⎥ ⎢
⎢
⎥
⎢
⎥
⎥
.
.
.
.
.
.
.
.
.
.
⎢. . . . . . . . . . . . . . . . . . . . . . . . . . . . .⎥ ⎢
⎢
⎥
⎢
⎥ ⎢
⎥
⎢
⎥
.
.
⎢
⎥
..
..
⎣
⎦ ⎣
Xpost ⎦
x(N − 1)
(4.111)
Using the matrix Xcov is essentially equivalent to using the same number of products, N − p, for each
lag in the estimation of the autocorrelation function. Unless the signal is periodic with a period which
is a multiple of N − p, the matrix S = XTcov Xcov will not be Toeplitz. Instead it is the so-called sample
covariance matrix. Hence, the resultant set of equations using Eq. 4.111 is called the covariance equations.
4.6 Algorithms for AR modelling
75
The sample covariance matrix is not always positive definite and instabilities of the AR filter may occur.
In the case of a periodic signal with a period which is a multiple of N − p, the sample covariance matrix
and the true autocorrelation matrix are identical. The covariance estimator is statistically closer to the
MLE estimator than the Yule-Walker estimator, although the latter has a lower variance, especially if data
windowing is applied. It is interesting to note that Xcov can be linked to non-linear dynamical systems
theory if seen as a sequence of N − p delay reconstructed vectors in a (p + 1)-dimensional embedding
space when using a one-sample delay [1].
The use of either Xpre or Xpost with Xcov leads to pre-windowed or post-windowed equations respectively.
Algorithms have been developed to solve the covariance equations, the pre-windowed and the postwindowed equations, but none of these perform significantly better than the Yule-Walker estimator. All
the LS methods based on forward prediction error present line spectral splitting in the PSD estimator.
This phenomenon consists of two peaks being generated in the PSD, very close to each other, where the
real PSD has only one peak. The method that uses the matrix X as in Eq. 4.111 gives the least spectral
resolution. Displacement of the peaks and great sensitivity to noise are other common problems of the
LS forward prediction approach. The next sections describe methods based on the backward prediction
error as well as the forward prediction error, in an attempt to improve the results described up to this
point.
Constrained LS of the forward and backward prediction errors
Provided that the process x(n) is stationary the forward and the backward prediction errors are equal in
a statistical sense. Based on this property, Burg [25] proposed the use of both errors in the cost function
for the parameters ap,k , without making any other assumption about the data:
Ep
N
−1
|ep (n)|2 + |bp (n)|2
n=p ⎛
2 p
2 ⎞
p
N
−1 ⎝ ap,k x(n − k) + ap,k x(n − p + k) ⎠
=
n=p
=
k=0
(4.112)
k=0
Burg further proposed to minimise the error Ep subject to the constraint over the ap,k parameters that
4.6 Algorithms for AR modelling
76
they satisfy the Levinson-Durbin recursion:
am,k = am−1,k + am,m am−1,m−k
(4.113)
for all orders m from 1 to p.
This constraint ensures a stable AR filter. Substituting the recursive expressions for ep (n) and bp (n) shown
in Eq. 4.100, the cost function Ep becomes a function of the reflection coefficient ap,p , defined in §4.6.1,
and the forward and backward prediction errors for the order immediately below, p − 1. Minimising with
respect to ap,p yields:
−2
N −1
ap,p = κp = N −1
k=p
k=p
bp−1 (k − 1)ep−1 (k)
(|bp−1 (k − 1)|2 + |ep−1 (k)|2 )
(4.114)
The denominator on the right-hand side of Eq. 4.114 can be found recursively using the relations given
in Eq. 4.100 as:
DENp
=
N
−1
|bp−1 (k − 1)|2 + |ep−1 (k)|2
(4.115)
k=p
= DENp−1 (1 − |κp−1 |2 ) − |bp−1 (N − p)|2 − |ep−1 (p)|2
The Burg algorithm is then implemented as follows:
1. Initialisation:
e0 (n) = b0 (n) = x(n)
DEN0 =
N
−1
|x(k − 1)|2 + |x(k)|2
k=0
2. Recursion for m = 1, 2, 3, . . . , p:
DENm = DENm−1 (1 − |κm−1 |2 ) − |bm−1 (N − m)|2 − |em−1 (m)|2
N −1
am,m = κm = −2 k=m bm−1 (k − 1)em−1 (k)
DENm
am,k = am−1,k + am,m am−1,m−k ,
for k = 1, 2, .., m − 1
em (n) = em−1 (n) + κm bm−1 (n − 1)
bm (n) = bm−1 (n − 1) + κm em−1 (n)
As it may be noted, the Burg method does not minimise the cost function for all the reflection coefficients
at the same time, but rather minimises the error with respect to the last reflection coefficient for each
4.6 Algorithms for AR modelling
77
model order. This has been pointed out as being the cause for the line spectral splitting also observed in
PSD estimators obtained from Burg AR estimates [81]. Several researchers have tried to overcome this
effect by minimising the error function with respect to either all the reflection coefficients or the feedback
coefficients of the p-order AR filter at the same time (see [81] for a review). The last approach removes
the constraint (Eq. 4.113) imposed by Burg, and is usually referred to as forward-backward LS. It requires
about 20% more computations than the Burg algorithm and although it is apparent that this method
removes the spectral line splitting [81], the stability of the AR filter is not guaranteed.
Modified Burg algorithm
Narayan and Burg [117] also proposed a solution to the spectrum line splitting problem for quasi-periodic
time series, in a method called covariance Burg. The method requires a priori knowledge of the period of
the signal to estimate the autocorrelation matrix from the sample covariance matrix.
The algorithm is very similar to the Burg method except for a trapezoidal weighting function applied in
the computation of the reflection coefficient:
−
N −1
κm = N −1
k=m
k=m
wm (k)bm−1 (k − 1)em−1 (k)
wm (k) (|bm−1 (k − 1)|2 + |em−1 (k)|2 )
(4.116)
where wm (n) takes the value of the minimum of n − m + 1, p − m + 1 and N − n, denoted as:
wm (n) = min(n − m + 1, p − m + 1, N − n)
(4.117)
For high signal-to-noise ratio and periodicity of the data, this estimator will be much closer to the optimum in the maximum likelihood sense 8 than the one obtained by the covariance method. However, the
algorithm also works quite well with non-periodic data, although the results are not so close to the MLE
optimum as with periodic data. This surprising feature may be explained by the fact that the method
performs a kind of average over the sample covariance matrices (dimension (p + 1) × (p + 1)) along the
N -point long data segment.
8 Burg et al [26] developed an algorithm to find the closest Toeplitz matrix to the sample covariance matrix in the maximum
likelihood sense using the normalised distance measure Dn (S, R) = Trace(R−1 S) − ln(|R−1 S|) − (p + 1). The algorithm, called
“structured covariance matrices”, is an approximation to the MLE and is computationally very expensive
4.7 Modelling the EEG
78
4.6.3 Sensitivity to additive noise of the AR model PSD estimator
One of the major problems in the use of AR parametric modelling for estimating PSD is the presence of
additive noise in the signal. Although noise has been considered in the model to represent the unpredictable nature of a stochastic signal, the addition of white or coloured noise to the signal will obscure
its spectrum in the PSD estimation of signal plus noise as is shown in Eqs. 4.118 and 4.119 below. Let us
suppose that x(n) is a p-order AR stochastic process contaminated with noise:
(4.118)
xn (n) = x(n) + ν(n)
Assuming that ν(n) is white uncorrelated noise with variance σν2 , the PSD of xn (n) is:
Sn (ω)
= Sx (ω) + σν2
=
=
2
σv,p
2
p
−jωk a
e
1+
p,k
k=1
+ σν2
2
p
−jωk 2
2
σv,p +σν 1+
a
e
p,k
k=1
2
p
ap,k e−jωk 1+
k=1
(4.119)
As can be seen in Eq. 4.119, the process which includes the additive noise is an ARMA process instead of
the original AR process. This distorts the spectrum, with a loss of resolution in the detection of the peaks
in the original process. Additive noise is very common in signal processing as it is the most common
model for sensor noise.
4.7 Modelling the EEG
AR modelling, just like FFT methods, assumes that the signal under analysis is stationary. The EEG can
be considered to be a stochastic process, which is the result of the summation of numerous sources of
electrical activity. These activities depend on external and internal variables, and on large numbers of
inhibitory and excitatory interactions that can be considered to be random. The central limit theorem
states that the sum of many independent random variables tends to be a random variable with a Gaussian
4.7 Modelling the EEG
79
distribution. The EEG sources are not independent but we may assume them to be statistically independent in origin, modelling the numerous interactions between neurons as a shaping filter which transforms
these sources in a correlated Gaussian process. Under stable external and internal conditions, we may
consider the EEG process to be stationary, and, even further, take it to be ergodic.
Now the question is for how long can we assume during continuous recordings that the external/internal
conditions remain stable? The answer depends on many factors, but for the kind of data analysed in
this thesis the main factors are the subject’s activity and presence or absence of a pathological condition.
If the subject is healthy and asleep, and if the external conditions are favourable to sleep, then we can
assume that the changes in the statistical properties of the EEG will occur slowly. However, we know
from section 3.1.2 that the sleep EEG may spontaneously present transient waves like spindles and vertex
waves, which affect the stationarity of the signal, even if the brain stays in the same sleep stage. These
events last for about a second. In fact, several authors [14][12] have recommended that the EEG should
be analysed in segments no longer than 1 second to ensure stationarity. However, segments of 1-s duration may be too short to obtain an accurate autocorrelation estimate. If the subject is awake, many more
factors can influence the patterns in the EEG, some of which are difficult to control under experimental conditions. For instance, eye closing/opening affects the alpha rhythm in the EEG. Transitions from
alertness to drowsiness bring an important series of changes in the EEG, some of which may happen very
quickly. Varri et al. use EEG segments of variable length, from 0.5s to 2s [170] in vigilance studies.
Barlow [14] gives a review of methods suitable for detecting EEG non-stationarities. The methods can be
based on fixed intervals or adaptive intervals. EEG features from a fixed reference window are compared
with the features from the EEG in a moving “test” window, looking for significant changes in the features.
Once a change is detected, a boundary is set at the point where the change occurs in order to segment the
signal into stationary segments. A new reference window is then placed at the start of the new segment.
The features are usually based on time descriptors or FFT coefficients or AR modelling.
Another method for analysis of non-stationary EEG is time-varying AR modelling, better known as Kalman
4.7 Modelling the EEG
80
filtering [148]. With this method, the AR parameters are estimated for a short segment at the beginning
of the signal by any conventional algorithm (Yule-Walker, Burg, etc.) and then updated every sample.
The non-stationarities can be tracked from the rate-of-change of the AR parameters. This method has
been reported to yield better results than FFT or conventional AR modelling in recognising rapid changes
in the frequency of oscillations [148], but the computational cost and the data expansion instead of
compression make the method unattractive [14]. Some researchers average the Kalman AR coefficients
over segments of 1s or more, for smoothing and data compression, but this has the same effect as using
a conventional AR method over the segment. Also, brief disturbances like artefacts and transient waves
have a lasting effect with Kalman filtering (depending on gain of the filter), given that the technique
employs information from the recent past to update the current estimate of the model coefficients. In
contrast, in conventional AR modelling, a brief disturbance will only affect the segment during which it
occurs [14].
Chapter 5
Neural network methods
We have already seen that the EEG signal can be described in terms of its power spectrum or as a set
of filter parameters. These values extracted from the PSD or the AR model are generally called features.
The next step is to find the relationship between the EEG features and the mental state, i.e. deep sleep,
wakefulness, drowsiness, etc. This constitutes a classification task which seeks to partition the input
space into 1-of-K classes. The input space is a mathematical abstraction such that a given set of features
xo1 , xo2 , . . . , xo(d) is assigned to a point xo in a d-dimensional space with coordinates x1 , x2 , . . . , xd . Classification involves dividing the input space into regions such that points taken from the same region belong
to the same class. The dividing line between two regions is known as a decision boundary.
Classifiers can be divided according to the type of mapping generated. In linear classifiers, the decision
boundaries are hyper-planes (for dimensions greater than three). However, it is well known that realworld problem datasets show considerable overlap between classes and hence require non-linear decision
boundaries. For the same reason, it is helpful to adopt a probabilistic framework, placing the decision
boundary in the loci for which probabilities of belonging to either class are equal. There are several
methods for evaluating the posterior probability of belonging to a given class, using either parametric
or non-parametric techniques. With the non-parametric methods, no assumption is made regarding the
probability distribution of the data belonging to each class.
Neural networks are non-linear, non-parametric function approximators which can be used in regression
5.1 Neural Networks
82
problems as well as in classification problems. Compared with other types of classifiers, neural networks
offer advantages in problems for which the classification rules are complex and difficult to specify. Provided that a sufficiently large set of input data is labelled by human experts (called the training set),
a neural network can “learn” the underlying generator of the data and for a given input, produce an
output in terms of the posterior probabilities of the classes. With new data (or test data), drawn from
the same distribution as the training set, the trained neural network should produce accurate posterior
probabilities of the data belonging to the classes, a property sometimes known as generalisation. The set
of posterior probabilities can be fed to a decision-making stage to assign the input to one of the classes.
Training a neural network is a time-consuming task, but once this task is performed, classification is fast,
and requires very little computational resources.
EEG
n
Feature
Extractor
n
features x 0n, ... , xd-1
MLP
posterior probabilities
P(C1 | x n), ... , P(CK | x n)
Decision
Making
classification
Ck
Figure 5.1: The classification process
5.1 Neural Networks
A neural network consists of arrays of interconnected “artificial neurons”. The structure of an artificial
neuron is showed in Fig. 5.2. The artificial neuron adds the weighted values of its d inputs xi , and applies
a non-linear function to this summation in order to produce an output y, whose value is in the range from
5.1 Neural Networks
83
0 to 1:
y=g
d
(5.1)
wi xi
i=0
x 0 =1
w0
x1
inputs
x2
w1
w2
Σ
a
gh
y output
wd
xd
Figure 5.2: An artificial neuron
The nonlinear function g(a) is known as the activation function. Several types of activation functions can
be used. One such function is the so-called hard-limiter gh (a), which produces an binary output, 1 or 0,
depending on whether or not the summation exceeds a given threshold, and is defined as:
gh (a) =
where a =
d
i=0
0
1
for a < 0
for a ≥ 0
(5.2)
wi xi .
The use of the hard-limiter has a physiological background, as it simulates the “all-or-nothing” rule of
real neurons. The weight associated with the bias input x0 , w0 represents the threshold when gh (a) is
used because it sets the minimum value of a for the neuron “to fire”.
Other activation functions have continuous outputs between 0 and 1, allowing the output to be interpreted as a probability. Examples of such functions are the sigmoid function gσ , and the softmax function
gsof tmax . The sigmoid function is the hyperbolic tangent function tanh, scaled to lie between saturation
levels of 0 and 1.
gσ (a) =
1
1 + e−a
(5.3)
This mathematically simple function is widely used in two-class problems [19, pp.231]. The softmax
function, a more generalised form of the sigmoid function also known as the normalised exponential, is
better suited to multiple class problems and will be explained later.
5.1 Neural Networks
84
tanh(a)
gσ (a)
1
1.0
a
0.5
-1
a
Figure 5.3: Hyperbolic tangent and Sigmoid functions.
The non-linear mapping performed by a neural network can be written as:
y = y(x; w) = G(x; w)
(5.4)
where y represents the vector of outputs [y1 , . . . , yK ]T , generally representing the probabilities of belonging to class Ck in a classification problem; and w represents the vector of connection weights between the
input nodes and the neurons, between neurons, and between the neurons and the outputs. The process
of finding the weights to perform the mapping correctly is called learning or training the network. During
supervised learning the input patterns or vectors are presented repeatedly to the network, along with
the desired value for the outputs (the target value tk for the kth output). The weights are successively
adjusted in order to minimise a cost function, generally associated with the mean squared error. The
performance of the trained network is then tested on the test set, i.e. a set of patterns not included in
the training set. An over-trained network will fit the noise rather than the data and hence will generalise
poorly.
5.1.1 The error function
The general goal of the neural network is to make the best possible prediction of the target vector t =
[t1 , . . . , tK ]T when a new input vector value x is presented. The most general and complete description
of the data is in terms of the joint probability density p(x, t) given by:
p(x, t) = p(t |x)p(x)
(5.5)
where p(t |x) represents the probability of t given a particular value of x, and p(x) is the unconditional
5.1 Neural Networks
85
probability of x, given by:
p(x) =
(5.6)
p(x, t)dt
The cost function to minimise during training can be arbitrarily defined. A good cost function can be
derived from the likelihood of the training set {xn , tn }, which can be written as:
L =
p(xn , tn )
n
=
(5.7)
p(tn | xn )p(xn )
n
For optimisation purposes, it is simpler to take the negative logarithm of the likelihood:
E
= − ln L
=−
ln p(tn | xn ) −
n
ln p(xn )
(5.8)
n
where E defines a cost function, usually called the error function. The second term of the right hand side
of Eq. 5.8 can be omitted as it does not depend on the parameters of the neural network, so:
E=−
ln p(tn | xn )
(5.9)
n
Sum of squares function
If we assume the target variables tk to be continuous, with independent zero-mean Gaussian distributions,
we can write:
p(t | x) =
K
p(tk | x)
(5.10)
k=1
Furthermore, let us assume that the tk ’s are given by some deterministic function of x with added Gaussian noise ε:
tk = hk (x) + εk
(5.11)
where the noise distribution is given by:
p(εk ) = √
1
2πσ2
ε2k
2
exp 2σ
−
(5.12)
5.1 Neural Networks
86
As the training process seeks to model the functions hk (x) with yk (x; w), we can use the latter in Eq. 5.11
and substitute εk in Eq. 5.12 to give:
{yk (x; w) − tk }2
2σ2
exp
−
1
p(tk | x) = √
2πσ2
(5.13)
Combining Eq. 5.13 and Eq. 5.10 in the expression for the error in Eq. 5.9:
E
= −
ln
n
= −
K
K
N n=1 k=1
=
p(tk | x)
k=1
1
ln √
2πσ2
{yk (x; w) − tk }2
2σ2
exp
−
N K
1 NK
ln(2π)
{yk (x; w) − tk }2 + N K ln σ +
2
2σ n=1
σ
(5.14)
k=1
where N is the number of input patterns used during the training process. Note that the last two terms
of Eq 5.14 do not depend on the weights w so they can be omitted, as well as the dividing factor σ 2 in
the first term. Thus the error function E ends up as:
E
=
1 {yk (xn ; w) − tnk }2
2 n
k
1
=
y(xn ; w) − tn 2
2 n
(5.15)
The error function in Eq. 5.15 is called the sum-of-squares function. It reduces the optimisation process
to a least-squares procedure. Its use is not restricted to Gaussian distributed target data, and although
the sum of the outputs equals unity (very convenient if we want to interpret the outputs as probabilities),
the results cannot distinguish between the true distribution and any other distribution having the same
mean and variance.
Cross-Entropy function
In a classification problem, the target data represent discrete class labels, therefore a more convenient
code for the target data is the “1-of-K” scheme:
tnk = δk ,
for xn ∈ C
(5.16)
5.1 Neural Networks
87
where δk is the Kronecker delta which is 1 for k = , and 0 otherwise.
The output is meant to represent the posterior probability of class membership:
y = P (C | x)
(5.17)
therefore, we can write p(t | x) = (y )t , and more generally, assuming that the distributions p(tn | xn )
are statistically independent:
p(t | x ) =
n
n
K
(ykn )tk
(5.18)
k=1
Substituting Eq. 5.18 in Eq. 5.9 for the log-likelihood error function:
E=−
K
tnk ln ykn
(5.19)
n k=1
This error function has an absolute minimum with respect to the yk ’s when ykn = tnk for all k and n. At the
minimum, E is:
Emin = −
K
tnk ln tnk
(5.20)
n k=1
If tk takes only values 0 or 1, this minimum is equal to zero, but if tk is a continuous variable in the
range (0, 1) this minimum does not necessarily get to zero. In fact, it will represent the cross entropy [19,
pp.244] between the distributions of the target and the output. Hence, this error function derived from
the maximum likelihood criterion for a 1-of-K target coding is called the cross-entropy error function.
To ensure a zero value at the minimum, the value Emin is subtracted from the error function in Eq. 5.19,
giving this modified error:
E=−
K
n k=1
tnk ln
ykn
tnk
(5.21)
which is non-negative and equals zero when ykn = tnk for all k and n.
The cross-entropy error function has some advantages over the sum-of-squares error function. Firstly, it
can be proved [19, pp.235-6] that for an infinitely large data set the outputs yk are exactly the posterior
5.1 Neural Networks
88
probability P (Ck | x), and therefore are limited to the (0, 1) range. Secondly, it performs better at
estimating small probabilities. Indeed, if we denote the error at the output ykn as εnk then the crossentropy error is:
K
tn + εn
=−
tnk ln k n k
tk
n
E
k=1
K
εn
=−
tnk ln 1 + nk
tk
n
(5.22)
k=1
Its clear from Eq. 5.22 that the cross-entropy error function depends on the relative errors of the neural
network outputs, in contrast with the sum-of-squares function which depends on the squares of the
absolute errors. Therefore, minimisation of the cross-entropy error will tend to give similar relative
errors on both small and large probabilities, while sum-of-squares tends to give similar absolute errors
for each pattern, resulting in large relative errors for small output values.
Cross-entropy error for a two-class problem For K > 2 it is desirable to have one output per class, so
that each output represents the posterior probability of belonging to one of the classes, but for a 2-class
problem, only one output representing one class is necessary as the probability for the other class can
be found by subtracting the output value from 1. This causes a few changes in the cross-entropy error
function, which will be reviewed here briefly.
Assigning y = P (C1 | x) the conditional probability p(t | x) can be written as:
p(t | x) = y t (1 − y)1−t
(5.23)
where t takes value 1 if x ∈ C1 and 0 if x ∈ C2 . The cross-entropy error takes the form:
E=−
{tn ln y n + (1 − tn ) ln(1 − y n )}
(5.24)
n
Differentiating with respect to y n :
∂E
y n − tn
= n
n
∂y
y (1 − y n )
(5.25)
5.1 Neural Networks
89
It is easy to see from Eq. 5.25 that the cross-entropy function for a 2-class problem has an absolute
minimum at 0 when y n = tn for all n. Again, it will be convenient to subtract the minimum from the
expression in Eq. 5.24:
E=−
yn
(1 − y n )
}
{tn ln n + (1 − tn ) ln
t
1 − tn )
n
(5.26)
5.1.2 The decision-making stage
To arrive at a classification from the posterior probabilities evaluated at the outputs of the neural network,
the minimum error-rate criterion is usually adopted. To minimise the probability of misclassification, a
new input should be assigned to the class having the largest posterior probability. Several aspects should
be taken into account when training in real-world problems. Firstly, the neural network is trained to
estimate the posterior probabilities of class membership based on the assumption that the input data
has been drawn from the same data distribution as the training set. Hence the output k represents
P (Ck | x, D) where D = {xn , tn } is the training set data. Secondly, the proportion of data from each
class in the training set reflects the prior probabilities of classes. From Bayes’ theorem:
P (Ck | x)
=
p(x | Ck )P (Ck )
p(x)
⇒ p(x | Ck )P (Ck ) = P (Ck | x)p(x)
(5.27)
(5.28)
Integrating on both sides of Eq. 5.28 yields:
p(x | Ck )P (Ck ) dx =
P (Ck | x)p(x) dx
(5.29)
P (Ck | x)p(x) dx
(5.30)
⇒ P (Ck ) =
By assuming that all the values of x are equally probable in the training set, the right-hand side of Eq. 5.30
can be approximated as:
P (Ck ) =
P (Ck | x)p(x) dx ≈
N
1 P (Ck | x)
N n=1
(5.31)
5.1 Neural Networks
90
Thus, the prior probabilities are approximated as the average of each neural network output over all the
patterns in the training set. Hence, the prior probability P (Ck ) should determine the proportion of the
patterns belonging to class Ck in the training set. In some cases, this can be problematic if the class with
the maximum risk of misclassification is very scarce (e.g. when diagnosing a fault or a disease). In these
cases, it would be desirable to include as many patterns from the high risk class as from the other classes
in the training set. Compensation for the different prior probabilities can be easily performed multiplying
each output by the ratio of the “true” prior probability with respect to the prior in the training set, and
normalising the corrected outputs so that they sum to unity.
5.1.3 Multi-layer perceptrons
A neural network with its neurons arranged in layers is called a perceptron. A single-layer of neurons is
therefore called a single-layer perceptron. It has inputs whose values are x1 , x2 , . . . , xd written as a feature
vector x. Given that there are no connections from the outputs to the inputs, a single-layer-perceptron
is a feed-forward neural network. It is also a linear classifier as it partitions the input space with hyperplanes. The perceptron learning rule [159, pp.11] guarantees to find a solution with a single-layer if the
input feature vectors are linearly separable.
If more complex decision boundaries are required, two or more layers of neurons should be used. Such
a network is known as a multi-layer perceptron (MLP). Fig. 5.4 shows an I − J − K (2-layer) MLP, with
I-dimensional input patterns, J hidden units zj , and K outputs yk . The neurons in the output layer are
simply called “outputs”, while the neurons in the intermediate layer are called “hidden units”. When
necessary, the superindices z or y will be used to distinguish the weights w or the inputs of the activation
function a of the hidden layer from those in the output layer.
It can be shown that a 2-layer perceptron with smooth nonlinearities is able to approximate any arbitrary
function [159, pp.16]. However, the decision boundaries will not be abrupt, as with the hard-limiter
perceptron, but smooth and continuous instead. The approximation accuracy will then depend on the
number of units in the hidden layer. A low number of hidden units will give an insufficiently complex
5.1 Neural Networks
91
x 0 =1
hidden z
units j
z0 =1
x1
x2
y1
z1
inputs
xi
outputs
yk
yK
zJ
xI
wizj
y
wj k
Figure 5.4: A I −J −K neural network.
model for the given problem, while a large number of hidden units will result in an over-fitted model. An
over-trained or over-fitted network is a disadvantage in real world problems, since most real-world data
is very noisy.
Training an MLP: the error backpropagation algorithm
The training algorithm that underpins the use of multi-layer perceptrons is the so-called error backpropagation algorithm. It uses error gradient information to seek a minimum of the error function. In order to
apply this algorithm to an MLP it is necessary to use continuous and differentiable activation functions.
The activation function for the hidden units does not necessarily have to be the same as for the outputs.
Hyperbolic tangent or sigmoid functions are usually chosen as the non-linearity for the hidden units.
If probabilities are to be represented at the outputs, then these units have to be restricted to the [0,1]
range. Hence, the sigmoid function for a 2-class problem, or its generalisation, the softmax function for
a K-class problem (K > 2) are recommended for the outputs. The softmax function is defined as:
eak
ak
k e
gsof tmax (ak ) = where ak =
J
j=1
(5.32)
wjk yj is the summation of the inputs to the kth neuron and k is the index of the
summation over all the K neurons in the output layer.
Let us assume that a neural network is to be trained to solve a classification problem for K > 2 mutually
exclusive classes, with a training set of input patterns xn with n = 1, . . . , N , represented by I feature
values, xn = [xn1 , . . . , xnI ]T , and a class membership target vector tn = [tn1 , . . . , tnk ]T coded with a 1-of-K
5.1 Neural Networks
92
scheme. Assume that the network is a 2-layer MLP with J hidden units zj with a sigmoidal activation
function, and one output per class yk with softmax activation function. We would like to find values for
all the weights in the neural network, the vector w that minimises the error function E(xn ; tn ; w).
The gradient of the error, given by the vector ∇w E, points in the opposite direction to that of the deepest
descent of the error function in weight space. It can, therefore, be used in the search for the minimum of
the error function in weight space, by recursive updating of the weights given by:
w(τ +1)
= w(τ ) + Δw(τ )
= w(τ ) − η∇w E (τ )
(5.33)
where η is called the learning rate and τ denotes the iteration number. Expressing Eq. 5.33 for each
weight leaves:
(τ +1)
wi
(τ )
+ Δwi
(τ )
− η ∂E
∂wi
= wi
= wi
(τ )
(t)
(5.34)
For the reasons given in section 5.1.1 the cross-entropy error function is chosen to optimise the network
parameters. This error function is now written as a function of the weight vector:
E(w)
=
E n (w)
n
= −
K
n
k=1
tnk ln
ykn (w)
tnk
The derivatives of the cross-entropy error function with respect to the weights of the neural network can
easily be found by propagating back the error at the outputs towards the hidden and input layers as will
be shown below.
Derivatives of E with respect to the hidden-to-output weights
The output units’ activation function (softmax g(·), Eq. 5.32) includes in its denominator the inputs ayk
y
for all the outputs yk , so the weight wjk
affects all the outputs. Therefore, all of the outputs should be
5.1 Neural Networks
93
y
considered when differentiating the error for pattern n with respect to the output weights wjk
:
K
∂E n
∂E n ∂ykn ∂ayk
n
y =
y
∂wjk
∂ykn ∂ayk ∂wjk
k =1
n
(5.35)
The first partial derivative for the right term of Eq. 5.35 is:
∂E n
tn
= − kn
n
∂yk
yk (5.36)
The second partial derivative can be found from Eq. 5.32:
∂ykn
n
n n
n = yk δkk − yk yk
∂ayk
(5.37)
and the last derivative in the chain:
n
∂ayk
n
y = zj
∂wjk
(5.38)
which does not depend on k . Combining these two partial derivatives and summing over k :
K
∂E n
n
n
y n = yk − tk
∂a
k
k =1
(5.39)
Then, substituting Eqs. 5.39 and 5.38 into Eq. 5.35 gives:
∂E n
yn n
y = δ̌k zj
∂wjk
(5.40)
where δ̌ky = yk − tk .
Derivatives of E with respect to the input-to-hidden weights
We follow a similar procedure to find the derivatives of the error function with respect to the hidden
z
layer weights wij
, this time noting that the sigmoid activation function of the unit zj only depends on the
inputs to this unit:
K
K
∂E n
∂E n ∂ykn ∂ayk ∂zjn ∂azj
=
n
n
z
z
∂wij
∂ykn
∂ayk ∂zjn ∂azj ∂wij
k =1
k=1
n
n
(5.41)
5.1 Neural Networks
94
n
The first two derivatives in Eq. 5.41 have already been found above, and are denoted by δ̌ky . The third
derivative is:
n
∂ayk
y
= wjk
∂zjn
(5.42)
∂zjn
n
n
n = zj (1 − zj )
∂azj
(5.43)
Using Eq. 5.3:
The last derivative of the chain is:
n
∂azj
n
z = xi
∂wij
(5.44)
Combining all of these derivatives according to Eq. 5.41 gives:
∂E n
z
∂wij
= zjn (1 − zjn ) xni
K
n
y
δ̌ky wjk
(5.45)
k=1
n
= δ̌jz xni
(5.46)
As can be seen in the above equation, the weight update for the input-to-hidden weights depends on the
weight update for the hidden-to-output weights. Weight errors (the δ̌’s) are propagated backwards, to
the preceding layer, hence the name given to the algorithm.
Backpropagation for the two-class problem
As we have seen in section 5.1.1, for a 2-class problem only one output is needed, in which case the crossentropy error function is slightly different and gives rise to a modified expression for the backpropagation
“errors”. Also, as there is only one output, the sigmoid activation function is used for all units in the
network.
The part of error function that depends on the weights is:
E(x) = −
{tn ln y n (w) + (1 − tn ) ln(1 − y n (w))}
n
5.1 Neural Networks
95
Differentiating with respect to the hidden-to-output weights wj :
n
∂E n ∂y n ∂ay
∂E n
=
∂wj
∂y n ∂ayn ∂wj
(5.47)
we find that:
∂E n
=
∂y n
y n −tn
y n (1−y n )
(5.48)
∂y n
= y n (1 − y n )
∂ayn
(5.49)
n
∂ay
=
∂wj
zjn
(5.50)
Then we get:
n
∂E n
= δ̌ y zjn
∂wj
(5.51)
where δ̌ y = y − t. This result is exactly the same as for the multiple-class problem.
To find the derivatives of E with respect to the input-to-hidden weights:
n
n
∂E n
∂E n ∂y n ∂ay ∂zjn ∂azj
=
n
∂wij
∂y n ∂ayn ∂zjn ∂azj ∂wij
(5.52)
we find that the partial derivatives in the chain are:
∂E n ∂y n
=
∂y n ∂ayn
y n − tn
(5.53)
n
∂ay
=
wj
∂zjn
∂zjn
zjn (1 − zjn )
n =
∂azj
(5.54)
(5.55)
n
∂azj
=
∂wij
xni
(5.56)
Substituting all of them into Eq. 5.52 gives:
∂E n
∂wij
n
= δ̌ y wj zjn (1 − zjn )xni
n
= δ̌jz xni
(5.57)
(5.58)
5.2 Optimisation algorithms
n
96
n
where δ̌jz = δ̌ y wj zjn (1 − zjn ). This result is also exactly the same as for the multiple class problem with
K = 1. This is very convenient because it makes the backpropagation of errors independent of the number
of classes in the problem.
5.2 Optimisation algorithms
5.2.1 Gradient descent
As we mentioned above, the training of an MLP is performed by minimising the error function E(w) in
the weight space W, conformed by all the weights in the network, using the deepest descent method, also
called gradient descent. This algorithm can be applied in batch fashion or sequentially. The first version
averages the Δwn for all the patterns and then updates the weights. The sequential version updates the
weights after each pattern is presented to the network. In either version a suitable value for the learning
rate η needs to be selected. A range can be found for η by using a quadratic approximation of the error
function around the minimum at w∗ .
1
E(w) ≈ E(w∗ ) + (w − w∗ )T H(w − w∗ )
2
where H is the Hessian matrix of the error function with elements Hij =
(5.59)
∂2E
∂wi ∂wj
|w∗ , for i = 1, 2, ..., W
and W is the total number of weights.
The gradient of this approximation is:
∇w E = H(w − w∗ )
(5.60)
The eigenvalue equation for the Hessian matrix is:
Hui = λi ui
(5.61)
where the eigenvectors ui can be used as a basis in W, so we can write:
w − w∗ =
i
αi ui
(5.62)
5.2 Optimisation algorithms
97
where αi can be interpreted as the distance from the minimum in the ui direction. Then, the gradient
approximation can be written in terms of the eigenvectors of H:
∇w E =
(5.63)
αi λi ui
i
and also the difference between the weights for two consecutive iterations of the algorithm:
w(τ +1) − w(τ )
=
=
(τ +1)
i (αi
(τ )
− αi )ui
(5.64)
i Δαi ui
But since Δw = −η∇w E (τ ) , then:
i
(τ +1)
= −η
Δαi ui
(τ )
⇒ αi
− αi
⇒
αi
i
(τ )
αi λi ui
(τ )
= −ηαi λi
(τ +1)
(5.65)
(τ )
= (1 − ηλi )αi
(0)
After τf steps from a starting point w0 , with αi :
(τf )
αi
(0)
= (1 − ηλi )τf αi
(5.66)
To reach the minimum, αi should tend to zero as τf increases. Then, the condition on η and the λi ’s is:
|1 − ηλi | < 1
(5.67)
⇒
0 < ηλi < 2
for i = 1, 2, ..., W .
It can be proved that if λi > 0 for all i the minimum at w∗ is a global minimum. This is true for a definite
positive Hessian matrix. In this case, the condition in Eq. 5.67 gives the following range for η:
0<η<
2
λmax
(5.68)
Note that in Eq. 5.66 the step size is constant around the minimum, imposing a linear convergence
towards the minimum. The convergence speed will be dominated by the minimum eigenvalue, so taking
the maximum value allowed for η, we find that the size of the minimum step is:
1−2
λmin
λmax
(5.69)
5.2 Optimisation algorithms
98
The ratio λmin /λmax is called the conditional number of the Hessian matrix. If this ratio is very small
(i.e. the error function has high curvature around the minimum) the convergence will be extremely slow.
A way to overcome this problem by increasing the effective step size can be achieved by adding an extra
term in the weight update equation.
Momentum
Adding a term proportional to the previous change in the weight vector in the equation for the weight
update may speed the convergence of w and smooth the oscillations.
Δw(τ ) = −η∇w E |w(τ ) +μΔw(τ −1)
(5.70)
where μ is called the momentum parameter. If the momentum rate is in the open interval (0, 1), the effect
of adding momentum to the weight update in low-curvature error surfaces is an increase in the effective
learning rate by the factor:
1
1−μ
(5.71)
However, in regions of large curvature the momentum term loses its effectiveness, and oscillations around
the minimum generally occur. In fact, the gradient descent rule used in the backpropagation algorithm
makes a very inefficient search for the minimum because, in practice, the error gradient does not point
towards the minimum most of the time, causing oscillations in the search for the minimum. Another
disadvantage of this method is the inclusion of two parameters, η and μ, with non-specified values and
no formal criteria for choosing their values.
5.2.2 Conjugate gradient
If instead of moving w a fixed distance along the negative gradient, we look in this direction until we
find the minimum of E(w) and then set the point as the new weight vector, the size of the step becomes
optimum in the search direction. At the new point the component in the search direction of the error
gradient vanishes. If additionally, we choose the new searching direction as one that does not “spoil” the
5.3 Model order selection and generalisation
99
minimisation achieved in the previous direction, i.e. that keeps the projection of ∇w E in the previous
direction null, and minimise again in this new direction, and repeat the procedure successively, we will,
after W steps, reach the minimum w∗ . The set of non-interfering (or conjugated) directions can be found
without any need for extra parameters as is shown in Appendix B, where this method is described in
detail. This represents a definite improvement over the gradient descent method, even if in practice the
convergence is achieved in more than W steps for general non-linear error functions.
As already stated above, this algorithm does not have any non-specified parameter and in general converges much faster than gradient descent, but it requires the computation of the first and second order
partial derivatives of the error function with respect to the weights. The backpropagation formulae for
the first order derivatives still apply. To avoid the use of the Hessian in the computation of αj , a numerical procedure of line searching, or central differences can be used as an approximation to the Hessian.
The latter is used in a modification of this algorithm called scaled conjugate gradient algorithm which is
described next.
Scaled conjugated gradient
Apart from using an approximation to avoid the calculation of the Hessian, this algorithm overcomes
the other two major drawbacks of the conjugate gradient algorithm that arise when the error is far from
being quadratic. A technique called model trust region, based on a quadratic approximation of E, can be
applied to make sure that every step leads to a lower error when the Hessian matrix is negative definite
(otherwise the error may increase with the step). Also, the quality of the quadratic approximation is
tested at every step to adjust the parameter of the model trust method. The scaled conjugate gradient
method is described in the Appendix B.
5.3 Model order selection and generalisation
As stated in section 5.1.3, the number of hidden units J in an MLP determines the accuracy and degree
of generalisation that the neural network can achieve. As with any regression problem, too few free
5.3 Model order selection and generalisation
100
parameters will fail to fit the function properly, while too many parameters will over-fit the noisy data. A
compromise between accuracy and generalisation has to be found. This can be compared to the trade-off
between the bias and the variance of the network. A neural network with zero bias produces zero error
on average to all the possible sets of patterns drawn from the same distribution as the training data.
Even in this case, if the neural network has a marked sensitivity to a particular set of patterns, then the
variance of the network is high. Unfortunately there is no formal means of relating the number of hidden
units to the bias and variance of the network. Roughly, the number of hidden units should be close to the
geometric mean of the number of inputs I and the number of outputs K:
J=
√
IK
(5.72)
One of the simplest, although very demanding in computational resources, way to find the optimum J
is to train a set of neural networks with a range of number of hidden units around the estimate given
in Eq.5.72. Because the error optimising algorithm can get stuck in a local minimum, several random
weight initialisations should be tried in order to increase the probability of finding a good minimum. The
optimum is then selected based on the performance on an independent set of labelled input patterns,
known as the validation set.
Another way to “optimise” the number of free parameters in the network is to set a sensibly high value of
J and then penalise the least relevant weights in the network with an appropriate cost function during
training. This method is called regularisation and it will be explained next.
5.3.1 Regularisation
Regularisation is a common technique in regression theory which aims to encourage a smoother fit
through the inclusion of a penalty term Ω in the error function:
Ê = E + νΩ
(5.73)
where ν is a control parameter for the penalty term Ω. The penalty function should be such that a good
fit will produce a small error E, while a smooth fit will produce a small value for Ω.
5.3 Model order selection and generalisation
101
It is well known heuristically that an over-fitted mapping produces large values of weights, whereas small
values of weights will drive the activation units mostly in the linear region of the activation function,
producing an approximately linear mapping, which is the smoothest possible. Therefore, a good function
for Ω would be one that increases as the magnitude of the weights increases. The simplest of these is the
sum-of-the-squares, commonly called weight decay:
Ω=
1 2
w
2 i i
(5.74)
If a gradient descent procedure is applied to optimise the modified error function Ê, the gradient of
which is proportional to the weights:
Δw Ê = Δw E + νw
(5.75)
then the variation of the weights in “time” due to the penalty term can be seen as:
dw
dτ
⇒ w(τ )
= −ηνw
(5.76)
= w(0)e
−ηντ
which shows how, as a result solely of the influence of the penalty term, all the weights “decay” exponentially towards zero during the training. It can be easily shown in a second order approximation of the
error function that the components of the error function along the directions with the lowest variances
of E in weight space are the most penalised by the regularisation term. This can be expressed as:
ŵj =
λj
w∗
λj + ν j
(5.77)
where ŵ is the minimum of the error function with weight decay Ê, w∗ is the minimum of the original
error function E, and λj is an eigenvalue of the Hessian matrix H evaluated at w∗, in a weight space
aligned with the eigenvectors of H. Therefore, weight decay will tend to reduce the value of the weights
with less influence on the error function. The final result will be a smoother fit than the one achieved
with w∗.
The weight decay function in Eq. 5.74 is not consistent with a linear transformation performed on the
input data as it treats weights and biases on equal grounds. A bias unit is an additional unit in the input
5.3 Model order selection and generalisation
102
and hidden layers of an MLP, with a permanent input value of 1, placed to compensate for the differences
between the yk ’s mean and the tk ’s mean. Considering weights from different layers separately and
excluding the bias unit weights from the regularising term will solve this consistency problem as in:
Ω=
νz 2 νy 2
w +
w
2
2
w∈Wz
(5.78)
w∈Wy
where Wz are the input-to-hidden weights except for the bias weights w0j , and Wy are the hidden-tooutput weights except for the bias weights w0k .
5.3.2 Early stopping
Another way to prevent an MLP with a relatively high number of hidden units from over-fitting the
training data is to stop the training process at a premature stage. This method, called early stopping,
makes use of a validation set to stop the training process when the error on the validation reaches a
minimum as is shown in Fig 5.5.
E
validation
error
training
error
τv
τ
Figure 5.5: Early stopping
5.3.3 Performance of the network
To evaluate the performance of the network the error function can be used, or its gradient. However,
given that the goal of training is to learn to discriminate between the classes of the training set in order
to perform interpolation of the class membership probabilities on test data, a more suitable measure of
performance could be the percentage of correctly classified patterns (accuracy) in a given dataset. For a
5.3 Model order selection and generalisation
103
K-class problem and a dataset with Nk patterns from each class, the accuracy of a classifier is defined as:
K
A = 100 ∗ k=1
K
k=1
Nkc
Nk
(5.79)
where Nkc is the number of correctly classified patterns from class k. A more common measure in pattern
recognition is the classification error rate, defined as the percentage of misclassified patterns:
Erate = 100 − A
(5.80)
If several networks are being evaluated, the optimal network should be chosen according to the final
validation error, while the performance of the selected network should be measured on data never seen
before, that being the purpose of the test set.
If several networks trained on the same data are very close to each other in terms of validation error, a
committee of networks can be formed, the output of this association being the average of the individual
outputs. It can be proved that a committee of networks statistically performs better, or at least not worse,
than the individual networks [19, pp. 366].
The training, validation and test sets should ideally be of equal size. The number of input patterns in
the training set should be at least 10 times greater than the number of free parameters in the network
[159, pp.70], a requirement sometimes difficult to meet, even more if an equal number of patterns has
to be saved for the validation and test sets. In this case, a technique usually applied in statistics as part of
“jack-knife” estimation [109], cross-validation, can be applied, such that the data is split into S subsets,
or partitions, of equal size. Each of these subsets is in turn the test set for the neural network while
the remaining S − 1 subsets are used to form the training and validation sets. The S neural networks
obtained in this way can then be combined in a committee of networks. If the data is not plentiful for
division into S subsets, then the leave-one-out method, a variant of the jack-knife method whereby every
sub-set consists of only one sample, can be used.
5.4 Radial basis function neural networks
104
5.4 Radial basis function neural networks
Another kind of neural network, which can be used to estimate posterior probabilities, is the so-called
Radial Basis Function (RBF) network. Its architecture, shown in Fig. 5.6, is very similar in appearance
to the MLP but its operation is very different. Firstly, the activation function of the hidden units is not a
weighted summation followed by a non-linearity. Instead, it is a radially symmetric function φj (usually
a Gaussian) with a different mean vector μj for each unit. In addition, the output units only perform
a linear combination of the hidden unit outputs, without applying any nonlinear function to the result.
Also, RBF training is different from that of an MLP, since it is performed in two phases instead of one, and
a nonlinear optimisation process is not required, the equations for minimising the quadratic output error
over the second-layer weights being linear.
x1
hidden Φ
j
units
Φ1
y1
x2
inputs
xi
outputs
yk
yK
ΦJ wj k
xI
Figure 5.6: A radial basis function network
Originally developed to perform exact function interpolation, early RBF networks made a non-linear
mapping of N input vectors in an I-dimensional space to K target points in a 1-dimensional space,
through N radial basis φj (·) functions. In order to obtain better generalisation when fitting noisy data,
the number of basis functions was reduced, to a number significantly lower than the number of input
vectors. The resulting RBF has been widely used not only in noisy interpolation, but also in optimal
classification theory.
The mapped points or RBF outputs for a K-class problem are given by:
yk =
J
j=1
wjk φj + w0k
for k = 1, 2, . . . , K
(5.81)
5.4 Radial basis function neural networks
105
For the case of a Gaussian basis function, φj is defined as:
1
φj (x) = exp − (x − μj )T Σ−1
(x
−
μ
)
j
j
2
(5.82)
where μj and Σ represent the mean and covariance matrix respectively. In an RBF network, the covariance matrix of the Gaussian basis functions can be considered to be of the form σ 2 I (hyper-spherical
Gaussians) without loss of generalisation. In this case, Eq. 5.82 takes the form:
x − μj 2
φj (x) = exp −
2σj2
(5.83)
The Gaussian functions in an RBF are un-normalised since any multiplier factors are absorbed in the
weights wjk in Eq. 5.81.
5.4.1 Training an RBF network
The first training phase is used to estimate the parameters of the basis functions φj (·) and no class
information is required for it, hence this phase is unsupervised. Once this phase is completed, the kernel
function activation is determined only by the distance between the input vector x and the mean vector
μj , and the kernel width σj . It can be shown [19] that, after this training phase, the summation of all the
radial basis function outputs is an estimate of the unconditional probability of the data p(x). Posterior
probabilities for each class P (Ck ) are estimated at the outputs of the RBF network after the second phase
of training, which adjusts the second-layer weights wjk , this time using the target values tk of each input
vector x in the training set. For this reason, the second phase is called supervised.
Unsupervised phase: cluster analysis
Unsupervised training can be viewed as a clustering problem. Each Gaussian kernel represents a group
of similar vectors in the I-dimensional input space. Since the objective of the initial phase of learning
is to model the unconditional probability density function, the clusters do not necessarily separate data
from different classes. To summarise, the aim of this phase is to find the location of the cluster centres
and the distribution of data within them in order to determine the mean and variance of each Gaussian
5.4 Radial basis function neural networks
106
(hidden unit of the RBF network). One of the most common clustering algorithms, the so-called Kmeans algorithm [159], can be used for this purpose. The number of means K is chosen to be equal to
the number of hidden units J.
The K-means Algorithm This algorithm seeks to find a partition of the input data set into K regions or
clusters. Usually, the similarity criterion that defines a cluster is the distance between data (Euclidean in
most cases). The algorithm determines, for each of the K clusters, Ck , the location of its centre mk , and
identifies the patterns xi that belong to this cluster. In an iterative optimisation process, the partition is
modified so that the distances between the patterns belonging to a cluster and its centre are minimised.
This can be expressed as the optimisation of a quadratic error function defined by:
2
EK
=
K x − mk 2
(5.84)
k=1 x∈Ck
Random values are initially assigned to the centres mk , and each data point is assigned to the cluster
with the centre nearest to it. Then, each centre mk is changed to be the mean of the data belonging to
the cluster Ck , reducing in this way the value of the error function defined in Eq. 5.84. These two last
steps are repeated until no significant change in centre positions is detected.
The procedure described above is known as the “batch” version of the K-means algorithm, since, at
every step, the centres are modified once all the patterns have been assigned to the clusters. There is an
“adaptive” version whereby the nearest centre is modified each time a pattern is considered, so that the
distance between them is reduced.
(τ +1)
mk
(τ )
(τ )
= mk + η(x − mk )
(5.85)
where the parameter η is a learning parameter. The adaptive version is a stochastic procedure because
the patterns are chosen from the data set randomly, and the algorithm is more prone to becoming trapped
in a local minimum.
The value for K can be found by running the algorithm for k = 1, 2, 3, . . . until a knee in the curve
5.4 Radial basis function neural networks
107
Ek2 -vs-k is obtained. Typically this curve decreases monotonically (reaching zero value for k = N ), but
the “knee” indicates a substantial change in the ‘error function rate’, which is large for small values of k
and decreases less quickly for k values above the ‘knee’ value. The value of k at the knee can be taken as
the optimum value [159, pp.23].
Normalisation
Since Euclidean distance is used to set the location of the Gaussian kernels, differences
in dynamic range between features will cause the smallest ones to be ignored by the clustering algorithm.
To avoid this, zero-mean, unit-variance normalisation is applied to the entire data set of N patterns before
the unsupervised phase:
x̃ni =
xni − μi
σi
(5.86)
N
1 n
x
N n=1 i
(5.87)
where:
μi =
σi2
N
1 n
=
(x − μi )2
N − 1 n=1 i
(5.88)
Then, a clustering procedure like the one described in the previous section is performed to find a set of
J centres. Once the unsupervised phase of the RBF training is complete, the cluster variance σj2 is found
for each cluster Cj :
σj2 =
1 (x̃ − μj )T (x̃ − μj )
Nj
(5.89)
x̃∈Cj
where Nj is the number of patterns that belong to cluster Cj .
Second phase: linear optimisation of weights
The second training phase uses the labelled patterns in supervised learning mode. The output layer
receives the information from the hidden-unit outputs and it can be trained with a data set smaller than
5.4 Radial basis function neural networks
108
the one used for the first training stage. The LMS algorithm is used to minimise the error function:
⎛
⎞2
N J
K
1
⎝
E(w) =
wjk φnj − tnk ⎠
(5.90)
2 n=1
j=0
k=1
Differentiating the error with respect to the weights wjk and setting it to zero to find the minimum gives:
⎛
⎞
N
J
⎝
wj k φnj − tnk ⎠ φnj = 0,
for j = 1, 2, . . . , J and k = 1, . . . , M
(5.91)
n=1
j =0
These equations are known as normal equations, and have an explicit solution. Using matrix notation:
ΦT ΦWT = ΦT T
(5.92)
with the elements of the matrices being defined as (Tkn ) = tnk , (Wjk ) = wjk and (Φjn ) = φj (xn ). The
solution is:
WT = Φ† T
(5.93)
where Φ† represents the pseudo-inverse matrix of Φ. The pseudo-inverse matrix is given by:
Φ† ≡ (ΦT Φ)−1 ΦT
(5.94)
for which Φ† Φ = I always, although ΦΦ† = I in general. When data is noisy, it is very common to find
that the matrix (ΦT Φ) is nearly singular. In this case, the singular value decomposition (SVD) algorithm
can be used to avoid larger values for the weights wjk , since it sorts out the roundoff error accumulation
problem and chooses from a set of possible solutions the one that gives the smallest values [131].
5.4.2 Comparison between an RBF and an MLP
In general, the performance of an MLP is slightly better than that of an RBF. The reason behind this is because the MLP’s fully surpevised non-linear optimisation is in general better than the RBF’s unsupervised
non-linear clustering process followed by a linear optimisation[159]. Advantages of the use of an RBF are
the shorter training time, since training does not require non-linear optimisation, and the lack of the need
for a validation set. The hidden layer representation of an RBF is more accessible. Since it represents the
5.5 Data visualisation
109
unconditional probability of the training set, it can be used as a novelty detector on new data, when all the
hidden units show very low activation, indicating that the RBF network is extrapolating, and therefore,
no confidence should be given to the result [159].
5.5 Data visualisation
One of the first stages in the solution of a classification problem usually consists of getting more insight
into the data structure. If the features extracted do not reveal enough separation between the classes
in the feature space, a search for new features should be considered. It is also desirable to obtain more
details such as inter-subject variability and incidence of outliers. The visualisation of the data distribution
for a number of features L less than or equal to 3 is straightforward, otherwise more sophisticated
procedures are required.
The relations of proximity and organisation of the data in a feature space, with dimensionality higher
than 3, can be visualised through a non-linear projection from RL to RM , with M typically 2 or 3. This is
the basis of the Sammon map, which will be described next.
5.5.1 Sammon map
Sammon’s algorithm seeks to create a mapping such that the distances between the image points in the
projection plane are as close as possible to the corresponding distances between the original data points
in feature space. The following error function at iteration number τ is defined:
1
E (τ ) = N N
i
(τ )
j=i+1 δij
(τ )
N
N [dij − δij ]2
(τ )
i
j=i+1
(5.95)
δij
where N refers to the number of vectors to be mapped, dij to the Euclidean distances between the vectors
(τ )
xi and xj in L-space, and δij to the Euclidean distances between the corresponding vectors (images or
(τ )
projections) yi
(τ )
and yj
in M -space.
dij = xi − xj (5.96)
δij = yi − yj (5.97)
5.5 Data visualisation
110
Minimisation of this error function can be achieved, starting from random locations for the image points,
by adjusting them in the direction which gives the maximum change in the error function (gradient
descent method), as is shown in Eq. 5.98
(τ +1)
(τ )
(τ )
yim
= yim − αΔim
for m = 1, . . . , M
(5.98)
Δim
2 ∂ E
∂E
=
÷ 2 ∂yim
∂yim
for m = 1, . . . , M
(5.99)
where:
and the gradient proportionality factor α is determined empirically to be between 0.3 and 0.4 [141]. The
partial derivatives are given by:
−2
∂E
= N N
∂yim
k=1
j=k+1 dkj
∂2E
−2
2 = N N
∂yim
k=1
j=k+1 dkj
N
j =1
j = m
N
j =1
j = m
dij − δij
(yim − yjm )
dij δij
(yim − yjm )2
1
dij − δij
(dij − δij ) −
1+
dij δij
dij
δij
(5.100)
(5.101)
A small number of representative vectors can be extracted (using K-means clustering, for example) to
reduce the number of computations required [O(N 2 )] to complete the mapping.
5.5.2 NeuroScale
The Sammon map’s main drawback is that it acts as a look-up table. Previously unseen data cannot be
located in the projection map without re-running the optimisation procedure. A parameterised transformation yi = G(xi ; w), where w is the parameter vector, would allow the desired interpolation however.
This parametric transformation can be performed by a neural network. During training, this neural
network has no fixed targets, the outputs and weights being adjusted to minimise an error, or “stress”
measure, related to the Sammon map error function and given by:
E=
N N
i
j=i+1
[dij − δij ]2
(5.102)
5.5 Data visualisation
111
where the terms dij and δij are given by Eqs. 5.96. and 5.97. The training of such a neural network
is said to be relatively supervised as there is no specific output target, but a relative measure of target
separation between each pair {yi , yj }.
For an RBF with H basis functions the square of dij can be expressed as:
d2ij =
M
m=1
H
2
whm [φh ( xi − μh ) − φh ( xj − μh )]
(5.103)
h=1
Then, the derivatives of the stress function with respect to the weights for each data point xi are given
by:
∂E i
∂E i ∂yi
=
∂whm
∂yi ∂whm
(5.104)
where:
N
2
dij − δij
∂E i
= −2
(yi − yj )
∂yi
δij
j =1
j = i
(5.105)
Note the difference between the derivative in Eq. 5.105 and the corresponding term for a supervised
problem with sum-of-squares error (see Eq. 5.90), the latter being given by:
∂E i
= yi − ti
∂yi
(5.106)
Thus the relatively supervised training procedure has an estimated target vector t̂i given by:
t̂i
= yi −
∂E i
∂yi
= yi + 2
N
j =1
j = i
2
dij − δij
(yi − yj )
δij
(5.107)
However, the minimisation of the stress measure cannot be performed in one step as in the linear phase
of an RBF training (Eq. 5.93) because the estimated targets are not fixed, but depend upon the current
outputs yi and weights. Instead, the minimum can be sought in an iterative approach with an EM1 -like
procedure, which is more efficient than backpropagation in an MLP [163].
1 Expectation-Maximisation [19, pp.65] is a two-step procedure to solve the highly non-linear, coupled equations of maximum
likelihood optimisation problems
5.5 Data visualisation
112
To prevent an increase in the stress during the early stages of the algorithm, when the estimate of the
targets is poor, a learning rate η(τ ) control is introduced in Eq. 5.107:
(τ )
t̂i
(τ )
= yi
− ητ
∂E i(τ )
(τ )
(5.108)
∂yi
where η(τ ) is initially set to have a small value and is progressively increased as the stress decreases
during training.
The training algorithm then becomes:
1. Initialise the weights to small random values
2. Initialise η1 to some small value
3. Calculate the pseudo-inverse matrix Φ†
4. Initialise τ = 1
(τ )
5. Calculate the target vectors ti
6. Solve W(τ )T = Φ† T̂
7. Calculate the stress
8.
• If the stress has increased, decrease η
• If the stress has decreased, increase η
9. If the stopping criterion is not satisfied, return to step 5
The increase and decrease in η in step 8 is arbitrarily set to a range of 10 − 20% [163]. If the stress
measure is comparable to the final stress calculated in a standard Sammon mapping procedure then the
algorithm can be stopped. This optimising procedure is called the shadow targets algorithm and the RBF
described in this section has been referred to as N EURO S CALE [163].
A caveat for the use of the techniques described above is that the lower dimensional projection generated
may show data overlap, which may not be present (or at least not in the same proportion) in the high
dimensional feature space.
5.6 Discussion
113
5.6 Discussion
So far we have introduced the neural network approach to classification as a non-parametric method
for the estimation of the posterior probabilities of class membership. Non-parametric methods are more
flexible than parametric approaches and are easier to apply than semi-parametric methods, such as Gaussian Mixture Models [19, pp.60]. The probabilistic nature of the neural network outputs gives them an
advantage over other classifiers, like linear discriminants and support vector machines to mention a few
[169].
We presented two kinds of neural networks, the MLP and the RBF network, and stated that the first tends
to outperform the second one for the reasons given in §5.4.2 (see [175] for a comparison between an
MLP and an RBF performance in disturbed sleep analysis). A balanced dataset should be used to assign
the same relevance to all the classes. For MLP training, a validation and a test set should be reserved
from the balanced dataset in order to avoid over-fitting the training data. If the amount of data is not
sufficient to allow this partitioning, the leave-one-out method should be used.
The aim of the work presented in this thesis is to estimate the state of the brain in the sleep context
(for μ-arousal detection), and with in the vigilance alertness-drowsiness continuum. Although the data
is labelled according to six or seven discrete classes (see sections 3.2.1 and 3.4.2), the neural network is
capable of performing interpolation between classes. The 1-of-K code is recommended when the targets
are discrete, as is the case in both the sleep and vigilance problems. The cost function associated with
this coding scheme is the cross-entropy error function. Minimisation of the cost function can be achieved
efficiently by using the scaled conjugate gradients algorithm. The performance of the network may be
evaluated using the misclassification error as the criterion.
The trade-off between bias and variance of the network suggests that the search for an optimum network
architecture can be carried out by training several networks with different initial values for the network
parameters. Regularisation techniques can be applied to achieve better generalisation, and although the
values of the regularisation parameters cannot be found analytically, they can be included in an extensive
5.6 Discussion
114
search for the best generalisation, the latter being evaluated as the classification performance on the
validation set.
The techniques known as Sammon map and N EURO S CALE, introduced in this chapter, can be used to
visualise the relations of proximity between patterns from different classes in the feature space, providing
hints as to which classes should be used for neural network training, and helping to establish what might
be expected from the neural network performance. These visualisation techniques can also help to rule
out outliers, i.e. data from a distribution different from that of the training data, the N EURO S CALE map
being particularly useful when analysing new data.
Chapter 6
Sleep Studies
Prior to the analysis of the sleep of patients with OSA, a deeper understanding of normal sleep should
be acquired. Previous work has shown that neural network methods for data visualisation and classification provide useful information on data structure and clustering [123], and these methods have been
successfully applied to sleep staging and tracking [143][123][16][158]. The EEG is the most significant
and reliable physiological measure of sleep, and is relatively easy to acquire as a signal. As we have
seen in Chapter 3, the sleep EEG of OSA sufferers has the same characteristics as that of normal subjects,
the difference being in the higher number of rapid transitions from sleep to wakefulness in OSA patients.
Therefore, it should be possible to use a neural network trained with normal sleep EEG data with the EEG
recorded from OSA patients. In this chapter we report on the training of MLP networks using a database
of normal sleep EEG records and investigate their subsequent performance on OSA sleep EEG (test data).
6.1 Using neural networks with normal sleep data: benchmark experiments
6.1.1 Previous work on normal sleep
In previous work [123], 10th-order AR modelling of 1s EEG segments and a visualisation technique
known as the Kohonen map [90] were applied to give an overall view in 2-D of the AR coefficients for
normal sleep EEG. Kohonen’s map is a self-organising algorithm which projects an entire data set or
input vectors from an L-dimensional space into a relatively few cluster centres or “code-vectors” laid
6.1 Using neural networks with normal sleep data: benchmark experiments
116
out on a mesh in a lower, M -dimensional space (usually M = 2), in such a way that the relations of
proximity (topology) between the input vectors are preserved1 . This work showed that there were three
well differentiated groups or clusters of data in the sleep EEG database, corresponding to the stages of
wakefulness, REM/light sleep (stage 1) and deep sleep (stage 4). Intermediate stages 2 and 3 did not
form separate clusters, but transient events such as K-complexes and spindles were mapped onto different
regions of the map. This phase of learning is unsupervised since no labels are taken into account when
constructing the Kohonen map, although labels are used later to identify the clusters. Based on the results
obtained with the Kohonen map, a neural network was trained with the same sleep EEG database, the
aim being to classify the sleep EEG into the 3 categories identified in the Kohonen map, by estimation of
the posterior probabilities of class membership. Results on test data showed that the plot of the neural
network outputs over time “tracks” the sleep-wake continuum with a better resolution than the R&K
discrete stages (as the neural network outputs can take any value between 0 and 1) and with a better
resolution in time since 1-s epochs are used to segment the EEG rather than the 30 seconds of the R&K
hypnograms. Fig. 6.1 shows the time course of the three neural network outputs (P(W ) for wakefulness,
P(R) for REM/light sleep, and P(S) for stage 4) for a 7-hour sleep recording. The main features of
the normal sleep-wake cycle can be seen in these plots. The P(W ) output takes a value close to 1 at
the beginning of the night, followed by a rapid descent to zero and remains at this level for about 40
minutes, while the P(R) output rises from zero to a value higher than 0.5 at the same time as the P(W )
output decreases, indicating a transition from fully awake to the first stage of sleep (sleep onset). The
P(S) output, which starts at zero, rises steadily as the P(W ) and P(S) outputs decrease, and stays high for
the remaining 40 minutes of the first hour of the night. For the rest of the night, the P(R) and the P(S)
outputs wax and wane alternately, an indication of the 90-minute REM and non-REM sleep cycle, with a
progressive lightening of sleep as the night advances. When P(W ) is high (subject awake), P(S) is low,
since it is not physiologically possible that these two probabilities can exhibit a similar value, except when
both are near zero. In such a case the value P(R) must be high since the sum of the three probabilities
1 Kohonen’s
map main disadvantage over Sammon’s map is that the image points are constrained to lie on a rectangular grid
6.1 Using neural networks with normal sleep data: benchmark experiments
117
must be equal to one, indicating that the subject is in REM/light sleep. Hence, the Wakefulness output
P(W ) and the deep sleep output P(S) were combined in a measure of “sleep depth” P(W )-P(S), in which
the values of 1, 0 and -1 indicate wakefulness, REM/light sleep and deep sleep respectively. The trace
P(W )-P(S), shown at the bottom of Fig. 6.1, is similar to the R & K hypnogram, but with a continuous
resolution in amplitude and a ×30 time resolution.
P(W)
1
0
P(R)
1
0
P(S)
1
0
P(W)−P(S)
1
0
−1
0
1
2
3
4
5
6
7
hours
Figure 6.1: The neural network’s wakefulness P(W ), REM/light sleep P(R) deep sleep P(S) outputs; and
measure of sleep depth P(W )-P(S) (from Pardey et al. [123])
The above work established the feasibility of using AR coefficients at the inputs to a neural network with
3 outputs to describe the sleep-wake continuum. A previous investigation [176] compared the use of
5th order AR parameters with the power in five EEG bands when these where used as inputs to a neural
network, and showed that they contribute the same information to the analysis of normal and disturbed
6.1 Using neural networks with normal sleep data: benchmark experiments
118
sleep. In this chapter, we build on these results in order to detect μ-arousals in the sleep of OSA subjects.
A 10-th model order is used as Pardey et al. [123] found that some EEG segments corresponding to
wakefulness may be under-fitted with a lower model order. Since the effect of OSA on the EEG is only
to change the sleep structure rather than the EEG itself, we can use a neural network trained on normal
subjects to analyse the sleep of OSA patients. We re-visit the choice of algorithm for extracting coefficients
and aim to minimise variance whilst ensuring stationarity. We go beyond the work described in [123]
by carrying out a thorough investigation of network architecture and free parameters (including weight
decay coefficients) in order to identify the optimal network. This involves training more than 2,000
networks. Finally, we use the optimal network in order to detect μ-arousals in sleep EEG recordings
acquired from seven subjects with severe OSA.
6.1.2 Data Extraction
The normal sleep EEG from nine healthy female adults, with no history of sleep disorders, aged between
21 and 36 years (average 27.4), was recorded with electrode pair C4/A1, and digitised with an 8-bit
A/D converter and a sampling rate of 128 Hz. Prior to digitisation, the analogue EEG was filtered with a
bandpass filter (0.5–40 Hz with a −40 dB/dec slope in the transition band). Two EOG channels and the
submental EMG were also recorded for the purpose of generating R & K hypnograms.
The length of every record was approximately 8 hours. Each record was divided into 30s-segments, and
classified separately by three human experts, trained in the same laboratory, according to the R&K rules.
The number of 30s-segments for which the three experts were in agreement in their classification varied
from 137 for stage 1, which a transitional stage that only last a few minutes (see section 3.2.1), to 2,665
for stage 2, the most abundant and easiest to score of all sleep stages. These segments are referred to as
being consensus-scored.
6.1 Using neural networks with normal sleep data: benchmark experiments
119
6.1.3 Feature extraction
Pre-processing
The sampled EEG signal was also digitally filtered with a low-pass linear phase filter, with a cutoff frequency at 30 Hz, a bandpass gain of 1.00 ± 0.01 and −50 dB attenuation at 50 Hz, using a zero-phasedistortion filtering technique. The mean of each EEG recording (calculated over the whole record) was
removed.
Autoregressive Analysis
To apply AR modelling to the EEG segments, an investigation of the algorithms described in chapter 4
was undertaken. The relationship between segment data length and the bias and variance of the AR
coefficients was also studied. 911 and 5,214 consensus-scored 4s-segments of Wakefulness and Sleep
Stage 4 respectively were selected from the database, and their reflection coefficients (for an AR model
order 10) were estimated using the Burg algorithm. The means of these estimates, μ̂W and μ̂S , were
then used to synthesise “typical” Wakefulness and Sleep Stage 4 EEG signals. The mean reflection coefficients were transformed to 10th-order mean feedback coefficients (for definition see §4.6.1) by using the
inverse Levinson-Durbin recursion (Eq. 4.102). Following the procedure described in section 4.3.3 for
AR synthesis, an ensemble of 500 time series with length N was generated using a white noise generator
with unit variance and the AR feedback coefficients, and four algorithms, namely Burg, Covariance (Cov),
Modified Burg (ModBurg) and Structure Covariance Matrices (SCM), were used in turn to estimate the
AR reflection coefficients for each time series. The Euclidean distance between the mean of the estimates
for the ensemble μ̃ and the value used to generate the ensemble μ̂ was calculated, as well as the trace of
the covariance matrix of the estimates TrS̃. The results for N , a power of 2 from 16 to 512, are shown
in Tables 6.1 and 6.2, and show that there is little difference between the various algorithms, at least for
data lengths N ≥ 128.
The Burg algorithm has lower computational cost than the others and so was chosen to estimate the
6.1 Using neural networks with normal sleep data: benchmark experiments
N
Burg
Cov
ModBurg
SCM
16
0.3814
-a
3.9057
0.9083
N
Burg
Cov
ModBurg
SCM
16
0.6267
-a
7897
1.5395
a Covariance
mean error
32
64
128
256
0.1714 0.0952 0.0517 0.0272
0.6402 0.1260 0.0521 0.0270
0.7073 0.0947 0.0520 0.0276
0.1783 0.0990 0.0529 0.0277
covariance matrix trace
32
64
128
256
0.2000 0.0929 0.0449 0.0207
395
4.4268 0.0500 0.0219
225
0.0958 0.0451 0.0209
0.2175 0.0955 0.0455 0.0210
120
384
0.0204
0.0204
0.0204
0.0204
512
0.0155
0.0154
0.0155
0.0156
384
0.0143
0.0148
0.0144
0.0144
512
0.0105
0.0107
0.0106
0.0106
algorithm requires the data length to be at least twice the model order
Table 6.1: Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (wakefulness)
N
Burg
Cov
ModBurg
SCM
16
0.3410
-a
2.1033
0.8055
N
Burg
Cov
ModBurg
SCM
16
0.5482
-a
2620
1.5370
a Covariance
mean error
32
64
128
256
0.1407 0.0719 0.0405 0.0189
7.2100 0.5122 0.0406 0.0187
1.2879 0.0707 0.0399 0.0188
0.1410 0.0729 0.0405 0.0189
covariance matrix trace
32
64
128
256
0.1719 0.0809 0.0385 0.0189
27595
68.6
0.0428 0.0195
839
0.0837 0.0390 0.0190
0.1861 0.0829 0.0391 0.0190
384
0.0128
0.0128
0.0128
0.0127
512
0.0094
0.0093
0.0094
0.0094
384
0.0127
0.0129
0.0127
0.0127
512
0.0096
0.0097
0.0096
0.0096
algorithm requires the data length to be at least twice the model order
Table 6.2: Mean error and trace of covariance matrix for synthesised EEG reflection coefficients (stage 4)
6.1 Using neural networks with normal sleep data: benchmark experiments
121
reflection coefficients of a 10th-order AR model of the EEG data. Figure 6.2 illustrates the results of the
Burg algorithm on the synthesised EEG with N going from 16 to 1,024. It can be seen that the accuracy
and the variance of the estimates improve significantly once N gets to 256.
As was discussed in §4.7, stationarity concerns suggest that segments should not be greater than 1 second
(128 samples) when analysing the EEG. But the results here show that the variance of the estimates can
be reduced by increasing the length of the time series segment to 256 or greater. A compromise was found
by using a 384-sample sliding window (corresponding to 3 seconds), which is advanced in one-second
steps (128 samples). Each set of reflection coefficients is then taken to represent the middle second of
the 3-second window.
Wakefulness
Wakefulness
0.8
covariance matrix trace
0.4
mean error
0.3
0.2
0.1
0
0
200
400
600
data length N
800
0.6
0.4
0.2
0
1000
0
200
Sleep stage 4
1000
800
1000
covariance matrix trace
0.8
0.3
mean error
800
Sleep stage 4
0.4
0.2
0.1
0
400
600
data length N
0
200
400
600
data length N
800
1000
0.6
0.4
0.2
0
0
200
400
600
data length N
Figure 6.2: Mean error and covariance matrix trace for reflection coefficients computed with the Burg
algorithm (wakefulness and Sleep stage 4) vs data length N
6.1.4 Assembling a balanced database
The same number of segments for each of the categories (wakefulness (W), REM (R) and sleep stage 4
(S)) was randomly taken from the overall database to build a balanced data set using only consensus-
6.1 Using neural networks with normal sleep data: benchmark experiments
122
scored segments, the overall number being determined by the minimum number available in any one
class. In sleep studies, it is obviously the wakefulness set which will be the smallest, with only 164 30-s
segments, yielding 4,920 one-second segments and hence 4,920 sets of 10 reflection coefficients were
assembled for each class. From now on, this dataset will be referred to as the balanced sleep dataset.
6.1.5 Data visualisation
The Sammon map and N EURO S CALE visualisation techniques (see section 5.5) were applied in order to
gain insight into the clustering present in the data. The reflection coefficients were normalised to give
a zero mean and unity standard deviation in each axis of the feature space. This gives each coefficient
equal importance a priori.
K-means algorithm
Given that the amount of data (14,760 data points) is too large to be handled by the visualisation algorithm, per-class clustering using the K-means algorithm was applied with 60 means per class and an η
factor (see Eq. 5.85) of 0.02.
Sammon Map
A 2D Sammon map algorithm was applied to the 180 mean vectors generated by the K-means algorithm.
The gradient proportionality factor α, (see Eq. 5.98) was adjusted to a value of 0.06. The Sammon map
for the three classes and for each class separately is shown in Fig. 6.3, with a circle around each centre
whose radius indicates the relative size of the cluster represented by the centre.
It can be seen from the map that the classes form well defined clusters with some overlap between them.
The Wakefulness cluster is the most sparse, whilst the REM/light sleep cluster lies between wakefulness
and deep sleep, as expected.
6.1 Using neural networks with normal sleep data: benchmark experiments
(a) classes W, R and S
(b) W class
(c) R class
(d) S class
123
Figure 6.3: Sammon map for the balanced sleep dataset; classes W, R and S
NeuroScale
A N EURO S CALE neural network with 50 basis functions was trained with the same reduced data set for
comparison and also to explore the overall distribution of the data points, as the advantage introduced
by this visualisation technique is that data not seen before, but belonging to the training data distribution, can be mapped onto the trained visualisation map (see §5.5.2). The map of the centres and the
subsequent projection of all the data points in the balanced feature set are shown in Fig. 6.4.
6.1.6 Training a Multi-Layer Perceptron neural network
A multi-layer perceptron was chosen over a radial basis function neural network because MLPs tend to
perform slightly better than RBF networks, when the latter are trained using the two-phase process described in §5.4. The balanced dataset was divided into 3 balanced subsets of 4,920 data points each
(1,640 per class), namely the training set, validation set and test set (see introductory section and section 5.3 in chapter 5). Although, all the inputs xi have absolute value equal to or lower than 1, they have
6.1 Using neural networks with normal sleep data: benchmark experiments
NeuroScale with sleep data, 60 means per class, 50 basis functions, 500 iterations
NeuroScale with sleep data, 60 means per class, 50 basis functions, 500 iterations
6
6
4
4
2
2
0
0
−2
−2
−4
−4
Wakefulness
REM/Light Sleep
Deep−sleep
Wakefulness
REM/Light Sleep
Deep−sleep
−6
−10
124
−8
−6
−4
−2
0
2
(a) Means only
4
6
−6
−10
−8
−6
−4
−2
0
2
4
6
(b) All patterns
Figure 6.4: N EURO S CALE map for the balanced sleep dataset; classes W, R and S
different dynamic ranges. In theory, this does not affect the MLP training, since the weights are capable of
correcting the differences in dynamic ranges, but certain optimisation procedures, such as regularisation
(see Eq. 5.78), require equal range of variations at the inputs to the neural network. Hence, zero-mean
and unit-variance normalisation was performed on the three subsets, using the training set statistics (μ,
σ). Normalisation also helps to reduce neural network training time [159, pp.84]. The 3 outputs of the
MLPs represent the classes W, R or S (1-of-K coding with softmax activation post-processing). Crossentropy error was selected as the cost function for the scaled conjugate gradients optimisation algorithm
for network training.
Optimising the network architecture
As was explained in chapter 5 there is no analytical means of determining the optimal value of MLP
parameters such as the number of hidden units J or the weight decay factors νz and νy . Although we
know that the regularising parameters νz and νy penalise an “excessive” number of hidden units, the
limits for this number are unknown. Therefore, we evaluate the performance on the validation set of a
number of MLP’s trained with values of these parameters varying over a given range in order to find the
“optimal” MLP architecture.
Equation 5.72 suggests that the numbers of hidden units J should be approximately the geometric mean
6.1 Using neural networks with normal sleep data: benchmark experiments
of the number of inputs times the number of outputs, i.e.
125
√
10 × 3 = 5.5 here. J was therefore varied from
4 to 10. No guideline is available for the regularisation parameters, and so these were varied between
10−6 to 10−2 in powers of ten. To avoid being trapped in local minima, a stochastic optimum search
was performed by shuffling the patterns using five random seeds when allocating them to the training,
validation and test sets. In addition, three different random weight initialisations were employed. This
yields the following total number of networks of:
5 shuffling seeds for training, validation and test sets ×
3 weight initialisations ×
7 values of J×
5 values of νz ×
5 values of νy
=
2,625
networks
The results show little performance variation with respect to weight initialisation, no more than 0.9% in
the difference between the best and the worst classification error. The variation in the classification error
for the data shuffling into the three datasets is not greater than 1.5% for the training and test sets, and
less than 2% for the validation set. Figure 6.5 shows the relationship between the number of hidden units
and the average performance of the networks (averaged for all data partitioning seeds, weight seeds and
reguralisation terms).
Mean classification error vs number of hidden units
7.1
7
% misclassifations
6.9
6.8
6.7
6.6
6.5
training set
validation set
test set
6.4
6.3
4
5
6
7
number of hidden units
8
9
10
Figure 6.5: Average performance of the MLPs vs number of hidden units
It can be seen from the plot that the performance for the training set improves monotonically with an
increase in the number of hidden units. But the percentage of misclassifications in the validation set
6.1 Using neural networks with normal sleep data: benchmark experiments
126
(and also the test set) reaches a minimum for J = 6 and then shows a slight increasing trend for larger
number of hidden units. The three neural networks which produce the best classification performance
on the validation set were all generated using the same shuffling seed, but have different initial values of
weights. The smallest of the three has a 10-6-3 architecture (see Table 6.3). Therefore, the 10-6-3 MLP
with νz = 10−4 and νy = 10−5 was chosen as the optimal network. Incidentally, this 10-6-3 MLP has the
best performance on the test set of the three optimal MLPs.
J
8
7
6
νz
10−3
10−3
10−4
νz
10−6
10−6
10−5
training
6.63%
6.54%
6.63%
validation
5.75%
5.75%
5.75%
test
6.83%
6.79%
6.28%
Table 6.3: Misclassification error (expressed as a percentage) for the best three MLPs
Figure 6.6 shows the performance of the 10-6-3 MLP for the whole range of (νz , νy ) parameters. The
training set performance increases as the values of (νz , νy ) decrease. However, the validation set performance (and also that of the test set) shows an increase in the percentage of misclassifications as the
values of (νz , νy ) are simultaneously decreased, the minimum being located at the (10−4 , 10−5 ) point in
the (νz , νy ) plane. A similar trend is found for the rest of the trained MLPs, with the exception of three
10-8-3 MLPs (same set shuffling seed and weight initialisation seed), from all the 2,625 MLPs, which were
the only ones to get stuck in a local minimum, with a percentage of misclassification equal to 66.3%.
It is clear that MLP performance on the training set tends to improve as parameters are moved towards
their extreme values. But the validation set performance also reveals that the MLP is being over-trained
as the number of hidden units is increased, or as the amount of regularisation is decreased. These trends
are all related, as the regularisation parameters penalise the non-relevant weights, compensating for an
excessive amount of hidden units.
6.1.7 Sleep analysis using the trained neural networks
The missclassification error on the test set, in Table 6.3, only shows how well an MLP trained using
“well-defined”, consensus-scored EEG segments from the three main stages of the sleep-wakefulness con-
6.1 Using neural networks with normal sleep data: benchmark experiments
127
training set
9
% misclassification
8
7
6
5
4
−2
−3
−2
−3
−4
−4
−5
−5
−6
log10 (νz )
−6
log10 (νy )
validation set
9
% misclassification
8
7
6
5
4
−2
−3
−2
−3
−4
−4
−5
−5
−6
log10 (νz )
−6
log10 (νy )
test set
9
% misclassification
8
7
6
5
4
−2
−3
−2
−3
−4
−4
−5
log
10
(ν )
z
−5
−6
−6
log10 (νy )
Figure 6.6: Performance of the 10-6-3 MLP vs regularisation parameters
tinuum, performs on data with the same “well-defined” characteristics. In order to test the performance
of the MLP with more general and “noisy” data (still drawn from the same distribution), the optimal
MLP was used to process an overnight record from one of the subjects in the sleep database (subject ID
9). The 10 reflection coefficients extracted from the EEG were presented to the MLP consecutively, on a
6.1 Using neural networks with normal sleep data: benchmark experiments
128
second-by-second basis (using a 3-second windows with 2-second overlap). The results for the 3 outputs,
the probability estimates P(W ), P(R) and P(S) are shown in Fig. 6.7. As expected, the night starts with a
high value for P(W ), and then this value decreases progressively, while the P(S) value increases. When
the P(R) output rises, the P(S) value decreases, suggesting that the subject has a REM or light sleep
period2 .
Sleep database subject 09 MLP outputs
Sleep database subject 09 MLP outputs (zoom in)
1
P(W)
P(W)
1
0
0
1
P(R)
P(R)
1
0
0
1
P(S)
P(S)
1
0
0
0
1
2
3
4
time [hours]
5
1:12:00
1:13:12
1:14:24
1:15:36
1:16:48 1:18:00 1:19:12
time (HH:MM:SS)
(a) all-night time courses
1:20:24
1:21:36
1:22:48
1:24:00
(b) zoom-in
Figure 6.7: MLP outputs, P(W ), P(R) and P(S) for subject 9’s all-night record, showing a 12-minute
segment detailed
Using the representation of the sleep-wake continuum described in [123] and in section 6.1.1, we compare the “depth of sleep” [P(W )-P(S)] with the hypnogram generated by a human expert in Fig. 6.8(a)
and (c). The extreme values (-1,+1) indicate the deep sleep and fully awake states respectively, and the
middle value (0) indicates REM/light sleep.
The spikes in the [P(W )-P(S)] output have two different causes. In the first instance, the MLP output
is generated on second-by-second basis, while experts score sleep on a 30-s basis, and so some of the
spikes in the MLP output show the short-time variations of the sleep-wake process. The second cause of
the spikes is the variability of the AR estimates and the possible overlap between the classes as shown by
the 2D projections in §6.1.5. To minimise the first of these effects when comparing with the 30-s epoch
2 It
is not possible to distinguish between REM or light sleep on the basis of the EEG alone
6.1 Using neural networks with normal sleep data: benchmark experiments
129
hypnogram, a 31-point median filter is applied to [P(W )-P(S)] for comparison with the hypnogram (see
Fig. 6.8(b)3 ).
Sleep database subject 09
P(W)−P(S)
1
0
−1
0
1
2
3
4
5
3
4
5
3
4
5
(a)
31−pt median filtered
P(W)−P(S)
1
−1
0
1
2
(b)
W
sleep stages
M
R
1
2
3
4
0
1
2
time [hours]
(c)
Figure 6.8: Sleep database subject 9 P(W )-P(S), raw (a) and 31-pt median filtered (b) compared to
human expert scored hypnogram (c)
The correlation between these two plots is excellent. The [P(W )-P(S)] output shows an initial value of 1
for the first 20-minute interval, in agreement with the human expert, who scored it as wakefulness. Then,
the [P(W )-P(S)] output shows three slow oscillations between −1 and 0, which match the transitions
from deep sleep (stage 4) to REM/light sleep (stage 1) of the hypnogram and back. During the intervals
in which the [P(W )-P(S)] output has a well defined mean at a value of -1 the hypnogram indicates sleep
3 Label
“M” in the hypnogram stands for movement
6.2 Using the neural networks with OSA sleep data
130
stage 4. Also, the intervals scored by the human expert as REM/sleep stage 1 (or light sleep) correspond
closely to those in which the [P(W )-P(S)] output has a near-zero mean. It is interesting to note that
some of the remaining spikes in the filtered [P(W )-P(S)] correspond to periods of movement. Movement
generally induces high-frequencies in the EEG, which can be indistinguishable from β rhythm once the
EEG has been low-pass filtered (see §3.1.4), and hence are categorised by the MLP as wakefulness. The
intervals corresponding to intermediate stages 2 and 3 in the hypnogram are not very stable, nor is the
[P(W )-P(S)] output, which shows the most pronounced local oscillations during these intervals.
6.2 Using the neural networks with OSA sleep data
6.2.1 Data description, pre-processing and feature extraction
Sleep EEG recordings from seven subjects with severe OSA (provided by the Osler Chest Unit, Churchill
Hospital, Oxford), with apnoea/hypopnoea index (AHI) higher than 30/h, were analysed in order to
detect the occurrence and length of the micro-arousals. The Fp1/A2 or Fp2/A1 electrode pair was used
instead of the C4/A1 montage to facilitate the recognition of the arousals by the human experts [156].
Other electrophysiological measures like EOG, chin EMG, nose and mouth airflow, ribcage and abdominal
movements, and oxygen saturation, were also taken to aid the experts in the identification of the breathing events. The length of the records varies from 32 to 61 minutes, but in all of them only 20 consecutive
minutes were scored according to standard American Sleep Disorders Association (ASDA) rules [11] (see
§3.3.2).
The OSA sleep EEG was sampled and pre-processed in the same way as the normal sleep data (see §6.1.3).
Autoregressive analysis with model order 10 was applied to the EEG recordings using the Burg algorithm
and a sliding window as described in §6.1.3. The patterns consisting of 10 reflection coefficients for each
second were stored as an OSA test set, with each recording being processed as a continuous sequence of
patterns.
6.2 Using the neural networks with OSA sleep data
131
6.2.2 MLP analysis
Normalisation was carried out on the OSA patterns using the normal sleep training set statistics. The
normalised OSA test set of patterns was then presented to the 10-6-3 MLP selected in §6.1.6, which had
been trained with the normal sleep data. Figure 6.9 shows the MLP outputs for 20 minutes of processed
EEG from two representative subjects in the OSA database, with ID number 3 and 8. The [P(W )-P(S)]
output is shown in Fig. 6.10 for each subject (upper and middle traces). Twenty minutes of [P(W )-P(S)]
for sleep subject 9 from the normal sleep database, chosen from her second hour of sleep, during the
transition from deep sleep to REM sleep, are shown at the bottom of Fig. 6.10 for reference. None of the
outputs shown in Figs. 6.9, 6.10 or in the subsequent figures in this chapter has been median-filtered.
OSA subject 3
OSA subject 8
P(W)
1
0.5
0.5
0
1
1
P(R)
0
0.5
0.5
0
0
1
1
P(S)
P(S)
P(R)
P(W)
1
0.5
0
0.5
0
0
2
4
6
8
10
12
time in minutes
14
16
18
20
0
2
4
6
8
10
12
time in minutes
14
16
18
20
Figure 6.9: OSA sleep MLP outputs for subjects 3 and 8
The oscillating nature of the [P(W )-P(S)] output shown in Fig. 6.10 compared with its counterpart in
normal sleep suggests that the sleep cycle in the OSA database is severely disrupted, with frequent (more
than 1 per minute) transitions from deep sleep to wakefulness for brief periods of time.
6.2.3 Detection of μ-arousals
According to the ASDA rules, a non-REM4 sleep μ-arousal is defined as an EEG shift in frequency lasting
3 seconds or more [11]. Given that the MLP has been trained to detect the changes in the EEG frequency
4 Submental (chin) EMG is necessary to score an μ-arousal in REM sleep. Given that we are only using the EEG, this study is
restricted to non-REM sleep events
6.2 Using the neural networks with OSA sleep data
132
OSA subject 3
P(W)−P(S)
1
0
−1
OSA subject 8
P(W)−P(S)
1
0
−1
Normal sleep subject 9
P(W)−P(S)
1
0
−1
0
2
4
6
8
10
12
time in minutes
14
16
18
20
Figure 6.10: [P(W )-P(S)] output for OSA sleep subjects 3 (top) and 8 (middle); and for normal sleep
subject 9, (bottom)
associated with sleep, μ-arousals can be detected from the [P(W )-P(S)] output by applying a threshold
and discarding transitions which last for less than 3s. The ASDA rules also treat two consecutive μarousals separated by less than 10 seconds as the same event. Therefore, we can automate the μ-arousal
scoring process according to the ASDA rules by removing pulses (thresholded [P(W )-P(S)] output) whose
duration is less than 3s and by merging two pulses which are separated by less than 10s. The automated
μ-arousal detection procedure applied to 3 minutes of [P(W )-P(S)] output from OSA subject 3 is shown
in Fig. 6.11, with a threshold of 0.5.
Events marked as “A” in Fig. 6.11 (middle trace) are pulses shorter than 3s, while those shown as “B” are
negative-going transitions also shorter than 3s. These two types of events have been removed from the
final output (lower trace). An event denoted by the letter “C” corresponds to two pulses separated by
less than 10s. These are considered to be the same event, according to the ASDA rules and they therefore
6.2 Using the neural networks with OSA sleep data
133
P(W)−P(S)
1
0.5
0
−1
after thresholding
A
B
A
C
B
C
1
0
including 3s and 10s ASDA criteria
1
0
200
220
240
260
280
time [s]
300
320
340
360
Figure 6.11: μ-arousal detection procedure. Upper trace: [P(W)-P(S)] and a 0.5 threshold; middle trace:
thresholding result; lower trace: μ-arousal automatic score with ASDA timing criteria
appear merged in the final μ-arousal output on the lower trace.
To evaluate the performance of this μ-arousal detector, the final output was compared with the μ-arousals
scored by the human expert (visual scoring). A true positive is found when both the visual and the
automatic scores agree on the occurrence of an event (logical AND equal to 1) as is shown in Fig. 6.12
for OSA subject 2. A false positive is an event only scored by the automated system (post-processed
[P(W )-P(S)] output). False negatives are the events missed by the automated system, scored only by the
expert using the visual method.
In the case of multiple detection of a single event only one true positive is counted, as can be seen in the
middle trace of Fig. 6.12 for the 3rd and 4th pulses. These two automated scored events match the second
visually scored μ-arousal but are considered as a single true positive. The dip between the two pulses is
6.2 Using the neural networks with OSA sleep data
134
automated system scores for threshold 0.7
TP
FP
TP
TP
TP
TP
FN
TP
FN
1
0
automated system scores for threshold 0.8
TP
FP
TP
TP
TP
1
0
human expert scores
1
0
150
200
250
300
350
400
time [s]
Figure 6.12: μ-arousal validation Upper trace: automated score for 0.7 threshold; middle trace: automated score for 0.8 threshold; lower trace: visually scored signal
not counted as a false negative. This is an arbitrary decision, introducing a bias in favour of the automated
system, but it is taken to facilitate comparison between different thresholds (see section 6.2.4).
The performance of the automated μ-arousal detector was assessed by estimation of the ratios known as
sensitivity (Se) and positive predictive accuracy (P P A)[138], given by:
Se = P ( an event has been detected | an event has occurred )
≈
TP
TP + FN
(6.1)
P P A = P ( an event has occurred | an event has been detected )
≈
TP
TP + FP
(6.2)
where T P is the number of true positives, F P the number of false positives and F N the number of false
6.2 Using the neural networks with OSA sleep data
135
negatives.
Se indicates the ability of the method under test to detect events, while P P A represents the selectivity of
the method, i.e. the ability to pin-point only the true events. A low value for the P P A indicates a large
number of false detections. The ideal detector would have Se and P P A values equal to 1.0, since neither
false negatives nor false positives would occur.
Although performance measures such as Se and P P A can give an idea of how many events are identified
by the automated system, they do not provide any indication of the relative timing between the events
scored by the automated system and those scored by the human expert. This is illustrated in Fig. 6.12,
where two sets of scores generated from the [P(W )-P(S)] output (thresholds 0.7 and 0.8) with the same
number of true positives (T P ), false positives (F P ) and false negatives (F N ), and hence the same Se
and P P A, are compared with the human expert scores. The gray thick lines under each signal indicate
the segments for which there is an exact match between the automated and the human expert scores.
The first true positive found by the automated system with a threshold of 0.7 (upper trace on Fig. 6.12)
has a similar duration and starting time as the visually scored event (lower trace). This is no longer
true when the threshold is given a value of 0.8 (middle trace). Other examples can be found later: see
the second, fourth and fifth events. For this reason, the correlation measure given below is used as an
additional indicator of the performance of the automated μ-arousal detector.
Corr = 1 −
N
1 (ynn (i) ⊕ yhs (i))
N i=0
(6.3)
where yws (i) represents the [P(W )-P(S)] output at time i seconds, thresholded and with pulses shorter
than 3s filtered out, yhs (i) represents the human scores, the ⊕ sign denotes the binary “exclusive OR”
operation and N is the duration in seconds of the two sequences.
For the two sequences shown in Fig. 6.12 (thresholds of 0.7 and 0.8) the correlation indices have values
of 0.83 and 0.75 respectively.
6.2 Using the neural networks with OSA sleep data
136
6.2.4 The choice of threshold
The shift in frequency that defines a μ-arousal can occur from any sleep stage to a lighter stage (sleep or
wake). This poses a problem in the setting of the threshold, illustrated in Fig. 6.10, which show subject
3’s sleep-wake continuum going from a value near 0 (REM or light sleep) to a value near 1 (wakefulness),
while subject 8’s sleep is disrupted at a deeper level, going from near -1 (deep sleep or sleep stage 4)
to near 1 (wakefulness). Several values of threshold from the [0-0.9] range were investigated and the
values of Se, P P A and Corr were calculated from each of these. The results are shown in Table 6.4 and
Fig. 6.13.
Subject
Se
2 PPA
Corr
Se
3 PPA
Corr
Se
4 PPA
Corr
Se
5 PPA
Corr
Se
6 PPA
Corr
Se
7 PPA
Corr
Se
8 PPA
Corr
0
1.00
1.00
0.44
1.00
1.00
0.37
1.00
1.00
0.40
0.88
0.86
0.40
1.00
1.00
0.46
1.00
1.00
0.38
1.00
1.00
0.50
0.1
1.00
1.00
0.67
1.00
1.00
0.45
1.00
1.00
0.76
0.62
0.75
0.65
0.93
0.96
0.76
1.00
1.00
0.60
1.00
1.00
0.51
0.2
1.00
1.00
0.72
1.00
1.00
0.57
1.00
0.96
0.90
0.47
0.76
0.73
0.83
0.92
0.84
1.00
0.96
0.71
1.00
1.00
0.52
0.3
1.00
1.00
0.79
1.00
0.96
0.68
1.00
1.00
0.92
0.41
0.88
0.75
0.76
0.92
0.83
1.00
0.96
0.77
1.00
0.97
0.54
Threshold
0.4
0.5
0.97 0.97
0.97 0.94
0.82 0.83
1.00 1.00
0.96 0.90
0.74 0.81
1.00 0.96
1.00 1.00
0.90 0.87
0.32 0.26
0.92 0.90
0.74 0.73
0.72 0.72
0.95 0.95
0.82 0.82
1.00 1.00
0.92 0.88
0.79 0.83
1.00 1.00
0.97 1.00
0.55 0.56
0.6
0.97
0.94
0.81
1.00
0.90
0.85
0.96
1.00
0.85
0.26
0.90
0.72
0.66
0.95
0.80
1.00
0.88
0.83
1.00
0.97
0.59
0.7
0.94
0.91
0.81
1.00
0.87
0.89
0.96
1.00
0.83
0.18
0.86
0.71
0.66
0.95
0.76
0.95
0.91
0.83
1.00
0.97
0.66
0.8
0.77
0.89
0.74
1.00
0.93
0.93
0.85
1.00
0.77
0.15
1.00
0.70
0.55
0.94
0.72
0.68
1.00
0.74
0.91
0.88
0.72
0.9
0.39
0.92
0.65
1.00
1.00
0.90
0.50
1.00
0.68
0.06
1.00
0.69
0.45
1.00
0.64
0.50
1.00
0.71
0.56
0.95
0.70
Table 6.4: Se, P P A and Corr per subject for various threshold values
In the light of our discussion of Fig. 6.12, we would agree that correlation is the most relevant index to
assess performance. The plots in Fig. 6.13 show that the Se and P P A indices are greater than 0.83 for all
subjects (except subject 5) at the point of maximum correlation. Table 6.5 shows the optimal threshold
from the results in Table 6.4 using the degree of correlation Corr as a criterion.
6.2 Using the neural networks with OSA sleep data
137
OSA subject 2
OSA subject 3
Se
PPA
corr
Se
PPA
corr
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
−0.1
0
0.1
0.2
0.3
0.4
0.5
theshold
0.6
0.7
0.8
0
−0.1
0.9
0
0.1
0.2
0.3
OSA subject 4
0.4
0.5
theshold
0.6
0.7
0.8
Se
PPA
corr
Se
PPA
corr
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
−0.1
0
0.1
0.2
0.3
0.4
0.5
theshold
0.6
0.7
0.8
0
−0.1
0.9
0
0.1
0.2
0.3
OSA subject 6
0.4
0.5
theshold
0.6
0.7
0.8
Se
PPA
corr
Se
PPA
corr
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0.1
0.2
0.3
0.4
0.5
theshold
0.9
OSA subject 7
1
0
−0.1
0.9
OSA subject 5
0.6
0.7
0.8
0
−0.1
0.9
0
0.1
0.2
0.3
0.4
0.5
theshold
0.6
0.7
0.8
0.9
OSA subject 8
Se
PPA
corr
1
0.8
0.6
0.4
0.2
0
−0.1
0
0.1
0.2
0.3
0.4
0.5
theshold
0.6
0.7
0.8
0.9
Figure 6.13: Se, P P A and Corr vs threshold for OSA subjects
Equi-distance to means (EDM) threshold
Two methods of finding the optimal threshold are considered, although this can only ever be done retrospectively. The first of these is to find the centres of the two main clusters of data points for the
[P(W )-P(S)] output, by running the K-means algorithm for K = 2 on the [P(W )-P(S)] output, and setting the threshold at the point x where the distances to the two centres, m1 and m2 , become equal. The
6.2 Using the neural networks with OSA sleep data
Subject
2
3
4
5
6
7
8
138
Optimal threshold
0.5
0.8
0.3
0.3
0.2
0.5
0.8
Se
0.97
1.00
1.00
0.41
0.83
1.00
0.91
PPA
0.94
0.93
1.00
0.88
0.92
0.88
0.88
Corr
0.83
0.93
0.92
0.75
0.84
0.83
0.72
Table 6.5: Optimal threshold
distance to each mean, d1 and d2 , is normalised with respect to the standard deviation, s1 and s2 , of the
corresponding mean to allow for the possibility of different data densities around each mean or cluster
centre, as shown below:
d1
=
1
x − m1 s1
(6.4)
d2
=
1
x − m2 s2
(6.5)
To find the threshold (x in Eq. 6.6 below) these two distances are made equal. Fig. 6.14 illustrates the
procedure to find the EDM threshold for OSA subject 2.
d1
= d2
1
x − m1 =
s1
1
⇒
(x − m1 )2 =
s1
⇒
⇒
1
(x − m1 )2
s21
=
1
x − m2 s2
1
(x − m2 )2 ,
s2
1
(x − m2 )2 ,
s22
(6.6)
Developing the square binomial in both sides of Eq. 6.6:
(s22 − s21 )x2 − 2(m1 s22 − m2 s21 )x + (m21 s22 − m22 s21 ) = 0
(6.7)
Solving the quadratic for x yields two possible solutions:
x1
=
m1 s2 − m2 s1
,
s2 − s1
(6.8)
x2
=
m1 s2 + m2 s1
s2 + s1
(6.9)
6.2 Using the neural networks with OSA sleep data
139
one of which is outside the range [m1 , m2 ] with m1 ≤ m2 , and is therefore discarded, while the other one
sets the EDM threshold.
The new results for the automated system using the equi-distance to means threshold are presented in
Table 6.6.
OSA subject 2
1
P(W)−P(S)
0.5
0
−0.5
−1
0
5
10
15
20
time in minutes
25
30
35
Amplitude histogram of P(W)−P(S)
300
250
200
150
100
50
0
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
EDM threshold
Figure 6.14: [P(W )-P(S)] output for OSA sleep subjects 2 (top) and amplitude histogram showing the
two main clusters, surrounded by a circle of one standard deviation, and the EDM threshold (bottom)
Subject
2
3
4
5
6
7
8
EDM threshold
0.47
0.55
0.47
-0.27
0.46
0.47
0.49
Se
0.97
1.00
1.00
0.97
0.72
1.00
1.00
PPA
0.94
0.90
1.00
1.00
0.95
0.88
1.00
Corr
0.83
0.83
0.88
0.34
0.82
0.82
0.56
Table 6.6: Equi-distance to means (EDM) threshold
It can be noticed in Table 6.6 that the threshold for the majority of the subjects lies within 0.5 ± .05.
Therefore, the simple approach of setting the threshold half way between REM/light sleep and wakefulness (i.e. [P(W )-P(S)]=0.5) was also tested. The results are shown in Table 6.7. Fig. 6.15 shows
the results using the two methods for setting the threshold compared with the results obtained with the
optimal threshold. Except for subject 5, the two methods can be seen to give very similar results.
6.2 Using the neural networks with OSA sleep data
Subject
2
3
4
5
6
7
8
Threshold
0.5
0.5
0.5
0.5
0.5
0.5
0.5
140
Se
0.97
1.00
0.96
0.26
0.72
1.00
1.00
PPA
0.94
0.90
1.00
0.90
0.95
0.88
1.00
Corr
0.83
0.81
0.87
0.73
0.82
0.83
0.56
Table 6.7: Fixed (0.5) threshold
Se
1
0.5
0
2
3
4
5
subject
6
7
8
2
3
4
5
subject
6
7
8
2
3
4
5
subject
6
7
8
PPA
1
0.5
0
Corr
1
0.5
0
Figure 6.15: Se, P P A and Corr for the best threshold (blue), the EDM threshold (red), and a 0.5 fixed
threshold (green)
6.2.5 Discussion
From the results obtained with the automated scoring system (shown in Fig. 6.15), two OSA subjects
stand out, subject 5 and subject 8, because of their low correlation values in relation to the rest of the
subjects. In order to investigate this, we examined the EEG and its power spectral density (PSD) for
these two subjects during the intervals which were scored as a μ-arousal by the human expert. The EEG
revealed that OSA subject 5 falls into a much deeper sleep than the other subjects before the onset of
a μ-arousal. Some of the deep sleep EEG is usually scored by the human expert as being part of the μarousal. Thus, this subject’s μ-arousals are characterised by an increase in magnitude both for the lower
frequencies (which is unusual) and the higher frequencies (which is the expected EEG change during a
μ-arousal) during the first few seconds of the event. For the rest of the μ-arousal, the EEG is generally
dominated by α activity (see §3.4.3), which is often interpreted as light sleep by the neural network. This
6.2 Using the neural networks with OSA sleep data
141
is illustrated in Fig. 6.16, which shows 24 seconds of EEG and the corresponding [P(W )-P(S)] output
during a μ-arousal event for subject 5. The start and end of the event, as determined by the expert scorer,
are shown by the broken vertical lines. Fig. 6.17 shows the 1s resolution spectrogram (PSD vs time)
of the EEG segment shown in Fig. 6.16. Note the increase in magnitude of both the δ and α rhythms
during the first few seconds of the μ-arousal and also the prevalence of the peak at 10Hz, indicating the
presence of α rhythm during the whole event. The relatively high power in the lower frequency bands
for subject 5’s EEG may be the reason for which some events are totally missed by the automated system,
as is shown in Fig. 6.18.
OSA subject 5
100
voltage [ μV]
50
0
−50
−100
510
515
520
525
530
520
525
530
time [s]
1
P(W)−P(S)
0.5
0
−0.5
−1
510
515
time [s]
Figure 6.16: OSA subject 5 EEG and [P(W )-P(S)] output during a typical μ-arousal for this subject (24s)
Another subject with low correlation value (Corr=0.56 for the EDM threshold and the 0.5-threshold) is
subject 8, who has an EEG with high frequency content and shows a reduction in the higher frequencies
prior to the onset of the μ-arousals. The [P(W )-P(S)] output is near 1 (wakefulness) most of the time,
falling to low negative levels in the few seconds prior to the start of the μ-arousal, resulting in the μarousals identified by the automated system being longer than those scored by the expert. Fig. 6.19
shows a 2-minute long section of the [P(W )-P(S)] output and the corresponding scores from the human
expert.
6.3 Summary
142
40
magnitude [dB]
30
20
10
0
δ
θ
0
α
5
530
10
525
β
15
520
20
515
25
510
30
frequency [Hz]
time [s]
Figure 6.17: Spectrogram of the EEG segment shown in Fig. 6.16 calculated with 1s resolution using
10th-order AR modelling
Comparing results with those using a 1-second analysis window
Previous work in the Neural Networks Research Group [123][175][176] used a 1-second window with
no overlap for the EEG feature extraction, but we found that, with such a window length, the misclassification error of the MLP on the validation set in normal sleep is grater than 10%, compared with the
5.75% obtained with the 3-s window. Fig. 6.20 shows the [P(W )-P(S)] output using the 1-s window
for normal subject 9, compared with the [P(W )-P(S)] output using the 3-s window, together with the
corresponding expert scores. The “noisier” appearance of the output in relation to the 3-s case is likely to
be due to the higher variance of the AR estimates.
Also, the averaged sensitivity (median 0.77) and correlation (median 0.76) in μ-arousal detection are
lower using a 1-s window than using a 3-s window (Se median 0.97 and Corr median 0.82).
6.3 Summary
In this chapter two databases have been presented, corresponding to normal sleep and OSA sleep. The
normal sleep database consists of nine all-night EEG recordings using the central electrode montage. The
EEG is labelled independently, according to the R&K rules, by three human experts on a 30-second basis.
6.3 Summary
143
OSA subject 5
100
voltage [ μV]
50
0
−50
−100
840
845
850
855
860
855
860
time [s]
1
P(W)−P(S)
0.5
0
−0.5
−1
840
845
850
time [s]
Figure 6.18: OSA subject 5 EEG and [P(W )-P(S)] output during a μ-arousal missed by the automated
scoring system (24s)
The OSA sleep database has seven 20-minute frontal EEG records corresponding to seven subjects with
severe OSA. The records have been scored for μ-arousals by a human expert using the ASDA rules.
An investigation was made to select the algorithm for the estimation of the reflection coefficients, used
to represent the frequency content of the EEG, and also to select the number of samples in the analysis
window. The Burg algorithm was selected for its low computational cost and competitive performance. A
3-second window with 2-second overlap was chosen as a compromise between minimising the variance
of the AR coefficient estimates and the requirement to ensure stationarity of the EEG.
Based on previous work [123], three classes were chosen to describe normal sleep, namely Wakefulness,
REM/light sleep and Sleep stage 4, and a balanced feature set was formed to train a neural network to
estimate the posterior probabilities of class membership.
A 2-layer MLP with the softmax function for the output units was used. The backpropagation algorithm
for multiple classes and the scaled conjugate gradient optimisation algorithm were used to train the
network (cross-entropy error function). Optimisation of the MLP parameters, number of hidden units
and weight decay terms, was achieved using cross-validation. The optimal network (performance on the
validation set) is a 10-6-3 MLP with weight decay parameters (νz , νy ) values at 10−4 and 10−5 respectively.
6.3 Summary
144
OSA subject 8
P(W)−P(S)
1
0
−1
0
20
40
60
time [s]
80
100
120
0
20
40
60
time [s]
80
100
120
visual scores
1
0
Figure 6.19: OSA subject 8 [P(W )-P(S)] output and human expert scores (2 minutes)
The percentage of misclassification on the test set achieved with this network is 6.28%.
The optimal MLP was used to analyse the all-night EEG record of a subject from the normal sleep
database. The time courses of two of the three MLP outputs were combined to give a measure of sleep
depth [P(W )-P(S)], which shows a high correlation with the hypnogram generated by a human expert,
suggesting that the MLP is able to interpolate between classes for intermediate sleep stages 2 and 3.
The sleep EEG of OSA subjects was analysed using the optimal MLP. The time courses of the [P(W )-P(S)]
output show severe disruption in the sleep. A method for automated μ-arousal detection using thresholding of the [P(W )-P(S)] output was introduced. The output of the automated scores was post-processed
to follow ASDA rules for μ-arousal scoring as closely as possible. Sensitivity, positive predictive accuracy
and correlation were used to evaluate the performance of the automated detection system with respect
to the human expert scores. The correlation measure was used to choose the optimal threshold value
per subject, and two methods for setting the threshold, one of them subject-adaptive, were applied retrospectively. The results for five of the seven subjects show a high correlation (greater than 0.8) value,
with values of Se and P P A mostly over 0.9. Possible causes for the lower correlation values (0.56 and
0.34-0.73) obtained with the other two subjects may be explained by the fact that these two subjects
have different types of μ-arousal.
6.4 Conclusions
145
P(W)−P(S) using 1s
Sleep database subject 09
1
0
−1
0
1
2
3
4
5
3
4
5
3
4
5
P(W)−P(S) using 3s
(a)
1
−1
0
1
2
sleep stages
(b)
2
3
4
0
1
2
time [hours]
(c)
Figure 6.20: Sleep database subject 9 raw P(W )-P(S) using a 1-s analysis window (a) and using a 3-s
analysis window (b), compared to the human expert scored hypnogram (c)
6.4 Conclusions
The neural network, trained with normal sleep data, is capable of following the abrupt transitions in
the sleep EEG of OSA patients. The methods introduced for automated μ-arousal detection were able
to identify a high percentage of the events scored by the human expert, giving the beginning and the
end times for the μ-arousal with relatively high accuracy (as measured by a simple correlation index)
for most of the OSA subjects in the database. The study of the subjects with low correlation levels in
the automated μ-arousal detection showed different changes in the EEG frequency content prior to and
during the μ-arousal.
The 3-second analysis window with a 2-second overlap for the AR modelling has yielded better results in
terms of MLP performance and in the sensitivity and correlation of the μ-arousal detection.
Chapter 7
Visualisation of the
alertness-drowsiness continuum
Daytime drowsiness or sleepiness is a common complaint in patients with OSA. A full assessment of an
OSA case may include a vigilance test after a night-time sleep recording has been performed. In any case,
it would be very useful for clinicians to have a method of assessing the day-time performance of OSA
patients in relation to the severity of their sleep disorder.
Drowsiness is a state in which a person will easily fall asleep in the absence of external stimuli. It is quite
different from exhaustion as a result of physical activity. While drowsiness is a mental state which occurs
prior to sleep, its opposite, alertness, is a physiological activated state of the human brain, characterised
by consciousness and awareness. Human beings experience fluctuations in their levels of alertness during
the day because of the circadian rhythm. These fluctuations can be affected by sleep deprivation or low
quality of sleep as is the case with OSA.
In this chapter we investigate changes in the level of alertness which may be gradual rather than abrupt,
like the short events (arousals during sleep) of the previous chapter. Two databases are considered:
1. The “sleep database”, previously used for training neural networks to track the sleep-wake continuum and hence detect arousals in test data. This has the three previously defined categories of
wakefulness, REM/light sleep and deep sleep.
2. The “vigilance database” described below in which eight sleep-deprived subjects perform vigilance
7.1 The vigilance database
147
tasks while having their EEG monitored. For reasons which are explained below, there are two broad
categories in this database: alertness and drowsiness.
One important question is the inter-relationship and overlap between these five categories. For example,
wakefulness in the sleep database corresponds to a mental state in which the subjects lie in bed with their
eyes shut in a darkened room. On the other hand alertness in the vigilance database represents a state in
which the subjects are awake, with their eyes open, in a well-lit room in front of a computer screen. In
both instances, the subjects are awake but their EEG activity may be different.
In both the analysis of the sleep EEG and the vigilance EEG [153][127][170][171][104][50] it is the
frequency content of the signal which is used to characterised it. Although a 5th-order AR model has been
used previously [50] in the analysis of vigilance EEG, we decided that, in order to be able to compare EEG
signals from both databases, the same parameterisation should be used in both cases, namely reflection
coefficients from a 10-th order model. The inter-relationship between these coefficients for the different
classes will be visualised in 2-D using both the Sammon map and the N EURO S CALE algorithm.
The rest of this chapter is organised as follows. Firstly, the vigilance database used both in previous
work[50] and in subsequent chapters is introduced. Secondly, the Sammon map and the N EURO S CALE
algorithm for visualising the high-dimensional data are applied to the vigilance database to investigate
the separation (or overlap) between the two classes, alertness and drowsiness. Finally, the visualisation
tools are used to study the inter-relationships between the EEG patterns of the five categories present in
the two databases together.
7.1 The vigilance database
The Department of Psychology at the University of West England conducted a study in which eight healthy
young subjects performed various vigilance tests for approximately 2 hours (see Appendix C), after a
night of sleep deprivation and no stimulant consumption for 24 hours before or during the test. The
EEG was recorded from a number of sites on the scalp but only the central (C4 ) site recordings, as in
7.1 The vigilance database
148
the sleep EEG studies, were used in the work described in this thesis. Expert scoring based on the visual
assessment of the EEG, EMG and EOG was undertaken on a 15-second basis, according to the Alford et
al. sub-categories of Table 3.2 [8]. A brief summary of the database is given in Appendix C and the Table
is reproduced in simpler format below:
Vigilance sub-category
Active Wakefulness
Quiet Wakefulness Plus
Quiet Wakefulness
Wakefulness with Intermittent α
Wakefulness with Continuous α
Wakefulness with Intermittent θ
Wakefulness with Continuous θ
(Active)
(QWP)
(QW)
(WIα)
(WCα)
(WIθ)
(WCθ)
Description
active/alert, > 2 eye mov/epoch, definite body mov.a
active/alert, > 2 eye mov/epoch, possibly body mov.
alert, < 2 eye mov/epoch, no body mov.
burst of α < half of an epoch
burst of α > half of an epoch
burst of θ < half of an epoch
burst of θ > half of an epoch
a movement
Table 7.1: Alford et al. vigilance sub-categories
In previous work in the Neural Networks Research Group, Duta [50] investigated the tracking of fluctuations in vigilance using both the central and mastoid (behind the ears) EEG sites. In that work the
EEG was divided into one-second segments. Since the expert scoring of the (central) EEG was undertaken using a 15-second timescale, a large number of 1-s segments are wrongly labelled, as for instance
a 1-s segment from a 15-s epoch of vigilance category WIα may consist predominantly of α-wave activity whereas another segment in the same epoch may correspond to Quite Wakefulness (QW). Duta
re-labelled the data using a combination of the expert scoring and Kohonen feature maps to visualise the
cluster to which the one-second segment belonged. As a result of this, she defined two categories:
1. Alertness: one-second segments which are labelled by the expert as Active, QWP or QW, and have
corresponding feature vectors which are mapped onto the area of the Kohonen map mostly visited
by the Active, QWP and QW sub-categories and not visited by WIα, WCα and WIθ.
2. Drowsiness: one-second segments labelled WIα, WCα or WIθ whose feature vectors visit the area
of the Kohonen map mostly visited by the WIα, WCα and WIθ sub-categories and not visited by
Active, QWP and QW.
In addition, an extra class of Uncertain was defined as containing the one-second segments whose fea-
7.1 The vigilance database
149
ture vectors are mapped onto an area of the Kohonen map visited by feature vectors extracted from
one-second segments from all vigilance sub-categories. There are approximately 8,000 and 20,000 1-s
segments which belong to the Drowsiness and Alertness classes respectively, although the distribution is
not uniform amongst the subjects. The distribution of patterns per subject per class is shown in Table 7.2.
Class
Drowsiness
Intermediate
Alertness
Artefact
Total
1
282
1220
4802
1151
7455
2
1541
2038
1896
2625
8100
3
413
1978
3280
2204
7875
Subject
4
804
2262
3394
2195
8655
5
1817
1181
2591
2286
7875
6
1218
2084
2416
4797
10515
7
2116
1749
1368
2717
7950
Table 7.2: Number of patterns per subject per class in vigilance training database
7.1.1 Pre-processing
Although the data in this database was sampled at 256 Hz, it is down-sampled to 128 Hz in order to keep
the pre-processing filters and AR modelling consistent across all databases in this thesis. Ten reflection
coefficients per second are calculated using the Burg algorithm for each 3-s window with 2-s overlap, as
with the sleep database (see section 6.1.3).
7.1.2 Visualising the vigilance database
Ideally we would take an equal number of Alertness (A) and Drowsiness(D) patterns per subject in order
to have every subject equally represented when training the visualisation algorithm. Unfortunately, some
subjects in the database have a very small number of patterns for the Drowsiness class. If we take 800
patterns per class per subject, 5 out of the 7 subjects can provide this number. A training set is then built
randomly selecting 800 patterns per class for each subject, or the maximum available when this is not
possible (see Table 7.3).
The visualisation algorithms used in this thesis, the Sammon map and N EURO S CALE, require a small
number of feature vectors for a reasonable convergence time. With approximately 5,000 patterns per
class, a reduction in the size of the training set is needed. Using the K-means clustering algorithm, the
7.1 The vigilance database
Class
D
A
150
1
282
800
2
800
800
3
412
800
Subjecta
4
5
800 800
800 800
6
800
800
7
800
800
Total
4694
5600
a Note that only seven subjects are listed above. The eighth subject was
discarded for reasons explained later.
Table 7.3: Number of patterns per subject per class in K-means training set
number of patterns in the training set is reduced to about 200 mean patterns per class (by choosing 14
means per subject per class).
The Sammon map and N EURO S CALE algorithms are run independently with the reduced dataset using
the same parameters as for the sleep database (for Sammon map, gradient proportionality factor = 0.06;
and for N EURO S CALE, number of basis functions = 50). The projections of the means produced by
both visualisation techniques, presented in Figs. 7.1 and 7.2, are very similar and show two partially
overlapped clusters, representing the A and D classes respectively. Of course, the overlapping does not
necessarily occur in the 10-D space as it does in the 2-D projection, in the same way as the edges of a 3-D
cube may touch each other in a 2-D projection.
Visualising the feature vectors for each subject
The cluster size in the Sammon maps shown in Fig. 7.1, represented by the radius of the circles around the
cluster mean, is calculated by counting the number of feature vectors in the training set which “belong”
to that mean (as defined by the Euclidean distance in 10-D between the feature vector and the cluster
mean). The distribution of patterns per subject can also be investigated by considering only the feature
vectors belonging to a specific subject. The results of using the Sammon and N EURO S CALE algorithms on
each subject individually are shown in Figs. 7.3 and 7.4.
7.1.3 Discussion
The maps showing the distribution of the patterns per subject reveal some differences between subjects.
One of the subjects in the database, subject 8 (not shown in the tables), was discarded because she was
7.1 The vigilance database
151
(a) Both classes
(b) Drowsiness
(c) Alertness
Figure 7.1: Vigilance Sammon map
identified by the expert who scored the records as belonging to the minority class α+ (see sections 3.4.1
and 3.4.3), a condition in which the subject’s EEG shows an α-rhythm during eyes-open wakefulness
[87]. Although alpha-plus people represent a significant fraction of the population, the lack of data
and subjects for this category in our database makes it difficult to include it in the rest of the analysis.
However, the data from subject 8 allows us to exploit the advantage that the N EURO S CALE algorithm
has over the Sammon map algorithm. The trained N EURO S CALE network can be used on previously
unseen data, provided that the new data is drawn from the same probability distribution as the training
data. Thus, the N EURO S CALE network trained with the 7-subject training set described in Table 7.3, can
be used with this α+ subject as input in order to visualise the A and D patterns of this subject with
respect to those from the rest of the subjects. Fig. 7.4h clearly shows that the D patterns for subject 8
7.2 Visualising vigilance and sleep data together
152
Vigilance NeuroScale, 14 means per class, 50 basis functions, 500 iterations
Projection on Vigilance NeuroScale map (14 means per class)
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−8
−8
−10
−10
−12
−12
−14
−16
−8
−14
Drowsy
Alert
−6
−4
−2
0
2
4
(a) Means only
6
8
−16
−8
Drowsy
Alert
−6
−4
−2
0
2
4
6
8
(b) All patterns
Figure 7.2: Vigilance N EURO S CALE map
lie mostly in the area where the A patterns from the others subjects are found. Given that this subject’s
EEG differs from the EEG of most of the population, it is very likely that the N EURO S CALE neural network
is extrapolating when presented with this subject’s patterns as they are not represented in its training
set. Another N EURO S CALE neural network is therefore trained, this time with subject 8’s mean patterns
added to the training set. The resulting 2-D plot for this 8-subject training set is shown in Fig. 7.5 and
the projection of subject 8’s patterns using this neural network is shown in Fig. 7.6h.
This figure reveals an interesting phenomenon which could not be seen in the 7-subject N EURO S CALE 2-D
projection. On Fig. 7.6h, the D patterns from subject 8 lie in an area where there are no patterns from
any other subject. Also subject 8’s A patterns overlap completely with her D patterns. This can be tracked
to the first five reflection coefficients for the D class which for subject 8 have mean values different from
those of the other subjects (see Fig. 7.7).
7.2 Visualising vigilance and sleep data together
To explore the relationship in feature space between the sleep and vigilance classes, a N EURO S CALE
neural network is trained with means extracted both from the sleep database classes Wakefulness (W),
REM/Light-sleep (R) and Deep-sleep (S) and from the vigilance categories Alertness (A) and Drowsiness
(D). An equal number of means is extracted for each class from the databases giving a total of 210 means.
7.2 Visualising vigilance and sleep data together
153
The resulting N EURO S CALE plot of the means is shown in Fig 7.8. The maps showing the projection of
the feature vectors for each of the five classes can be seen in Fig 7.9.
A Sammon map was also trained with the means from the combined sleep-vigilance databases. The
results shown in Fig. 7.10 are comparable to those obtained with N EURO S CALE (Figs. 7.8 and 7.9), but
the relation between pairs of two classes may be seen more clearly on the Sammon map, as shown in
Fig. 7.11.
7.2.1 Discussion
It can be seen from Fig. 7.11b that the Wakefulness class from the sleep database is broader than the
Alertness category from the vigilance database. Although the Alertness patterns are mostly mapped
onto a region of the map covered by the Wakefulness class, it is not necessarily correct to say that the
Alertness category is a subset of the Wakefulness class. On the one hand, we have the Alert patterns of
sleep-deprived subjects performing a rather boring task (see Appendix C), fighting to remain awake. On
the other hand, we have the Wakefulness patterns from subjects lying in bed, ready to sleep, in a quiet,
dark and comfortable room. It is not known whether these subjects were relaxed or not, but it is very
likely that they were not concentrating their mind on anything in particular. The overlap between these
two classes is understandable but it was also expected that there would be a region for each class not
shared with the other one. It is possible that this region may be represented by three dense Alertness
clusters at the lower edge of this class on the Sammon map, a region not visited by any other class. The
same region is seen in the N EURO S CALE plot as the right-hand side of Alertness category in Fig. 7.9d. It
is also encouraging to find a small area where the Wakefulness patterns on the Sammon map overlap the
Drowsiness patterns but not the Alertness patterns (see Figs. 7.11b and 7.11c).
The spatial relationship between Alertness, Drowsiness and REM/Light Sleep is shown in Figs. 7.11d and
7.11e. There is a large area of overlap between REM/Light Sleep and Drowsiness, but the REM/Light
Sleep area only overlaps Alertness in the area where the latter overlaps Drowsiness. This is reasonable,
as the brain cortex, fully active when the subject is alert, is randomly stimulated during REM sleep. The
7.3 Conclusions
154
Drowsiness area extends onto the Wakefulness area towards the upper-centre border of the map, where
it becomes the dominant class. The centre-left region of the map is dominated by REM/Light Sleep.
Finally, Fig. 7.11f shows two well defined completely separated clusters representing the 2-D projections
for Drowsiness and Deep Sleep. This is expected as Drowsiness only includes short bursts of θ rhythm
and no δ rhythm, while Deep Sleep patterns consists mainly of δ waves with some occasional θ rhythm.
From the visualisation maps, the following hypotheses can be formulated:
• A transition from an alert state of mind to sleep may progress from the area exclusive to Alertness
through Drowsiness’ area shared by A, W and D, Light Sleep’s area shared by A, D and R, and then
into Deep Sleep.
• Another transition from a relaxed state of Wakefulness to sleep starts from the region of Wakefulness
not shared with Alertness, moves towards Drowsiness’ area shared by W and D, and then into Light
Sleep’s area shared by R and D only, eventually reaching Deep Sleep.
7.3 Conclusions
In this chapter, we have analysed the EEG recordings from the vigilance database, which consists of 2hour recordings from seven healthy sleep-deprived subjects performing vigilance tasks. Two vigilance
categories were defined, namely Alertness and Drowsiness, and used to label 1-s EEG segments based on
the scores from a human expert. The EEG was processed in the same way as for the sleep database. A
near-balanced training set was built from the vigilance database by randomly selecting an equal number of patterns per subject and per class. Visualisation of the data distribution in the feature space
revealed inter-subject variability in both the Alertness and Drowsiness classes. An interesting example
was discussed, namely an α+ subject whose Drowsiness patterns seem to be different from the rest of the
feature vectors in the training set.
A further visualisation study was carried out integrating the sleep and vigilance categories in one training
set. From this analysis we may draw the following conclusions:
7.3 Conclusions
155
1. The Alertness and Drowsiness patterns give rise to two well-defined but partially overlapping clusters.
2. Wakefulness (from the sleep database) is a very broad class that includes some alert patterns as
well as some drowsy ones.
3. There is a small but relatively dense area beyond Wakefulness occupied by Alertness only.
4. The area shared by Wakefulness and Drowsiness patterns only may represent the sleep onset not
included in the REM/Light Sleep region.
5. The REM/Light Sleep and Drowsiness classes overlap significantly but not totally with obvious areas
not represented by any other class.
6. Deep Sleep is a separate class, of relatively low importance for the study of vigilance.
It is obvious that the vigilance categories, Alertness and Drowsiness, are not fully represented by any of
the sleep classes and therefore require a separate neural network analysis.
7.3 Conclusions
156
(a) All subjects
(b) Subject 1
(c) Subject 2
(d) Subject 3
(e) Subject 4
(f) Subject 5
(g) Subject 6
(h) Subject 7
Figure 7.3: Vigilance Sammon map showing subject’s distribution (Alertness in red and Drowsiness in
blue)
7.3 Conclusions
157
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−8
−8
−10
−10
−12
−12
−14
−16
−8
−14
−6
−4
−2
0
2
4
6
8
−16
−8
−6
−4
(a) Subject 1
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−8
−8
−10
−10
−12
−12
−14
−4
−2
0
2
4
6
8
−16
−8
4
6
8
−6
−4
−2
0
2
4
6
8
4
6
8
4
6
8
(d) Subject 4
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−8
−8
−10
−10
−12
−12
−14
−14
−6
−4
−2
0
2
4
6
8
−16
−8
−6
−4
(e) Subject 5
−2
0
2
(f) Subject 6
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−8
−8
−10
−10
−12
−12
−14
−16
−8
2
−14
−6
(c) Subject 3
−16
−8
0
(b) Subject 2
6
−16
−8
−2
−14
−6
−4
−2
0
2
(g) Subject 7
4
6
8
−16
−8
−6
−4
−2
0
2
(h) Subject 8
Figure 7.4: Vigilance N EURO S CALE map projections for each subject (Alertness in magenta and Drowsiness in blue)
7.3 Conclusions
158
8−subject vigilance NeuroScale, 192 means per class, 50 basis functions, 500 iterations
Projection on the 8−subject Vigilance NeuroScale map
12
12
Drowsy
Alert
Drowsy
Alert
10
10
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−8
−8
−6
−4
−2
0
2
(a) Means only
4
6
8
10
−8
−8
−6
−4
−2
0
2
4
6
8
10
(b) All patterns
Figure 7.5: Vigilance N EURO S CALE map trained with all subjects, including the α+ subject
7.3 Conclusions
159
12
12
10
10
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−8
−8
−6
−6
−4
−2
0
2
4
6
8
10
−8
−8
−6
−4
(a) Subject 1
12
10
10
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−4
−2
0
2
4
6
8
10
−8
−8
4
6
8
10
−6
−4
−2
0
2
4
6
8
10
6
8
10
6
8
10
(d) Subject 4
12
12
10
10
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−6
−4
−2
0
2
4
6
8
10
−8
−8
−6
−4
(e) Subject 5
−2
0
2
4
(f) Subject 6
12
12
10
10
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−8
−8
2
−6
−6
(c) Subject 3
−8
−8
0
(b) Subject 2
12
−8
−8
−2
−6
−6
−4
−2
0
2
4
(g) Subject 7
6
8
10
−8
−8
−6
−4
−2
0
2
4
(h) Subject 8
Figure 7.6: Vigilance N EURO S CALE trained with all subjects, including α+ subject (Alertness in magenta
and Drowsiness in blue)
7.3 Conclusions
160
−0.8
−0.7
−0.6
−0.5
−0.7
−0.6
−0.5
−0.4
−0.3
−0.7
−0.6
−0.5
−0.4
−0.3
0.2
−0.8
0.6
0.8
1
0.4
0.5
0.6
0.7
0.8
0.3
0.4
0.5
0.6
0.7
Coeff. 6
Coeff. 5
−0.8
0.4
Coeff. 4
−0.9
Coeff. 3
−1
Coeff. 2
Coeff. 1
Subject 12 drowsiness patterns
0.2
Figure 7.7: Subject 8 reflection coefficient histogram (green) in relation to the rest of the subjects in the
training set (magenta)
NeuroScale with vigilance and sleep data, 42 means per class, 50 basis functions, 500 iterations
6
4
2
0
−2
−4
−6
−6
Wakefulness
REM/light−sleep
Deep−sleep
Drowsiness
Alertness
−4
−2
0
2
4
6
Figure 7.8: Vigilance and sleep N EURO S CALE map
7.3 Conclusions
161
Wakefulness
REM/light sleep
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−4
−2
0
2
4
−6
−6
6
−4
−2
(a) Wakefulness
0
Deep sleep
4
4
2
2
0
0
−2
−2
−4
−4
−2
6
4
6
Alertness
6
−4
4
(b) REM/light sleep
6
−6
−6
2
0
2
4
−6
−6
6
−4
−2
(c) Deep Sleep
0
2
(d) Alertness
Drowsiness
6
4
2
0
−2
−4
−6
−6
−4
−2
0
2
4
6
(e) Drowsiness
Figure 7.9: Vigilance and sleep N EURO S CALE projections for all the patterns in each class (colour code:
W, cyan; R, red; S, green; A, magenta; and D, blue)
7.3 Conclusions
162
(a) All classes
(b) Wakefulness
(c) REM/light sleep
(d) Deep Sleep
(e) Alertness
(f) Drowsiness
Figure 7.10: Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta; and
D, blue)
7.3 Conclusions
163
(a) All classes
(b) Alertness and wakefulness
(c) Wakefulness and drowsiness
(d) REM/light sleep and drowsiness
(e) REM/light sleep and alertness
(f) Deep Sleep and drowsiness
Figure 7.11: Vigilance and Sleep Sammon map (colour code: W, cyan; R, red; S, green; A, magenta; and
D, blue)
Chapter 8
Training a neural network to track the
alertness-drowsiness continuum
At the end of the previous chapter, we showed that a neural network used to assess the level of drowsiness in OSA patients should be trained using exclusively vigilance labelled patterns. In this chapter we
train and test a neural network to track the alertness-drowsiness continuum using single-channel EEG
recordings from control subjects performing vigilance tests.
8.1 Neural Network training
The visualisation techniques applied to vigilance data in section 7.1.2, showed a high degree of overlap
between the A and D classes in the 2-D projection of the vigilance database feature vectors. Despite this
overlap, a neural network may be able to resolve the differences using 10-D feature vectors as inputs. We
expect that a two-class neural network trained exclusively with patterns from the extreme conditions of
fully alert (A) and fully drowsy (D), will be able to interpolate when a pattern belonging to an intermediate stage is presented at the input. In this way the vigilance continuum may be tracked by an output
fluctuating between the full alertness and the full drowsiness levels.
8.1.1 The training database
The 7-subject vigilance database described in chapter 7 is used for the neural network training and
testing. The set of 10 reflection coefficients extracted from the A and D patterns used for visualisation in
8.1 Neural Network training
165
§7.1.2 is now used in this chapter for the training process.
8.1.2 The neural network architecture
An MLP neural network is selected for the same reasons as in chapter 6. As with the sleep-wake continuum network, the cross-entropy error function and the scaled conjugate gradient optimisation algorithm
are used during the training process. Given that only one output is required in a two-class problem1 , the
configuration for the MLP is 10-J-1, the output representing the posterior probability of the input vector
belonging to the alertness class. The estimate of the number of hidden units J given by Eq. 5.72, i.e.
the geometric mean of the number of inputs times the number of outputs, is
√
10 × 1 = 3.16. Hence a
search for the optimum J is done training 10-J-1 MLPs with values of J from 2 to 15. As before, the
problem of over-fitting the network is dealt with by introducing regularising terms νz and νy , one for
each weight layer. Based on results from a preliminary investigation, the values of the regularisation
parameters are varied between 10−3 and 1 for the input-to-hidden layer νz , and from 10−7 to 10−5 for
the hidden-to-output layer νy , increasing in powers of ten. To avoid being trapped in a local minima,
three different random weight initialisations are used. Cross-validation is used, as before, to optimise the
MLP architecture and regularisation parameters.
8.1.3 Choosing training, validation and test sets
Ideally, balanced training and validation sets should be assembled for the cross-validation tests, assuming
equal prior probabilities for both classes. However, inter-subject differences were found when visualising
the vigilance database (§7.1.2). All the subjects should be equally represented in the training and validation sets, but as is shown in Table 7.2, the distribution of A and D patterns among the vigilance database
is very uneven. Using the same criterion as for the N EURO S CALE training set, 800 (or fewer patterns
when this is not possible) were drawn per class for each subject, yielding 5,494 patterns for Alertness
and 5,822 for Drowsiness.
1 P (D
| x) = 1 − P (A | x)
8.1 Neural Network training
166
Assigning these patterns to two equal-sized sets, we obtain approximately 2,800 patterns per class in
each set. With as few as 7 subjects in our database and the high degree of inter-subject variability seen
in the visualisation studies, the best strategy for training and testing the MLP will be the leave-one-out
method [159]. This requires the leaving of one subject out of the training and validation sets, so that it
can be used as a test subject, and repeating this for each subject in turn. This method leads to 7 different
partitions of the data, as shown in Table 8.1.
Partition
1
2
3
4
5
6
7
Training and validation
subjects
2, 3, 4, 5, 6, 7
1, 3, 4, 5, 6, 7
1, 2, 4, 5, 6, 7
1, 2, 3, 5, 6, 7
1, 2, 3, 4, 6, 7
1, 2, 3, 4, 5, 7
1, 2, 3, 4, 5, 6
A
4800
4800
4800
4800
4800
4800
4800
D
4412
3894
4282
3894
3894
3894
3894
Total
Tr
Va
4606 4606
4347 4347
4541 4541
4347 4347
4347 4347
4347 4347
4347 4347
Test
subject
1
2
3
4
5
6
7
Table 8.1: Partitions and distribution of patterns in training (Tr) and Validation (Va) sets
The MLP training and parameter optimising process can be summarised as follows:
1. Build training and validation sets on (n − 1) subjects using 800 (or as many as there are if this is
not possible) patterns per subject per class. Repeat this for each subject2 .
2. For each partition:
(a) Normalise training, validation and test sets with respect to the training set statistics.
(b) For each set of values of the network parameters (J, νz , νy ) and weight initialisation seed,
train a 10-J-1 MLP using the cross-entropy error function and the scaled conjugate gradient
optimisation algorithm.
(c) Choose the optimal MLP based on the performance on the validation set.
(d) Test the optimal MLP on the nth subject. Compare with the expert assessment.
2 Here
n=7
8.1 Neural Network training
167
Hence, the MLP parameter optimisation involves the training of the following number of networks: (7
partitions) × (3 weight initialisations) × (14 values of J) × (4 values of νz ) × (3 values of νy )= 3,528
networks
8.1.4 Optimal (n − 1)-subject MLP per partition
The optimisation of the MLP parameters yields the results shown in Table 8.2. Fig. 8.1 shows the average
variation in misclassification error for the validation set with respect to the number of hidden units J.
The optimum value for J is clearly between 3 and 4 for the majority of the partitions as estimated using
Eq. 5.72. Partition 5 is the only one which has a higher value for optimum J. The best 10-3-1 MLP for
partition 5 has a classification error of 18.79% on the validation set, but the optimum value of J = 13
was kept for this partition. Fig. 8.2 shows the average variation in misclassification error in the validation
set with respect to the regularisation parameters (νz ,νy ) for the 10-3-1 MLPs. Either from the plot or
from the table, it can be seen that the optimum regularisation parameters occur towards the end of the
ranges (10−3 , 10−7 ) in many of the partitions, suggesting that the search could have been continued in
that direction. However, a previous investigation found that network performance on the validation set
drops significantly for smaller values of νz and νy . This is expected since the regularisation terms become
negligible with a consequent loss in generalisation.
partition
1
J
3
νz
10
−2
−3
Tr error
Va error
10
−6
21.02
20.73
10
−7
20.54
19.65
νy
2
3
10
3
3
10−3
10−6
19.69
19.22
4
3
10−3
10−6
19.37
19.39
10
−2
10
−5
18.36
18.73
10
−3
10
−7
22.15
22.22
10
−3
10
−7
19.62
20.31
5
6
7
13
3
4
Table 8.2: Optimum MLP parameters per partitions and percentile classification error for training (Tr)
and validation (Va) sets
8.2 Testing on the nth subject
168
Vigilance 6−subject MLP optimisation
21.5
21.4
average classification error [%]
21.3
21.2
21.1
21
20.9
20.8
20.7
20.6
20.5
2
4
6
8
10
number of hidden units
12
14
16
Figure 8.1: Average misclassification error for the validation set vs. number of hidden units J for the
(n − 1)-subject MLP
Vigilance 6−subject MLP optmisation (J=3)
average classification error [%]
21.2
21
20.8
20.6
20.4
20.2
20
19.8
19.6
19.4
−5
−5.5
0
−0.5
−6
−1
−1.5
−6.5
−2
−2.5
log10 (νz )
−7
−3
log
10
(ν )
y
Figure 8.2: Average misclassification error on the validation set with respect to regularisation parameters
(νz ,νy ) for the (n − 1)-subject MLP with J = 3 (linear interpolation used between 12 values)
8.2 Testing on the nth subject
The optimal MLP for each partition is tested using the nth subject. Given that the main goal is not classification, but the tracking of the alertness-drowsiness continuum, the assessment of MLP performance on
test data is carried out on the time course of the MLP output instead of on a number of randomly selected
1-s segment feature vectors. The time course of the MLP output is compared with the expert assessment
of the subject’s vigilance according to the Alford et al. scale described in section 3.4.2. The time courses
of the MLP output and expert scores are shown in Figs. 8.3 to 8.9. Given that the expert scored the EEG
on a 15-s basis, the MLP output is filtered using a 15-pt median filter. This allows comparison with the
8.2 Testing on the nth subject
169
expert’s discretised representation of the alertness-drowsiness continuum.
8.2.1 Qualitative correlation with expert labels
A visual inspection of the time courses and corresponding expert labels reveals that, in all cases, the time
course of the MLP output follows the fluctuations in the vigilance scale fairly closely.
There is no difference between the MLP outputs corresponding to labels Active and QWP, for which it is
almost always 1.0. Subject 1’s time course shows that the MLP is not reaching the lower values associated
with the WIθ category. The 2-D projections of this subject’s feature vectors in Figs. 7.3 and 7.4 may give a
possible explanation. It can be seen in the figures that the D patterns of this subject lie in the overlapping
area between the A and D classes. The MLP is not always able to resolve the difference between the two
classes in this area, hence the posterior probabilities of belonging to either class are approximately equal
(MLP output ≈ 0.5). Subject 2 is not affected by this problem, the network performance being generally
as expected as the MLP output sweeps the [0-1] range in synchronism with the expert labels. Large
fluctuations remain, even after the filtering, but this is expected from an individual who goes from being
totally drowsy to being fully active several times during the recording. Subject 3 is similar to subject
1 , as the MLP output does not reach the drowsiness levels. His A and D patterns in the 2-D feature
space projection also lie in an area of high overlap. The performance for subject 4 is poor, the output
remaining persistently high despite the multiple occurrences of the WIθ label. There are fewer problems
with label WIα, the MLP output reaching a value of around 0.5. This subject’s A patterns seem to be
divided in two clusters far apart in the 7-subject Sammon map (Fig. 7.3), and some of its means are not
visited by the A patterns of other subjects. In contrast, the MLP analysis yields good results in general
for the next three subjects, as with subject 2. Subject 5’s MLP output matches the expert labels with only
two major exceptions, around times 00:42 and 01:57 (42 and 117 minutes), in which the MLP output
is low when the expert labels are WIα-QWP. Similar errors can be found in the time course of the MLP
output for subject 6, when for brief periods of time (around 00:25, 00:47 and 01:18), the MLP output
is high when the subject labels are WIθ-WIα. Note that this subject’s A and D patterns are the furthest
8.3 Training an MLP with n subjects
170
away in the 2-D projection of the feature space, lying in areas of little or no overlap between classes. One
possible reason for the segments with the poor correlation in the MLP output time course is the presence
of artefacts, as occurs during the interval centered on 01:18. Subject 7’s MLP performance also shows
a good correlation with the expert labels, with just two segments at times 0:17 and 1:20 for which the
output fails to indicate an intermediate to high level of alertness.
8.2.2 Quantitative correlation with expert labels
To give a more objective measure of MLP performance on each test subject in turn, the 15-pt median filtered MLP output range was divided into three sub-intervals. Values between 0.0 and 0.3 are considered
to match the drowsy labels WCα, WIθ and WCθ. The second interval, bounded between 0.3 and 0.7, is
to represent the intermediate state WIα, and values between 0.7 and 1.0 are to correspond to the alert
states Active, QWP and QW. Correlation of the median-filtered MLP output with the expert labels, on a
1-s basis, according to this assignment, reinforces the visual assessment (see Table 8.3). The gap between
the best and the worst values is as narrow as 16.4%, the worst correlation being found for subject 4, as
expected, and the best for subjects 1 and 6.
partition
correlation
1
60.93
2
53.04
3
50.82
4
44.56
5
47.45
6
58.38
7
49.13
Table 8.3: Percentage correlation between 1-s segments of the 15-pt median filtered MLP output and
15s-based expert labels
8.3 Training an MLP with n subjects
The results in the last section show that an MLP trained with the vigilance database is able to track the
alertness-drowsiness continuum. The set of optimal MLPs for all the partitions could be used to analyse
new data as a committee of networks (see §5.3.3). The new data would be presented to all the networks
and the average of the outputs used as an estimate of the alertness posterior probability P (A | x).
However, this average may conceal rapid changes in the alertness-drowsiness continuum which may be
8.4 Summary and conclusions
171
important in the assessment of sleepiness in OSA patients. An alternative and easier approach is to train
an MLP using all seven subjects in the vigilance database in order to analyse subsequent test data. This
neural network will be referred to as the 7-subject MLP in the sections and chapters which follow.
The sequence of steps follows as:
1. Build training and validation sets on n subjects using 800 (or as many as there are if this is not
possible) patterns per subject per class. This yields a total of 2,347 D and 2,800 A patterns (see
Table 7.3) per set.
2. Normalise training and validation sets with respect to the training set statistics.
3. For each set of values of the network parameters (J, νz , νy ) and weight initialisation seed, train a
10-J-1 MLP using the cross-entropy error function and the scaled conjugate gradient optimisation
algorithm. Use three different random initialisations for the weights to increase the chance of
finding a better minimum for the error function during the training process.
4. Choose the optimal MLP based on the performance on the validation set.
The range for the MLP regularisation parameters is the same as for the MLP trained with (n − 1) subjects.
The number of hidden units J is varied from 2 to 10. Thus, the total number of 7-subject MLP trained to
find the optimum parameters is 324. Fig. 8.10 shows the average misclassification error for the validation
set against the number of hidden units J. The optimal MLP is found at J = 3, with regularisation
parameters (νz , νy ) optimal at (10−3 , 10−6 ). The best classification error on the validation set is 20.24%,
with a corresponding error of 20.67% on the training set.
8.4 Summary and conclusions
In this chapter, the vigilance database has been used to train a single-output MLP in order to track the
alertness-drowsiness continuum. Wakefulness EEG is more susceptible to artefacts and rapid changes
than sleep EEG. When the high degree of overlap between Alertness and Drowsiness classes is also con-
8.4 Summary and conclusions
172
sidered, this makes the analysis of the vigilance EEG a more difficult problem. As there are only 7 subjects
available and it was known from the visualisation studies that there existed a large amount of inter-subject
variability in the feature vectors, the leave-one-out method was used to train the neural network MLP. For
a 7-subject database this method yields 7 data partitions, each with 6 subjects. Training and optimisation
of the MLP parameters was carried out for each partition, and the optimal network tested in each case
with the nth subject. The correlation between the MLP output and the expert labels varies from 44.6% to
60.9% across the subjects, showing that an optimal MLP trained with (n − 1) subjects from the vigilance
database is capable of tracking the variations in the level of alertness of the nth (test) subject. For further
use with unseen data, the MLP is re-trained using all the subjects in the 7-subject database. Its use in the
evaluation of test data acquired from other subjects is considered in the following chapter.
173
0
WCθ
MLP output
Figure 8.3: Time course of the MLP output for vigilance subject 1
WIθ
WIα
WCα
QW
QWP
active
0.0
0.3
0.7
0.0
0.3
0.7
1.0
15−pt median filtered
expert labels
1.0
20
40
60
time [minutes]
Vigilance subject 1
80
100
120
8.4 Summary and conclusions
174
0
WCθ
MLP output
Figure 8.4: Time course of the MLP output for vigilance subject 2
WIθ
WIα
WCα
QW
QWP
active
0.0
0.3
0.7
0.0
0.3
0.7
1.0
15−pt median filtered
expert labels
1.0
20
40
60
time [minutes]
Vigilance subject 2
80
100
120
8.4 Summary and conclusions
175
0
WCθ
MLP output
Figure 8.5: Time course of the MLP output for vigilance subject 3
WIθ
WIα
WCα
QW
QWP
active
0.0
0.3
0.7
0.0
0.3
0.7
1.0
15−pt median filtered
expert labels
1.0
20
40
60
time [minutes]
Vigilance subject 3
80
100
120
8.4 Summary and conclusions
expert labels
MLP output
0.0
0.3
0.7
1.0
0.0
0.3
0.7
Figure 8.6: Time course of the MLP output for vigilance subject 4
WCθ
WIθ
0
WCα
WIα
QW
QWP
active
15−pt median filtered
1.0
20
40
60
80
time [minutes]
Vigilance subject 4
100
120
140
8.4 Summary and conclusions
176
177
0
WCθ
MLP output
Figure 8.7: Time course of the MLP output for vigilance subject 5
WIθ
WIα
WCα
QW
QWP
active
0.0
0.3
0.7
0.0
0.3
0.7
1.0
15−pt median filtered
expert labels
1.0
20
40
60
time [minutes]
Vigilance subject 5
80
100
120
8.4 Summary and conclusions
expert labels
MLP output
0.0
0.3
0.7
1.0
0.0
0.3
0.7
Figure 8.8: Time course of the MLP output for vigilance subject 6
WCθ
WIθ
0
WCα
WIα
QW
QWP
active
15−pt median filtered
1.0
20
40
60
80
100
time [minutes]
Vigilance subject 6
120
140
160
8.4 Summary and conclusions
178
179
0
WCθ
MLP output
Figure 8.9: Time course of the MLP output for vigilance subject 7
WIθ
WIα
WCα
QW
QWP
active
0.0
0.3
0.7
0.0
0.3
0.7
1.0
15−pt median filtered
expert labels
1.0
20
40
60
time [minutes]
Vigilance subject 7
80
100
120
8.4 Summary and conclusions
8.4 Summary and conclusions
180
Vigilance 7−subject MLP optimisation
21.5
21.4
average classification error [%]
21.3
21.2
21.1
21
20.9
20.8
20.7
20.6
20.5
1
2
3
4
5
6
7
number of hidden units
8
9
10
11
Figure 8.10: Average misclassification error for the validation set vs. number of hidden units J for the
7-subject MLP
Chapter 9
Testing using the vigilance trained
network
The MLP trained with the vigilance database can now be used to track the vigilance continuum in new
OSA patients. This chapter presents the use of the 7-subject vigilance MLP with new data obtained during
a separate vigilance study in OSA patients.
9.1 Vigilance test database
A physiological vigilance study carried out by the Osler Chest Unit staff at the Churchill Hospital, Oxford, provides frontal EEG records from ten OSA subjects, with varying degrees of severity of the sleep
disorder. The EEG was recorded during a vigilance test which lasted for a maximum of 40 minutes, the
duration depending on the degree of sleepiness of the subject during the test. The test, performed in a
sleep promoting environment, consists of the subject having to respond (by pushing a button) after he
has seen a light emitting diode (LED) flash for about 1s. The LED flashes every 3 seconds and the test
finishes after the subject misses 7 consecutive stimuli. More details about the test and clinical details of
the patients’ sleep disorders can be found in Appendix D. No expert scores are provided for this database,
just the button signal for every test. This can be used as a performance measure to validate the analysis
of the EEG. A summary of this database follows:
9.2 Running the 7-subject vigilance MLP with test data
Number of subjects
10
Condition
diagnosed with OSA
Description
4 to 6 vigilance tests denoted with the letters A to F in
182
chronological order.
Electrode montage
Frontal
Sampling frequency
128 Hz
Number of expert scorers
none, but performance measure is available
9.2 Running the 7-subject vigilance MLP with test data
9.2.1 Pre-processing
EEG signal:
The EEG data was pre-processed with the 19pt-low pass FIR filter and the mean removed
as described in previous chapters. Feature extraction, using 10th-order reflection coefficients calculated
using Burg’s algorithm within a sliding 3s-window with 2s overlap, yields a 10-D vector for each second
of EEG. The complete set of these feature vectors will be referred to as the LED test set from now on.
Visual identification of artefacts in the EEG was performed to mark and discard from the analysis those
segments contaminated with saturation and artefacts caused by poor electrode contact. Subject 10’s tests
B and E were excluded from the analysis that follows, due to artefacts or the lack of regular response to
the stimuli.
Button signal:
The pulse signal from the button was filtered and used to extract a performance measure
related to the number of missed stimuli. No trigger signal was provided, hence the start of each test was
set to be the second at which the subject starts pressing the button with regularity, every 3s, assuming that
the LED flashed every 3s from that moment on. A missed stimulus is then recorded as occurring when
the button is found not to have been pressed during the three seconds between flashes. The number of
consecutive missed stimuli is calculated on a 3s basis, synchronised with the stimuli, i.e. if the subject has
missed n consecutive hits at time ta seconds, then 1 missed hit has been recorded at (ta −3n) seconds, 2
9.2 Running the 7-subject vigilance MLP with test data
183
missed hits at (ta −3(n − 1)) seconds, . . . , (n−1) missed hits at (ta −3) seconds, and finally n missed hits
at ta .
9.2.2 MLP analysis
Normalisation of the reflection coefficients extracted from the EEG signals acquired during the LED tests
was performed, using the 7-subject training set statistics, and the normalised patterns were presented
to the 7-subject 10-3-1 vigilance MLP (see §8.3). Figures 9.1 to 9.22 show the MLP output time courses
along with the missed stimuli performance measure for each test for each patient. None of the MLP
outputs shown have been median-filtered. Note also the different time scales for each figure, depending
on the length of each test.
Visual inspection of the time courses does not reveal a consistent pattern of correlations between the
MLP output and the performance measure across patients, and not even between different tests for the
same patient. For instance, test A for subject 1 shows a paradoxically low value of the MLP output for the
first 3 minutes of the test, when the subject missed no more than 3 stimuli, and an increase in the MLP
output towards the second half of the test as the subject starts to miss more and more button hits. The
MLP output for the other two tests (C and D) suggests a drowsy subject struggling to keep himself awake
throughout the test, with little or no correlation with the actual performance, with the exception of the
last few seconds of test D, at which point the output goes close to zero while the missed stimuli measure
shows a severe decrease in the subject’s performance.
The next example, subject 2 test A, shows very good correlation between the MLP output and the performance measure. The MLP output, generally high during the first half of the test, when the performance is
good, suddenly decreases, just before the performance starts to deteriorate, and remains close to drowsiness levels towards the end of the test. However, the MLP output in subsequent tests for the same
subject, suggests a drowsier subject, remaining under 0.5 even when the performance is good. Some isolated peaks and other oscillations close to intermediate values may indicate the subject’s struggle against
drowsiness. Subject 3 is another example of good correlation in the first and fourth tests but not in the
9.2 Running the 7-subject vigilance MLP with test data
184
other three tests, for which the MLP output is highly oscillatory with no apparent connection with the
button hits. The fourth test is similar to the first one, showing a decreasing trend as the performance
worsened.
Subject 4 endured the four tests with excellent performance, with a high level MLP output, in agreement
with the performance. Unfortunately, this subject has no data for the “drowsy stages” since he never
missed more than two hits. The flat output during the 33th minute in test C for subject 4 is an artefact
due to a loose-electrode connection. Subject 5 has a medium-to-low MLP output, with a trend that tends
to match the decrease in the performance in all the tests.
The MLP output for subject 6 shows dramatic oscillations between the extreme values of drowsiness and
alertness (0 and 1) as the performance decreases. The period of the oscillations is of the order of minutes,
and for most of the cases when the number of missed hits rises above a value of two, a dip in the MLP
output which is approximately 20s-long precedes the increase in the number of missed stimuli. Although
this indicates a better correlation than for the subjects discussed up to now, a low value in the MLP output
at the beginning of tests A, B, C and E, when the performance is very good, spoils the overall correlation
between the MLP output and the performance measure for this subject.
Subject 7 starts with a very short test, during which the MLP output is constantly low notwithstanding
the first minute of perfect stimuli response. His second test shows the expected correspondence between
the MLP output and the number of missed hits. His third test is similar to the second one, but longer,
suggesting that the subject was drowsier than in the previous tests, perhaps performing reasonably well
because he had learned to cope with the test in spite of his increasing drowsiness. His fourth test differs
from the rest in that he did not fall asleep. The MLP output suggests that he was drowsy at the start of
the test, progressively gaining a better degree of vigilance, until he starts missing several LED flashes, and
then struggles between drowsiness and alertness throughout the rest of the test, although maintaining
good performance until the end.
The 8th subject shows an MLP output close to 1.0 throughout, with a few dips, most of them matching
9.2 Running the 7-subject vigilance MLP with test data
185
the loss of ability to respond the stimuli. The one exception to this pattern is the 3-minute period at the
beginning of test C, where the MLP output sweeps through a much wider range. It is worthwhile noting
that subject 8 has the lowest value for the subjective measure of sleepiness (ESS) of any patients in this
database (see Table D.2), and also one of the lowest oxygen saturation dip rates during the overnight
sleep study (see Table D.3), values comparable to those of subject 4, whose performance was the best
as he did not fall asleep during any of the tests. These two subjects could be considered to be at the
lower end of the spectrum of OSA severity, and the MLP output seems to corroborate the clinical results.
However, we shall see later that subject 8 ’s EEG largely differs from the rest of the EEGs in the LED
database and in the original vigilance database of chapter 8, which explains why the output is almost
constant at the upper end of the scale.
The next subject, number 9 in the database, has the worst ESS value and a relatively high index of
night-time O2 de-saturations, and fell asleep very quickly in every test. The MLP output is generally well
correlated with the performance measure, in trend and locally, as for example, when the subject recovers
from a peak in the number of missed stimuli. The last subject in the database, subject 10, performed
very short tests, and there is a degree of correlation between the MLP output and the performance
measure in the first two tests. His third test shows a notch in the MLP output preceding a 9s-long lapse in
performance. His fourth test, however, does not show any significant decrease in the MLP output when
the performance drops towards the end.
To corroborate these comments, the MLP output values were plotted against the number of consecutive
missed stimuli for each LED subject. Only the values at the times at which the stimuli occur have been
considered, as we cannot make any assumption about the subject’s vigilance state in the absence of the
stimulus.
The scatter plots in Figs. 9.23 and 9.24 show that the MLP output tends to take on values below 0.5 as
the number of missed hits increases, especially in subjects 1, 2, 5 and, to some extent, in subjects 7 and
9. This cannot be said of subjects 3, 6, 8, and 10, however, although results for subject 8 can be discarded
9.2 Running the 7-subject vigilance MLP with test data
186
LED subject 1, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
time [minutes]
4
5
6
(a) test A
LED subject 1, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
time [minutes]
20
25
30
(b) test B
Figure 9.1: LED subject 1 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
187
LED subject 1, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
8
10
time [minutes]
12
14
16
(a) test C
LED subject 1, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
time [minutes]
20
25
(b) test D
Figure 9.2: LED subject 1 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
188
LED subject 2, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
time [minutes]
20
25
(a) test A
LED subject 2, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
8
10
time [minutes]
12
14
16
(b) test B
Figure 9.3: LED subject 2 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
189
LED subject 2, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
time [minutes]
(a) test C
LED subject 2, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
time [minutes]
8
10
12
(b) test D
Figure 9.4: LED subject 2 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
190
LED subject 3, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.5
1
1.5
2
time [minutes]
2.5
3
3.5
4
(a) test A
LED subject 3, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
5
time [minutes]
6
7
8
(b) test B
Figure 9.5: LED subject 3 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
191
LED subject 3, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
5
time [minutes]
6
7
8
(a) test C
LED subject 3, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
time [minutes]
5
6
7
(b) test D
Figure 9.6: LED subject 3 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
192
LED subject 3, test E
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
time [minutes]
4
5
6
(a) test E
Figure 9.7: LED subject 3 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
193
LED subject 4, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
time [minutes]
20
25
30
(a) test A
LED subject 4, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
20
time [minutes]
25
30
35
40
(b) test B
Figure 9.8: LED subject 4 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
194
LED subject 4, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
20
time [minutes]
25
30
35
40
25
30
35
40
(a) test C
LED subject 4, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
20
time [minutes]
(b) test D
Figure 9.9: LED subject 4 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
195
LED subject 5, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
8
10
12
time [minutes]
14
16
18
20
(a) test A
LED subject 5, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
time [minutes]
8
10
12
(b) test B
Figure 9.10: LED subject 5 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
196
LED subject 5, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
8
10
time [minutes]
12
14
16
(a) test C
LED subject 5, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
8
10
time [minutes]
12
14
16
18
20
(b) test D
Figure 9.11: LED subject 5 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
197
LED subject 6, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
8
10
12
14
time [minutes]
(a) test A
LED subject 6, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
5
6
time [minutes]
7
8
9
10
(b) test B
Figure 9.12: LED subject 6 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
198
LED subject 6, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
time [minutes]
8
10
12
(a) test C
LED subject 6, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
5
time [minutes]
6
7
8
9
(b) test D
Figure 9.13: LED subject 6 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
199
LED subject 6, test E
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
2
4
6
8
10
time [minutes]
12
14
16
18
(a) test E
Figure 9.14: LED subject 6 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
200
LED subject 7, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.5
1
1.5
2
time [minutes]
(a) test A
LED subject 7, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
5
time [minutes]
6
7
8
9
(b) test B
Figure 9.15: LED subject 7 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
201
LED subject 7, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
time [minutes]
5
6
7
8
(a) test C
LED subject 7, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
20
time [minutes]
25
30
35
40
(b) test D
Figure 9.16: LED subject 7 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
202
LED subject 8, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
time [minutes]
15
20
(a) test A
LED subject 8, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
time [minutes]
20
25
30
(b) test B
Figure 9.17: LED subject 8 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
203
LED subject 8, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
5
time [minutes]
6
7
8
9
(a) test C
LED subject 8, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
5
10
15
time [minutes]
20
25
30
(b) test D
Figure 9.18: LED subject 8 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
204
LED subject 9, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
time [minutes]
5
6
7
(a) test A
LED subject 9, test B
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.5
1
1.5
2
2.5
time [minutes]
3
3.5
4
4.5
(b) test B
Figure 9.19: LED subject 9 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
205
LED subject 9, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.1
0.2
0.3
time [minutes]
0.4
0.5
0.6
(a) test C
LED subject 9, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.2
0.4
0.6
time [minutes]
0.8
1
1.2
(b) test D
Figure 9.20: LED subject 9 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
206
LED subject 10, test A
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
1
2
3
4
5
6
time [minutes]
(a) test A
LED subject 10, test C
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.5
1
1.5
2
time [minutes]
2.5
3
3.5
(b) test C
Figure 9.21: LED subject 10 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
207
LED subject 10, test D
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.5
1
1.5
time [minutes]
2
2.5
(a) test D
LED subject 10, test F
MLP output
1
0.5
0
7
6
missed hits
5
4
3
2
1
0
0
0.5
1
1.5
2
2.5
3
time [minutes]
3.5
4
4.5
5
(b) test F
Figure 9.22: LED subject 10 MLP output and missed hits time courses
9.2 Running the 7-subject vigilance MLP with test data
208
as we will see in the next section. Subject 4 lacks any data for more than 2 missed hits. It is important to
note that the reliability of the results for high values of missed hits is low, as not enough data points are
available, given that the test finishes whenever the subject fails to respond to 7 consecutive LED flashes.
LED subject 2, tests
LED subject 3, tests
1
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
MLP output
1
0.9
MLP output
MLP output
LED subject 1, tests
1
0.9
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
0
7
0.1
0
1
2
missed hits
3
4
5
6
0
1
2
missed hits
(a) Subject 1
(b) Subject 2
4
5
6
7
(c) Subject 3
LED subject 5, tests
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
MLP output
1
0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
3
missed hits
LED subject 4, tests
MLP output
0
7
0.1
0
1
2
3
4
missed hits
(d) Subject 4
5
6
7
0
0
1
2
3
4
5
6
7
missed hits
(e) Subject 5
Figure 9.23: LED subjects MLP output vs missed hits scatter plots
It is also clear that, when the subject responds to the stimuli (i.e. no missed hits), the MLP output can
take on any value over the whole range, with a distribution that varies from unimodal to bimodal to
uniform, as is shown in Fig. 9.25. This figure is particularly puzzling for subjects 1 and 2, as well as
subject 5, since all these have a unimodal distribution around zero or a very low value of MLP output.
Subjects 3, 7 and 9 on the other hand, present a very uniform distribution. This suggests that severe OSA
patients may perform reasonably well for some time when their brain is in a ”drowsy” state. However,
they cannot maintain this level of performance indefinitely, and it drops off sooner or later depending of
the severity of the disorder.
It can also be said, when reviewing the MLP outputs for all the subjects, that the transition to Drowsiness
can happen in a progressive manner as well as in sudden dips. Two examples of the former are shown
9.3 Visualisation analysis
209
LED subject 7, tests
LED subject 8, tests
1
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
MLP output
1
0.9
MLP output
MLP output
LED subject 6, tests
1
0.9
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
1
2
3
4
5
6
0
7
0.1
0
1
2
missed hits
3
4
5
6
0
1
2
missed hits
(a) Subject 6
(b) Subject 7
4
5
6
7
(c) Subject 8
LED subject 10, tests
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
MLP output
1
0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
3
missed hits
LED subject 9, tests
MLP output
0
7
0.1
0
1
2
3
4
5
6
7
0
0
missed hits
(d) Subject 9
1
2
3
4
5
6
7
missed hits
(e) Subject 10
Figure 9.24: LED subjects MLP output vs missed hits scatter plots
in Figs. 9.11 and 9.3, and two examples of the latter can be found in Figs. 9.12 and 9.6. It is not clear
why this should be but it is probably dependent on the subject rather than on the condition under which
the test is performed (e.g. the time at which the test takes place), as no subject was found to exhibit
both types of behaviour in the LED tests. The time courses of the MLP outputs from the normal sleepdeprived subjects (Figs. 8.3 to 8.9 in the previous chapter) showed predominantly a gradual transition
to drowsiness, with a few occasional dips (for example, subject 2 at times 55 minutes and 100 minutes)
but the lack of a suitable performance measure for these records prevents us from being able to draw a
definite conclusion.
9.3 Visualisation analysis
In order to get a deeper insight into the EEG data corresponding to good performance (i.e. no missed hits
as for the histograms of Fig. 9.25), the distribution of the vectors in feature space is investigated using
the 7-subject vigilance N EURO S CALE map of section 7.1.2.
9.3 Visualisation analysis
210
LED subject 1
LED subject 2
350
LED subject 3
500
300
LED subject 4
1200
300
60
1000
250
800
200
600
150
400
100
200
50
400
250
50
300
200
150
40
30
200
100
20
100
50
0
10
0
0.5
1
0
0
LED subject 6
0.5
1
0
0
LED subject 7
450
0.5
1
0
0
LED subject 8
250
0.5
1
0
0
LED subject 9
1800
400
0.5
1
LED subject 10
30
120
25
100
20
80
15
60
10
40
5
20
1600
200
350
1400
300
1200
150
250
200
1000
800
100
150
600
100
400
50
50
0
LED subject 5
70
200
0
0.5
MLP output
1
0
0
0.5
MLP output
1
0
0
0.5
MLP output
1
0
0
0.5
MLP output
1
0
0
0.5
MLP output
1
Figure 9.25: LED subjects no-missed hits MLP output histogram
9.3.1 Projection on the 7-subject vigilance on the N EURO S CALE map
The LED feature vectors are normalised using the mean and variance of the 7-subject vigilance database,
and then presented to the N EURO S CALE map previously trained on the same vigilance database. Figs. 9.26
and 9.27 show the projection of the LED patterns (thick dots, grey for data points with 0 and 1 missed
hits, yellow for data points with 2, 3 or 4 missed hits, and red for data points corresponding to 5, 6 or 7
missed hits) in the 7-subject vigilance 2-D map (A means in magenta ×’s and D means in blue o’s). The
scale is the same for Figs. 9.26 to 9.28, slightly expanded on Fig. 9.29 and completely different for subject
8 on Fig. 9.30. In each case, the percentage of outliers not shown on the map is indicated in brackets.
The 2-D projections show that for more than half of the LED subjects, the patterns lie in the same area of
the vigilance A and D means, with a few patterns lying outside, mainly in the middle-lower section of the
plots and in the outer area around the A means. The LED database, in contrast with the vigilance database
which used the central electrode montage, was acquired with frontal electrodes and it was found that
9.3 Visualisation analysis
211
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−8
−6
Drowsy
Alert
0−1 missed hits
−10
−6
−4
−8
−2
0
2
4
6
−10
−6
8
6
6
4
4
2
2
0
0
−2
−2
−6
Drowsy
Alert
2−4 missed hits
−10
−6
−4
−8
−2
0
2
4
6
−10
−6
8
8
6
6
4
4
2
2
0
0
−2
−2
−8
0
2
4
6
Drowsy
Alert
2−4 missed hits
−4
−2
0
2
4
6
−2
0
2
4
6
−4
−4
−6
−2
−4
−4
−8
−4
LED subject 2, (outliers 0%)
LED subject 1, (outliers 3.4%)
8
−6
Drowsy
Alert
0−1 missed hits
−6
Drowsy
Alert
5−7 missed hits
−10
−6
−4
−8
−2
0
2
4
(a) LED subject 1
6
−10
−6
Drowsy
Alert
5−7 missed hits
−4
(b) LED subject 2
Figure 9.26: Patterns from LED subject 1 and 2 projected onto the 7-subject vigilance N EURO S CALE map
EEG contaminated by blinking artefacts or movement artefacts produce patterns that lie on the periphery
of the A means. EEG recorded with frontal electrodes is prone to these two kinds of artefacts, usually
absent in central EEG, and this could explain the outliers in the 2-D projections of the patterns for LED
subjects 4, 5, 6, 7 and 10. In contrast, subjects 1, 2, 3 and 9 produced EEG patterns which are very
likely to come from the same distribution as the vigilance database, given their projections in the 2-D
N EURO S CALE map. Subject 8’s patterns represent the other end of the range, his 2-D projection showing
a high percentage of patterns lying far away from the A and D means, towards the lower-right corner of
the map.
Qualitative correlation with the MLP results
Subject 1
Most of his 0-1 missed hits patterns lie in the region of overlap between alertness and
drowsiness. There is a tendency towards the area in the map which represents drowsiness, so that it can
9.3 Visualisation analysis
212
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−6
−8
−4
−6
Drowsy
Alert
0−1 missed hits
−10
−6
−4
−8
−2
0
2
4
6
−10
−6
Drowsy
Alert
0−1 missed hits
−4
−2
LED subject 3, (outliers 1.6%)
8
6
6
4
4
2
2
0
0
−2
−2
−4
−8
−6
−4
−8
−2
0
2
4
6
−10
−6
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−8
4
6
−4
Drowsy
Alert
2−4 missed hits
−10
−6
−6
2
LED subject 5, (outliers 0.62%)
8
−6
0
Drowsy
Alert
2−4 missed hits
−4
−2
0
2
4
6
−2
0
2
4
6
−4
−6
Drowsy
Alert
5−7 missed hits
−10
−6
−4
−8
−2
0
2
4
(a) LED subject 3
6
−10
−6
Drowsy
Alert
5−7 missed hits
−4
(b) LED subject 5
Figure 9.27: Patterns from LED subject 3 and 5 projected onto the 7-subject vigilance N EURO S CALE map
be said that this subject was mainly drowsy during the tests. The histogram in Fig. 9.25 for the 0-1 missed
hits MLP output is unimodal, with a peak near to zero, indicating that this subject performs well while
drowsy, although he is the second most severe case of OSA in the database according to the overnight
sleep study, the severity of the disorder being mostly assessed here according to the number of dips in
oxygen saturation per hour (see Appendix D).
Subject 2
This subject has a similar distribution of data points over the map, with more vectors in the
alertness region than subject 1, corresponding to the second peak in the bimodal distribution of the 0-1
missed hits MLP output values in Fig. 9.25. From this, it can be said that this subject is generally drowsy
but sometimes alert. The MLP output time courses reveal that the subject only presents these “alert”
9.3 Visualisation analysis
213
patterns during the first test, then he seems to have trained himself to perform in “automatic mode”1 , i.e.
while being deeply drowsy. His overnight study suggests that he also has a severe case of OSA.
Subject 3
This subject’s patterns are evenly distributed over the region of overlap on the map, corre-
lating with the uniform distribution of the 0-1 missed hits MLP outputs in Fig. 9.25. These results and
the MLP output time courses suggest that the level of vigilance for this subject varies between drowsiness
and alertness. He had an average number of oxygen de-saturations the previous night.
Subject 4
This subject’s patterns in general and especially for 0-1 missed hits lie mostly within the
region of alertness area and within the region of overlap (suggesting that he was alert during the tests),
with a significant number of outliers. The histogram for the MLP output in Fig. 9.25 shows unimodal
distribution with a peak at 1.0. This peak is due not only to the patterns within the region of alertness
but also to the outliers. This subject is the second mildest case of OSA in the database.
Subject 5
His patterns are distributed in the region of overlap area on the map, with a tendency
towards alertness, so that it can be said that he was slightly more alert than drowsy. The histogram in
Fig. 9.25 shows a unimodal distribution with a mean at around 0.3. This subject represents the mildest
degree of OSA in the database.
Subject 6
The patterns for this subject present the particularity of lying mostly in the lower centre area
of the map (a region of overlap) with a large number of outliers. The distribution for the MLP output for
0-1 missed hits in Fig. 9.25 shows a peak which could be due to outliers, the distribution being uniform
otherwise (and suggesting a level of vigilance between drowsy and alert). Indeed, the MLP output time
courses show continuous fluctuations between drowsiness and alertness. The sleep study categorised this
subject as having a serious case of OSA.
1 Automatic behaviour is a phenomenon reported by the sleep-deprived in which they perform relatively routine behaviour
without having any memory of doing so [31].
9.3 Visualisation analysis
Subject 7
214
Although they lie mostly in the region of overlap area between drowsiness and alertness,
a proportion of the patterns for 0-1 missed hits lies in the alertness area, explaining the second peak in
the 0.9-1.0 bin of the histogram in Fig. 9.25. The first peak occurs around 0.25. This suggests that this
subject was more alert than drowsy, but is also able to perform well when drowsy, as his first and third
test reveal in the MLP time courses.
Subject 8
Note the completely different scale used for this subject because a large proportion of his
patterns are outliers, with a few lying in the alertness dominated region. The impulse-like histogram in
Fig. 9.25 probably owes its peak at 1.0 to the outliers.
A visual inspection of subject 8’s EEG reveals a signal rich in high frequencies. The raw signal was strongly
contaminated with mains interference, removed by the filtering process prior to analysis. Nevertheless,
the filtered signal still shows frequencies in the upper β band, i.e. as high as 25-30 Hz, characteristic of
a very alert state. The vigilance database was obtained from normal subjects who were sleep deprived,
and who were probably drowsy enough not to show up the higher β frequencies in their EEG. These are
therefore absent in the training database and appear as outliers in the N EURO S CALE map of Fig. 9.30.
Subject 9
The few patterns available from this subject lie in the region of overlap on the map, with some
of them spreading towards the alertness region. This correlates well with the nearly uniform distribution
for the histogram of MLP outputs for 0-1 missed hits in Fig. 9.25, and suggests that this subject’s vigilance
was somewhere between alertness and drowsiness. This subject, who fell asleep very quickly in every
test, rated his level of sleepiness as the worst possible (EES in Table D.2, which also shows a large number
of oxygen de-saturations during the night (severe OSA)).
Subject 10
Almost all the patterns for this subject lie in the alertness area of the map, including those
for 5-7 missed hits. This suggests that while subject 10 was mostly alert during the tests, he failed to
respond to 5, 6 and 7 LED flashes when alert! Although his histogram of MLP outputs for 0-1 missed
9.4 Discussion
215
hits is as expected, the scatter plot in Fig. 9.24 shows no correlation between the MLP output and the
performance measure. This corresponds to what is shown in the N EURO S CALE map. It is important to
note that this subject fell asleep very quickly each time and is the most severe case of OSA in the database,
with a rate of oxygen de-saturations more than double the second most severe case, and with the highest
number of movement arousals during the night.
9.4 Discussion
The results of the visualisation analysis have shown that for some subjects in the LED database, as with
subject 8 and to a lesser extent subjects 4 and 6 , the EEG does not have the same characteristics as
the EEG in the training database, and hence the results from the MLP are not reliable. Also, the low
average value of the MLP output for most of the subjects may be an influential factor in the decreasing
exponential trend in the scatter plots of subjects 1, 2, 5, 7 and 9, as the statistical significance of the
plot decreases (fewer data points) with an increase in the number of consecutive missed hits. Except for
subject 6, the projection in the N EURO S CALE map shows little difference in the distribution of the feature
vectors for 0–1, 2–4 and 5–7 missed hits. Although this does not necessarily imply the same overlap in
the 10-D feature space, it is another factor to bear in mind in the interpretation of the MLP results and
when considering the correlation between the MLP output and the performance measure for the subjects
in the LED database.
9.5 Summary and conclusions
The MLP trained with the 7-subject vigilance database has been used to analyse new data from a vigilance
study in OSA patients. The study, consisting of 4 to 6 vigilance tests, provided frontal EEG recordings
and a performance measure at regular intervals. The MLP output, representing a continuum between
drowsiness (0) and alertness (1), was calculated for each test.
Visual inspection of the MLP output time courses does not show consistent correlation with the perfor-
9.5 Summary and conclusions
216
mance measure. Scatter plots of the MLP output against the performance measure reveal that the MLP
output is generally low when the performance measure indicates deep drowsiness, as expected, but takes
any value between 0 and 1 when the performance measure suggests alertness. Similar results have been
found by other researchers in a random visual stimulus response test [88][32]. OSA patients seem to
perform relatively well even when their electrophysiological signals indicate drowsiness or even light
sleep. Kecklund and Akerstedt have also found that lorry drivers seem to be able to drive in spite of the
appearance of alpha activity in their EEG [83]. However, the reduction in the statistical significance of
the correlation between MLP output and performance measure as the performance deteriorates prevents
us from making any strong statement about the “drowsy” EEG of OSA subjects.
The N EURO S CALE visualisation technique was also applied to the database tested in this chapter in order
to validate the MLP results. The analysis strongly suggests that the EEG of one of the subjects in the
database, subject 8, is very different from that in the vigilance database and therefore the MLP results
for this subject should be discarded, as the MLP produces no reliable results when it extrapolates. Form
the rest of the subjects in the database, 7 out of 9 seem to have EEG patterns belonging to the same
distribution as that found in the vigilance database, and hence the MLP trained with normal subjects can
be used in the study of these OSA patients.
9.5 Summary and conclusions
217
8
8
6
6
4
4
2
2
0
0
−2
−2
−4
−6
−8
−10
−6
−4
−6
Drowsy
Alert
0−1 missed hits
−4
−8
−2
0
2
4
Drowsy
Alert
0−1 missed hits
−10
−6
6
−4
−2
LED subject 7, (outliers 1.1%)
8
6
6
4
4
2
2
0
0
−2
−2
−4
−8
−10
−6
−6
−4
−8
−2
0
2
4
8
6
4
4
2
2
0
0
−2
−2
−4
−10
−6
6
Drowsy
Alert
2−4 missed hits
−10
−6
6
6
−8
4
−4
Drowsy
Alert
2−4 missed hits
8
−6
2
LED subject 9, (outliers 0.12%)
8
−6
0
−4
−2
0
2
4
6
−2
0
2
4
6
−4
−6
Drowsy
Alert
5−7 missed hits
−4
−8
−2
0
2
4
Drowsy
Alert
5−7 missed hits
−10
−6
6
−4
(a) LED subject 7
(b) LED subject 9
8
6
4
2
0
−2
−4
−6
−8
−10
−6
Drowsy
Alert
0−1 missed hits
−4
−2
0
2
4
6
LED subject 10, (outliers 2.1%)
8
6
4
2
0
−2
−4
−6
−8
−10
−6
Drowsy
Alert
2−4 missed hits
−4
−2
0
2
4
6
−2
0
2
4
6
8
6
4
2
0
−2
−4
−6
−8
−10
−6
Drowsy
Alert
5−7 missed hits
−4
(c) LED subject 10
Figure 9.28: Patterns from LED subject 7, 9 and 10 projected onto the 7-subject vigilance N EURO S CALE
map
9.5 Summary and conclusions
218
5
5
0
0
−5
−5
−10
−15
−6
−10
Drowsy
Alert
0−1 missed hits
−4
−2
0
2
4
6
8
−15
−6
5
0
0
−5
−5
−15
−6
−10
Drowsy
Alert
2−4 missed hits
−4
−2
0
2
4
6
8
−15
−6
5
5
0
0
−5
−5
−10
−15
−6
−10
Drowsy
Alert
5−7 missed hits
−4
−2
0
2
(a) LED subject 4
−4
−2
0
2
4
6
8
LED subject 6, (outliers 0.61%)
LED subject 4, (outliers 8.9%)
5
−10
Drowsy
Alert
0−1 missed hits
4
6
8
−15
−6
Drowsy
Alert
2−4 missed hits
−4
−2
0
2
4
6
8
−2
0
2
4
6
8
Drowsy
Alert
5−7 missed hits
−4
(b) LED subject 6
Figure 9.29: Patterns from LED subject 4 and 6 projected onto the 7-subject vigilance N EURO S CALE map
9.5 Summary and conclusions
219
10
5
0
−5
−10
−15
−20
−25
Drowsy
Alert
0−1 missed hits
−30
−5
0
5
10
15
LED subject 8, (outliers 0.54%)
10
5
0
−5
−10
−15
−20
−25
Drowsy
Alert
2−4 missed hits
−30
−5
0
5
10
15
0
5
10
15
10
5
0
−5
−10
−15
−20
−25
−30
−5
Drowsy
Alert
5−7 missed hits
Figure 9.30: Patterns from LED subject 8 projected onto the 7-subject vigilance N EURO S CALE map
Chapter 10
Conclusions and future work
10.1 Overview of the thesis
The main objective of the research described in this thesis has been to develop neural network methods
to study the sleep and wake EEG of subjects with the severe breathing disorder OSA. In chapter 6, which
describes the analysis of sleep studies, an MLP neural network was trained with AR model reflection
coefficients as inputs. These were extracted from a single-channel of EEG recorded during the sleep of
normal subjects and the network was trained to track the sleep-wakefulness continuum. An automated
system based on this MLP output and a set of logical rules was developed and tested with OSA sleep EEG.
The results, validated against scores from a human expert, show that the automated system is able to
detect most of the μ-arousals in the EEG of these patients with accuracy, not only in occurrence (with a
median sensitivity of 0.97, and a median positive predictive accuracy of 0.94), but also in starting time
and duration (with a median correlation index of 0.82). In chapter 7 visualisation algorithms applied
to the sleep EEG database and the wake EEG database (acquired from sleep-deprived normal subjects),
showed the need for another MLP network to analyse the wake EEG, as its characteristics differ from
those of the EEG in the sleep database. The vigilance analysis, covered in chapters 8 and 9, was again
carried out by training MLP neural networks with AR model reflection coefficients at the input. This time,
these were extracted from a single-channel of EEG recorded from normal sleep-deprived subjects and the
network was trained to track the alertness-drowsiness continuum. The trained MLPs were tested with
data from normal sleep-deprived subjects as well as from OSA patients performing a visual vigilance task.
10.2 Discussion of results
221
The test on normal subjects was correlated with a human expert assessment of the EEG, and the mean
correlation was found to be 52.0% (sd 5.9%). A performance measure was used to evaluate the MLP
output on the EEG from OSA subjects. The results of this analysis, although not totally conclusive, have
raised important questions about the effect of OSA on the EEG.
10.2 Discussion of results
While the sleep studies yielded very good results in general, the correlation between the MLP output
and the performance measure in OSA subjects was highly variable. It is well known, however, that the
effectiveness of performance measures in the assessment of sleepiness depends largely on the task characteristics. There is no perfect task to evaluate the decrease of vigilance. The physiological-behavioural
link is not straightforward, the task itself being intrusive in the natural process of drowsiness. Many factors such as motivation, circadian rhythm and habituation can make a very drowsy subject perform well
or better than otherwise. Pivik [130] stressed the relevance of long-practice effects, which can improve
the performance on a given task without an improvement in the physiological condition. Also, not all
investigations of sleep loss have shown adverse effects in performance [139]. The effects of sleep loss are
similar to those of OSA, as the latter fragments the sleep and diminishes its total time. Dinges et al. review the literature in the area [42] and conclude that performance variance increases with sleep loss, that
habituation to a repetitive task is augmented in a sleepy brain, that performance depends non-linearly
on sleep loss and time of the day (related to the circadian rhythm), and that motivation or “willingness
to perform” may have a distinct effect on the capacity to perform. The attentional task used to build
the LED database is repetitive and could have caused habituation to the task after a few minutes of the
first test or in subsequent tests as is the case of subjects 1 and 2. Variance in performance as the subject
gets drowsy might explain the poor correlation found for subjects 3, 6 and 10 (see Figs. 9.23 and 9.24).
Circadian rhythm and/or motivation may explain why subject 7 fell asleep after two and a half minutes
in his first test at midmorning, and performed without falling asleep for 40 minutes in the last test, early
in the afternoon.
10.3 Main research results
222
EEG subject inter-variability was a problem encountered many times in this thesis. The μ-arousal automated system results were very satisfactory for 5 out of 7 subjects, but disappointing for two of the
subjects. One of these two subjects presents mixed frequency EEG at the time of the μ-arousal and the
other one shows an EEG with much higher content in the upper frequency bands than for the rest of the
database. In a recent paper [46] Drinnan et al. have found that μ-arousal inter-scorer agreement tends
to be poor when the μ-arousal occurs embedded in high-frequency EEG. Also, in the vigilance EEG study
(chapter 7), a case of an α+ subject highlighted the variation of wake EEG patterns in the general population, and showed the need for special considerations of these subjects who represent a non-negligible
fraction of the total population.
10.3 Main research results
As mentioned in section 1.1, there had been no prior work in the computerised analysis (using the same
framework) of both sleep disturbance and vigilance from the EEG before the research described in this
thesis. In the course of this research, several findings have been made. Amongst these are:
1. A compromise was found between the stationarity requirements of AR modelling and the variance
of the AR reflection coefficients by using a 3-second analysis window with a 2-second overlap.
The AR model is still able to follow rapid changes whose duration is 3 seconds or more. To our
knowledge, this is the first time that AR modelling has been used in μ-arousal detection.
2. A μ-arousal may cause an increase in the δ rhythms of the EEG at the same time as it causes an
increase in the amplitude of the higher frequencies (α and/or β bands). Hence a μ-arousal is
not necessarily just a “shift” in frequency, as often described in the literature related to μ-arousal
detection.
3. The visualisation analysis described in chapter 7 revealed that Alertness and Drowsiness in vigilance
tests are not the same as Wakefulness and REM/Light Sleep in a sleep-promoting environment.
4. MLP analysis and visualisation techniques applied to wake EEG in OSA subjects show that these
10.4 Conclusions
223
patients can present “drowsy” EEG while performing well during a visual vigilance test.
5. The MLP analysis of the wake EEG in OSA patients has shown that the transition to Drowsiness may
occur progressively as well as in sudden dips.
6. The Alertness EEG patterns of OSA may not be the same as the Alertness patterns of normal sleepdeprived subjects. Instead they seem to resemble more closely the Drowsiness patterns of these
normal subjects.
10.4 Conclusions
From the sleep study results we conclude that the automatic scoring system, based upon a neural network
trained with normal sleep EEG, can be reliably used as a supporting diagnosis tool in the detection of μarousals in OSA patients.
As for the vigilance study, the neural network proved useful in the assessment of the EEG of these patients
in relation to their performance during the task. More work needs to be done to improve the statistical
significance of these results and to validate them against the scores from a human expert, as it is well
known that correlation of EEG with performance measures is not consistent, the task performance being
influenced by many factors other than physiological sleepiness.
In summary, the use of neural network methods, namely the N EURO S CALE algorithm for EEG data visualisation, and the MLP network for description of the EEG state in sleep and in vigilance, have led to
a better understanding of the effects of the OSA disorder on the sleep EEG, as we have obtained more
insight into the changes during a μ-arousal. Also, we have found that the EEG alertness patterns of OSA
subjects may present similar characteristics to those of the drowsiness patterns in normal subjects (after
minor sleep deprivation).
10.5 Future work
224
10.5 Future work
Several issues are left open, which should be considered if further work is going to be carried out in the
analysis of the EEG in OSA patients. The most important of these are:
1. There is a missing link between alertness in sleep-deprived normal subjects, alertness in OSA subjects and wakefulness in normal subjects prior to sleep onset. To fill this gap, a study should be
carried out to acquire EEG data from normal fully alert subjects performing vigilance tasks.
2. The various databases used in the work described in this thesis come from three different sleep
laboratories. This has some repercussions on the results as the databases differ in several aspects.
For instance, the wake EEG signals acquired with the frontal electrode montage show variations
in the α and θ content of the signal with respect to those acquired with the central electrode
montage. The vigilance training database was acquired using the central electrode montage while
the test database was recorded using frontal electrodes. This could have a significant effect on the
assessment of alertness by the neural network. Human expert scoring based on the same scale
as used for the training database is desirable on the LED database, in order to validate the MLP
results, given the controversy surrounding the reliability of performance measures in the evaluation
of drowsiness. Also, the task should be redesigned in order to control the habituation factor, and to
increase the amount of data for low performance.
3. The wake EEG from some OSA subjects presented patterns which differ in some degree from normality. Is the EEG of OSA patients when they are awake different from that of normal subjects? In
other words, is their alertness EEG more like the drowsiness EEG for normal individuals? Are they
constantly drowsy, behaving as if they were alert as a result of habituation? Little or no attention
has so far been given to this issue which has a major effect on the quality of life for many people.
4. There is a good deal of controversy surrounding the treatment of OSA [173]. Some evidence has
been found to support the use of nasal continuous positive pressure (nCPAP) therapy [59][32] as a
10.5 Future work
225
means of keeping the airways open during sleep. Pre- and post-treatment analysis of the EEG and
its relation to performance is suggested to validate nCPAP as an effective therapy for OSA. Once
this is done, the results could be used to find out if the EEG recovers after treatment (so that both
alertness and drowsiness patterns become similar to those of normal subjects) or whether there is
an irreversible long-term effect on the EEG.
5. The α content of the wake EEG has been described as distractive by clinicians [154], as it largely
differs across the population, and shows inconsistent variations from alertness to drowsiness. Some
of the work done in vigilance assessment [170] (and reviewed in section 3.4) uses α power with
eyes open and eyes closed, as a reference. We suggest that this procedure should be considered
(making sure that the subject is alert during this “calibration”), in order to pave the way for subjectadaptive vigilance analysis. This would imply re-training the neural network using eyes-open and
eyes-closed α power as reference in order to adjust the results for inter-subject α differences.
6. The work in this thesis has taken the application of linear model as far as possible, but the assumption of stationarity must break down regularly for wake EEG. Therefore, non-linear features, such
as complexity as in the work of Rezek et al. [137], time-delay embedding and ICA as in the work
of Lowe [98], should be investigated, as they have been reported as giving better discrimination
than linear methods in preliminary studies on the changes in the EEG of subjects either asleep or
performing vigilance tasks.
7. A more generalised approach in learning theory, called Support Vector Machine (SVM), has been
used for regression and classification [169], reporting better generalisation than neural networks
[144]. SVMs have been discarded for the work presented in this thesis as they lack probabilistic
outputs. However, a Bayesian framework has recently been developed for SVM, introducing the
Relevance Vector Machine (RVM) which does not suffer from the above disadvantage, and demonstrates comparable generalisation performance to SVM [162]. Future work should explore RVM as
an alternative to the use of neural networks in posterior probability estimation for the sleep and the
10.5 Future work
vigilance problems.
226
Appendix A
Discrete-time stochastic processes
A.1 Definitions
The definitions presented in this section have been taken from [63] and [62].
Stochastic Process: A statistical phenomenon that evolves in time according to probabilistic laws. From
the definition of a stochastic process one may be confused and interpret it as a function of the discrete
variable time n1 , when indeed it represents an infinite number of different realisations ξ of the process
u(n, ξ). An ensemble represents a set of realisations of the same process.
Time Series: A realisation ξo of a discrete-time stochastic process is called a time series, u(n), consisting of a set of observations generated sequentially in time. A time series of interest is a sequence of
observations u(n), u(n − 1), ..., u(n − M ) generated at discrete and uniformly spaced instants of time,
n, n − 1, ..., n − M 1 .
Statistical description of a discrete-time stochastic process: Consider a stochastic process represented by the ensemble shown in Fig. A.1. Each time series ui (n) represents a random variable along the
time axis, but a set of observations at a specific time n1 represents a random variable as well, in this case
across the ensemble.
First and second order moments may be defined across the process (ensemble). The mean-value function
of the process is defined as:
(A.1)
μ(n) = E[u(n)]
The autocorrelation function of the process may be defined as:
r(n, n − k) = E[u(n)u(n − k)],
k = 0, ±1, ±2, ...
(A.2)
Another second order moment, the autocovariance function is defined as:
c(n, n − k) = E[(u(n) − μ(n))(u(n − k) − μ(n − k))],
(A.3)
for k = 0, ±1, ±2, ...
The autocorrelation and the autocovariance functions are related by:
c(n, n − k) = r(n, n − k) − μ(n)μ(n − k)
1 For
convenience time is normalised with respect to the sampling period.
(A.4)
Stochastic processes
228
u i (n1 )
u5 (n)
u4 (n)
u3 (n)
u2 (n)
u1 (n)
n1
sample in time
n
Figure A.1: Stochastic process ensemble
So, for partial characterisation of a stochastic process through its first and second moments it will be
sufficient to specify the mean value and either the autocorrelation or the autocovariance function.
Stationary process: A stochastic process will be stationary in the strict sense if all of its moments are
constant. For example, the mean value will be:
μ(n) = μ
(A.5)
For such a process the autocorrelation and autocovariance functions depend only on the lag k.
r(n, n − k) = r(k)
(A.6)
c(n, n − k) = c(k)
Note that for a stationary process the autocorrelation function at k=0 equals the mean-square value:
r(0) = E[ | u(n) |2 ]
(A.7)
and the autocovariance for k=0 equals the variance:
c(0) = E[ | u(n) − μ |2 ] = σu2
(A.8)
If the first and second moments of a process satisfy the conditions described above, it is at least stationary
to the second order, and if the variance is finite, wide-sense stationarity conditions will be satisfied.
Ergodicity Consider a stationary (in the wide-sense) process in which the time moments are constant
as well and equal to their equivalents across the process. This is very convenient because it allows us to
characterise the process with suitable measurements of one of its time series.
We may estimate the mean of the process computing the time average of one of its realisations using:
μ̃(N ) =
N −1
1 u(n)
N n=0
(A.9)
Stochastic processes
229
where N is the number of observations or samples of the time series u(n). We expect that this time
average will converge to the ensemble mean as N increases. The mean-square error defines a criterion
for this convergence:
lim [ (μ − μ̃(N ))2 ] = 0
N →∞
(A.10)
If we repeat the estimation for some more realisations and find the expected value of the square error,
we may find that:
lim E[ | μ − μ̃(N ) |2 ] = 0
N →∞
(A.11)
In this case it can be said that the process is mean ergodic. In other words,a wide-sense stationary process
will be mean ergodic in the mean-square error sense if the mean-square value of the error between the
ensemble mean μ and the time average μ̃(N ) approaches zero as the number of samples N approaches
infinity.
This criterion may be extended to other time averages of the process. The estimate used for the autocorrelation function is:
r̃(k, N ) =
N −1
1 u(n)u(n − k)
N n=0
(A.12)
for 0 ≤ k ≤ N − 1.
In this case, the process will be correlation ergodic in the mean-square error sense if the mean-square
value of the difference between the ensemble autocorrelation r(k) and the time estimate r̃(k, N ) approaches zero as the number of samples approaches infinity.
Transmission of a discrete-time stationary process through a linear filter Let the time series y(n) be
the output of a discrete time shift invariant linear filter with unit-sample response h(n) and input u(n).
Assume that u(n) represents a single realisation of a wide-sense stationary discrete-time process. Then,
y(n) also represents a single realisation of a stationary wide-sense discrete-time stationary process with
autocorrelation ry (k) given by:
ry (k) =
∞
∞
h(i)h(k)ru ( − i + k)
(A.13)
i=−∞ =−∞
Correlation matrix If the M × 1 vector u(n) represents the time series as:
u(n) = [ u(n), u(n − 1), ..., u(n − M + 1) ]T
(A.14)
where the superscript T denotes transposition. The M × M correlation matrix R may be defined as:
R = E[ u(n)uT (n) ]
Expanding this expression:
⎡
[E[u(n)u(n)]
⎢ E[u(n − 1)u(n)]
⎢
R(n) = ⎢
..
⎣
.
E[u(n − M + 1)u(n)]
(A.15)
⎤
. . . E[u(n)u(n − M + 1)]
⎥
. . . E[u(n − 1)u(n − M + 1)]
⎥
⎥
.
..
..
⎦
.
E[u(n − M + 1)u(n − 1)] . . . E[u(n − M + 1)u(n − M + 1)]]
(A.16)
E[u(n)u(n − 1)]
E[u(n − 1)u(n − 1)]
..
.
Stochastic processes
230
If the process is stationary in the wide-sense:
⎡
r(0)
⎢
r(−1)
⎢
R=⎢
..
⎣
.
. . . r(M − 1)
. . . r(M − 2)
..
..
.
.
r(−M + 1) r(−M + 2) . . .
r(0)
r(1)
r(0)
..
.
⎤
⎥
⎥
⎥
⎦
(A.17)
From the property of the autocorrelation function of a wide-sense stationary process:
r(−k) = r(k)
(A.18)
we find that the matrix R is symmetric. According to this, only M values of the autocorrelation function
r(k) are needed to calculate the correlation matrix R.
⎡
⎤
r(0)
r(1)
. . . r(M − 1)
⎢
r(1)
r(0)
. . . r(M − 2) ⎥
⎢
⎥
R=⎢
(A.19)
⎥
..
..
..
⎣
⎦
.
.
...
.
r(M − 1) r(M − 2) . . .
r(0)
As can be seen from Eq. A.19 the correlation matrix of a wide-sense stationary process is Toeplitz, i.e. all
the elements along the main diagonals are equal. A Toeplitz correlation matrix guarantees wide-sense
stationarity.
A general property, valid for all stochastic processes is that the correlation matrix is always nonnegative
definite and almost always positive definite. If it is positive definite it will be nonsingular as well. The rare
condition of a singular correlation matrix represents linear dependency between the elements of the time
series. This arises only when the process u(n) consists only of a sum of K ≤ M sinusoids. Although this
situation is almost impossible in practice, the correlation matrix may be ill-conditioned if its determinant
is very close to a zero value.
Gaussian processes A particular strictly stationary stochastic process, common in the physical sciences,
is the Gaussian process, which has the property that it can be fully statistically characterised with only
the first and second moments. We may call a process, u(n), Gaussian if any linear functional of u(n) is a
Gaussian distributed random variable. A linear functional is defined by Eq. A.20 as,
Y =
T
g(t)u(t)dt
(A.20)
0
where g(t) is a weighting function such that the mean-square value of the random variable Y is finite.
For a discrete-time stochastic process,
the linear functional becomes a linear function of all the samples
n
of u(n) up to the time n, Y = i=0 gi x(i). The Gaussian distribution probability density function fY (y)
is shown in Eq. A.21.
1
(y − μY )2
fY (y) = √
exp
−
(A.21)
2σ2
2πσY2
where μY is the mean and σY2 is the variance of the random variable Y . Usually a Gaussian process will
be denoted as N (μ, R). As the mean is a constant value that can be subtracted from the time series, we
will consider only zero mean Gaussian processes, N (0, R).
The joint probability density function of N samples of a Gaussian process is described by:
fU (u) =
1
(2π)N/2 det(R)1/2
1
exp(− uT R−1 u)
2
(A.22)
Stochastic processes
231
Note that fU (u) is N -dimensional for a real-valued process. For the case N = 1, matrix R becomes the
variance of the process, σ2 . One particularly interesting property of a Gaussian process, derived from its
definition, is that if a Gaussian process u(n) is applied to a stable linear filter, then the output of the filter
is a Gaussian process as well.
Appendix B
Conjugate gradient optimisation
algorithms
The description of the algorithms in this Appendix can be found in [19]. For further details the reader is
directed to [53] and [113].
B.1 The conjugate gradient directions
Let us assume line searching takes place along the direction d(τ ) . At the minimum the derivative of E in
the direction d(τ ) vanishes:
d
(B.1)
E(w(τ ) + λd(τ ) ) = 0
dλ
Let us set the new weight vector w(τ +1) at this minimum in d(τ ) . Eq. B.1 implies that the gradient vector
in w(τ +1) is orthogonal to the searching direction. By adopting g ≡ ∇E as a short-hand notation for the
gradient of the error function, we can write the orthogonality property as:
g(τ +1)T d(τ ) = 0
(B.2)
We would like to find a new searching direction d(τ +1) such that the property described in Eq. B.2 holds
for all the points in this new direction:
g(w(τ +1) + λd(τ +1) )T d(τ ) = 0
(B.3)
By using the first order expansion of g around w(τ +1) :
(g(τ +1) + g (τ +1)T λd(τ +1) )T d(τ ) = 0
⇒ g(τ +1)T d(τ ) + λd(τ +1)T g (τ +1) d(τ ) = 0
(B.4)
The first term on the left hand side vanishes as a result of the property given in Eq. B.2 and g is none
other than the Hessian matrix, so we can write Eq. B.4 as:
d(τ +1)T Hd(τ ) = 0
(B.5)
The directions d(τ +1)T and d(τ )T are said to be non-interfering or conjugate. Suppose that we can find a
set of W vectors which are mutually conjugate with respect to H so that:
dTj Hdi = 0,
i = j
(B.6)
Optimisation Algorithms
233
It can be shown [19, pp.277] that these vectors are linearly independent, if H is positive definite, and
that they form a complete, non-orthogonal basis set in W. Starting at w1 , we can write the difference
between the minimum w∗ in W and the point w1 as:
w∗ − w1 =
W
αi di
(B.7)
αi di
(B.8)
wj+1 = wj + αj dj
(B.9)
i=1
If we define wj as:
wj = w1 +
j−1
i=1
an iterative equation can be written in the form:
Eq. B.9 represents a succession of line searching steps in the conjugate directions, with the j th step length
controlled by the parameter αj . To find the parameters αj let us assume the quadratic form for the error
function:
1
E(w) = EQ (w) = c + bT w + wT Hw
2
(B.10)
with constant parameters c, b and H, where the latter is a positive definite matrix, and gradient g(w) is
given by:
g(w) = b + Hw
(B.11)
which vanishes at the minimum in w∗ . For this error function, let us pre-multiply Eq. B.7 by dTj H:
dTj Hw∗ − dTj Hw1 =
W
αi dTj Hdi
(B.12)
i=1
Given that b + Hw∗ = 0, and by using the orthogonality property described in Eq. B.6, we can write
Eq. B.12 as:
−dTj (b + Hw1 ) = αj dTj Hdj
(B.13)
from which we can express the αj as:
αj = −
dTj (b + Hw1 )
dTj Hdj
(B.14)
By proceeding in a similar way with Eq. B.8 we find the relationship:
dTj Hwj = dTj Hw1
(B.15)
which can be used in the numerator in the expression for αj to yield:
αj
=−
dTj (b + Hwj )
dTj Hdj
(B.16)
dTj g(wj )
=− T
dj Hdj
Optimisation Algorithms
234
By noting that:
gj+1 − gj
= H(wj+1 − wj )
(B.17)
= αj Hdj
and substituting the value found in Eq. B.14 in Eq. B.17 and premultiplying by dTj , we find that:
dTj gj+1 = 0
(B.18)
Similarly, if we pre-multiply Eq. B.17 by dTk , with k < j ≤ W , we get:
for k < j ≤ W
dTk gj+1 = dTk gj ,
(B.19)
It can be found easily by induction that:
for k < j ≤ W
dTk gj = 0,
(B.20)
Eq. B.20 shows that, for a quadratic error function, at every step the gradient at wj is orthogonal to the
previous conjugate directions dk and the minimum is reached in W steps.
Using the relationships found for this quadratic error function, we can find a set of mutually conjugate
directions by choosing the first one as the negative gradient:
d1 = −g1
(B.21)
Once the minimum w1 on d1 is found, the next direction can be chosen as a linear combination of the
previous one and the gradient at w1 :
dj+1 = −gj+1 + βj dj
(B.22)
The parameters βj can be found by pre-multiplying Eq. B.22 by dTj H:
βj =
gj+1 Hdj
dTj Hdj
(B.23)
To avoid the computation of the Hessian, we can use Eq. B.17 in the equation for the βj , getting:
βj =
gj+1 (gj+1 − gj )
dTj (gj+1 − gj )
(B.24)
This expression can be simplified further by using Eq. B.21 and the orthogonality property in Eq. B.20:
βj =
gj+1 (gj+1 − gj )
gjT gj
(B.25)
This last formula, known as the Polak-Ribiere form, gives better results than the other ones because it
tends to reset the conjugate direction in the direction of the gradient if the algorithm is making little
progress (i.e. gj+1 ≈ gj ), restarting in this way the conjugate gradient procedure. A caveat of this
algorithm is that the Hessian matrix can be negative definite in some regions of weight space for a
general non-linear error surface. In this case, a robust procedure should make sure that the error does
not increase at any step.
Optimisation Algorithms
235
B.1.1 The conjugate gradient algorithm
A description of the algorithm follows:
1. Choose an initial set of weights w1
2. Evaluate the gradient g1
3. Set d1 = −g1
4. Initialise j = 1
5. Find the minimum of the error function along dj and call this point wj+1
6. If E(wj+1 ) < , stop the procedure and set the neural network weights to wj+1 ,
otherwise continue.
7. Evaluate the gradient gj+1
8. If j is a multiple of W then reset the procedure by setting dj+1 = −gj+1 and
go to step 11.
9. Compute βj using the Polak-Ribiere formula (Eq. B.25)
10. Calculate the new direction as dj+1 = −gj+1 + βj dj
11. Increment j by one and go back to step 5.
B.2 Scaled conjugate gradients
In Eq. B.14, the product of the vector dj with the Hessian matrix (defined as H ≡ ∇(∇E)) can be
approximated by substituting v for dj in the following equation:
vT H = vT ∇(∇E) =
∇E(w + v) − ∇E(w)
+ O()
(B.26)
where v O() is a residual term of the order of . This residual term can be reduced by one order by using
central differences:
vT H = vT ∇(∇E) =
∇E(w + v) − ∇E(w − v)
+ O(2 )
2
(B.27)
However, in the case of a non-quadratic error function, the conjugate gradient approach can lead to an
increase in the error if the Hessian matrix is not positive definite. In such a case, the product vT Hv will
not be positive. To make sure that the denominator of Eq. B.14 remains positive for a negative definite
Hessian, the matrix H can be replaced by:
Hmod = H + λI
(B.28)
where I is the identity matrix and λ is a scaling factor. The condition over λ is to make the denominator
in Eq. B.14 positive:
dTj Hdj + λ dj 2 > 0
(B.29)
Since the size of the step αj depends inversely on the scaling factor, λ also controls the step size. If the
value of λ is too small, the searching region is large. This can be a problem if the error function is far
from being quadratic in the searching region. If the quadratic approximation is not valid, the conjugate
gradient formulae may not be effective in the search for the minimum. In such a case the step size
should be reduced. Conversely, if the approximation is good the step size can be safely increased. Hence,
the scaling factor will have two functions: to make sure that the error decreases when the Hessian is
Optimisation Algorithms
236
negative definite, and to control the searching region based on a measure of the goodness of the quadratic
approximation. Its value will be adjusted at each iteration j.
Starting with λ1 = 0, the denominator of Eq. B.14 can be written as:
DENj = dTj Hj dj + λj dj 2
(B.30)
Note that the Hessian now has subindex j indicating that in general it is not constant as its value may
change with each step.
If DENj < 0 the value of λj should be increased. Denoting the new values for λj and the denominator
with an upper bar, we have:
= dTj Hj dj + λ̄j dj 2
DEN j
(B.31)
= DENj + (λ̄j − λj ) dj 2
To make DEN j > 0 the new scaling factor should be:
λ̄j > λj −
DENj
dj 2
(B.32)
By choosing double the value of the right hand side in inequality B.32 we get:
DEN j = −dTj Hj dj
(B.33)
Then, replacing the denominator in Eq. B.14 by the right-hand side of Eq. B.33, we can calculate the value
of the step αj . To check if the quadratic approximation is valid, the following index has been proposed
[53]:
Δj =
E(wj ) − E(wj + αj dj )
E(wj ) − EQ (wj + αj dj )
(B.34)
where EQ (w) is the local quadratic approximation of the error function in the neighbourhood of wj ,
given by:
1
EQ (wj + αj dj ) = E(wj ) + αj dTj gj + α2j dTj Hj dj
2
(B.35)
It is clear from the above equation that Δj will be close to 1 if the approximation is good, and close to
zero if the error function differs largely from the quadratic assumption made. If the approximation is
good then the value of the scaling factor can be decreased for the next iteration. On the contrary, if the
value of Δj is very small, then the value of λ for the next iteration should be increased. A negative index
Δj indicates that the step will move the weights to a point where the Hessian matrix is negative definite,
therefore the weights should not be updated and the value of λ should be decreased accordingly1 to
re-calculate the step αj .
Recalling the definition of αj (Eq. B.14) the expression of EQ (wj + αj dj ) can be written as:
1
EQ (wj + αj dj ) = E(wj ) + αj dTj gj
2
(B.36)
Substituting this expression in Eq. B.34 yields:
Δj =
1A
decrease given by λj = λj + DENj
1−Δj
dj2
2{E(wj ) − E(wj + αj dj )}
αj dTj gj
has been suggested [113]
(B.37)
Optimisation Algorithms
237
Lower and upper thresholds for Δj can be 0.25 and 0.75 respectively, for example [53]. The increase or
decrease in λ is also arbitrarily chosen. An example of the quadratic approximation quality check could
be:
• If Δj > 0.75, the approximation is good, decrease the scaling factor, λj+1 = λj /4
• If Δj < 0.25, the approximation is poor, increase the scaling factor, λj+1 = 4λj
• If 0.25 ≤ Δj ≤ 0.75, leave the scaling factor as it is, λj+1 = λj
• If Δj < 0, the Hessian has become negative definite with the step αj , increase the scaling
factor as shown in footnote (1) on page 236, then recalculate the modified Hessian and
αj and check Δj again.
The scaling technique has been called the model trust region method because the model, in this case the
quadratic, is only trusted in a region defined by the scaling factor.
B.2.1 The scaled conjugate gradient algorithm
The scaled conjugated algorithm can be summarised as follow:
1. Choose an initial set of weights w1
2. Set λ1 = 0
3. Choose a very small value for 4. Evaluate the gradient g1
5. Set d1 = −g1
6. Initialise j = 1
7. Estimate dTj H by central differences
8. Evaluate the denominator of DENj ; if negative, increase λj to yield DEN j
9. Calculate αj
10. Check the quality of the quadratic approximation and modify λj correspondingly.
If Δj < 0 go back to step 8, otherwise continue.
11. If E(wj+1 ) < , stop the procedure and set the neural network weights to wj+1 ,
otherwise continue.
12. Evaluate the gradient gj+1
13. If j is a multiple of W then reset the procedure by setting dj+1 = −gj+1
and go to step 16.
14. Compute βj using the Polak-Ribiere formula (Eq. B.25)
15. Calculate the new direction as dj+1 = −gj+1 + βj dj
16. Increment j by one and go back to step 7.
Appendix C
Vigilance Database
The central channel EEG from eight healthy young adults, performing various vigilance tasks for more
than 2 hours, was recorded and digitised with 12-bit precision and 256 Hz sampling rate. The subjects were asked to stay awake the night before and to abstain from caffeine or any other stimulatory
substances 24 hours before and during the tests. Subject age and gender are shown in Table C.1. The
recording montage consisted of electrode pairs C4 − A1 (central right), C3 − A2 (central left) and A1 − A2
(mastoid), EOG left, EOG right and submental EMG.
Subject
1
2
3
4
5
6
7
8
ID number
3
4
6
7
9
10
11
12
Gender
female
female
male
male
female
female
male
female
Age [years]
20
19
24
18
23
21
20
21
Table C.1: Bristol subjects
The test consisted of three different vigilance tasks, a tracking task, a reaction time task and a serial
attention task. In the tracking task the subject is asked to follow a rectangle on a computer screen by
moving a pointing device. The rectangle moves randomly. In the reaction time task, the subject has to
press the space bar of the computer keyboard every time a 3x3 mm red square appears on the screen.
The square appears at random intervals at an average rate of 18 times per minute. The serial attention
task consists of a digit display with values within the [−9, +9] interval. The value decreases or increases
at random times and the subject has to hit the left or right button of the mouse to keep it at zero value.
Performance indices taken for these tasks are:
• Tracking error for the tracking task, or deviation of the position indicator from the rectangle.
• Reaction time or time-interval in milliseconds from appearance of the red rectangle and the pressing
of the space bar in the reaction task.
• Missed stimuli, number of times when the subject did not react when the red rectangle showed up
in the reaction time task.
• Serial attention task error, the absolute value of the display in the serial attention task.
A previous study [50] found very little or no correlation between these performance indices and the
expert scoring of the EEG. For instance, the reaction time remains almost constant for all the vigilance subcategories, while the increase in the tracking error is not significant as the subject gets drowsy. Although
The vigilance database
239
it is well known that lapses of alertness due to sleepiness or fatigue lead to decreased performance,
quantifying the loss of performance and correlating it with physiogical measures of sleepiness have proved
to be a difficult task [7][157]. Non-related factors like motivation and distractions may affect the results
[121][130][35][42]. Therefore the performance indices in the vigilance database are not used in this
thesis.
Appendix D
LED Database
D.1 Method
A frontal-channel EEG1 was recorded from ten OSA patients, performing a behavioural version of the
Maintenance of Wakefulness Test (MWT), and digitised with 12-bit precision at a sampling rate of 128
Hz. Each subject performed at least four tests, on the same day at 9:00, 11:00, 13:00 and 15:00 hours, in
a darkened room with the subject lying on a couch at 45 degrees. The subjects were asked to stay awake
for as long as possible. Each test lasts for a maximum of 40 minutes. A light emitting diode flashes a
red light which is displayed for approximately one second every three seconds throughout the test. The
subject is asked to touch a button on a hand-piece every time the light flashes. Each flash that a subject
fails to respond to is recorded. When seven flashes in succession are not responded to (total time is
21s) the test is terminated automatically and the subject is considered to have fallen asleep. The subject
wears headphones through which white noise is played on a pre-recorded tape. This is to reduce any
interference due to background noise. All subjects were asked to abstain from alcohol for 24 hours prior
to the study and coffee and tea for 12 hours prior to the study. Also, subjects were asked not to sleep
during the day of testing. Results from the tests are shown in Table D.1.
D.2 Demographic data
The patients have a mean age of 50.4 years (standard deviation -sd- of 11.3 years), average body mass
index (BMI)2 41.1 (sd 8.6). All 10 subjects had been diagnosed with OSA, with an Epworth Sleepiness
Scale (ESS) score greater than 10 indicating subjective daytime sleepiness, mean ESS 16.7 (sd 4.6), and
a positive overnight sleep study (performed the night before the vigilance tests) with the number of
oxygen saturation (SaO2 ) dips of greater than 4% per hour, mean SaO2 dips/hour of 30.4 (sd 19.7), and
a number of movements per hour of sleep with a mean of 76.9 (sd 40.4). Full data for each subject can
be found in Tables D.2 and D.3.
1 Other
2 BMI
channels recorded are right and left Mastoid and reference on either mastoid.
is determined by dividing the weight in kilograms by the square of the height in metres
The LED database
Subject
1
2
3
09:00
06:21 (A)
25:57 (A)
03:06 (A)
4
5
6
40:00 (A)
20:03 (A)
13:27 (A)
7
8
9
10
02:33 (A)
21:30 (A)
07:51 (A)
05:18 (A)
a too
241
Test
11:00
13:00
29:12 (B) 16:57 (C)
17:39 (B) 15:00 (C)
10:00 (B) 07:12 (D)
08:27 (C)
40:00 (B) 40:00 (C)
13:00 (B) 17:12 (C)
10:24 (B) 11:39 (C)
08:45 (B)
32:30 (B)
-a (B)
00:21 (B)
06:03 (C)
07:33 (C)
09:03 (C)
00:21 (C)
02:57 (D)
01:33 (E)
15:00
26:00 (D)
12:09 (D)
05:12 (E)
40:00 (D)
19:21 (D)
(D)
10:45 (E)
40:00 (D)
31:30 (D)
00:51 (D)
05:21 (F)
comments
Repeat as the patient said
he didn’t fall asleep
Repeat as the patient fell asleep at start
Falling asleep all the time
Repeat as the patient said
he didn’t fall asleep
short
Table D.1: Time of falling asleep (in mm:ss) measured by the clinician from the start of the MWT test.
The letter used in this thesis to refer to a given test is shown in brackets
subject
1
2
3
4
5
6
7
8
9
10
age [years]
57
33
56
55
41
52
53
72
37
48
height [m]
1.75
1.85
1.78
1.83
1.73
1.78
1.83
1.75
1.70
1.75
weight [Kg]
152.40
112.94
118.40
159.00
111.00
115.20
150.32
87.90
127.50
110.00
BMI [Kg 2 /m]
50
33
37
48
37
36
45
29
44
36
Table D.2: Subject demographic details
Subject
1
2
3
5
4
6
7
8
9
10
O2 dip rate [hr−1 ]
55
30.5
16.3
13.7
14.3
30.1
17.4
18.0
36.4
72.7
Movement [hr−1 ]
77
12
107
12
61
100
58
107
116
119
Table D.3: Overnight sleep study results
ESS
15
20
12
11
16
20
19
10
24
20
Bibliography
[1] H.D.I. Abarbanel, T.W. Frison, and L.Sh. Tsimring. Obtaining order in a world of chaos. IEEE Signal
Processing Magazine, pages 49–65, May 1998.
[2] P. Achermann, R. Hartmann, A. Gunzinger, W. Guggenbuhl, and A.A. Borbely. All-night sleep EEG
and artificial stochastic control signals have similar correlation dimensions. Electroencephalogr.
Clin. Neurophysiol., 90(5):384–7, May 1994.
[3] L.A. Aguirre, V.C. Barros, and A.V. Souza. Nonlinear multivariable modeling and analysis
of sleep apnea time series. Comput Biol Med, 29(3):207–28, 1999. Abstract available at:
http://www.websciences.org/cftemplate/NAPS/indiv.cfm?ID=19992088.
[4] T. Akerstedt. Work hours, sleepiness and the underlying mechanisms. J. Sleep Res., 4(Suppl.2):15–
22, Apr 1995.
[5] T. Akerstedt and M. Gillberg. Subjective and objective sleepiness in the active individual. Int. J.
Neurosci., 52(1-2):29–37, May 1990.
[6] C. Alford. EEG, performance and subjective sleep measures are not the same: implications for
assesment of daytime sleepiness. In Abstracts: British Sleep Society 4th Annual Meeting, page 10.
British Sleep Society, 1992.
[7] C. Alford, C. Idzikowski, and I. Hindmarch. Are electrophysiological measures of sleep tendency
related to subjective state and performance? In Abstracts: British Sleep Society 3rd Annual Meeting,
page 31. British Sleep Society, 1991.
[8] C. Alford, N.Rombaut, J. Jones, S. Foley, and C. Idzikowski. Acute effects of hydroxyzine on
nocturnal sleep and sleep tendency the following day: a C-EEG study. Human Psychopharmacology,
7, 1992.
[9] P. Anderer, S. Roberts, A. Schlogl, G. Gruber, G. Klosch, W. Herrmann, P. Rappelsberger, O. Filz,
M.J. Barbanoj, G. Dorffner, and B. Saletu. Artifact processing in computerized analysis of sleep
EEG - a review. Neuropsychobiology, 40(3):150–7, Sep 1999.
[10] N.O. Andersen. On the calculation of filter coeficients for maximum entropy spectral analysis.
Geophysics, 19(1):69–72, 1970.
[11] Atlas task force of the American Sleep Disorders Association, EEG arousals: Scoring rules and
examples. Sleep, 15(2):174–184, 1992.
[12] Kemp B. A proposal for computer-based sleep/wake analysis. J. Sleep Res., 2(3):179–85, 1993.
Consensus Report.
[13] I.N. Bankman, V.G. Sigillito, R.A. Wise, and P.L. Smith. Feature-based detection of the K-complex
wave in the human electroencephalogram using neural networks. IEEE Transactions on Biomedical
Engineering, 39(12):1305–10, Dec 1992.
[14] J. S. Barlow. Methods of analysis of nonstationary EEGs, with emphasis on segmentation techniques: A comaprative review. Journal of Clinical Neurophysiology, 2(3):267–304, 1985.
Bibliography
243
[15] R. Baumgart-Schmitt, W.M. Herrmann, and R. Eilers. On the use of neural network techniques to
analyze sleep EEG data. third communication: robustification of the classificator by applying an
algorithm obtained from 9 different networks. Neuropsychobiology, 37(1):49–58, 1998.
[16] R. Baumgart-Schmitt, W.M. Herrmann, R. Eilers, and F. Bes. On the use of neural network techniques to analyse sleep EEG data. first communication: application of evolutionary and genetic
algorithms to reduce the feature space and to develop classification rules. Neuropsychobiology,
36(4):194–210, 1997.
[17] M.A. Bedard, J. Montplaisir, F. Richer, and J. Malo. Nocturnal hypoxemia as a determinant of
vigilance impairment in sleep apnea syndrome. Chest, 100(2):367–70, Aug 1991.
[18] L.S. Bennett, B.A. Langford, J.R. Stradling, and R.J.O. Davies. Sleep fragmentation indices as
predictors of daytime sleepiness and NCPAP response in OSA. The Osler Chest Unit, Churchill
Hospital, Headington, Oxford, England.
[19] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995.
[20] M.H. Bonnet and D.L. Arand. We are chronically sleep deprived. Sleep, 18(10):908–11, Dec 1995.
[21] G. E. P. Box and G. M. Jenkins. Time series analysis : forecasting and control. Holden-Day series in
time series analysis. Holden-Day, San Francisco, rev. edition, 1976.
[22] G. Bremer, J.R. Smith, and I. Karacan. Automatic detection of the K-complex in sleep electroencephalograms. IEEE Transactions on Biomedical Engineering, 17(4):314–23, Oct 1970.
[23] D.M. Brittenham. Artifacts: Activities not arising from the brain. In Daly and Pedley [37].
[24] P. Brown and C.D. Marsden. What do the basal ganglia do? The Lancet, 351:1801–4, June 1998.
[25] J. P. Burg. Maximum entropy spectral analysis. PhD thesis, Stanford University, Stanford, California,
1975.
[26] J. P. Burg, D. G. Luenberger, and D. L. Wenger. Estimation of structured covariance matrices.
Proceedings of the IEEE, 70(9):963–974, Sep 1982.
[27] M.A. Carskadon and W.C. Dement. Daytime sleepiness: quantification of a behavioral state. Neurosci. Biobehav. Rev., 11(3):307–17, 1987.
[28] R Caton. The electric currents of the brain. British Medical Journal, (2):278, 1875.
[29] K. Cheshire, H. Engleman, I. Deary, C. Shapiro, and N.J. Douglas. Factors impairing daytime
performance in patients with sleep apnea/hypopnea syndrome. Arch. Intern. Med., 152(3):538–
41, Mar 1992.
[30] S. Chokroverty, editor. Sleep disorders medicine: basic science, technical considerations, and clinical
aspects. Butterworth-Heinemann, Oxford, 2nd edition, 1999.
[31] Circadian Technologies Inc. Alertness Technologies, 2000. Available at: http://www.circadian.com.
[32] R. Conradt, U. Brandenburg, T. Penzel, J. Hasan, A. Varri, and J.H. Peter. Vigilance transitions in reaction time test: a method of describing the state of alertness more objectively. Clin. Neurophysiol.,
110(9):1499–509, Sep 1999.
[33] R. Conradt, T. Penzel, U. Brandenburg, and J.H. Peter. Description of vigilance in the EEG during reaction time test in patients with sleep apnea. In Proceeding of the European Medical and
Biological Engineering Conference EMBEC’99, volume 1, pages 414–15, Vienna, Austria, Nov 1999.
International Federation for Medical and Biological Engineering.
[34] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex Fourier series.
Mathematics of Computation, 19(90):297–301, Apr 1965.
[35] M. Corsi-Cabrera, J. Ramos, C. Arce, M.A. Guevara, M. Ponce de Leon, and I. Lorenzo. Changes in
the waking EEG as a consequence of sleep and sleep deprivation. Sleep, 15(6):550–5, Dec 1992.
Bibliography
244
[36] A.C. da Rosa, A. L. N. Fred, and J. M. N. Leitao. Stochastic model of awake and sleep EEG.
In M. Holt, C. Cowan, P. Grant, and W. Sandham, editors, Signal Processing VII: Theories and
Applications. European Association for Signal Processing, 1994.
[37] D.D. Daly and T.A. Pedley, editors. Current practice of clinical electroencephalography. Raven Press,
1990.
[38] R.S. Daniel. Alpha and theta EEG in vigilance. Perceptual and Motor Skills, 25:697–703, 1967.
[39] R.J. Davies, P.J. Belt, S.J. Roberts, N.J. Ali, and J.R. Stradling. Arterial blood pressure responses
to graded transient arousal from sleep in normal humans. J. Appl. Physiol., 74(3):1123–30, Mar
1993.
[40] F. De, Carli, L. Nobili, P. Gelcich, and F. Ferrillo. A method for the automatic detection of arousals
during sleep. Sleep, 22(5):561–72, Aug 1999.
[41] D. F. Dinges. An overview of sleepiness and accidents. J. Sleep Res., 4(Suppl. 2):4–14, 1995.
[42] D.F. Dinges and N. Barone-Kribbs. Performing while sleepy: effects of experimentally-induced
sleepiness. In Monk [114], pages 97–128.
[43] K. Doghramji. Maintenance of wakefulness test. In Chokroverty [30].
[44] N. J. Douglas. The sleep apnoea/hypopnoea syndrome and snoring. In C. M. Shapiro, editor, ABC
of Sleep Disorders. BMJ, 1993.
[45] N. J. Douglas. The sleep apnoea/hypopnoea syndrome. In R. Cooper, editor, Sleep. Chapman and
Hall Medical, 1994.
[46] M.J. Drinnan, A. Murray, G.J. Gibson, and C.J. Griffiths. Interobserver variability in recognizing
arousal in respiratory sleep disorders. Am. J. Respir. Crit. Care Med., 158(2):358–62, 1998.
[47] M.J. Drinnan, A. Murray, J.E. White, A.J. Smithson, G.J. Gibson, and C.J. Griffiths. Evaluation of
activity-based techniques to identify transient arousal in respiratory sleep disorders. J. Sleep Res.,
5:173–180, 1996.
[48] M.J. Drinnan, A. Murray, J.E. White, A.J. Smithson, C.J. Griffiths, and G.J. Gibson. Automated recognition of EEG changes accompanying arousal in respiratory sleep disorders. Sleep,
19(4):296–303, 1996.
[49] J. Durbin. The fitting of time series models. Revue de l’Institut international de statistique, 28:233–
44, 1960.
[50] M. Duta. The Study of Vigilance using Neural Networks Analysis of EEG. PhD thesis, University of
Oxford, 1998.
[51] Nervous
system.
In
Encyclopdia
Britannica
Online,
page
<http://search.eb.com/bol/topic?eu=119939& sctn=1>. Encyclopdia Britannica, Inc., 19942000. [Accessed 23 June 2000].
[52] J. Fell, J. Roschke, K. Mann, and C. Schaffner. Discrimination of sleep stages: a comparison
between spectral and nonlinear EEG measures. Electroencephalogr. Clin. Neurophysiol., 98(5):401–
10, May 1996.
[53] R. Fletcher. Practical methods of optimization. Wiley, Chichester, 2nd edition, 1987.
[54] J.M. Gaillard, M. Krassoievitch, and R. Tissot. Automatic analysis of sleep by a hybrid system: new
results. Electroencephalography and Clinical Neurophysiology, 33(4):403–10, Oct 1972.
[55] I. Gath and E. Bar-On. Computerized method for scoring of polygraphic sleep recordings. Comput.
Programs Biomed., 11(3):217–23, Jun 1980.
[56] C. F. George and A. Smiley. Sleep apnea and automobile crashes. Sleep, 22(6):790–5, 1999.
[57] C.J. Goeller and C.M. Sinton. A microcomputer-based sleep stage analyzer. Computer Methods and
Programs in Biomedicine, 29(1):31–6, May 1989.
Bibliography
245
[58] C. Guilleminault, M. Partinen, M.A. Quera, Salva, B. Hayes, W.C. Dement, and G. Nino-Murcia.
Determinants of daytime sleepiness in obstructive sleep apnea. Chest, 94(1):32–7, Jul 1988.
[59] M. Hack, R.J. Davies, R. Mullins, S.J. Choi, S. Ramdassingh-Dow, C. Jenkinson, and J.R. Stradling.
Randomised prospective parallel trial of therapeutic versus subtherapeutic nasal continuous positive airway pressure on simulated steering performance in patients with obstructive sleep apnoea.
Thorax, 55(3):224–31, 2000.
[60] P. Halasz, O. Kundra, P. Rajna, I. Pal, and M. Vargha. Micro-arousals during nocturnal sleep. Acta
Physiologica Academia Scientiarum Hungaricae, 54(1):1–12, 1979.
[61] J. Hasan, K. Hirvonen, A. Varri, V. Hakkinen, and P. Loula. Validation of computer analysed
polygraphic patterns during drowsiness and sleep onset. Electroencephalogr. Clin. Neurophysiol.,
87(3):117–27, Sep 1993.
[62] S. S. Haykin. Communication Systems. Wiley, New York, 3rd edition, 1994.
[63] S. S. Haykin. Adaptive Filter Theory. Information and systems sciences series. Prentice-Hall, New
Jersey, 3rd edition, 1996.
[64] H. Head. The conception of nervous and mental energy II. vigilance: a physiological state of the
nervous system. Br. J. Psychol., 14:125–147, 1923.
[65] R. Hess. The electroencephalogram in sleep. Electroenceph. clin. Neurophysiol., 16:44–55, 1964.
[66] S.L. Himanen and J. Hasan. Limitations of the Rechtschaffen and Kales. Sleep Medicine Reviews,
4(2):149–67, Apr 2000.
[67] B. Hjorth. EEG analysis based on time domain properties. Electroencephalography and Clinical
Neurophysiology, 29:306–310, 1970.
[68] C.A. Holzmann, C.A. Perez, C.M. Held, M. San Martin, F. Pizarro, J.P. Perez, M. Garrido, and
P. Peirano. Expert-system classification of sleep/waking states in infants. Medical and Biological
Engineering and Computing, 37(4):466–76, 1999.
[69] J. Horne. Why we sleep : the functions of sleep in humans and other mammals. Oxford University
Press, Oxford, 1988.
[70] J.A. Horne. Dimensions to sleepiness. In Monk [114], pages 169–96.
[71] E. Huupponen, A. Varri, J. Hasan, J. Saarinen, and K. Kaski. Sleep arousal detection with neural
network. Medical & Biological Engineering & Computing, 34(suppl.1):219–20, 1996.
[72] K. Inoue, K. Kumamaru, S. Sagara, and S. Matsuoka. Pattern recognition approach to human sleep
EEG analysis and determination of sleep stages. Memoirs of the Faculty of Engineering, Kyushu
University, 42(3):177–95, Sep 1982.
[73] Wu J., E.C. Ifeachor, E.M. Allen, and N.R. Hudson. A neural network based artefact detection
system for EEG signal processing. In Proceedings of the International Conference on Neural Networks and Expert Systems in Medicine and Healthcare, pages 257–66, Plymouth, UK, 1994. Univ.
Plymouth.
[74] B. H. Jansen. Time series analysis by means of linear modelling. In R. Weitkunat, editor, Digital
Biosignal Processing. Elsevier Science Publishers, 1991.
[75] B.H. Jansen, A. Hasman, and R. Lenten. Piecewise analysis of EEGs using AR-modeling and clustering. Comput. Biomed. Res., 14(2):168–78, Apr 1981.
[76] H.H. Jasper. The 10-20 system of the international federation. Electroencephalography and Clinical
Neurophysiology, 10:371–5, 1958.
[77] G. M. Jenkins and D. G. Watts. Spectral analysis and its applications. Holden-Day series in time
series analysis. Holden-Day, San Francisco, 1968.
Bibliography
246
[78] M. Jobert, H. Escola, E. Poiseau, and P. Gaillard. Automatic analysis of sleep using two parameters based on principal component analysis of electroencephalography spectral data. Biological
Cybernetics, 71(3):197–207, 1994.
[79] T. Jokinen, T. Salmi, A. Ylikoski, and M. Partinen. Use of computerized visual performance test in
assessing day-time vigilance in patients with sleep apneas and restless sleep. Int. J. Clin. Monit.
Comput., 12(4):225–30, 1995.
[80] T.P. Jung, S. Makeig, M. Stensmo, and T.J. Sejnowski. Estimating alertness from the EEG power
spectrum. IEEE Transactions on Biomedical Engineering, 44(1):60–69, 1997.
[81] S. M. Kay and S. L. Marple. Spectrum analysis-a modern perspective. Proceedings of the IEEE,
69(11):1380–1419, November 1981.
[82] S.M. Kay. Recursive maximum likelihood estimation of autoregressive processes. IEEE Transactions
on Acoustics, Speech, and Signal Processing, 31(1):56–65, Feb 1983.
[83] G. Kecklund and T. Akerstedt. Sleepiness in long distance truck driving: an ambulatory EEG study
of night driving. Ergonomics, 36(9):1007–17, Sep 1993.
[84] S.A. Keenan. Polysomnographic technique: An overview. In Chokroverty [30].
[85] P. Kellaway. An orderly approah to visual analysis: characteristics of the normal EEG of adults and
children. In Daly and Pedley [37].
[86] B. Kemp, E. W. Gröneveld, A. J. M. W. Jansen, and J. M. Franzen. A model-based monitor of human
sleep stages. Biological Cybernetics, 57:365–378, 1987.
[87] L. G. Kiloh, A. G. McComas, and J. W. Osselton. Clinical Electroencephalography. Butterwoths,
fourth edition, 1981.
[88] K. Kinnari, J.H. Peter, A. Pietarinen, L. Groete, T. Penzel, A. Varri, P. Laippala, A. Saastamoinen,
W. Cassel, and J. Hasan. Vigilance stages and performance in OSAS patients in a monotonous
reaction time task. Clinical Neurophysiology, 111(6):1130–6, 2000.
[89] J.R. Knott, F.A. Gibbs, and C.E. Henry. Fourier transform of the electroencephalogram during sleep.
J. Exp. Psychol., 31:465–77, 1942.
[90] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics,
43:59–69, 1982.
[91] M.H. Kryger and P.J. Hanly. Cheyne-Stokes respiration in cardiac failure. In Sleep and Respiration,
pages 215–26. Wiley-Liss, Inc., 1990.
[92] M. Kubat, G. Pfurtscheller, and D. Flotzinger. AI-based approach to automatic sleep classification.
Biol. Cybern., 70(5):443–8, 1994.
[93] St. Kubicki, W.M. Herrmann, and L. Höller. Critical comments on the rules by rechtschaffen and
kales concerning the visual evaluation of EEG sleep records. In St. Kubicki and W.M. Herrmann,
editors, Methods of sleep research, pages 19–35. Gustav Fischer Verlag, Stuttgart, 1985.
[94] A. Kumar. A real-time system for pattern recognition of human sleep stages by fuzzy system
analysis. Pattern Recognition, 9(1):43–6, Jan 1977.
[95] N. Levinson. The Wiener RMS (root-mean-square) error criterion in filter design and prediction.
Journal of Mathematics and Physics, 25:261–278, 1947.
[96] A.L. Loomis, E.N. Harvey, and G.A. Hoart III. Cerebral stages during sleep, as studied by human
brain potentials. J. exp. Psychol., 21:127–144, 1937.
[97] I. Lorenzo, J. Ramos, C. Arce, M.A. Guevara, and M. Corsi-Cabrera. Effect of total sleep deprivation
on reaction time and waking EEG activity in man. Sleep, 18(5):346–54, Jun 1995.
[98] D. Lowe. Feature space embeddings for extracting structure from single channel wake EEG using
RBF networks. In Neural Networks for Signal Processing VIII. Proceedings of the 1998 IEEE Signal
Processing Society Workshop, pages 428–37, New York, 1998. IEEE.
Bibliography
247
[99] R. Luthringer, R. Minot, M. Toussaint, F. Calvi-Gries, N. Schaltenbrand, and J.P. Macher. All-night
EEG spectral analysis as a tool for the prediction of clinical response to antidepressant treatment.
Biol. Psychiatry, 38(2):98–104, Jul 1995.
[100] P.M. Macey, J.S. Li, and R.P. Ford. Deterministic properties of apnoeas in an abdominal breathing
signal. Med. Biol. Eng. Comput., 37(3):335–43, May 1999.
[101] P.M. Macey, J.S.J. Li, and R.P.K. Ford. Expert system for the detection of apnoea. Engineering
Applications of Artificial Intelligence, 11(3):425–38, Jun 1998.
[102] D.J.C. MacKay. The evidence framework applied to classification networks. Neural Computation,
4(5):720–36, Sep 1992.
[103] D.J.C. MacKay. A practical bayesian framework for backpropagation networks. Neural Computation, 4(3):448–72, May 1992.
[104] S. Makeig and T.P. Jung. Changes in alertness is principal component of variance in the EEG
spectrum. NeuroReport, 7:213–216, 1995.
[105] J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975.
[106] M. Matsuura, K. Yamamoto, H. Fukuzawa, Y. Okubo, H. Uesugi, T. Kojima M. Moriiwa, and Y. Shimazono. Age development and sex differences of various EEG elements in healthy children and
adults, quantification by a computerized wave form recognition method. Electroencephalogr. Clin.
Neurophysiol., 60(5):394–406, May 1985.
[107] W.T. McNicholas. Sleep apnoea and driving risk. european respiratory society task force on ”public
health and medicolegal implications of sleep apnoea” [editorial]. Eur. Respir. J., 13(6):1225–7,
Jun 1999.
[108] L. T. McWhorter and L. L. Scharf. Nonlinear maximum likelihood estimation of autoregressive
time series. IEEE Transactions on Signal Processing, 43(12):2909–2919, 1995.
[109] R.G. Miller. The jackknife, a review. Biometrika, 61(1):1–15, Apr 1974.
[110] A. Mitchell. Liquid genius. New Scientist, 13 March 1999.
[111] M.M. Mitler, K.S. Gujavarty, and C.P. Browman. Maintenance of wakefulness test: a polysomnographic technique for evaluation treatment efficacy in patients with excessive somnolence. Electroencephalogr. Clin. Neurophysiol., 53(6):658–61, 1982.
[112] M.M. Mitler, J.S. Poceta, and B.G. Bigby. Sleep scoring technique. In Chokroverty [30].
[113] M. Møller. A scaled conjugated gradient algorithm for fast supervised learning. Neural Networks,
6(4):525–33, 1993.
[114] T.H. Monk, editor. Sleep, sleepiness and performance. Human performance and cognition. John
Wiley & Sons, Chichester, England, 1991.
[115] M. Moore-Ede. We have ways of keeping you alert. New Scientist, pages 30–5, Nov. 13th 1993.
[116] MTI Research’s Alertness Technology.
http://www.mti.com.
Alertness Monitor Technical summary.
Available at:
[117] S. S. Narayan and J. P. Burg. Spectral estimation of quasi-periodic data. IEEE Transactions on
Acoustics, Speech, and Signal Processing, 38(3):512–518, March 1990.
[118] R.D. Ogilvie, D.M. McDonagh, S.N. Stone, and R.T. Wilkinson. Eye movements and the detection
of sleep onset. Psychophysiology, 25(1):81–91, Jan 1988.
[119] M.M. Ohayon and C. Guilleminault. Epidemiolgy of sleep disorders. In Chokroverty [30].
[120] B.S. Oken and K.H. Chiappa. Short-term variability in EEG frequency analysis. Electroencephalogr.
Clin. Neurophysiol., 69(3):191–8, Mar 1988.
[121] J. P. Howe on behalf of the Council of Scientific Affairs. Fatigue, sleep disorders, and motor vehicle
crashes. Technical Report CSA Report 1-A-96, American Sleep Disorders Association, 1996.
Bibliography
248
[122] A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice-Hall, Englewood Cliffs, NJ.,
1975.
[123] J. Pardey, S. J. Roberts, L. Tarassenko, and J. Stradling. A new approach to the analysis of the
human sleep-wakefulness continuum. Journal of Sleep Research, pages 201–210, 1996.
[124] B. Parks, M. Olsen, and P. Resnik. WordNet: A machine-readable lexical database organized by
meanings. Available at: http://work.ucsd.edu:5141/cgi-bin/http webster, 1991-98.
[125] T.A Pedley and R.D. Traub. Physiological basis of the EEG. In Daly and Pedley [37].
[126] T. Penzel and R. Conradt. Computer based sleep recording and analysis. Sleep Medicine Reviews,
4(2):131–48, Apr 2000.
[127] T. Penzel and J. Petzold. A new method for the classification of subvigil stages, using the Fourier
transform, and its application to sleep apnea. Comput. Biol. Med., 19(1):7–34, 1989.
[128] P. Philip, J. Taillard, C. Guilleminault, M.A. Quera-Salva, B. Bioulac, and M. Ohayon. Long distance
driving and self-induced sleep deprivation among automobile drivers. Sleep, 22(4):475–80, Jun
1999.
[129] D. Pitson, N. Chhina, S. Knijn, M. van Herwaaden, and J. Stradling. Changes in pulse transit time
and pulse rate as markers of arousal from sleep in normal subjects. Clin. Sci. Colch., 87(2):269–73,
Aug 1994.
[130] R.T. Pivik. The several qualities of sleepiness: psychophysiological considerations. In Monk [114],
pages 3–37.
[131] W. H. Press, S. A. Teukolsky, W. T Vetterling, and B. P. Flannery. Numerical Recipes in C The Art of
Scientific Computing. Cambridge Uinversity Press, Cambridge, 2nd edition, 1994.
[132] J.C.. Principe, S.K.. Gala, and T.G. Chang. Sleep staging automaton based on the theory of evidence. IEEE Transactions on Biomedical Engineering, 36(5):503–9, May 1989.
[133] J.C. Principe and J.R. Smith. SAMICOS a sleep analyzing microcomputer system. IEEE Transactions
on Biomedical Engineering, 33(10):935–41, Oct 1986.
[134] P.F. Prior and D.E. Maynard. Monitoring cerebral function: long-term monitoring of EEG and evoked
potentials. Elseview, 1986.
[135] Cooper R., C.D Binnie, and Fowler C.J. Origins and technique. In C. D. Binnie and J. W. Osselton,
editors, Clinical Neurophysiology: EMG, nerve conduction and evoked potentials / EEG technology.
Butterworth-Heinemann Ltd, Oxford, 1995.
[136] A. Rechtschaffen and A. Kales. A Manual of Standardized Terminology, Techniques and Scoring
System for Sleep Stages of Human Subjects. Public Health Service, U.S. Government Printing Office,
Washington D.C., 1968.
[137] I. A. Rezek and S. J. Roberts. Stochastic complexity measures for physiological signal analysis.
IEEE Transactions on Biomedical Engineering, 45(9):1186–91, 1998.
Available at:
http://www.robots.ox.ac.uk/ sjrob/pubs.h.
[138] B D Ripley. Statistical theories of model fitting. volume 168 of NATO ASI series. Series F, Computer
and systems sciences, Cambridge, U.K., August 1998. NATO Advanced Study Institute on Generalization in Neural Networks and Machine Learning, Springer.
[139] S. Roberts, I. Rezek, R. Everson, H. Stone, S. Wilson, and C. Alford. Automated assessment of
vigilance using committees of radial basis function analysers. IEE Proceedings Science, Technology
and Measurement, 147(6):333–338, 2000.
[140] T. Roth, T.A. Roehrs, and L. Rosenthal. Measurement of sleepiness and alertness: Multiple sleep
latency test. In Chokroverty [30].
[141] J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers,
C-18(5):401–409, 1969.
Bibliography
249
[142] J. Santamaria and K.H. Chiappa. The EEG of drowsiness in normal adults. J. Clin. Neurophysiol.,
4(4):327–82, Oct 1987.
[143] N. Schaltenbrand, R. Lengelle, and J.P Macher. Neural network model: application to automatic
analysis of human sleep. Comput. Biomed. Res., 26(2):157–71, Apr 1993.
[144] B. Schölkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support
vector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. Sign. Processing, 45:2758–65, 1997. Available at http://www.kernel-machines.org/papers/AIM-1599.ps.
[145] F.W. Sharbrough. Electrical fields and recording techniques. In Daly and Pedley [37].
[146] F.Z. Shaw, R.F. Chen, H.W. Tsao, and C.T Yen. Algorithmic complexity as an index of cortical
function in awake and pentobarbital-anesthetized rats. J. Neurosci. Methods, 93(2):101–10, Nov
1999.
[147] D.K. Siegwart, L. Tarassenko, S.J. Roberts, J.R. Stradling, and J. Partlett. Sleep apnoea analysis
from neural network post-processing. In Proceeding of Fourth International Conference on ‘Artificial
Neural Networks‘, pages 427–32, London, UK, 1995. IEE.
[148] D.W. Skagen. Estimation of running frequency spectra using a Kalman filter algorithm. Journal of
Biomedical Engineering, 10(3):p.275–9, May 1988.
[149] J. R. Smith. Automated analysis of sleep EEG data. In F. H. Lopes da Silva, W. Storm van Leeuwen,
and A. Rémond, editors, Handbook of Electroencephalography and Clinical Neurophysiology, volume 2. Elsevier Science Publishers, 1986.
[150] J.R. Smith and I. Karacan. EEG sleep stage scoring by an automatic hybrid system. Electroencephalography and Clinical Neurophysiology, 31(3):231–7, Sep 1971.
[151] J.R. Smith, I. Karacan, and M. Yang. Automated analysis of the human sleep EEG. Waking and
Sleeping, 2:75–82, 1978.
[152] E. Stanus, B. Lacroix, M. Kerkhofs, and J. Mendlewicz. Automated sleep scoring: a comparative
reliability study of two algorithms. Electroencephalogr. Clin. Neurophysiol., 66(4):448–56, Apr
1987.
[153] M.B. Sterman, G.J. Schummer, T.W. Dushenko, and J.C. Smith. Electroencephalographic correlates
of pilot performance: simulation and in-flight studies. In Electric and Magnetic Activity of the
Central Nervous System: Research and Clinical Applications in Aerospace Medicine, pages 31/1–16,
Neuilly sur Seine, France, Feb 1988. AGARD.
[154] J.R. Stradling. Personal communication.
[155] J.R. Stradling. Handbook of Sleep-Related Breathing Disorders. Oxford University Press, Oxford,
1993.
[156] J.R. Stradling, D.J. Pitson, L. Bennett, C. Barbour, and R.J.O. Davies. Variation in the arousal pattern after obstructive events in obstructive sleep apnea. Am. J. Respir. Crit. Care. Med., 159(1):130–
6, Jan 1999.
[157] K. Swingler and L.S. Smith. Producing a neural network for monitoring driver awareness. Neural
Computing and Applications, 4:96–104, 1996.
[158] Shimada T., Shiina T., and Saito Y. Detection of characteristic waves of sleep EEG by neural
network analysis. IEEE Transactions on Biomedical Engineering, 47(3):369–79, 2000.
[159] L. Tarassenko. A Guide to Neural Computing Applications. Arnold, London, 1998.
[160] L. Tarassenko, J. Pardey, S. Roberts, H. Chia, and M. Laister. Neural network analysis of sleep
disorders. In Proceedings of ICANN’95, Paris, Oct 1995. European Neural Network Society.
[161] J. Teran-Santos, A. Jimenez-Gomez, and J. Cordero-Guevara. The association between sleep apnea
and the risk of traffic accidents. Cooperative group Burgos-Santander. New England Journal of
Medicine, 340(11):847–51, 1999.
Bibliography
250
[162] M.E. Tipping. The relevance vector machine. In S.A. Solla, T.K. Leen, and K-R. Müller, editors,
Advances in Neural Information Processing Systems, volume 12. MIT Press, Cambridge, Mass, 2000.
Available at http://www.kernel-machines.org/papers/upload 10444 rvm nips.ps.
[163] M.E. Tipping and D. Lowe. Shadow targets: a novel algorithm for topographic projections by
radial basis functions. Neurocomputing, 19(1-3):211–22, Mar 1998.
[164] L. Torsvall and T. Akerstedt. Extreme sleepiness: Quantification of EOG and spectral EEG parameters. Intern J. Neuroscience, 38:435–441, 1988.
[165] N. Townsend and L. Tarassenko. Micro-arousals in human sleep: An initial evaluation of automatic detection. Robotics Research Group, Department of Engineering Science, Oxford University,
Oxford, 1996.
[166] U. Trutschel, R. Guttkuhn, C. Ramsthaler, M. Golz, and M Moore-Ede. Automatic detection of
microsleep events using a neuro-fuzzy hybrid system. In 6th European Congress on Intelligent Techniques and Soft Computing. EUFIT’98, volume 3, pages 1762–6, Verlag Mainz, Aachen, Germany,
1998.
[167] S. Uchida, I. Feinberg, J.D. March, Y. Atsumi, and T Maloney. A comparison of period amplitude
analysis and FFT power spectral analysis of all-night human sleep EEG. Physiol. Behav., 67(1):121–
31, Aug 1999. I haven’t read it all.
[168] S. Uchida, M. Matsuura, S. Ogata, T. Yamamoto, and N. Aikawa. Computerization of Fujimori’s
method of waveform recognition. a review and methodological considerations for its application
to all-night sleep EEG. J. Neurosci. Methods, 64(1):1–12, Jan 1996.
[169] V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances
in Neural Information Processing Systems, volume 9, pages 281–287. MIT Press, Cambridge, Mass,
1997. Available at http://www.kernel-machines.org/papers/vapgolsmo96.ps.
[170] A. Varri, K. Hirvonen, J. Hasan, P. Loula, and V. Hakkinen. A computerized analysis system for
vigilance studies. Comput. Methods Programs Biomed., 39(1-2):113–24, Sep-Oct 1992.
[171] R. Venturini, W.W. Lytton, and T.J. Sejnowski. Neural network analysis of event related potentials
and electroencephalogram predicts vigilance. In J.E. Moody, S.J. Hanson, and R.P. Lippmann,
editors, Advances in Neural Information Processing Systems 4, pages 651–658. Morgan Kaufmann
Publishers, San Mateo, CA, 1992.
[172] M.L. Vis and L.L. Scharf. A note on recursive maximum likelihood for autoregressive modeling.
IEEE Transactions on Signal Processing, 42(10):2881–3, Oct 1994.
[173] J. Wright, R. Johns, I. Watt, A. Melville, and T. Sheldon. Health effects of obstructive sleep apnoea
and the effectiveness of continuous positive airways pressure: a systematic review of the research
evidence. British medical journal, 314:851–60, Mar 1997.
[174] G.U. Yule. On a method of investigating periodicities in disturbed series, with special reference to
Wölfer’s sunspot numbers. Philosophical transactions of the Royal Society of London, A226:267–98,
1927.
[175] M. Zamora. How disturbed is your sleep? the study of arousals using neural networks. In Neural
Computing Application Forum Meeting, Oxford, England, Sep 1998. NCAF.
[176] Tarassenko L. Zamora, M. The study of micro arousals using neural network analysis of the eeg.
In IEE Ninth International Conference on Artificial Neural Networks, volume 2, pages 625–30, Edinburgh, Scotland, Sep 1999. IEE.