CONTROL AND COMMUNICATION FOR
PHYSICALLY DISABLED PEOPLE, BASED
ON VESTIGIAL SIGNALS FROM THE BODY
BY
YVONNE MAY NOLAN
B.E.
A thesis presented to
The National Univerisity of Ireland
in fulfilment of the
requirements for the degree of
PHILOSOPHIAE DOCTOR
in the
DEPARTMENT OF ELECTRONIC AND ELECTRICAL ENGINEERING
FACULTY OF ENGINEERING AND ARCHITECTURE
NATIONAL UNIVERSITY OF IRELAND, DUBLIN
SEPTEMBER 2005
Supervisor of Research:
Head of Department:
An tOllamh A.M. de Paor
Professor T. Brazil
Abstract
When people become disabled as a result of a road traffic accident, stroke or
another condition, they may often lose their ability to control their environment and communicate with others by conventional means. This thesis investigates methods of harnessing vestigial body signals as channels of control and
communication for people with very severe disabilities, using advanced signal
acquisition and processing techniques. Bioelectrical, acoustic and movement
signals are among the signals investigated.
Some applications are presented that have been developed to assist environmental control and communication. These applications rely on a variety
of control signals for operation. Some applications may be controlled by a
simple binary switching action whereas others require user selection from a
wider range of possible options. A mechanical switch or adjustable knob may
be used to interact with these applications but this may not be an option for
people who are very severely disabled.
The remainder of the thesis focuses on alternative methods of enabling user
interaction with these and other applications. If a person who is physically
disabled is able to modify some body signal in such a way that two states can
be distinguished reliably and repeatedly, then this can be used to actuate a
switching action. Reliable detection of more than two states is necessary for
multiple-level switching control. As user’s abilities, requirements and personal
preferences vary greatly, a wide range of body signals have been explored.
Bio-signals investigated include the electrooculogram (EOG), the electromyogram (EMG), the mechanomyogram (MMG) and the conductance of the skin.
The EOG is the electrical signal measurable around the eyes and can be used
to detect eye movements with careful signal processing. The EMG and the
MMG are the electrical and mechanical signals observable as a result of muscle contraction. The conductance of the skin varies as a person relaxes or
tenses and with practice it can be consciously controlled. These signals were
all explored as methods of communication and control. Also, investigation of
the underlying physical processes that generate these signals led to the development of a number of mathematical models. These models are also presented
here.
Small movements may be harnessed using computer vision techniques. This
has the advantage of being non-contact. Often people who have become disabled will still be capable of making flickers of movement e.g. with a finger
or a toe. While these movements may be too weak to operate a mechanical
switch, if they are repeatable they may be used to provide a switching action
in software through detection with a video camera.
Phoneme recognition is explored as an alternative to speech recognition.
Physically disabled persons who have lost the ability to produce speech may
still be capable of making simple sounds such as single-phoneme utterances.
If these sounds are consistently repeatable then they may be used as the basis of a communication or control device. Phoneme recognition offers another
advantage over speech recognition in that it may provide a method of controlling a continuously varying parameter through varying the length of the
phoneme or the pitch of a vowel sound. Temporal and spectral features that
characterise different phonemes are explored to enable phoneme distinction.
Phoneme recognition devices developed in both hardware and software are
described.
ACKNOWLEGDEMENTS
I would firstly like to thank Harry, my supervisor, for all his support, encouragement and advice and for sacrificing his August bank holiday Monday to
help me get this thesis in on time!
Thanks also to all the postgrads who have been in the lab in the NRH with
me over the past three years - Deirdre, Claire, Catherine, Kieran, Ciaran and
Jane. Special thanks to Ted for all his assistance, support and friendship.
Thanks also to Emer for generating some of the graphs for this thesis.
Thanks to my parents for their patience and financial help and to my sisters
Tamara and Jill for keeping the house (relatively) quiet to enable me to get
some work done.
Thanks to all my friends for understanding my disappearance over the past
few months and giving me space to get this thesis finished.
Finally, a big thanks to Conor for being so supportive and patient with me
over the past few months, for giving me a quiet place to work and for helping
me with the pictures for this thesis!
i
LIST OF PUBLICATIONS ARISING FROM THIS
THESIS
“An Investigation into Non-Verbal Sound-Based Modes of Human-to-Computer
Communication with Rehabilitation Applications”, Edward Burke, Yvonne
Nolan & Annraoi de Paor, Adjunct Proceedings of 10th International Conference on Human-Computer Interaction, Crete, June 22-27 2003, pp. 241-2.
“The Mechanomyogram as a Tool of Communication and Control for the Disabled”, Yvonne Nolan & Annraoi de Paor, 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Francisco,
CA, September 1-5 2004, pp. 4928-2931.
“An Electrooculogram Based System for Communication and Control Using
Target Position Variation”, Edward Burke, Yvonne Nolan & Annraoi de
Paor, IEEE EMBSS UKRI Postgraduate Conference on Biomedical Engineering and Medical Physics, Reading, UK, July 18-20 2005, pp. 25-6.
“The human eye position control system in a rehabilitation setting”, Yvonne
Nolan, Edward Burke, Claire Boylan & Annraoi de Paor, International Conference on Trends in Biomedical Engineering, University of Zilina, Slovakia,
September 7-9 2005.
Accepted Paper: “Phoneme Recognition Based Software System for Computer
Interaction by Disabled People”, Yvonne Nolan & Annraoi de Paor, IEEE
EUROCON 2005 - International Conference on “Computers as a Tool”, University of Belgrade, Serbia and Montenegro, November 21-24 2005.
ii
Contents
1 Introduction
1
1.1 Assistive Technologies . . . . . . . . . . . . . . . . . . . . . . .
2
1.2 Thesis Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Assistive Technology
6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.2 Causes of Paralysis . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.1
Neurological Damage . . . . . . . . . . . . . . . . . . . .
7
2.2.2
Spinal Cord Injuries . . . . . . . . . . . . . . . . . . . .
9
2.2.3
Diseases of the Nervous System . . . . . . . . . . . . . . 17
2.3 Assistive Technology . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1
Importance of a Switching Action . . . . . . . . . . . . . 19
2.3.2
Switch Based Systems . . . . . . . . . . . . . . . . . . . 20
2.3.3
Brain Computer Interfaces
. . . . . . . . . . . . . . . . 23
2.4 Communication Device . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1
Technical Details . . . . . . . . . . . . . . . . . . . . . . 25
iii
2.4.2
The Natterbox Graphical User Inteface . . . . . . . . . . 25
2.4.3
Switch Interface Box . . . . . . . . . . . . . . . . . . . . 26
2.4.4
Other Features . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.5
Possible Future Developments of Natterbox . . . . . . . 31
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Muscle Signals
33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 The Nervous System . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1
Nerves and the Nervous System . . . . . . . . . . . . . . 34
3.2.2
Resting and Action Potentials . . . . . . . . . . . . . . . 38
3.3 Muscles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1
Muscle Physiology . . . . . . . . . . . . . . . . . . . . . 41
3.3.2
Muscle Contraction . . . . . . . . . . . . . . . . . . . . . 44
3.3.3
Muscle Action in People with Physical Disabilities . . . . 47
3.4 Electromyogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1
EMG Measurement . . . . . . . . . . . . . . . . . . . . . 49
3.4.2
EMG as a Control Signal . . . . . . . . . . . . . . . . . . 52
3.5 Mechanomyogram . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1
MMG as a Control Signal . . . . . . . . . . . . . . . . . 56
3.5.2
MMG Application for Communication and Control . . . 58
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
iv
4 Other Biosignals - Eye Movements and Skin Conductance
65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 The Electrooculogram . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.2
Anatomy of the Eye . . . . . . . . . . . . . . . . . . . . 67
4.2.3
Eye Tracking Methodologies . . . . . . . . . . . . . . . . 69
4.2.4
The EOG as a Control Signal . . . . . . . . . . . . . . . 76
4.2.5
Target Position Variation
4.2.6
Experimental Work . . . . . . . . . . . . . . . . . . . . . 86
4.2.7
TPV Based Menu Selection . . . . . . . . . . . . . . . . 94
4.2.8
Limitations of Eyetracking for Cursor Control . . . . . . 99
4.2.9
A Model of the Eye . . . . . . . . . . . . . . . . . . . . . 100
. . . . . . . . . . . . . . . . . 84
4.3 Electrodermal Activity as a Control Signal . . . . . . . . . . . . 119
4.3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3.2
Anatomy and Physiology of the Skin . . . . . . . . . . . 120
4.3.3
Electrodermal Activity . . . . . . . . . . . . . . . . . . . 121
4.3.4
Skin Conductance as a Control Signal . . . . . . . . . . . 123
4.3.5
Non-invasive Measurement of the Sympathetic System
Firing Rate . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5 Visual Techniques
132
v
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.2 Visual Based Communication and Control Systems . . . . . . . 133
5.2.1
The Camera Mouse . . . . . . . . . . . . . . . . . . . . . 133
5.2.2
Reflected Laser Speckle Pattern . . . . . . . . . . . . . . 135
5.3 Visual Technique for Switching Action . . . . . . . . . . . . . . 136
5.3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3.2
Technical Details . . . . . . . . . . . . . . . . . . . . . . 138
5.3.3
Frame Comparison Method . . . . . . . . . . . . . . . . 139
5.3.4
Path Description Method
. . . . . . . . . . . . . . . . . 150
5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6 Acoustic Body Signals
159
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.2.1
Speech Recognition: Techniques . . . . . . . . . . . . . . 160
6.2.2
Speech Recognition: Limitations . . . . . . . . . . . . . . 163
6.3 Anatomy, Physiology and Physics of Speech Production . . . . . 164
6.3.1
Respiration . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.3.2
Phonation . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.3.3
Resonance . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.3.4
Articulation . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.4 Types of Speech Sounds . . . . . . . . . . . . . . . . . . . . . . 173
vi
6.4.1
The Phoneme . . . . . . . . . . . . . . . . . . . . . . . . 174
6.4.2
Types of Excitation . . . . . . . . . . . . . . . . . . . . . 177
6.4.3
Characteristics of Speech Sounds . . . . . . . . . . . . . 180
6.4.4
Proposal of a Phoneme Recognition Based System for
Communication and Control . . . . . . . . . . . . . . . . 183
6.5 Hardware Application . . . . . . . . . . . . . . . . . . . . . . . 186
6.5.1
Analogue Circuit . . . . . . . . . . . . . . . . . . . . . . 188
6.5.2
Microcontroller Circuit . . . . . . . . . . . . . . . . . . . 192
6.6 Software Application . . . . . . . . . . . . . . . . . . . . . . . . 194
6.6.1
Application for Linux . . . . . . . . . . . . . . . . . . . . 195
6.6.2
Application for Windows . . . . . . . . . . . . . . . . . . 199
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
7 Conclusions
211
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
7.2 Resolution of the Aims of this Thesis . . . . . . . . . . . . . . . 212
7.2.1
Overview of Current Communication and Control Methods213
7.2.2
Identification of Signals . . . . . . . . . . . . . . . . . . . 213
7.2.3
Measurement Techniques . . . . . . . . . . . . . . . . . . 214
7.2.4
Signal Processing Techniques and Working Systems Developed . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.2.5
Patient Testing . . . . . . . . . . . . . . . . . . . . . . . 218
7.2.6
Biological Studies . . . . . . . . . . . . . . . . . . . . . . 220
vii
7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
7.3.1
The Mechanomyogram . . . . . . . . . . . . . . . . . . . 221
7.3.2
Target Position Variation . . . . . . . . . . . . . . . . . . 222
7.3.3
Visual Methods for Mouse Cursor Control . . . . . . . . 222
7.3.4
Communication System Speed . . . . . . . . . . . . . . . 223
7.3.5
Multi-Modal Control Signals . . . . . . . . . . . . . . . . 223
7.3.6
Other Vestigial Signals . . . . . . . . . . . . . . . . . . . 223
A MMG Circuit
235
B Simulink Models
237
C MATLAB Code for TPV Fit Function
242
D Optimum Stability
244
E Circuit Diagram for Measuring Skin Conductance
249
F
Phoneme Detection Circuit Diagrams and Circuit Analysis 251
F.1 Analogue Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 251
F.1.1 Pre-Amplifier . . . . . . . . . . . . . . . . . . . . . . . . 251
F.1.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
F.1.3 Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . 254
F.1.4 Rectifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
F.1.5 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . 255
viii
F.1.6 Delay and Comparator . . . . . . . . . . . . . . . . . . . 256
F.1.7 Relays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
F.2 Microcontroller Circuit . . . . . . . . . . . . . . . . . . . . . . . 259
F.2.1 Microphone . . . . . . . . . . . . . . . . . . . . . . . . . 259
F.2.2 Amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . 259
F.2.3 Infinite Clipper . . . . . . . . . . . . . . . . . . . . . . . 262
F.2.4 Microcontroller . . . . . . . . . . . . . . . . . . . . . . . 262
F.2.5 Debouncing Circuit . . . . . . . . . . . . . . . . . . . . . 262
F.2.6 Current Amplifier and Relay Coils . . . . . . . . . . . . . 263
G PIC 16F84 External Components and Pinout
264
H Phoneme Recognition Microcontroller Code and Flowchart 266
I
Code for Programs
273
I.1
Natterbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
I.2
USB Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
I.3
MMG Detection Program . . . . . . . . . . . . . . . . . . . . . 274
I.4
Path Description Program . . . . . . . . . . . . . . . . . . . . . 274
I.5
Graphical Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
I.6
Spelling Bee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
ix
List of Figures
2.1 The Vertebral Column . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The Spinal Nerves
. . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Dasher program . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Natterbox GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Natterbox Phrases Menu . . . . . . . . . . . . . . . . . . . . . . 30
3.1 The Nerve Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Classification of Nerve Fibre Types . . . . . . . . . . . . . . . . 36
3.3 Nerve Fibres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 An Action Potential
. . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Muscle Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 The Muscle Fibre . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Sarcomere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 The Neck Muscles . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.9 EMG and frequency spectrum . . . . . . . . . . . . . . . . . . . 50
3.10 EMG Differential Amplifier
. . . . . . . . . . . . . . . . . . . . 51
x
3.11 Electrode Position
. . . . . . . . . . . . . . . . . . . . . . . . . 51
3.12 MMG showing Muscle Contraction . . . . . . . . . . . . . . . . 57
3.13 MMG Prosthesis Socket . . . . . . . . . . . . . . . . . . . . . . 58
3.14 Accelerometer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.15 MMG Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 The Outer Eye . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Cross section of the eye . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Pupil and Corneal Reflections . . . . . . . . . . . . . . . . . . . 72
4.4 50Hz Video Eyetracker . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Scleral Search Coil . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 EOG Electrode Positions . . . . . . . . . . . . . . . . . . . . . . 75
4.7 EOG recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.8 EOG controlled alphabet board . . . . . . . . . . . . . . . . . . 77
4.9 TPV Based Menu Selection Application
. . . . . . . . . . . . . 85
4.10 TPV Candidate Target Shapes . . . . . . . . . . . . . . . . . . . 87
4.11 Results of TPV: Experiment 1 . . . . . . . . . . . . . . . . . . . 90
4.12 TPV Experiment 2 Screenshot . . . . . . . . . . . . . . . . . . . 94
4.13 TPV Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.14 Fit Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.15 Eye feedback control loop . . . . . . . . . . . . . . . . . . . . . 102
4.16 Step Response of Eye with Muscle Spindle Influence . . . . . . . 106
xi
4.17 Nuclear Bag Model . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.18 Unit step response and Bode magnitude diagrams of the muscle
spindle controllers . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.19 Actual EOG and Simulated Saccadic Responses . . . . . . . . . 111
4.20 Feedback Control Loop for Smooth Pursuit . . . . . . . . . . . . 113
4.21 Modified loop for Smooth Pursuit . . . . . . . . . . . . . . . . . 115
4.22 Bode Plot for Gi (s) . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.23 Smooth Pursuit Model Graphs . . . . . . . . . . . . . . . . . . . 117
4.24 Sweat Gland . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.25 Electrodermal response . . . . . . . . . . . . . . . . . . . . . . . 124
4.26 Skin Conductance Model . . . . . . . . . . . . . . . . . . . . . . 126
4.27 Proposed Loop For Firing Rate Output . . . . . . . . . . . . . . 127
4.28 Measured and Modelled Skin Conductance . . . . . . . . . . . . 128
4.29 Measured Skin Conductance and Estimated Firing Rate . . . . . 129
5.1 Camera Mouse Search Window . . . . . . . . . . . . . . . . . . 134
5.2 Speckle Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3 Webcam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.4 Filter Graph used for Video Data in application. . . . . . . . . . 139
5.5 Filtered Video Frames . . . . . . . . . . . . . . . . . . . . . . . 143
5.6 Various Thresholding Methods . . . . . . . . . . . . . . . . . . . 145
5.7 Video Frame Histogram . . . . . . . . . . . . . . . . . . . . . . 146
xii
5.8 Path Description . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.9 Region Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.10 Overlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.1 The Vocal Organs . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.2 Waveform of Vowel Sounds . . . . . . . . . . . . . . . . . . . . . 179
6.3 Spectrum of Vowel Sounds . . . . . . . . . . . . . . . . . . . . . 182
6.4 Phoneme Waveforms and Spectra . . . . . . . . . . . . . . . . . 189
6.5 Analogue Circuit Block Diagram . . . . . . . . . . . . . . . . . 190
6.6 Audio signal pre-processing . . . . . . . . . . . . . . . . . . . . 193
6.7 AudioWidget GUI . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.8 Graphical Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.9 The X10 Module . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.10 Phoneme Detection Program Signal and Spectrum . . . . . . . . 206
6.11 The Spelling Bee GUI . . . . . . . . . . . . . . . . . . . . . . . 208
A.1 MMG Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
B.1 Simulink MMG Muscle Contraction Detection . . . . . . . . . . 238
B.2 Simulink Model for Eye System . . . . . . . . . . . . . . . . . . 239
B.3 Simulink Model for Smooth Pursuit . . . . . . . . . . . . . . . . 240
B.4 Simulink Model for Firing Rate . . . . . . . . . . . . . . . . . . 241
D.1 Root Locus Varying f0 . . . . . . . . . . . . . . . . . . . . . . . 245
xiii
D.2 Root Locus Varying f1 . . . . . . . . . . . . . . . . . . . . . . . 246
D.3 Root Locus Varying h0 . . . . . . . . . . . . . . . . . . . . . . . 247
D.4 Root Locus Varying h1 . . . . . . . . . . . . . . . . . . . . . . . 248
E.1 Skin Conductance Circuit Diagram . . . . . . . . . . . . . . . . 250
F.1 Circuit Diagram for Phoneme Detection . . . . . . . . . . . . . 257
F.2 Electret Microphone Circuit . . . . . . . . . . . . . . . . . . . . 260
F.3 Circuit Diagram for PIC-Based Phoneme Detection . . . . . . . 261
G.1 Pin-out Diagram for PIC . . . . . . . . . . . . . . . . . . . . . . 265
H.1 Microcontroller Flowchart . . . . . . . . . . . . . . . . . . . . . 272
xiv
List of Tables
2.1 Cranial Nerve Damage . . . . . . . . . . . . . . . . . . . . . . .
9
2.2 Incomplete Spinal Cord Injury Patterns . . . . . . . . . . . . . . 14
2.3 Spinal Cord Injuries Motor Classifications . . . . . . . . . . . . 15
2.4 Spinal Cord Injury Functional Abilities . . . . . . . . . . . . . . 16
3.1 MMG Experimental Results . . . . . . . . . . . . . . . . . . . . 63
4.1 Icon Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 TPV Experiment 2 Sequence . . . . . . . . . . . . . . . . . . . . 93
5.1 Program Steps
. . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.2 Video Capture Parameters . . . . . . . . . . . . . . . . . . . . . 141
5.3 RGB24 format
. . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1 The Phonemes of Hiberno-English . . . . . . . . . . . . . . . . . 176
6.2 Classification of English Consonants . . . . . . . . . . . . . . . . 178
6.3 Spectral Peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.4 Example Relative Harmonic Amplitudes . . . . . . . . . . . . . 205
xv
F.1 Component Values for Phoneme Detection Circuit . . . . . . . . 258
F.2 Component Values for PIC-Based Circuit . . . . . . . . . . . . . 260
xvi
Chapter 1
Introduction
This thesis arises from work in the Engineering Research Laboratory in the
National Rehabilitation Hospital (NRH)1 . Typically, the patients in this hospital are people who have become disabled as a result of a stroke, disease
or accident. Advances in medical research are ensuring that more and more
people survive from these disabling conditions. It is important that research
follows that not only keeps these people alive, but also enables a fulfilling and
worthwhile quality of life.
Loss of speech production abilities can be one of the most devastating
elements of severe physical disability. Without the means to communicate by
conventional methods, people may find themselves shut off from the outside
world. Communication with other people is one of the most important actions
that we as humans perform. It is important to be able to converse with loved
ones, and to have a means for expressing our emotions, needs and desires.
Communication with others allows us to build relationships, make requests,
reach our intellectual potential and lead a stimulating and participative life.
The independence of people with severe physical disabilities is also an important consideration. Results from the 2002 census from the Central Statistics
1
Rochestown Ave., Dun Laoghaire, Co. Dublin, Ireland
1
Office [1] indicate that there are 159,000 people in this country who provide
regular unpaid help for a friend or family member with a long-term illness,
health problem or disability. Frequent reliance on family and friends can be
frustrating for the disabled person, both for practical reasons and because it
can compromise a person’s feelings of dignity. As technology advances, it is
important to ensure that systems are developed which can provide disabled
people with the ability to control their living environment, without needing
assistance from others.
1.1
Assistive Technologies
For people who are unable to control their environment and communicate
with others by conventional means, there are various systems available which
provide alternative methods of performing these tasks. The term augmentative
and alternative communication is often used to describe a range of alternative
communication techniques, from the use of gestures, sign language and facial
expressions to the use of alphabet or picture symbol boards [2]. In order to be
able to make use of these systems it is necessary to be able to interact with the
system in some way. Perkins and Stenning [3] state that the main objective
for people who are unable to use a keyboard is to be able to identify a function
or movement over which they have some control and utilise that. This could
be from movement of the head, eyes, chin, arms, hands or feet, for example.
These movements can be converted into such electrical signals as “on” or “off”
switches, or, in the case of those with a little more control, variable voltages.
People with very severe physical disabilities may only be capable of making
very small movements to indicate intent that may be difficult to harness. The
focus of this thesis is on investigating advanced methods of signal acquisition
and signal processing to enable these signals to be captured and used to control
communication and control devices. The principal aims of this thesis may be
outlined as follows.
2
• Overview of current methods of providing communication and control
for disabled people.
• Identification of alternative signals from the body which may be harnessed from the body for communication and control purposes for people
with very severe disabilities.
• Study of measurement techniques that may be used to acquire these
vestigial signals.
• Investigation of signal processing methods to enable these signals to be
correctly interpreted.
• Development of working systems that demonstrate the capabilities of
these techniques.
• Testing of these techniques and systems with people with severe disabilities.
• Development of some mathematical models that evolved as a result of
studying these body signals.
1.2
Thesis Layout
Some of the causes of paralysis and severe disability are outlined in Chapter 2.
An overview of assistive technology applications that may be of relevance to
people with very severe disabilities is given and the importance of identifying a
switching action is emphasised. An alphabet board based communication tool
was developed as part of this work called the Natterbox. This is also described
in Chapter 2.
The nervous system and the structure of muscle are given in Chapter 3, and
the mechanism of muscle contraction is described. Often people who are disabled will retain some ability to contract certain muscles, but not to a sufficient
3
extent to enable a mechanical switch to be used. However, the muscle contraction may still be harnessed for communication and control purposes through
other means. The electromyogram is the electrical signal observable from the
surface of the skin due to action potentials which occur on contraction. The
electromyogram as a control signal for prosthetics and for communication and
control systems is described. An alternative method of measuring muscle contraction for communication and control purposes is proposed. This method
uses the mechanomyogram, which is the mechanical signal observable on the
skin surface due to muscle contraction. A mechanomyogram based system for
communication and control was developed and this is presented here. Some
experiments were also performed with this system to assess its efficacy in controlling an alphabet board. The results of these experiments are reported.
Two more biosignals are investigated in Chapter 4, the electrooculogram
and the electrical conductance of the skin. The electrooculogram is the electrical signal observable around the eyes which can be used to measure eye
movement. An overview of different eye movement measurement techniques
is given and the electrooculogram is described in more detail. Some limitations of the electrooculogram signal as a communication and control signal
are identified and a novel technique is presented that seeks to overcome these
limitations to allow the electrooculogram to be used as a control signal. Study
of movement of the eyes led to development of a mathematical model of the
eye, which is also presented in Chapter 4. This model incorporates the effect
of the muscle spindle on the eye’s torque and predicts saccadic and smooth
pursuit eye movements. The electrical conductance of the skin is also briefly
explored as a control signal. Electrical skin conductance is related to sweat
gland activity on the surface of the skin and may be modulated by tensing
or relaxing, as will be discussed. Resulting from this study, a technique for
measuring the firing rate of the sympathetic nervous system was developed
which uses measurement of the skin conductance as its input.
Visual techniques are discussed in Chapter 5, which use a computer camera
4
or another light sensitive device to measure movement. Often people who
have become disabled will retain the ability to make flickers of movement of
a certain body part, for example a finger or a thumb. If these movements are
repeatable then they may be used to indicate intent. A novel algorithm for
describing specific paths of motion is presented. This algorithm is incorporated
into a software program, which detects specific movements and uses them to
generate a switching action. This switching action can then be used to control
any communication and control application operable by one switch.
Acoustic methods of harnessing signals from the body are explored in Chapter 6. For people who have speech production abilities, there is a wide range
of speech recognition technologies available that allow environmental control
using the voice. For those who are unable to speak, there may still be ways
of harnessing acoustic signals from the body. Often people who have lost the
ability to produce speech will still remain capable of producing non-verbal utterances. If these utterances are repeatable then they may be used as the basis
of a communication and control system. A number of acoustic based systems
were developed as part of the work described here and these are presented in
this chapter. A system for controlling a reading machine, an environmental
controller and an alphabet board based communication device are given.
The conclusions drawn from the research presented here are given in Chapter 7. Suggestions are made for future work in the area of communication and
control for disabled people.
5
Chapter 2
Assistive Technology
2.1
Introduction
Assistive technology is defined by Lazzaro [4] as any device that enables persons with disabilities to work, study, live, or play independently. Cook and
Hussey [5] describe it as any device or technology that increases or improves
the functional capabilities of individuals with disabilities. Assistive technology
may offer assistance to people with a wide range of disabilities including vision, hearing, motor, speech and learning impairments. Screen magnifiers and
braille are assistive technologies for blind or partially blind persons. Hearing
aids and subtitled films may be classed as assistive technologies for the deaf.
This thesis focuses on assistive technologies for people who, for one reason or
another, require assistance to communicate with others and to control their
environment. A principal aim of this thesis is to explore ways in which signals
from the body may be harnessed so that people with extremely severe physical
disabilities can interact with control and communication devices.
In this chapter, some of the possible causes of paralysis are first described
in Section 2.2. Section 2.3 reviews some of the available assistive technology
devices that may be of benefit to such people. An application called the
6
Natterbox is presented in Section 2.4. This communication application was
developed as part of this work to act as a testing board for switching action
methods described in later chapters.
2.2
Causes of Paralysis
There are many different circumstances that will lead to a person requiring
the use of an assistive device to communicate with others or to control their
environment. Paralysis can result from spinal injury following a road traffic
accident or other trauma. It can be caused by damage to the brain due to a
brain haemorrhage or a tumour. Motor neurone diseases, which cause wasting
of the muscle tissue, may eventually lead to paralysis, and necessitate use of a
communication and control device.
Some of the reasons that may lead to a person becoming severely physically
disabled are discussed in this section although this review is by no means
extensive. A major focus of this thesis is on exploring a range of available
options, so that a suitable assistive technology system may be identified for
each individual user, based on their capabilities and requirements, rather than
offering one single solution that will allow all severely disabled people to use a
control and communication device. Similarly, it is impossible to state here the
exact group of people who might benefit from the methods described in this
thesis. Some of the more common causes of paralysis will now be discussed.
2.2.1
Neurological Damage
Neurological damage, or damage to the brain, can occur due to a number of
different circumstances. One of the most commonly occurring reasons is due
to a stroke. The Irish Health Website [6] estimates that 8500 people in this
country suffer from a stroke annually.
7
Stroke is not a disease in itself, but a syndrome of neurological damage
caused by cerebrovascular disease [7]. Although paralysis is the most commonly associated aspect of a stroke, the stroke syndrome consists of a number
of different aspects which also include spasticity, contractures, sensory disturbances, psychological impairments, emotional and personality changes and
apraxia (the loss of ability to carry out familiar purposeful movements in the
absence of paralysis [8]).
A stroke occurs when normal blood circulation in the brain is interrupted,
either due to occlusion caused by a blood clot (an ischemic stroke) or through
sudden bursting of blood vessels (a haemorrhagic stroke). Strokes due to
blood clots may be divided into two categories. Cerebral thrombosis occurs
due a clot that develops in situ and cerebral embolism is caused by a clot that
forms elsewhere in the body and travels up to the brain [7]. Paralysis can
result from damage to the frontal lobe and/or damage to the internal capsule
fibres. The frontal lobe of the brain contains the motor area, which connects
to the motor cranial nerve nuclei and the anterior horn cells. The internal
capsule of the brain is the narrow pathway for all motor and sensory fibres
ascending from lower levels to the cortex. Damage to one side of the motor
fibres or the frontal lobe leads to loss of power in the muscles on the side of
the body opposite the lesion [9], a paralysis known as hemiplegia [8].
While paralysis is the main symptom of a stroke relevant here, some other
symptoms caused by damage to the cranial nerves are summarised in Table
2.1. The cranial nerves exist in pairs and damage to one of the nerves may
result in the symptoms listed at the side of the lesion. Note that damage to
the tenth nerve is one of the causes of total or partial loss of speech production
abilities. Speech impairments will be discussed in more detail in Chapter 6.
Following a stroke, some voluntary movement may return within a few
weeks of the incident. This is usually due to a number of causes. Following
cerebral infarction and particularly in the case of a cerebral haemorrhage,
abnormally large amounts of fluid in the surrounding tissue can temporarily
8
Table 2.1: Signs and symptoms of cranial damage, adapted from [10], pg. 100
Nerve
Name
Signs and Symptoms of Damage
V
trigeminal
Pain and burning on outer and inner aspect of cheek
Loss of sensation over face and cheek
VI
abducens
Diplopia, external rectus weakness, squint
VII
facial
Weakness of face
VIII
auditory
Vertigo, vomiting, nystagmus
Deafness and tinnitus
IX
glossopharyngeal
Loss of taste
X
vagus
Dysphagia
Paralysis of vocal cord and palate
disrupt neurological function. As the pressure subsides, the neurons in this
area may regain function. Motor function may also be restored due to central
nervous system reorganisation where other areas of the brain take on the role
of voluntary motor control [7]. This partial return of voluntary movement
following a stroke may be of enormous benefit when considering methods for
enabling stroke victims to interact with control and communication systems.
2.2.2
Spinal Cord Injuries
Spinal cord injuries usually occur as the result of a trauma, which is often
caused by a road traffic accident or a domestic, sporting or work-related injury.
The basic anatomical features of the spine and the innervation of the spinal
cord will first be discussed and the classifications of spinal cord injury will then
be described.
Structure of the Vertebral Column and the Spine
The spinal cord is protected by the vertebral column, a line of bony vertebrae
that runs down the middle of the back. The structure of the vertebral column
9
is shown in Figure 2.1. When viewed from the side, the vertebral column
displays five curves - an upper and lower cervical curve, and one each thoracic,
lumbar and sacral [11]. The sacral curve is not shown in Figure 2.1 but it
is located at the very bottom of the vertebral column, from the lumbarsacral
junction to the coccyx. The coccyx is better known as the tailbone, which is
made up of several fused vertebrae at the base of the spine [12]. The spinal
cord terminates before the end of the vertebral column, around the top of the
lumbar vertebrae in adults [13]. The lower tip of the spinal cord is called the
conus medullaris [8]. The area from the conus medullaris to the coccyx is
known as the cauda equina [13].
• The Cervical Spine
The purpose of the cervical spine is mobility. The two curves in the
cervical spine can be divided into upper and lower segments at the second
cervical vertebra. The first cervical vertebra (C1) is called the atlas and
the second cervical vertebra (C2) is called the axis. The upper cervical
muscles move the head and neck and are principally concerned with
positioning of the eyes and the line of vision, hence these muscles are
highly innervated to enable these movements to be made with a fine
degree of precision [11]. The axis provides a pivot about which the atlas
and head rotate. The lower cervical spine (C2-C7) also contribute to
movement of the head and neck.
• The Thoracic Spine and Ribs
An important function of the thoracic spine and rib cage is to protect
the heart, lungs and major vessels from compression. Due to this, the
thoracic area is the least mobile region of the spine. The thoracic vertebrae are numbered T1-T12 and the ribs are numbered R1-12 on each
side. The diaphragm muscle fibres are attached to ribs R7-R12.
10
Figure 2.1: The Vertebral Column, from pg. 2 in [11]
11
• The Lumbar Spine
The lumbar spine is made up of five vertebrae numbered L1-L5. The fifth
lumbar vertebra (L5) is the largest and its ligaments assist in stabilising
the lumbar spine to the pelvis.
There are 31 pairs of spinal nerves attached to the spinal column. Each
pair is named according to the vertebra to which they are related. The spinal
nerves are shown in Figure 2.2.
Classification of Injury
Injury of the spinal cord may produce damage that results in complete or
incomplete impairment of function. A complete lesion is one where motor and
sensory function are absent below the level of injury. A complete lesion may be
caused by a complete severance of the spinal cord, by nerve fibre breakage due
to stretching of the cord or due to a restriction of blood flow (ischaemia) to the
cord. An incomplete lesion will enable certain degrees of motor and/or sensory
function below the injury [14]. There are recognised patterns of incomplete
spinal cord injuries,which are summarised in Table 2.2.
A spinal cord injury may produce damage to upper motor neurons, lower
motor neurons or both. Upper motor neurons originate in the brain and are located within the spinal cord. An upper motor neuron injury will be located at
or above T12. Upper motor neuron injury produces spasticity of limbs below
the level of the lesion and spasticity of bowel and bladder functioning. Lower
motor neurons originate within the spinal cord where they receive nerve impulses from the upper motor neurons. These neurons transmit motor impulses
to specific muscle groups and receive sensory information which is transmitted
back to the upper motor neurons. Lower motor neuron injuries may occur at
the level of the upper neuron but more commonly are identified when occurring
at or below T12. Lower motor neuron injuries produce flaccidity of the legs,
decreased muscle tone, loss of reflexes and atonicity of bladder and bowel [14].
12
Figure 2.2: The Spinal Nerves, pg. 208 in [11]
13
Table 2.2: Patterns of incomplete spinal cord injuries, from text in [14]
Syndrome
Central Cord
Damaged Area Common Cause
Cervical Region
Hyperextension
injury
Brown-Séquard
Hemisection of
Stab Wound
Spinal Cord
Characteristics
Flaccid arm weakness
Good leg function
Injured Side
Loss of Motor Function
Uninjured Side
Loss of temperature
& pain sensation
Anterior Cord
Corticospinal
Ischaemia
& spinothalamic
& direct trauma
tracts
Variable loss of motor
function
Reduced sensitivity to
pain and temperature
Conus medullaris/
Sacral cord or the
Flaccid bladder and bowel
cauda equina
cauda equina nerves
Loss of leg motor function
Spinal cord injuries due to complete lesions are usually classified according
to the level of injury to the spine. Table 2.3 summarises the motor classification of spinal cord injury. The word paraplegia describes lower lesion spinal
cord injuries resulting in partial or total loss of the use of the legs. The words
tetraplegia and quadriplegia both describe high level spinal cord injuries, usually occurring due to injury of the cervical spine. Both terms mean “paralysis
of four limbs” and the injury causes the victim to lose total or partial use of
their arms and legs [15].
The main causes of spinal cord injury may be gauged from figures from the
Duke of Cornwall Spinal Treatment Centre, which are given in [16]. For the
new patient admissions with spinal injuries for the period 1993-1995, 36% are
due to road traffic accidents, 6.5% are due to self harm and criminal assault,
37% are due to domestic and industrial accidents and 20.5% are due to injuries
at sport. Until recently spinal cord injury was recognised as a fatal condition.
14
Table 2.3: Motor classification of spinal cord injury, adapted from pg. 63 in [14]
Level
Muscles
Level
C4
Deltoids
L2
Hip Flexors
C5
Elbow Flexors
L3
Knee Extensors
C6
Wrist Extensors
L4
Ankle dorsiflexors
C7
Elbow Extensors
L5
Long toe extensors
C8
Finger Flexors
S1
Ankle Plantar Flexors
T1
Finger Abductors
S4-S5
Muscles
Anal contraction
In the First World War, 90% of patients who suffered a spinal cord injury died
within one year of wounding and only about 1% survived more than 20 years
[16]. The chances of survival from a spinal cord injury began to increase in the
1940s with the introduction of sulfanilamides and antibiotics [14]. Nowadays,
due to better understanding and management of spinal cord injury, the outlook
has greatly improved for people with spinal cord injuries.
There has been a gradual change in the pattern of survival from low-lesion
paraplegia in the 1950s, high-lesion paraplegia in the 1960s and low-lesion
quadriplegia in the 1970s. Finally, in the 1980s, people with spinal cord injuries at or above C4, resulting in high-lesion quadriplegia, have been surviving
in significant numbers. It is estimated that each year in the USA, 166 sustain
injury at C1-C3 and 540 people at C4 [14]. As medicine advances, such individuals will survive in increasing numbers and thus it is important to identify
methods for interaction with communication and control systems for this group
of severely disabled individuals.
The functional ability of tetraplegic patients based on the level of injury
are summarised in Table 2.4. In general, movements of the limbs suffer more
severely than those of the head, neck and truck. Movements of the lower face
also tend to be more severely impaired than those of the upper face [10].
15
Table 2.4: Expected functional ability based on level of injury, constructed using
information from [16].
Level of Injury
Functional Ability
Complete lesion below C3
Dependent on others for all care
Chin and head movement
Can use breath controlled devices
Complete lesion below C4
Dependent on others for all care
Chin and head movement
Shoulder shrugging possible
Can type/use computer using a mouth stick
Complete lesion below C5
Shoulder movement
Elbow flexion
Complete lesion below C6
Wrist Extension
Complete lesion below C7
Full wrist movement
Some hand function
Complete lesion below C8
All hand muscles except intrinsics preserved
Complete lesion below T1
Complete innervation of arms
16
2.2.3
Diseases of the Nervous System
The words motor neurone disease (MND) and amyotrophic lateral sclerosis
(ALS) are often used interchangeably. However, amyotrophic lateral sclerosis
may be described more accurately as a type of motor neurone disease, and
probably the most well known. Motor neurone diseases affect the motor nerves
in the brain and the spinal cord [17] and the term motor neurone disease may
be used to describe all the diseases of the anterior horn cells and motor system,
including ALS [18].
Motor neurone diseases may be divided into two categories - idiopathic motor neurone diseases and toxin-related motor neurone diseases. An idiopathic
disease is one of spontaneous origin [8]. The idiopathic motor neurone diseases
include both the familial and juvenile forms of amyotrophic lateral sclerosis.
Also included under this category are progressive bulbar palsy (PBP), progressive muscular atrophy (PMA), primary lateral sclerosis (PLS), Madras
motor neurone disease and monomelic motor neurone disease [18]. The toxinrelated motor neurone diseases are suspected to be linked to environmental
factors [18]. These include Guamanian ALS (due to a high incidence of ALS
in Guam), lathyrism and Konzo.
The exact figure for the number of people diagnosed with ALS varies, but
it is thought to affect between 1-3 in every 100,000 of the population each year
[17, 18]. There are an estimated 300 people living with amyotrophic lateral
sclerosis at any one time in Ireland [17]. ALS is a progressive fatal disease of the
nervous system and the rate of progression depends on the individual [18]. The
muscles first affected by motor neurone diseases tend to be those in the hands,
feet or mouth and throat. As ALS progresses, the ability to walk, use the
upper limbs and feed orally are progressively reduced. In the terminal stage
of the disease, none of these functions can be independently performed and
respiratory functions become compromised [18]. At this stage of the disease, it
is as important as ever to give the person the best quality of life possible and
17
assistive technologies must be considered that can harness the vestigial signals
left to these people. Usually, the motor function of the eye muscles is spared
due to the calcium binding proteins in these nerve cells [18] and this feature
could be used to provide a method of control and communication, as will be
discussed in Chapter 4. Brain computer interface (BCI) technologies are also
often considered at the very latest stages of the disease, these will be briefly
described in Section 2.3.3.
Paralysis can also occur due to demyelinating diseases such as multiple
sclerosis. A demyelinating disease causes impairment of conduction of signals
in nerves as it damages the myelin sheath of neurons. More about the structure
of nerves will be described in Chapter 3. Neurological damage resulting in
paralysis may also occur due to viral infections such as poliomyelitis or polio
[10] or due to bacterial infections such as bacterial meningitis, which affects
the fluid in the spinal cord and the fluid surrounding the brain [19].
2.3
Assistive Technology
Assistive technologies can be of immense benefit to people with severe physical
disabilities such as those described above. As mentioned already, this thesis
focuses mainly on facilitating interaction with two type of assistive technology
applications - control and communication.
Communication applications are usually described in assistive technology
terms as augmentative and alternative communication (AAC) systems [2].
Augmentative and alternative communication systems refer to assistive technology systems designed for people who have limited or no speech production
abilities. Alternative communication systems usually consist of some sort of
alphabet board or symbolic board [4]. Some alternative communication systems display text to a computer screen, others output the text to a printer and
some work in conjunction with speech synthesis systems to “speak out” the in-
18
tended message. Some are computer operated and some are handheld, such
as the handheld LightWriter1 , a dual display keyboard based communication
aid. Some, such as Voicemate2 , allow the user to record phrases for digitised
playback [4].
Control applications refer to any system that can be operated automatically
using a control signal. For example, a control signal could be used to handle an
environmental control system to operate appliances in the user’s environment,
such as lights, fans or the television. The reading machine described in Chapter
6 is another example of a system that may be operated using a control signal.
Control signals can also be used to operate wheelchairs or electrically powered
prosthetics. The electromyogram muscle signal is often harnessed to replace
muscle function to control prosthetics for amputees, as described in Chapter
3.
2.3.1
Importance of a Switching Action
The simplest control signal is probably the switching action, which is any
action that allows the user to alternate between two possible states, “on” or
“off”. There are numerous systems in use today that may be operated by
pressing a single switch or multiple switches. Such systems are often called
switch-activated input systems [2]. A standard computer keyboard may be
described as a switch based system for interfacing with a computer. The
keyboard usually has around 100 keys or switches and each key press sends
a control signal to the processor which is recognised as a different letter or
symbol by the computer. The combination of two or more key presses may
also be used to increase the number of possible control signals [5].
There are many types of commercially available switches and a comprehensive guide to switches is given in [20]. The standard type of switch is the paddle
1
2
Lightwriter, Zygo Industries, Inc., P.O. Box 1008, Portland, OR 97202 USA
Tash Inc., Unit 1, 91 Station Street, Ajax, Ont. L1S 3H2, Canada.
19
type switch. These mechanical switches have movement in one direction and
can be activated by the user by pressing on the switch with any part of the
body. For persons who do not have sufficient strength or ability to operate
these switches there are a number of other types of switches available. These
switches include suck-puff switches, wobble switches, leaf switches and lever
switches [5, 21]. The switch chosen for a particular individual will depend on
the capabilities of the user.
For people who are very severely physically disabled, performing a switching action using any of these physical switches may not be an option. In
these cases, other methods of harnessing signals from the body to provide a
switching signal must be explored. One of the main objectives in developing
alternative systems for communication and control is to be able to correctly
identify two or more distinct states that a user can voluntarily elicit. If these
states can be reliably distinguished, then transition from one state to another
can be harnessed as a means of effecting a switching action.
2.3.2
Switch Based Systems
Switches are generally used in one of two ways - in a scanning system or in a
coding system. In a coding system, the user taps out a message using some
scheme such as the famous Morse code, using the switch. The Morse code
software functions like a translator, converting Morse code to text in real time
[4]. The coding can either be done using one switch with long switch presses for
the dash and short switch presses for the dots, or using two separate switches to
represent dots and dashes [2]. Morse code based systems have the disadvantage
that the code must first be learnt by the user.
A more popular type of switch-activated input system uses scanning based
selection. These systems are usually based on some variation of the rowscanning method described by Simpson and Koester [22]. The user is presented
with a screen of options, arranged in rows and columns. The program scans
20
through the rows and the user can select a particular row by pressing a switch.
The program then scans through each item on the selected row and the user
can select the desired item by pressing a switch again. Row scanning is often
used in software alphabet boards and can be used to spell out messages [2].
The idea of switch based menu selection has been around for years. The
personal computer became popular in the early 1980s and software based assistive technology systems soon followed. An independent living system known
as the ADAPTER program was developed around 20 years ago by a team in
Lousiana Tech University in the USA [23]. This program uses the row-scanning
method to allow the user to select one of several tasks from a menu. The five
options given are letters, words, codes, phone and environment. The program
is designed to be operated with a mechanical switch and the two examples
mentioned are a push-button switch and a bulb-pressure switch. If the user
selects the letter option on the main menu then they will be presented with a
second sub-menu with rows of letters and numbers which allows messages to be
spelled out. The word option provides quick access to a list of important words
e.g. light, water, bath etc. Selection of the code option allows communication
through Morse code by pressing the switch for long or short periods which is
then converted to text. The phone option displays a pre-programmed list of
names and phone numbers which may be dialled through the computer and
the environment option allows control of appliances in the user’s surroundings.
Another scanning based alphabet board system developed around this time
is described in [21], in which the scanning device is a hardware logic-based module that uses LEDs to highlight each character. This device can be connected
to the computer as a substitute for a manually operated keyboard. The system uses two switches to scan through the characters and enter the required
character into the computer.
Damper [24] estimates that a communication rate of 6-8 words per minute
is typically achieved using an alphabet board based communication system.
There have been a number of different methods suggested for increasing the
21
rate at which the user can select the letters. Perkins and Stenning [3] experimented with the idea of using two or five switches to operate an alphabet
board and also tested the communication rate with different menu layouts.
The two layouts tested had 57 characters - one had letters and each number
once and the second had additional characters related to frequency of use (e.g.
the letter ’E’ appears on the board five times) but no numbers. Simpson and
Koester [22] have proposed a method of increasing text entry rate using an
adaptive row-column scanning algorithm which increases or decreases the scan
delays according to user performance.
Although it is not yet implemented as a switch based text entry system,
the Dasher program by Ward [25] will briefly be described. Rates of 39 words
per minute have been claimed for it when operated using a mouse and 25
words per minute when operated using eye tracking. It is a software based
program which enables a person to spell out words by steering through a
continuously expanding two-dimensional scene containing alphabetical listings
of the letters [26]. A screenshot from this program is shown in Figure 2.3. The
line in the centre of the screen is the cursor. The user is initially presented
with an alphabetical list of letters and the user selects a letter by moving the
cursor inside the area of the letter. As the user approaches a letter the letter
grows in size. Once the letter is selected the user will again find themselves
presented with another list of letters but the relative sizes of all the letters
on the new list is based on the probability of this letter being the desired
letter based on the previous letter. Dasher uses a language model to predict
this, and the model is trainable on example documents in almost any language
[26]. In the example shown in Figure 2.3, the user is spelling out the word
“demonstration” and has already selected “demonstrat”. As the user moves the
cursor closer towards the letter “i”, the letter grows in size until the user is inside
the box. The screenshot also illustrates alternative words that could instead
have been selected such as “demolished”, “demonstrated that”, “demoralise”
and “demonstrative”. A number of different methods for interfacing with the
22
Figure 2.3: Dasher program - spelling out the word “demonstration”.
Dasher program are suggested on the Dasher website [27], including a mouse,
a joystick, eye-tracking and head-tracking. Future possible developments of
Dasher are described in [26], and include a suggestion for a modified method
for operation using a single switch. This will allow the user to operate Dasher
using a switch that changes the direction of cursor movement on activation.
2.3.3
Brain Computer Interfaces
Brain computer interfaces (BCI) may offer another method of providing switching actions in cases of very severe disability. Brain computer interfaces are
usually used in situations of very severe disability where there is no other
method of communication and control possible. These methods allow the user
23
to interact with the computer using some measurement of brain activity, such
as function magnetic resonance imaging (fMRI) or the electroencephalogram,
the electrical signal measurable from the surface of the scalp. Correct interpretation of these signals can be used to convey user intention and thus actuate
a switching action. The area of brain computer interfaces for the disabled is
a huge research area and the interested reader is referred to the IEEE review
of the first international BCI technology meeting [28] as a starting point for
more information.
2.4
Communication Device
A software communication device called Natterbox was developed as part of
this study, based on an alphabet board. The code for this program is included
in Appendix I. Although there are many similar communication programs
available commercially, this program was developed for two reasons. Firstly, it
was in response to a request made by one of the occupational therapists in the
hospital, who had been using a previous version of the same program, which
had been developed earlier in our laboratory in the NRH. She was attempting
to use the system with a male patient who had suffered from a brainstem
stroke. The patient had poor visual ability and was also very photosensitive.
This rendered him unable to see the letters of the alphabet board on screen.
She suggested making each of the rows of the alphabet board a different colour,
in accordance with the layout of physical alphabet boards used by occupational
therapists. An auditory facility was then added which speaks out the colours
on each of the different rows as they are highlighted. The patient was able
to learn which letters corresponded to which coloured row and hence could
perform a switching action when the program called out the name of the row
that was desired. The program then calls out each letter in that row in turn,
and the user can again select the desired letter when it is reached, thus enabling
the user to spell out messages.
24
The second benefit gained from development of the Natterbox program is
that it served as a useful testing board for different switching mechansims
developed in the work presented here. Since the Natterbox allows the user to
spell out words and sentences simply by performing a single switching action,
it was an invaluable tool in demonstrating translation of different body signals
into communication. The Natterbox program as described here was used by
a number of different patients in the hospital. For each of these patients, a
reliable method of interfacing with the program had to be identified and some
of the techniques used are discussed in this thesis. As the program developed,
various features were added in response to therapist and patient requests. Some
of these will now be briefly outlined.
2.4.1
Technical Details
The Natterbox program was developed with C++ using the Fast Light Tool
Kit3 (FLTK) to develop the graphical user interface. The sound feature was
added using tools from the Simple Directmedia Layer4 (SDL), which is a C++
multimedia library designed to provide access to audio devices. The primary
advantage of using FLTK and SDL is that they are both cross-platform, making
the Natterbox program portable across different operating systems.
2.4.2
The Natterbox Graphical User Inteface
The graphical user interface (GUI) of the Natterbox main menu is shown in
Figure 2.4, demonstrating a message being spelled out. In Figure 2.4(a), the
yellow row is highlighted. The user activates a switch to select this row and the
program begins scanning the letters on that row. In Figure 2.4(b), the symbol
“.” is highlighted. The user again activates a switch to select this symbol.
Figure 2.4(c) shows that the symbol has appeared on the message banner and
3
4
FLTK Website: http://fltk.org
SDL Website: http://www.libsdl.org
25
also on the history panel along the right-hand side of the screen.
2.4.3
Switch Interface Box
The switch input required by Natterbox was chosen to be an “F2” keypress.
Thus Natterbox can be used in one of three ways. Firstly it is operable by
simply pressing the physical key on the keyboard. Obviously this is not a very
useful interaction method for people with very severe disabilities. Secondly,
it may be used in conjunction with another program that is monitoring some
signal from the body and will simulate an “F2” keypress when it recognises
intention. Possible methods for harnessing body signals for these purposes
forms much of the remainder of the this thesis.
Thirdly, it may be used with a switch interface box. Any arbitrary two way
switch, such as those mentioned in Section 2.3.1, can be connected to this box.
The switch interface box is connected into the USB port of the computer and
a supplementary software application simulates an “F2” key press on detection
of a switching action. The supplementary program was called USB Switch and
the code is given in Appendix I.
2.4.4
Other Features
Phrases Menu
Due to requests from the occupational therapists in the hospital, the option
of a sub-menu was added to the Natterbox program. This sub-menu provides
quick access to a list of commonly used phrases. This menu may be opened
by selecting the last row in the main menu. The sub-menu screen is shown in
Figure 2.5(a). When the user selects the phrase “Turn on or off fan” it appears
in the message banner back in the main screen. This phrase could be used by
the user to request that the fan is turned off if it is already on, or turned off if
it is on.
26
(a)
(b)
27
(c)
Figure 2.4: The Natterbox program (a) The program is highlighting the second
(yellow) row. (b) When the user selects the second row the user begins scanning the
letters on this row. The “.” button is currently highlighted. (c) The user selects this
symbol and it appears above on the banner.
28
Printing Feature
An option to print the message to paper was added in response to a request
from a patient who wanted a facility for writing letters to her children. This
request was fulfilled by placing an option “Print” at the bottom of the phrases
menu. Selection of this option sends all the text in the history box to an
attached printer. This option could be of immense benefit to users since it
allows the user to prepare lengthy messages in advance.
Cancel Feature
A “cancel” option was added for people who are capable of actuating a second switching action. The second switch input cancels the effect of the last
input. Thus if the user has accidently selected a letter they may delete this
letter from the message bar by activating the second switch. If the user has
accidently selected the wrong row and the program is scanning through each
of the items on that row, the user may use the second switch to change back
to row scanning.
Three-Switch Mouse
A three-switch mouse was developed for one of the patients who was in the
hospital who was particularly successful with the Natterbox program. The
patient used a push-button switch placed between his thumb and hand to
operate the program. He also had head movement on both sides so was able to
operate two head switches. The Natterbox program was modified to include a
mouse cursor control system using these three switching actions. The patient
could exit the alphabet board program by selecting an “Exit” option at the
end of the phrases menu. This switches the program into mouse cursor control
mode. The mouse cursor is controlled by the USB Switch program.
The head switches may be used to move the mouse cursor either up and
29
(a)
(b)
Figure 2.5: The Natterbox Phrases Menu (a) The program is highlighting the second
phrase “Turn on or off the fan”. (b) When the user selects this phrase it appears on
the banner back in the main menu.
30
down, or left and right. Switching between these two directions is performed
using the hand switch. Pressing the hand switch twice in succession actuates
a mouse click.
2.4.5
Possible Future Developments of Natterbox
The addition of a submenu to Natterbox containing numbers and punctuation
marks could be of great benefit. In addition to adding to user dignity by making the messages look more presentable, they could also enable emoticons to
be used to add more meaning to messages. Emoticons are being more and
more popular nowadays due to emailing, instant messaging and text messaging. Emoticons (emotion icons) are a method of adding symbols to the end
of messages to represent different facial expressions. These can be used to
communicate more effectively what is meant by the message. For instance,
the simple term “It’s ok” could be interpreted in a number of different ways.
It can be intended straightforwardly and this can be emphasised by placing a
smiley face symbol at the end of the message i.e. “It’s ok :-)”. Conversely, if
the person wishes to impart some sort of satirical tone to the message, they
may express this by adding the sad smiley “It’s ok :-(” or the angry smiley symbol “It’s ok :-@”, depending on intent. These emoticon symbols are becoming
more and more integrated into casual everyday written communications and
could offer an immense benefit to people who are severely disabled and wish to
more effectively convey their emotions when writing messages. The addition
of a speech synthesiser to the complete program to allow the messages to be
spoken out loud is also being considered.
2.5
Conclusions
This chapter has outlined some of the diseases, conditions and circumstances
that may render a person severely physically disabled. A review of assistive
31
technology applications has been given and the importance of generating a
switching action has been emphasised. Now that these areas of been discussed,
the aims of this thesis may be more accurately defined. This thesis aims to
investigate alternative methods of harnessing vestigial signals from people who
have been severely paralysed and have very little motor function, such as those
with high-level lesions above C4. These people may be unable to operate a
mechanical switch and thus require a more complex technique to be identified
that will allow a switching action to be actuated. A large part of the remainder
of this thesis focuses on methods of harnessing these vestigial signals to provide
switching actions and other control signals.
32
Chapter 3
Muscle Signals
3.1
Introduction
This chapter and Chapter 4 investigate methods of harnessing bio-signals from
the body for control and communication purposes. The exact criteria required
to enable a particular body signal to be described as a bio-signal are not always
well defined. In the broadest sense of the word, a bio-signal may refer to any
signal from the body related to biological function. Under this definition, all of
the signals presented in this thesis would fall under the category of bio-signals,
including the signals obtained through video capture techniques, described
in Chapter 5, and speech signals obtained through audio signal processing
techniques, described in Chapter 6. A more narrow definition of the term biosignals is meant here. A bio-signal as discussed in this thesis refers to any signal
that is measurable directly from the surface of the skin. This includes signals
such as biopotentials, which are measured voltages from certain sites on the
body, but also other electrical signals, such as the electrical skin conductance,
and mechanical signals, such as the mechanomyogram.
This chapter discusses two bio-signals which may be used to detect muscle contraction. These are the electrical signal, the electromyogram (EMG),
33
and the mechanical signal, the mechanomyogram (MMG). Muscle signal based
switching systems may be an option for people who retain some ability to contract certain muscles but may not be able to operate a mechanical switch.
This may be because the particular muscle that can be contracted is not suitable for operating a switch or because the muscle contraction is not strong
enough to operate the switch. This chapter investigates how deliberate muscle
contraction can be used to effect a switching action to operate control and
communication systems.
The anatomy and physiology of the nerves and the nervous system are first
described in Section 3.2.1. Action potentials and the method of information
transfer in the body are described in Section 3.2.2. The anatomy of muscle and
the process of muscle contraction are discussed in Section 3.3. Some different
muscles that may be suitable for use for an EMG-based or MMG-based system are identified in Section 3.3.3. The electromyogram as a control signal is
discussed in Section 3.4. Finally the possibility of using the mechanomyogram
as a control signal is explored in Section 3.5.
3.2
3.2.1
The Nervous System
Nerves and the Nervous System
The Nerve Cell
The basic building block of the human body’s nervous system is the nerve cell,
or neuron. The neurons in the body are interconnected to form a network
which is responsible for transmitting information around the body. The spinal
cord, the brain and the sensory organs (such as the eyes and ears) all consist
largely of neurons.
The structure of a neuron is shown in Figure 3.1. The central part of
34
Figure 3.1: The Nerve Cell, from pg. 2 in [29]
the neuron is the cell body, or soma, which contains the nucleus. The cell
body has a number of branches leading from its centre, which can either be
dendrites or axons. The dendrites receive information and the axons transmit
information, both in the form of impulses, which will be described in more
detail later. There is generally only one axon per cell. The axon links the
nerve cell with other cells, which can be nerve cells, muscle cells or glandular
cells. In a peripheral nerve, the axon and its supporting tissue make up the
nerve fibre. A bundle of nerve fibres is known as a nerve.
Classification of Nerve Fibres
The peripheral nervous system refers to the neurons that reside outside the
central nervous system (CNS) and consists of the somatic nervous system and
the autonomic nervous system [30]. A nerve fibre may be classified as either an
afferent nerve fibre or an efferent nerve fibre. An afferent nerve fibre transmits
information to the neurons of the CNS and the efferent nerve fibre transmits
35
information from the CNS.
Afferent nerve fibres may further be divided into somatic nerve fibres and
visceral nerve fibres. Visceral afferents are nerve fibres from the viscera, which
are the major internal organs of the body. All other afferent nerve fibres in the
body are called somatic afferents. These come from the skeletal muscle, the
joints and the sensory organs such as the eyes and ears, and bring information
to the CNS.
Efferent nerve fibres can be categorised as either motor nerve fibres or
autonomic nerve fibres. Motor efferents control skeletal muscle and autonomic
efferents control the glands, smooth muscle and cardiac muscle. See Figure 3.2
for a summary of nerve fibre classifications.
The visceral afferent nerve fibres and the autonomic efferent nerve fibres
both belong to the autonomic nervous system. The autonomic nervous system
is responsible for controlling such functions as digestion, respiration, perspiration and metabolism which are not normally under voluntary control. The
function of perspiration, controlled by the autonomic nervous system will be
described in more detail in Chapter 4.
Sensory Organs
Skeletal Muscle
Joints
Somatic
Visceral
Central
Nervous
System
Motor
Autonomic
Viscera
Afferents
Efferents
Skeletal Muscle
Cardiac Muscle
Smooth Muscle
Glands
Figure 3.2: Classification of Nerve Fibre Types
Supporting Tissue
Neurons are supported by a special type of tissue constructed of glial cells.
These cells perform a similar role to connective tissue in other organs of the
body. In a peripheral nerve, every axon lies within a sheath of cells known as
36
Figure 3.3: (A) Myelinated Nerve Fibre (B) Unmyelinated Nerve Fibres, from pg.
8 in [29].
Schwann cells, which are a type of glial cell. The Schwann cell and the axon
together make up the nerve fibre. A nerve fibre may be either a myelinated
nerve fibre or an unmyelinated nerve fibre depending on how the Schwann
cells are positioned around the axon. Myelinated nerve fibres have a higher
conduction velocity than unmyelinated nerve fibres. About two-thirds of the
nerve fibres in the body are unmyelinated fibres, including most of the fibres in
the autonomic nervous system, since these processes generally do not require
a fast reaction time.
In myelinated nerve fibres, the Schwann cell winds around the axon several
times as shown in Figure 3.3. A lipid-protein mixture known as myelin is
laid down in layers between the Schwann cell body, forming a myelin sheath.
This sheath insulates the nerve membrane from the conductive body fluids
surrounding the exterior of the nerve fibre. The myelin sheath is discontinous
along the length of the axon. At regular intervals there are unmyelinated
sections which are called the Nodes of Ranvier. These nodes are essential in
enabling fast conduction in myelinated fibres [29].
As mentioned in Chapter 2, diseases such as multiple sclerosis damage the
myelin sheath of neurons, or dymyelinate the fibres along the cerebrospinal
axis [10]. Paralysis occurs due to impairment of the conduction of signals in
demyelinated nerves.
37
3.2.2
Resting and Action Potentials
The Membrane Potential
A potential difference usually exists between the inside and outside of any cell
membrane, including the neuron. The membrane potential of a cell usually
refers to the potential of the inside of the cell relative to the outside of the cell
i.e. the extracellular fluid surrounding the cell is taken to be at zero potential.
When no external triggers are acting on a cell, the cell is described as being in
its resting state. A human nerve or skeletal muscle cell has a resting potential
of between -55mV and -100mV [29]. This potential difference arises from a
difference in concentration of the ions K+ and Na+ inside and outside the cell.
The selectively permeable cell membrane allows K+ ions to pass through but
blocks Na+ ions. A mechanism known as the ATPase pump pumps only two
K+ ions into the cell for every three Na+ cells pumped out of the cell resulting
in the outside of the cell being more positive than the inside. The origin of
the resting potential is explained in further detail in [29].
The Action Potential
As mentioned already, the function of the nerve cell is to transmit information
throughout the body. A neuron is an excitable cell which may be activated by
a stimulus. The neuron’s dendrites are its stimulus receptors. If the stimulus
is sufficient to cause the cell membrane to be depolarised beyond the gate
threshold potential, then an electrical discharge of the cell will be triggered.
This produces an electrical pulse called the action potential or nerve impulse.
The action potential is a sequence of depolarisation and repolarisation of the
cell membrane generated by a Na+ current into the cell followed by a K+
current out of the cell. The stages of an action potential are shown in Figure
3.4.
38
mV
3
30
2
0
4
−55
Threshold
−70
6
1
Resting Potential
5
Figure 3.4: An Action Potential. This graph shows the change in membrane potential as a function of time when an action potential is elicited by a stimulus. The
time duration varies between fibre types.
• Stage 1 - Activation
When the dendrites receive an “activation stimulus” the Na+ channels
begin to open and the Na+ concentration inside the cell increases, making
the inside of the cell more positive. Once the membrane potential is
raised past a threshold (typically around -50mV), an action potential
occurs.
• Stage 2 - Depolarisation
As more Na+ channels open, more Na+ ions enter the cell and the inside
of the cell membrane rapidly loses its negative charge. This stage is also
known as the rising phase of the action potential. It typically lasts 0.2 0.5ms.
• Stage 3 - Overshoot
The inside of the cell eventually becomes positve relative to the outside
of the cell. The positive portion of the action potential is known as the
overshoot.
39
• Stage 4 - Repolarisation
The Na+ channels close and the K+ channels open. The cell membrane
begins to repolarise towards the resting potential.
• Stage 5 - Hyperpolarisation
The membrane potential may temporarily become even more negative
than the resting potential. This is to prevent the neuron from responding
to another stimulus during this time, or at least to raise the threshold
for any new stimulus.
• Stage 6
The membrane returns to its resting potential.
Propagation of the Action Potential
An action potential in a cell membrane is triggered by an initial stimulus to
the neuron. That action potential provides the stimulus for a neighbouring
segment of cell membrane and so on until the neuron’s axon is reached. The
action potential then propagates down the axon, or nerve fibre, by successive
stimulation of sections of the axon membrane. Because an action potential is
an all-or-nothing reaction, once the gate threshold is reached, the amplitude
of the action potential will be constant along the path of propagation.
The speed, or conduction velocity, at which the action potential travels
down the nerve fibre depends on a number of factors, including the initial
resting potential of the cell, the nerve fibre diameter and also whether or not
the nerve fibre is myelinated. Myelinated nerve fibres have a faster conduction
velocity as the action potential jumps between the nodes of Ranvier. This
method of conduction is known as saltatory conduction and is described in
more detail in [29].
40
Synaptic Transmission
The action potential propagates along the axon until it reaches the axonal
ending. From there, the action potential is transmitted to another cell, which
may be another nerve cell, a glandular cell or a muscle cell. The junction of
the axonal ending with another cell is called a synapse. The action potential is
usually transmitted to the next cell through a chemical process at the synapse.
If the axon ends on a skeletal muscle cell then this is a specialised kind of
synapse known as a neuromuscular end plate. In this case, the action potential
will trigger the muscle to contract. The physical processes that must occur to
enable muscle contraction will be examined in more detail later, but first the
structure of the muscle is described.
3.3
3.3.1
Muscles
Muscle Physiology
There are three types of muscle present in the human body - smooth, skeletal
and cardiac. Smooth muscle is the muscle found in all hollow organs of the
body except the heart, and is generally not under voluntary control. Cardiac
muscle, the only type of muscle which does not experience fatigue, is the muscle
found in the walls of the heart which continuously pumps blood through the
heart. Skeletal muscle is the muscle attached to the skeleton which is the type
of muscle that will be described here. The main function of skeletal muscle
is to generate forces which move the skeletal bones in the body. The basic
structure of a skeletal muscle is shown in Figure 3.5.
Muscle is a long bundle of flesh which is attached to the bones at both ends
by tendons. The muscle is protected by an outer layer of tough tissue called
the epimysium. Inside the epimysium are fasicles or bundles of muscle fibre
cells. The fasicles are surrounded by another layer of connective tissue called
41
Epimysium−outer layer of the muscle
Tendon
Perimysium − surrounds
111111
000000
each muscle bundle
000000
111111
000000
111111
000000
111111
00000
11111
000
111
000000
111111
00000
11111
000
111
000000
111111
00000
11111
000
111
000000
111111
00000
11111
000
111
000000
111111
0000011
11111
000
111
00
00000
11111
00000
11111
000
111
00
11
00000
11111
00000
11111
000
111
00
11
00000
11111
00000
11111
000
111
00
11
00000
11111
00000
11111
000
111
00
11
00000
11111
00000
11111
000
111
00
11
00000
11111
00000
11111
000
111
00
11
00000
11111
00000
11111
000
111
00
11
00000
11111
00000
11111
000
111
00
11
000000
111111
00000
11111
00000
11111
00
11
000000
111111
00000
11111
00000
11111
0000
1111
000000
111111
00000
11111
00000
11111
0000
1111
000000
111111
00000
11111
0000
1111
000000
111111
0000
0000001111
111111
Bone
Muscle fibre (cell)
Endomysium surrounds each cell
Fasicle − bundle of
muscle cells
Figure 3.5: Muscle Anatomy
the perimysium. The individual muscle fibre is surrounded by a layer of tissue
called the endomysium. The structure of the individual muscle fibre will now
be described now in more detail.
The Muscle Fibre
Each individual muscle fibre is a cell which may be as long as the entire muscle
and 10 to 100µm in diameter. The nuclei are positioned around the edge of
the fibre. The inside of the muscle fibres consists of closely packed protein
structures called myofibrils which are the seat of muscle contraction. The
myofibrils run along the length of the muscle fibre. These myofibrils exhibit a
cross striation pattern which is shown in Figure 3.6.
The myofibrils may be seen in detail using a technique known as polarised
light microscopy.
Under a microscope, the myofibrils exhibit a repeating
pattern of dark and light bands. The dark bands are termed A-bands or
anisotropic bands and the light bands are termed I-bands or isotropic bands.
Anisotropic and isotropic refer to how the bands transmit the polarized light
which is shone on them as part of the microscopy process. The isotropic bands
transmit incident polarised light at the same velocity regardless of the direction and so appear light coloured, while the anisotropic bands transmit the
light at different velocities depending on the direction of the incident light and
42
Muscle Fibre
Myofibril
111111
000000
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
0
1
000000
0A−Band I−Band 111111
1
000000
111111
0
1
000000
111111
000000
1111111111
0000
0
1
000000
111111
0
1
000000
111111
0
1
0000
1111
0
1
0
1
0 1
1
0000
1111
000 1
111
000
111
0
0000
1111
0
0
1
0000
1111
000
111
000
111
0
1
0000
1111
0
1
0
1
0000
1111
000 1
000
111
0 111
1
0000
1111
0
1
0000
1111
000 0
111
000
111
0
1
0000
1111
0
1
0
1
0000
1111
000
111
000
111
0
1
0000
1111
0
1
0
1
0000
1111
000
111
000
111
0
1
0000
1111
0
1
0
1
00
0000
1111
000 1
000
0 111
1
0000 11
1111
0
1
0 111
00
11
000000000
111111111
0
1
00
11
000000000
111111111
1111111111
0000000000
0
1
00
000000000
111111111
0 Sarcomere 11
1
00
11
000000000
111111111
0
1
00
11
000000000
111111111
0
1
00
11
Z disc
0
1
00
11
11
00
00
11
00
11
00
11
00
11
00
11
00
11
00
11
Figure 3.6: The muscle fibre and the myofibril cross striation pattern.
therefore appear dark coloured. In the middle of the I-band there is a thin
dark strip known as the Z-disc.
The basic contractile element of muscle is known as the sarcomere and
is the region between two Z-discs. The sarcomere is about 2µm in length.
The myofibril is made up of a repeating chain of sarcomeres. A sarcomere
consists of one A-band and one I-band. The structure of the sarcomere is
shown in Figure 3.7(a). The Z-discs link adjacent thin myofilaments, the Ibands, which are about 5nm in diameter. These bands primarily consist of
actin, but also contain tropomyosin and toponin [31]. The A-band in the
centre of the sarcomere contains thicker myofilaments made of myosin which
interlink the thin myofilaments [29]. These myosin filaments are about 11nm
in diameter [30]. When the muscle contracts the thin filaments are pulled
between the thick filaments. The position of the actin and myosin filaments
are shown before contraction in Figure 3.7(a) and during contraction in Figure
3.7(b). The importance of these bands and their role in muscle contraction
will be described in the next section.
43
(a)
Actin
I−Band
A−Band
Myosin
Z−Disc
(b)
Figure 3.7: (a) The sarcomere before contraction occurs. The A-band, containing
thick myosin filaments, and the I-band, containing the Z-disc and the thin actin
filaments are shown. (b) On contraction of the muscle, the thin actin filaments slide
between the myosin filaments.
3.3.2
Muscle Contraction
The Motor Unit
Each efferent motor nerve fibre, or α motor neuron as they are also known,
stimulates a number of muscle fibres. The nerve fibre, and the muscle fibres it
innervates, make up the smallest functional unit of muscle contraction known
as the motor unit. Each individual muscle fibre in a motor unit will be stimulated simultaneously by the nerve fibre, so they will each always contract and
relax in synchronisation. The force produced by a muscle can be increased by
increasing either of two parameters:(i) The number of active motor units. The motor units are roughly arranged
in parallel along the length of the muscle so by activating more motor
units, more muscle force can be produced. The forces produced by individual muscle units sum algebraically to give the total muscle force.
44
(ii) The rate at which the nerve fibres activate the muscle fibres, or fire. This
rate is known as the firing frequency. When a single motor unit receives
a single stimulation, the response is a single twitch. The duration of
a single twitch varies depending on whether the muscle fibres are slowtwitch (Type 1) muscle fibres or fast-twitch (Type 2) muscle fibres. A
motor unit will usually be made up entirely of either fast-twitch muscle
fibres or slow-twitch muscle fibres. The slow motor units have a slower
speed of contraction but will take longer to fatigue. When a muscle
contracts, the slow motor units are recruited first, this principle is known
as the size principle of motor unit recruitment [31]. The duration of a
single twitch in a slow-twitch muscle fibre is about 200ms. The action
potential causing the single twitch is only about 0.5ms in duration so the
twitch goes on for a long time once it has been initiated.
If the length of a single twitch is 200ms and the firing frequency is less
than 5Hz, then the force response will show a series of individual twitches.
As the firing frequency of the motor unit increases, the second stimulus
will begin to stimulate the muscle before the effects of the first stimulus
have subsided. In this cases the forces begin to accumulate. As the firing frequency increases, the force response becomes larger in magnitude.
For relatively low frequencies (less than 20Hz for slow motor units and
less than 50Hz for fast motor units) there will be some force relaxation
between stimulation pulses. If the muscle force is oscillating, then this
is known as unfused tetanic contraction. At higher firing frequencies the
force will remain constant, this is known as fused tetanic contraction.
Types of Contraction
When a muscle is stimulated by a nerve impulse, it tends to shorten, provided
it can overcome the external resistance imposed on it. Shortening and force
production of muscle is referred to as contraction [31]. A shortening contraction
is called a concentric contraction. In certain instances the muscle is fixed so
45
it cannot shorten and the increase in muscle contraction is then measurable
as an increase in the force acting on the muscle. This type of contraction is
known as an isometric contraction. Each muscle has a maximum isometric
force capability which is the maximum amount of force that can be applied to
a muscle which is fixed at a certain length without forcible stretching. If the
muscle is subjected to an external force greater than its maximum isometric
force capability then the muscle is forcibly stretched. This is known as eccentric
contraction. These contractions can be measured in vivo - i.e. while the muscle
is still ‘living’ in the human body. Other muscle contractions are measurable by
severing a muscle at its tendons and placing it in a bath for experiments. These
types of measurements are known as in vitro measurements (literally meaning
in glass). In vitro experiments can be used to measure isotonic or isokinetic
contractions. Isotonic contraction occurs when the muscle is subjected to a
constant load and isokinetic contraction refers to contractions performed at a
constant speed. An in vivo contraction is rarely fully isometric or isotonic.
Molecular Mechanism of Contraction
During an isotonic contraction, it is observed that the width of the A-bands
stays constant but the width of the I-bands becomes narrower. However, the
length of the actin filaments in the I-band are found to stay the same length
during the contraction. The I-band is thus shortened by the actin filaments
sliding in between the myosin filaments. The cross-bridge theory, which was
first postulated by Huxley in 1957 [32], is widely used to describe how the
actin filaments slide between the myosin filaments. When a muscle begins to
contract a cross-bridge is formed between the myosin and actin filaments. The
head of the cross-bridge rotates, which pulls the actin filament between the
myosin filaments. The bridge is then broken and reformed with the next part
of the actin filament and the cycle continues.
As described earlier, a muscle cell is stimulated to contract when it re-
46
ceives an action potential. It is thought that the depolarisation of the cell
that occurs during an action potential might cause an increase in the calcium
ion concentration inside the cell. The exterior of the myofibrils consists of a
network of tiny sacks or vesicles, known as the sarcoplasmic reticulum. The
vesicles provide calcium to the Z-discs when the cell is depolarised. The crossbridge is formed by a binding of the actin and myosin molecules and requires
calcium ions to split the ATP and release energy for contraction. When the
muscle is in a relaxed state, the sarcomere contains a very low concentration
of calcium ions, so there is no interaction between the actin and myosin and
no ATP splitting. On activation the calcium ion concentration rises and so
cross-bridges are formed between the two sets of filaments, ATP is split and
sliding occurs [30].
3.3.3
Muscle Action in People with Physical Disabilities
Often, even people who have become severely paralysed will retain some level
of ability to contract certain muscles. For example, quadriplegic patients who
have been injured around the C5/C6 level usually retain the ability to move
their head to some extent. In some cases, this movement is sufficient to allow the person to communicate intent by operating head-switches, which are
usually affixed to their wheelchair. Unfortunately, although a person may still
be able to activate a muscle voluntarily, often the contractions may be too
weak to operate a conventional mechanical switch. This weakness is caused
largely by a loss of functional input from higher brain centres to the spinal
motor nerves, which leads to partial muscle paralysis and submaximal muscle
activation [33]. In these situations, the contraction must be detected by other
means.
The sternocleidomastoid muscle is one of the muscles which may often
still be under voluntary control in people with high-lesion quadriplegia. This
muscle is one of the muscles which flex the neck. The neck muscles are shown
47
Figure 3.8: The Neck Muscles, showing the sternocleidomastoid, from pg. 97 in [11]
in Figure 3.8. The sternocleidomastoid muscle receives motor supply from the
spinal part of the accessory (eleventh cranial nerve). It receives sensory fibres
from the anterior rami of C2 and 3 [11] and thus may still be controlled by
people who still have these nerve fibres intact, which usually includes people
with spinal cord injuries lower than this level. Unilateral contraction of the
sternocleidomastoid laterally flexes the head on the neck, rotating it to the
opposite side, and laterally flexes the cervical spine. Bilateral contraction
draws the head forwards and assists in neck flexion.
Differentiation between muscle contraction and muscle relaxation can be
used to control a single switch system e.g. a communication program. There
are two methods considered for measuring muscle contraction. Muscle contraction may be detected non-invasively by measuring either the electrical or
mechanical signal at the surface of the skin. The electrical signal is known as
the electromyogram and the mechanical signal is known as the mechanomyogram. These will now be described in more detail.
48
3.4
Electromyogram
The electromyogram or EMG is an electrical signal that can be used to observe
muscle contraction. It is measured either by using surface electrodes on the
skin (surface EMG) or by invasive needle electrodes which are inserted directly
into the muscle fibre (the invasive, needle or indwelling EMG). As mentioned
already, a muscle fibre contracts when it receives an action potential. The
electromyogram observed is the sum of all the action potentials that occur
around the electrode site. In almost all cases, muscle contraction causes an
increase in the overall amplitude of the EMG. Thus it is possible to determine
when a muscle is contracting by monitoring the EMG amplitude.
The EMG is a stochastic signal with most of its usable energy in the 0500Hz frequency spectrum, with its dominant energy in the 50-150Hz range.
The amplitude of the signal varies from 0-10mV (peak-to-peak) or 0-1.5mV
(rms) [34]. An example of an EMG and its frequency spectrum is shown in
Figure 3.9.
3.4.1
EMG Measurement
The EMG may be measured invasively or non-invasively. Clinical electromyography almost always uses invasive needle electrodes as it is concerned with
the study of individual muscle fibres [35]. It produces a higher frequency
spectrum than surface electromyography and allows localised measurement of
muscle fibre activity [36]. For simple detection of muscle contraction, it is
usually sufficient to measure the electromyogram non-invasively, using surface
electrodes.
The standard measurement technique for surface electromyography uses
three electrodes. A ground electrode is used to reduce extraneous noise and
interference, and is placed on a neutral part of the body such as the bony part
of the wrist. The two other electrodes are placed over the muscle. These two
49
Figure 3.9: EMG and frequency spectrum, from [34], measured from the tibialis
anterior muscle during a constant force isometric contraction at 50% of voluntary
maximum.
electrodes are often termed the pick-up or recording electrode (the negative
electrode) and the reference electrode (the positive electrode) [35]. The signal
from these two electrodes is differentially amplified to cancel the noise, as
shown in Figure 3.10.
The surface electrodes used are usually silver (Ag) or silver-chloride (AgCl). Saline gel or paste is placed between the electrode and the skin to improve
the electrical contact [37]. Over the past 50 years it has been taught that the
electrode location should be on the motor point of a muscle, at the innervation
zone. According to De Luca [34], this is probably the worst location for detecting an EMG. The motor point is the point where the introduction of electrical
currents causes muscle twitches. Electrodes placed at this point tend to have
a wider frequency spectrum [36] due to the addition and subtraction of action
potentials with minor phase differences. The widely regarded optimum position to place the electrodes over the muscle is now on the belly of the muscle,
midway between the motor point and the tendinous insertion, approximately
1cm apart [36]. The electrode position on the muscle is shown in Figure 3.11.
50
Figure 3.10: EMG differential amplifier configuration, from [34]. The EMG is
represented by m and the noise signal by n.
Figure 3.11: Preferred location for the electrodes, from [34]. The electrode shown
is a parallel bar electrode but two circular electrodes could also be used, placed
approximately 1cm apart.
51
3.4.2
EMG as a Control Signal
The electromyogram has been used as a control signal for assistive technologies
for a number of years. It is usually used for prosthesis control and is often
described by the term myoelectric control. Scott and Parker [38] estimate that
the use of the myoelectric signal to control powered prostheses had become
an important clinical alternative by 1980. Myoelectric controlled prostheses
are usually controlled by measuring the EMG on the muscle remnants in the
residual limb.
The first types of myoelectric controlled prostheses were hand prostheses
for below-elbow amputees [38]. These systems were generally based on two-site
two-state control. Electrodes are placed over two muscles such as the forearm
extensor muscles and the forearm flexor muscles. Contraction of one muscle
opens the hand and contraction of the other muscle closes the hand. This has
the advantage in that it seems natural to use these two muscles to control hand
movement, but it may be difficult to learn to produce isolated contractions of
the two muscles.
For more complex prostheses, such as those that include the elbow joint,
simple on-off control is not sufficient. Hudgins, Parker and Scott [39] have
explored a multifunction myoelectric control strategy which extracts features
of the EMG in an attempt to identify four distinct types of muscle contraction
from measurements of the EMG. Different modes of muscle contraction will
result in different signal patterns due to different motor unit activation patterns
[40]. In the experiments that they have reported [39], one electrode is placed
over the biceps brachii and one over the triceps brachii to enable maximum
pickup of the EMG from the muscles in the upper arm. The four movements
tested were forearm supination, elbow extension, wrist flexion and forearm
pronation. If these four movements can be correctly recognised, then each
one can be used to generate a distinct signal to control a prosthetic limb.
The classifier for the system described is based on an artificial neural network
52
classifier [41], which was found to correctly classify between 70-98% of test
patterns after an initial training of the neural network.
The concept of myoelectric control for prostheses has been extended to explored its suitability as a control signal for severely disabled people for other
applications. Chang et al [40] have explored electromyogram pattern recognition techniques as a control command for man-machine interfaces. The onset
of muscle contraction is detected by counting the number of zero crossings
in the signal. The feature extraction stage uses a fourth order autoregressive
(AR) model and the classifier used is a modified maximum likelihood distance
classifier [42]. The EMG was recorded from the sternocleidomastoid and upper
trapezius muscles (see Figure 3.8) and discrimination of ten muscle motions of
the neck and shoulders was investigated. From these 10 motions, it was found
that five specific motions were almost perfectly recognisable. These are head
flexion, head right rotation, head left rotation, right shoulder elevation and
left shoulder elevation. The mean correct recognition rate reported was 95%
indicating that a system such as this could provide a five-way control system
that could be used to operate a control and communication system for severely
disabled people with control of their neck muscles.
The EMG was briefly explored as part of the work presented here, as a
control signal for a male patient in the NRH. This patient had sufficient hand
control to use a hand switch, which he initially used to operate the Natterbox.
He also had some degree of control of his neck muscles, most noticeably left and
right neck rotation. This led to an investigation of possible methods of eliciting
two switching actions corresponding to these two movements. The EMG was
one of the methods considered. Two pairs of electrodes were placed between
the sternocleidomastoid and upper trapezius muscles of the neck, one pair on
either side. Each pair of signals was then differentially amplified. Rotation of
the neck to the right was found to cause an increase in amplitude of the signal
between the left pair of electrodes and vice versa. This increase in amplitude
could be harnessed by thresholding the signal and thus used to actuate two
53
switching actions. This method was tested with him for operation of the ThreeSwitch Mouse described in Section 2.4.4, which allows a mouse cursor to be
controlled with three switches. The hand switch was used to switch between
left/right and up/down movement and the two head movements were used to
move the mouse either up and down or left and right depending on the mode of
operation. This enabled the patient to control a number of software programs,
including Windows Media Player, which he could use to select albums, play
songs and adjust the volume independently. The Three-Switch Mouse program
was combined with the Natterbox to allow the patient to easily switch between
the two. EMG-based detection of head movements was eventually replaced
with two mechanical switches placed at either side of the head. With careful
placement of the switches, the patient was able to actuate these switches with
slight head rotations to the left and the right. Nonetheless, this case highlights
the potential of the EMG for communication and control purposes.
While the EMG offers much potential for use as a communication and
control signal, the work presented here investigates the possibility of using the
mechanical signal, the mechanomyogram or MMG, as an alternative method
of providing communication and control using muscle contraction. There may
be a number of advantages in using the MMG over the EMG.
• The MMG can be measured using a single small accelerometer attached
to the skin, as opposed to the three electrodes required for single-channel
EMG recordings. This may be more convenient and comfortable for the
user.
• Since the MMG is a mechanical signal, no skin preparation is required,
as opposed to EMG recordings which typically require that the skin be
prepared with an alcohol swab before recordings to improve skin conductance.
• The MMG typically has a higher signal-to-noise ratio than the EMG,
which means that an MMG system has the potential to detect and make
54
use of weaker contractions than an EMG system does.
• Bandwidth is lower than electromyogram bandwidth (typically 3-100Hz)
so a lower sampling rate can be used.
• Mechanical vibrations can propagate through fluid and tissue surrounding the contracting muscles which means MMG sensors are capable of
detecting contraction signals from virtually every muscle in the body
due to MMG signal propagation properties, while EMG is limited to the
contraction of superficial muscles [43].
• Less precise sensor placement is necessary than with the EMG.
• EMG may vary due to changes in skin conductance caused by sweating.
This is not a factor when measuring the MMG.
3.5
Mechanomyogram
As well as the electrical signal, muscle contraction produces mechanical vibrations that may be detected at the surface of the skin. This mechanical signal
has been observed and detected since the beginning of the 1800s [44] but it is
only more recently that it is being better understood. The mechanical signal
due to muscle contraction is described under a wide variety of names. As it is
a pressure signal with much of its frequency spectrum close to that of sound it
can be measured using a microphone. When measured in this way it is often
referred to as muscle sound, the phonomyogram, the acoustic myogram or the
soundmyogram. As some of the signal is below the audible range of the human
ear, Orizio [45] suggests these terms should be avoided. It is also sometimes
referred to as the vibromyogram or accelerometermyogram when measured using an accelerometer, as in the system described here, but these terms reflect
the nature of the measurement technique rather than the nature of the signal.
Orizio [45] proposes the term mechanomyogram to more accurately reflect the
nature of this signal.
55
The mechanomyogram (MMG) signal is observable at the surface of the
muscle due to the movement of the muscle fibres underneath. Orizio [45]
states that it is due to three things:
• Dimensional changes of the active fibres upon contraction.
• A gross lateral movement at the initiation of muscle contraction generated by the non-simultaneous activation of muscle fibres.
• Smaller subsequent lateral oscillations generated at the resonant frequency of the muscle.
The resonant frequency of a particular muscle is a function of several parameters including muscle mass, length, topology and stiffness. The exact
relationship between the peak MMG frequency components and the muscle
resonant frequencies has been investigated by Barry, and is discussed in [46].
The amplified MMG measured from the biceps brachii is shown in Figure
3.12. The measurement technique will be described in more detail shortly. The
MMG is a random, noise-like signal with an approximately Gaussian amplitude
distribution. The useful bandwidth of the signal is between approximately
3Hz to 100Hz, with a peak usually around 20-25Hz [47]. In the figure shown,
the subject contracts their muscle at approximately t=5s, which increases the
absolute amplitude of the signal. This feature can be used to detect muscle
contraction, and therefore provide a means of communication and control for
disabled people.
3.5.1
MMG as a Control Signal
Recently, the MMG has been explored by Silva, Heim and Chau as a control
signal for prosthetic limbs [43]. The system developed uses three microphoneaccelerometer pairs as sensors, placed evenly spaced 120◦ apart inside a socket
56
Amplified MMG (Contraction at t=5s)
−2.8
Amplitude (V)
−2.85
−2.9
−2.95
−3
−3.05
0
5
10
15
20
25
30
Time (s)
Figure 3.12: The amplified MMG, showing the increase in amplitude when the
muscle contracts at t ≃ 5s.
57
Figure 3.13: Side(left) and front(right) views of soft silicone socket built for MMG
recording, from [43]. Note the embedded multisensor array containing three coupled
MMG sensors equi-distant angles around the end of the stump.
1.5cm up from the stump, as shown in Figure 3.13. The mechanical muscle signal generated upon muscle contraction can propagate through fluid and tissue
to the sensors, but will be proportionally dampened depending on the distance
travelled. Thus signals from muscles far away from the sensors will be diminished compared to signals from muscles that are nearer the sensors. Hence
different signal amplitudes will be observed at different sensors depending on
which muscles are used in the movement and thus different muscle activities
can be classified. Results reported in [43] seem to indicate that muscle activity
can be tracked with MMG sensors in a way similar to that of EMG sensors.
3.5.2
MMG Application for Communication and Control
As part of the work presented here, an MMG based system was developed that
could be used to control any application operable by one switching action, such
as the Natterbox program described in Chapter 2. The system developed will
now be described. The signal acquisition technique is first explained and then
the signal processing steps necessary and the resulting system developed will
be described.
58
Signal Acquisition
The MMG was detected using a dual-axis accelerometer ADXL203E (dimensions 5mm × 5mm × 2mm) from Analog Devices1 , shown in Figure 3.14. Its
small size makes it an attractive method of monitoring muscle activity. This
sensor was affixed to the belly of the muscle using adhesive tape, oriented so
that one axis of the accelerometer was measuring the signal along the muscle
fibres, the other axis was measuring the signal perpendicular to the muscle
fibres tangential to the skin surface. When the accelerometer is powered by
a 5V supply, the two output signals have a DC offset of 2.5V and this needs
to be removed to allow greater amplification before the signal is converted to
a digital signal and read into the computer. The signal will also have some
DC component due to orientation in the earth’s gravitational field (1V/g).
The 2.5V component is subtracted using the circuit in Appendix A. This is
preferable to high pass filtering the signal as it may be necessary to know
accelerometer orientation in some applications. The bandwidth of the signal
from the accelerometer is limited by its output capacitors to 200Hz. Note that
the actual signal measured is acceleration. In order to get the displacement,
the signal should be double integrated. In this instance, since the objective is
control rather than signal analysis, that step is unnecessary.
Signal Processing
The acquired signals from the two channels of the accelerometer are read into
the computer using the NIDAQ PCI-6023E data acquisition card (sampling
rate 500Hz). The Real Time Workshop, which is part of Simulink for MATLAB
was used to perform initial tests on this signal to determine the steps necessary
for detection of contraction from the MMG. The Simulink block diagram is
given in Appendix B. Once the required steps had been identified, the code
was converted to a stand-alone C++ program using the National Instruments
1
Analog Devices Website: http://www.analog.com
59
Figure 3.14: The accelerometer used to measure MMG, compared with a one euro
coin.
libraries. This program outputs a switching action on detection of muscle
contraction. Simulation of the F2 key press was chosen as the “switching”
action since this is the expected input for operation of the Natterbox.
The MMG signal shown in Figure 3.15(a) is an example of the raw signal
that the computer receives, for one axis. For the 100 second period shown,
the subject contracts their muscle three times which is clearly observable in
the recorded signal. The DC component of the signal is due to gravity, and
the magnitude of the DC component of each signal is dependent on the orientation the accelerometer with respect to the earth’s gravitational field. There
is a pronounced change in the overall shape of the muscle between the relaxed and contracted states. This causes a relatively sudden change in the
displacement of the accelerometer at the onset of contraction, causing distinct
exaggerated peaks in each of the two accelerometer output signals at these
times, as shown for one axis in 3.15(a). Furthermore, small changes in the
orientation of accelerometer between the relaxed and contracted states cause
changes in the gravitational offset of its output signals. The two signals were
high passed filtered (cutoff 2Hz) to remove this DC component, as shown in
60
Figure 3.15(b).
The resulting signals were then full-wave rectified as shown in Figure 3.15(c)
and smoothed using a moving-average filter (N=100) [42]. The averaged signal
is shown in Figure 3.15(d). Finally, the processed signal from both channels is
compared to a threshold value and a decision is made as to whether or not the
muscle is contracted. The appropriate threshold value is dependent on which
muscle is being observed and on the user’s maximum voluntary contraction
of that muscle. Therefore, provision is made for this value to be determined
by the therapist. A value of 3.5V was chosen as an appropriate threshold for
the signal shown in Figure 3.15(d). If the averaged signal is higher than this
threshold then the output of this block is “1”, as shown in Figure 3.15(e). If the
output from either channel is “1” a software switching action is performed. In
the final software implementation, a 0.5s debounce time was added to ensure
that multiple peaks will not be translated into multiple switch actuations.
Software Implementation
Simulink provides an option for converting a block diagram to C++ code and
these files were used as the basis of a stand-alone MMG muscle contraction
detection system. DirectX was used to provide a graphical user interface which
allowed the user or therapist to observe the signals at each of the stages mentioned. The graphical user interface also allows the threshold to be changed.
The code for this program is on the included disk (see Appendix I).
Testing
Preliminary testing of this system was performed on four able-bodied persons
to test their ability to use muscle contraction to communicate using the Natterbox. The system was tested on both the biceps brachii muscle and the
sternocleidomastoid muscle of each subject. The testers were asked to spell
out the message “The quick brown fox jumps over the lazy dog” (9 words)
61
Filtered MMG
2
3
1.5
2.5
1
2
0.5
1.5
0
Amplitude (V)
Amplitude (V)
Raw MMG
3.5
1
−0.5
0.5
−1
0
−1.5
−0.5
−2
−1
−2.5
−1.5
40
42
44
46
48
50
52
54
56
58
−3
40
60
42
44
46
48
Time (s)
50
52
54
56
58
60
Time (s)
(a) Raw
(b) Filtered
Rectified MMG
Averaged MMG
3
140
120
2.5
100
Amplitude (V)
Amplitude (V)
2
1.5
80
60
1
40
0.5
0
40
20
42
44
46
48
50
52
54
56
58
0
40
60
42
44
46
48
Time (s)
50
52
54
56
Time (s)
(c) Rectified
(d) Averaged
Output MMG
1
Amplitude (V)
0.8
0.6
0.4
0.2
0
40
42
44
46
48
50
52
54
56
58
60
Time (s)
(e) Output
Figure 3.15: The MMG at various stages of processing, generated using the Simulink
model in Appendix B
62
58
60
Table 3.1: MMG Experimental results. *B=biceps brachii, S=sternocleidomastoid
User
1
2
3
4
Muscle*
Time taken
Speed
No. of
(min)
(words/min)
errors
B
6:59
1.28
2
S
5:40
1.58
1
B
5:58
1.51
0
S
6:20
1.42
3
B
5:12
1.73
1
S
4:56
1.82
3
B
5:38
1.59
0
S
5:42
1.58
0
by contracting their muscle when the desired row or letter is highlighted i.e.
contracting the biceps brachii by clenching their fist or moving their head to
contract the sternocleidomastoid muscle.
The results of the experiments on the biceps brachii muscle and the sternocleidomastoid muscle for each of the four users are shown in Table 3.1. The
speed of the users was limited by the 0.5 second debounce time of the system
and also by the scanning speed of the alphabet board. The alphabet board
scanned at 0.5s per row/column, which appeared to be comfortable speed for
the users. Therefore, the time taken to select a single letter was between 1-6s
depending on the position of the letter on the board. The average speed over
the four users and the two muscles was 1.56 words/min with an average of 1.25
errors.
Compared to natural speech, which has rates of up to a few hundred words
per minute, a speed of 1.56 words/min is very slow, and may be frustrating
for the user. However, for users who are completely unable to communicate by
conventional means any finite speed of communication is better than nothing.
63
Further Considerations
Unintentional movements of the person caused the system to incorrectly detect
contraction of the muscle on several occasions. Waiting a few hundred milliseconds after the initial peak due to muscle movement, and then determining
if the muscle is still contracted, could prevent this, although this might affect
user perception since it would introduce a delay between muscle contraction
and the system response.
Although the subjects where asked to trigger the sternocleidomastoid using
full head movements, facial movements also caused unintentional triggering.
These movements could be intentionally detected by an appropriately placed
accelerometer, and used as the basis of communication. The work described
here was presented at the 2004 IEEE EMBS conference in California.
3.6
Conclusions
The MMG appears to offer a promising alternative to EMG for control using
muscle contraction. Since it is capable of detecting very small contractions,
it may be useful in cases of very severe disability where the individual has
very limited muscle contracting abilities. Preliminary results indicate that the
MMG can provide a useful method of aiding communication by people who
cannot communicate using traditional means.
Further studies should be carried out on this system to assess its performance with disabled people and its ability to detect muscle contractions in
these subjects. Pattern recognition techniques that allow differentiation between different muscle actions should also be investigated in more detail for
the MMG, ultimately to provide a means of operating multiple-switch operated
systems.
64
Chapter 4
Other Biosignals - Eye Movements
and Skin Conductance
4.1
Introduction
The previous chapter described two biosignals which relate to muscle contraction - the electromyogram and the mechanomyogram. Two further physiological signals from the body are dealt with in this chapter, both of which may be
harnessed to provide control and communication for people who are severely
disabled. These are the electrooculogram (EOG) and the electrical conductance
of the skin.
Section 4.2 describes the electrooculogram (EOG). This signal is a biopotential measurable around the eye - either between the top and bottom of the
eye (the vertical EOG), or between the two sides of the eye (the horizontal
EOG). The EOG amplitude varies as the eyeball rotates within the head, and
thus can be used to determine horizontal and vertical eye movements. These
movements can then be harnessed as a control signal for applications, as will
be discussed. In Section 4.3, a method for control based on the electrical
conductance of the skin is described. The conductance of the skin may be con-
65
trolled by consciously relaxing or tensing the body, thus activating or relaxing
the sweat glands. This method is constrained by the length of time taken to
elicit a response, but it may be applicable in cases of very severe disability. A
method for measurement of the firing rate of the sympathetic nervous system
based on measurement of the skin conductance is also presented.
4.2
4.2.1
The Electrooculogram
Introduction
In this section, eye tracking is discussed as a means of control and communication for disabled people. Firstly, the anatomy of the eye is briefly described
in Section 4.2.2, and a review of eye tracking technologies is presented. The
eye tracking method chosen to study in more detail and implement as part
of the work presented here is based on measurement of a signal known as the
electrooculogram (EOG). A description of the physiological origin of the EOG
is given, with reference to the anatomy of the eye. The method employed to
measure the EOG is described, and some of the advantages and limitations of
using this signal to track eye movements are discussed. A novel method called
Target Position Variation (TPV) is presented in Section 4.2.5, which was developed as part of the work described here as a way of overcoming some of
the EOG’s limitations. This method allows for more accurate inference of absolute eye position from the EOG, enabling more robust control in EOG-based
applications. Based on these studies of eye movement for communication and
control purposes, a feedback model for eye movements was developed which
is given here. This model describes rotation of the eyeball in either the horizontal or vertical plane. It accurately predicts the measured response of the
eye when it makes a sudden saccadic movement, that is, when the eye’s focus suddenly moves from one target to another. The response of the eye to
smooth pursuit movements, where the eyes are following a moving target, are
66
Figure 4.1: Sections visible in the outer part of the eye (from [48])
also briefly explored using this model. Development of these models led to a
deeper understanding of the underlying processes involved during saccadic and
smooth pursuit eye movements. This knowledge was of immense benefit when
considering how eye movements could be used for communication and control
purposes.
4.2.2
Anatomy of the Eye
The main features visible at the front of the eye are shown in Figure 4.1. The
lens, directly behind the pupil, focuses light coming in through the opening in
the centre of the eye, the pupil, onto the light sensitive tissue at the back of
the eye, the retina. The iris is the coloured part of the eye and it controls the
amount of light that can enter the eye by changing the size of the pupil, contracting the pupil in bright light and expanding the pupil in darker conditions.
The pupil has very different reflectance properties than the surrounding iris
and usually appears black in normal lighting conditions. Light rays entering
through the pupil first pass through the cornea, the clear tissue covering the
front of the eye. The cornea and vitreous fluid in the eye bend and refract
this light. The conjuctiva is a membrane that lines the eyelids and covers the
sclera, the white part of the eye. The boundary between the iris and the sclera
is known as the limbus, and is often used in eye tracking.
A horizontal section through the right eye is shown in Figure 4.2, showing
67
Figure 4.2: Horizontal cross section of the eye, the symbol “K” is reflected onto the
retina (from [48])
how an image of the letter “K” is projected onto the retina at the back of the
eye. The crystalline lens located just behind the iris focuses the light rays onto
the retina. Note that the image on the retina is inverted. The brain is able
to process this image and invert it so we see the image in its original upright
form.
The light rays falling on the retina cause chemical changes in the photosensitive cells of the retina. These cells convert the light rays to electrical impulses
which are transmitted to the brain via the optic nerve. There are two types of
photosensitive cells in the retina, cones and rods [49]. The rods are extremely
sensitive to light allowing the eye to respond to light in dimly lit environments.
They do not distinguish between colours, however, and have low visual acuity,
or attention to detail. The cones are much less responsive to light but have a
much higher visual acuity. Different cones respond to different wavelengths of
light, enabling colour vision. The fovea is an area of the retina of particular
importance. It is a dip in the retina directly opposite the lens and is densely
packed with cone cells, allowing humans to see fine detail, such as small print.
The human eye is capable of moving in a number of different manners
to observe, read or examine the world in front of them. Most types of eye
68
movements are conjugate, where both eyes move together in the same direction. A saccadic eye movement is a type of conjugate eye movement where
the eye suddenly changes fixation from one place to another voluntarily, such
as in reading, where the eye jumps back to the start of the next line when
it reaches the end of the previous line. Saccadic eye movements are characterised by a very high initial acceleration and deceleration. The purpose of
saccadic eye movement is to fix the new target image on the fovea. During
saccadic motion, the eye moves with an angular velocity in the range 30-900◦/s
[50]. When the eye is tracking a continuously moving object, the movement
is described as a smooth pursuit movement. Smooth pursuit movements have
an angular motion in the range 1-30◦/s [50]. Smooth pursuit movements can
not be produced voluntarily as they always require a moving stimulus [51].
Saccadic eye movements and smooth eye movements will be examined in more
detail in Section 4.2.9, where a model for both of these movements is proposed.
Compensatory eye movements are another type of smooth movements similar
to pursuit movements, which act to keep the eyes fixed on a target when the
head or trunk moves.
4.2.3
Eye Tracking Methodologies
The areas concerned with measurement of both relative eye movement and
absolute eye position are generally both included under the term eye tracking,
although it must be pointed out that some of the “eye tracking” methodologies
only measure the direction of eye movement from an initial position, rather
than continuously tracking the exact location of the eye. However, as it is the
most generally used term, eye tracking will be used here to describe measurement of both changes in eye position and absolute eye position.
Eye tracking has become an important research field due to its applicability
to a range of different disciplines. It is used in market research to provide a way
of assessing the effectiveness of different advertisement layouts on the viewer.
69
It is often used by psychologists during tests, to determine the patient’s focus
or interest level, for example. It is used clinically to determine illnesses such
as schizophrenia from unusual eye movements, and in developmental tests on
babies. For the work presented here, eye movement is important as it may
represent an individual’s only voluntary movement during some of the later
stages of motor neurone diseases such as Amyotrophic Lateral Sclerosis (ALS).
Unlike other motor neurons in the body which are affected by ALS, the motor
neurons in the eye are relatively spared even at the very terminal stage of this
disease, due to high levels of calcium-binding proteins in these cells [18]. If eye
movement can be correctly tracked in people with diseases such as this, then it
can provide a useful tool for enabling these people to communicate with others
and to control their environment.
There are many different systems available commercially that can be used
to track eye movement. Some of the different methods commonly employed
for eye tracking are discussed here by division into three categories - visual eye
tracking techniques, the magnetic search coil technique and the electrooculogram.
Visual Eye Tracking Techniques
Visual eye tracking techniques, which are sometimes referred to as videooculographic methods, are based on observation of the position of the eye,
usually by using some sort of camera. Clearly, one can roughly determine the
direction of a person’s gaze by monitoring the location of the centre of their
eyes, since the eyeball rotates to place the object that the person is looking at
directly in line with the centre of the pupil, at the centre of vision. In automated visual eye tracking, computer vision techniques are employed to track
one or more pertinent features of the eye. In most of these methods, light is
shone directly at the eye. Infra-red light is often used, as it is invisible and
does not make the subject close their eyes. When light is shone directly at
70
the pupil so that it enters the eye along the optic axis, it is reflected back due
to the reflective properties of the retina and the pupil will appear as a bright
reflection. This accounts for the red-eye effect in photography and the phenomenon is known as the bright-eye effect. The cornea also reflects light, as can
be readily observed in a room with a window where the image of the window
can be seen to appear on the eye’s surface. A camera is used to detect these
reflections which can then be used to calculate eye position relative to some
reference point. Two commonly used methods using reflections are the Pupil
Centre/Corneal Reflection Technique and the Limbus Boundary Technique.
The pupil centre/corneal reflection technique, or simply corneal reflection
technique as it is often called, is credited to Kenneth Mason who formalised this
technique in the late 1960s [52]. His technique presents an automated procedure for eye tracking, based on observation of the eye with a camera, detection
of two reflections and calculation of the eye gaze position. The two reflections
used are the large reflection from the pupil due to the bright-eye effect and the
smaller reflection from the corneal bulge of the eye. A photograph of the eye
showing the bright pupil and the smaller corneal reflection is shown in Figure
4.3. The smaller reflection is often called the glint or first Purkinje image.
The position of the eye is determined based on the relative movement of the
pupil reflection with respect to the corneal reflection. The radius of curvature
of the cornea is less than that of the eye, so when the eye moves the corneal
reflection moves in the direction of eye movement but only about half as far
as the pupil moves. This can be used to calculate eye gaze position. Several
commercial systems are available that use this technique for eye gaze tracking,
for example, the 50Hz Video Eyetracker Toolbox, manufactured by Cambridge
Research Systems1 , and the Eyegaze Computer System, manufactured by LC
Technologies Incorporated2 . Another method often used tracks the limbus,
1
Cambridge Research Systems Ltd., 80 Riverside Estate, Sir Thomas Longley Rd.,
Rochester, Kent ME2 4BH, England. Website: http://www.crsltd.com
2
LC Technologies Inc., 3955 Pender Drive, Suite 120, Fairfax, Virginia 22030, USA.
Website: http://www.eyegaze.com
71
Figure 4.3: Photograph of the reflections when light is shone upon the eye, from LC
Technologies Inc.2 , showing the large reflection of the pupil and the smaller corneal
reflection.
which is the iris-scleral edge, as the eye moves around. This boundary can be
readily detected due to large differences in colour intensity between the sclera
and the iris. Once the boundary is detected, the iris can be modelled as two
circles or ellipses. The position of the two eyes can be calculated based on the
centre of the two shapes which will change as the eyeball rotates away from
the central field of vision of the camera. This method has difficulties, particularly for tracking in the vertical direction, due to the eyelid occluding part
of the limbus when the eye is looking up or looking down. The commercially
available system IRIS, manufactured by Cambridge Research Systems1 , and
the Model 310 Limbus Tracker, from the Applied Science Laboratories3 , are
both based on limbus tracking.
Many of the systems described above require that the head be kept stationary, and some use a large, constrictive head-rest to keep the user’s head in
place. The head-rest used with the 50Hz Video Eyetracker is shown in Figure
4.4. There are many systems available that attempt to overcome this constraint by using computer vision techniques which also track the movement
of the head and incorporate this factor into calculation of the eye position.
One such system is FaceLab developed by Seeing Machines and Cambridge
Research Systems1 , although this system has quite a low recovery time (0.2s)
from a tracking failure that may occur when the head is moved suddenly.
3
Applied Science Laboratories, 175 Middlesex Turnpike, Bedford, MA 01730, USA. Web-
site: http://www.a-s-l.com
72
Figure 4.4: 50Hz Video Eyetracker, from Cambridge Research Systems website 1 .
Magnetic Search Coil Technique
The Magnetic Search Coil technique was developed in the 1960’s by Robinson
[53], and has been marketed as a method of eye tracking by Skalar Medical4
since 1975, under the name Scleral Search Coil (SSC). An induction coil, encased in a suction ring of silicone rubber, is affixed onto the eye’s limbus, the
boundary between the sclera and the iris. A high frequency horizontal and
vertical magnetic field is generated around the subject, which induces a high
frequency voltage in the induction coil. As the user moves their eye, the voltage
changes, and thus the sclera position, and therefore the eye, can be tracked.
A photograph of a subject wearing the scleral search coil is shown in Figure
4.5. This technique is not often used nowadays for a number of reasons. Initial
preparation is cumbersome - application of a local anesthetic is required before
inserting the coil into the eye. The subject can only wear the search coil continuously for 30 minutes before irritation begins to occur. The subject must
also stay in the centre of the magnetic field for the duration of the recordings.
Health issues with high frequency electromagnetic fields are not yet resolved.
The Electrooculogram
The EOG is probably the most commonly used non-visual method of eye
tracking. The EOG is a bio-electrical skin potential measured around the eyes.
As already described in Section 4.2.2, the photoreceptor cells in the retina are
4
Skalar Medical bv, Thorbeckestraat 18, 2613 BW DELFT, The Netherlands. Website:
http://www.skalar.nl
73
Figure 4.5: User wearing the scleral search coil, from Skalar website 4 . The wire is
just visible coming down the right side of the picture.
excited by light rays falling on them. This causes increased negativity of the
membrane potential, due to ions being pumped out. Over time, a charge
separation occurs between the cornea and retina of the eyeball, which can
vary anywhere between 50-3500µV from peak to peak. In humans the cornea
is positive with respect to the retina. The eyeball can be thought of as a
dipole rotating in a conducting medium. The DC voltage generated by the
eye radiates into adjacent tissues, which produces a measurable electric field
in the vicinity of the eye, which rotates with the eye.
The EOG is measured by placing two electrodes at opposite sides of the eyes
and differentially amplifying the signal to obtain the DC voltage between two
sides of the eyeball. This measurement enables the direction of eye gaze to be
inferred since the potential field varies as the eyeball rotates towards or away
from each electrode. Conventionally, the vertical and horizontal directions
are used to measure eye positions (either individually or in conjunction with
each other depending on the application in question). Vertical movements
are detected by placing electrodes above and below the eye and horizontal
movements are detected by placing the electrodes to the left and right of the
eye (the outer canthi of the eye). Note that for horizontal recordings, the
electrodes are generally not placed on the outer and inner sides of the same
eyeball, as may be expected. This is due to practical difficulties in placing
the electrode at the inner edge of the eye, and for conjugate eye movements,
74
+
_
+_
(a) Vertical EOG
(b) Horizontal EOG
Figure 4.6: EOG Electrode Positions. Eye movement towards the “+” electrode
increases the EOG amplitude, movement towards the “-” electrode decreases its amplitude.
placing the electrodes at the two outer canthi should give an almost identical
recording. Care must be taken when placing the electrodes to choose a location
that will minimise EMG interference, which may occur when the subject frowns
or speaks. Typical electrode positions for the vertical and horizontal EOG are
shown in Figure 4.6.
The EOG recordings for saccadic eye movement and smooth pursuit eye
movement are shown in Figure 4.7, recorded as part of the work presented here.
These are both horizontal EOG recordings. The recording of the saccadic eye
movement was made by asking the subject to focus on the centre of a square
on the computer screen and then suddenly translating the square horizontally
by an optical angle of 15◦ at t = 0s. The recording of the smooth pursuit
movement was made by asking the subject to follow a square which is moving
on screen in a sinusoidal fashion, with frequency 0.4Hz. These signals have
been amplified by a factor of approximately 1000, sampled at 200Hz and lowpass filtered in MATLAB using a 25th order FIR equiripple filter with fpass =
10Hz and fstop = 30Hz. The complete EOG measurement method will be
described in more detail in Section 4.2.6.
75
Filtered and Amplified EOG during a Saccadic Movement
Filtered and Amplifed EOG signal during Smooth Pursuit
0.35
0.05
0
0.3
Amplitude (V)
Amplitude (V)
−0.05
−0.1
−0.15
0.25
0.2
−0.2
0.15
−0.25
0.1
−0.3
0
2
4
6
8
10
12
14
16
18
0
0.05
0.1
20
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Time (s)
Time (s)
(a) Smooth Pursuit
(b) Saccade
Figure 4.7: EOG recordings, recorded by the author. (a) EOG during smooth
pursuit of a target moving sinusoidally with frequency 0.4Hz, over a 20s time frame.
Notice the baseline has a DC offset. (b) EOG during a saccade. The target moves
at t = 0s, and the eye moves to the new position with a latency of just over 0.25s
4.2.4
The EOG as a Control Signal
The EOG as a communication and control tool has been explored by a number
of research teams, some of whom are mentioned below. Some of the systems
developed are based on detection of a small number of eye movements which
may be translated into switching actions while others attempt to recognise
absolute eye position from the EOG signal. An EOG based alphabet was
developed in the lab [49] and uses the relative movements of the vertical and
horizontal EOG to select letters on the alphabet board shown in Figure 4.8.
The EagleEye system is another EOG based communication and control
tool, developed by a team in Boston College in the 1990s [54]. Absolute eye
position is used to control a cursor moving over an alphabet board based
communication system, which is described by Teece [55]. This system uses
the horizontal and vertical EOG to control a cursor moving over a software
alphabet board. If the user’s eye-gaze remains within the region of a certain
letter for more than 50% of a 833ms epoch, then that letter is selected and
appears below the alphabet board. The main limitation of this system is that
it requires frequent manual balancing of the amplifier voltages, to compensate
76
Figure 4.8: EOG controlled alphabet board, from [49]
for baseline drift. For people with very severe disabilities, this would require
the aid of a helper, and thus this restraint limits the independence of the user.
The problem of baseline drift in EOG recordings is further discussed below.
The EOG has also been investigated as a controller for a wheelchair by
Barea et al. [56], as part of a larger wheelchair project from the University of
Alcala in Madrid known as SIAMO [57] (a Spanish acronym for Integral System
for Assisted Mobility) which uses ultrasound, infrared sensors and cameras
to create information about an environment and facilitate safer wheelchair
navigation. In the system described in [56], the wheelchair user is presented
with a menu of different wheelchair commands on a computer laptop screen
(e.g. STOP, FORWARD, BACKWARD, LEFT and RIGHT). The user looks
at the desired word to select a command to operate the wheelchair. This
system uses an AC coupled amplifier to overcome problems with DC drift.
The EOG was used for communication and control purposes with a patient
in the hospital. The male patient had suffered a brainstem stroke and as
a result was only capable of making small vertical eye movements. These
movements were harnessed by recording the vertical EOG and actuating a
switching action whenever a threshold was crossed. Thus the user was able to
operate the Natterbox program and thus spell out messages.
77
Advantages of the EOG over other methods
The visual systems mentioned above in this section offer robust methods of
eye tracking, usually with very good accuracy. While in certain circumstances,
visual methods may be more appropriate, the electrooculogram offers a number
of advantages. Some of the reasons for favouring the EOG over other options
for measuring eye movements are presented here.
• Range
The EOG typically has a larger range than visual methods which are
constrained for large vertical rotations where the cornea and iris tend to
disappear behind the eyelid. Oster and Stern [50] estimate that since
visualisation of the eye is not necessary for EOG recordings, angular deviations of up to ±80◦ can be recorded along both the horizontal and
vertical planes of rotation using electrooculography. Visual based systems often have a lot more restricted range, for example, the 50Hz Video
Eyetracker from Cambridge Research Systems1 has a horizontal range of
only ±40◦ and a vertical range of just ±20◦ .
• Linearity
The reflective properties of ocular structures used to calculate eye position in visual methods are linear only for a restricted range, compared
to the EOG where the voltage difference is essentially linearly related to
the angle of gaze for ±30◦ and to the sine of the angle for ±30◦ to ±60◦
[50].
• Head Movements are Permissible
The EOG has the advantage that the signal recorded is the actual eyeball
position with respect to the head. Thus for systems designed to measure
relative eyeball position to control switches (e.g. looking up, down, left
and right could translate to four separate switch presses) head movements will not hinder accurate recording. Devices for restraining the
78
head or sensing head movement are only necessary when the absolute
eye position is required. Conversely, visual methods such as the limbus
boundary technique require that the head be kept stationary so a head
movement will not be misinterpreted as a change in eye position, and
even slight head movements with respect to the light source can produce
disproportionately large calibration errors. Head-brackets or chin-rests
are often used to keep the head in place, often these are uncomfortable
and therefore impractical to use for any length of time.
Even visual methods that compensate for head movements by tracking relative movement of two points in the eye (as in the pupil boundary/corneal reflection technique) require that the eyes be kept within
the line of sight of the camera and thus often use a head rest anyway to
keep the head in position. The criterion that the head must be kept in
front of a camera may not be possible in certain circumstances where it
is conceivable that the user may not be in front of a computer screen or
in instances where the user has uncontrolled head spasms, as may be the
case for users with cerebral palsy.
• Non-invasive
Unlike techniques such as the magnetic search coil technique, EOG recordings do not require anything to be fixed to the eye which might cause
discomfort or interfere with normal vision. EOG recording only requires
three electrodes (for one channel recording), or five electrodes (for two
channel recording), which are affixed externally to the skin.
• Obstacles in front of the eye
In visual methods, measurements may be interfered with by scratches on
the cornea or by contact lenses. Bifocal glasses and hard contact lenses
seem to cause particular problems for these systems. EOG measurements
are not affected by these obstacles.
79
• Cost
EOG based recordings are typically cheaper than visual methods, as they
can be made with some relatively inexpensive electrodes, some form of
data acquisition card and appropriate software, unlike most of the visual systems described above, which require expensive equipment and
can cost around the e10,000 benchmark. Any method using infrared
light requires an infrared transmitter and camera for operation, plus expensive software to calculate the eye position from the captured image.
Software to convert EOG recordings into absolute eye position is considerably more straightforward than video based techniques that require
complicated computations to analyse video frames and convert this into
an estimate of eye position, and thus EOG software should be less expensive. In hospitals, the electrodes necessary to measure the EOG are
usually readily available.
• Lighting Conditions
Variable lighting conditions may make some of the visual systems unsuitable or at least require re-calibration when the user moves between
different environments. One such scenario which could pose problems is
where the eye tracking system is attached to a user’s wheelchair. As the
user moves between different environments the system needs to respond
accordingly. This could be achieved using some sort of photosensitive device to measure lighting conditions but this would need to be interfaced
with whatever system is used. Variable lighting conditions will cause
baseline drift of the EOG signal. For measurement of relative eye movements this should not be a problem, since a system could be developed
to respond only to sudden large changes in EOG amplitude, rather than
slow changes due to varying lighting conditions.
• Eye Closure is Permissible
The EOG is commonly used to record eye movement patterns when the
eye is closed, for example during sleep. Visual methods require the eye
80
to remain open to know where the eye is positioned relative to the head,
whereas an attenuated version of the EOG signal is still present when
the eye is closed.
• Real-Time
The EOG can be used in real-time as the EOG signal responds instantaneously to a change in eye position and the eye position can be quickly
inferred from the change. The EOG is linear up to 30◦ . The frequency response of visual methods is limited by the frame rate and the calculation
time of eye position from the frames.
Obviously there are also some disadvantages and these are discussed below.
Limitations of EOG-Based Eyetracking
The EOG recording technique requires electrodes to be placed on both sides
of the eyes, and this may cause some problems. Firstly, it requires that a
helper is present who has been taught how to correctly position the electrodes.
Secondly, electrodes placed around the eyes may draw attention to the user’s
disability and compromise the user’s feelings of dignity. For horizontal EOG
recordings, a possible solution is to use a pair of glasses or sunglasses. The
two electrodes are placed on the inside of the temple arm of the glasses so
that the electrodes make contact with the skin when the glasses are worn.
Many people who are disabled already wear sunglasses, even indoors, due to
photosensitivity.
Another large problem faced by EOG-based gaze tracking systems using
DC coupled amplifiers is the problem of baseline drift. This problem may be
circumvented by using an AC coupled amplifier but then the signal recorded
will only reflect changes in the eye position rather than expressing the absolute
eye position. If eye position is to be used for any sort of continuous control
(rather than one or more switching actions) then a DC coupled amplifier is
81
usually necessary. The measured EOG voltage varies for two reasons. Either
the eye moves (which we want to record), or baseline drift occurs (which we
want to ignore). Baseline drift occurs due to the following factors:-
• Lighting Conditions
The DC level of the EOG signal varies with lighting conditions over long
periods of time. When the source of the light entering the eye changes
from dark conditions to room lighting, Oster and Stern [50] state that
it can take anywhere from between 29-52 minutes for the measured potential to stabilise to within 10% of the baseline, and anywhere between
17-51 minutes when the transition is from room lighting to darkness.
• Electrode Contact
The baseline may vary due to the spontaneous movement of ions between
the skin and the electrode used to pick up the EOG voltage. The mostly
commonly used electrode type is silver-silver chloride (Ag-AgCl). Large
DC potentials of up to 50mV can develop across a pair of Ag-AgCl electrodes in the absence of any bioelectric event, due to differences in the
properties of the two electrode surfaces with respect to the electrolytic
conduction gel [58]. The extent of the ion movement is related to a number of variables including the state of the electrode gel used, variables in
the subject’s skin and the strength of the contact between the skin and
the electrode. Proper preparation of the skin is necessary to maximise
conduction between the skin and the conduction gel, usually by brushing
the skin with alcohol to remove facial oils.
82
• Artifacts due to EMG or Changes in Skin Potential
The baseline signal may change due to interference from other bioelectrical signals in the body, such as the electromyogram (EMG) or the skin
potential. EMG activity arises from movement of the muscles close to the
eyes, for example if the subject frowns or speaks. These signals may be
effectively rejected by careful positioning of the electrodes and through
low pass filtering the signal. Skin potential changes due to sweating or
emotional anxiety pose a more serious problem.
• Age and Sex
Oster and Stern [50] report that age and sex have a significant effect
on baseline voltage levels, although this should not pose a problem if a
system is calibrated initially for each particular user.
• Diurnal Variations
The baseline potential possibly varies throughout the day.
Manual calibration is often used to compensate for DC drift - the subject
shifts his gaze between points of known visual angle and the amplifier is balanced until one achieves the desired relationship between voltage output and
degree of eye rotation. With frequent re-calibration, accuracies of up to ±30′
can be obtained [50]. While manual calibration may be acceptable practice in
clinical tests that use the EOG, this restriction hinders the EOG from being
used independently as a control and communication tool by people with disabilities. A technique called Target Position Variation is proposed here, which
enables the user to automatically re-calibrate their EOG whenever significant
baseline drift is perceived to have occurred.
83
4.2.5
Target Position Variation
Theory
Target Position Variation (TPV) was developed as part of the work presented
here as a way of improving EOG based communication and control. There are
two possible applications for TPV given here, although there may be others.
These are in menu selection or in automatic eye position re-calibration.
In TPV-based menu selection the user is presented with a screen containing
a number of different menu items. An example menu is shown in Figure
4.9, where four menu options are given - “Lights”, “Radio”, “Television” and
“Fan”. Underneath each of the menu options are icons which are each moving
sinusoidally at a unique frequency. The user chooses one of the four options
by looking at the appropriate icon and following its path of movement. This
type of eye movement is a smooth pursuit movement and will generate an EOG
similar to the one shown in Figure 4.7(a), where the EOG voltage value varies
in synchronisation with the sinusoidal movement of the icon on screen. Since
each of the icons is moving at a different frequency, spectral analysis of the
EOG signal can be used to determine which icon is being tracked by the user.
Thus the system can identify the required menu item. In the system shown
in Figure 4.9, the program could then issue a command to an environmental
control module to toggle the state of the appliance chosen. Phase differences
could also be used instead of frequency differences to individually recognise
each icon.
Target Position Variation can also be used for automatic eye position recalibration in applications where eye gaze is used for continuous control. An
example of such a system would be one where eye position is used to control the
position of a mouse cursor on screen. In EOG based systems for mouse cursor
control, the user may find that after a certain time the position of the cursor
begins to drift away from their centre of vision, due to the onset of baseline
84
Figure 4.9: TPV based menu selection application. The user selects a menu option
by tracking the moving icon below the desired item.
85
drift. Once this occurs, re-calibration is necessary. Usually this requires another person to manually balance the amplifier to compensate for drift. TPV
offers a means of automatic re-calibration without requiring another person
present, by positioning an icon at some corner of the screen which is varying
sinusoidally. Each time the user finds that there is a significant error between
their desired cursor position and the actual cursor, they move their eye gaze
towards the oscillating icon at the corner of the screen. This will generate a
sinusoidal wave in their EOG which can easily be detected. Since the position
of the icon on screen is known, the system can calculate the baseline offset
and thus re-calibrate the absolute eye position for subsequent mouse cursor
movements.
4.2.6
Experimental Work
A number of different experiments were conducted with able-bodied subjects to
assess the suitability of TPV for human-computer interfaces, and to determine
suitable parameters that could be used to form a working system. Preliminary
testing was performed to select the following features used for the subsequent
experiments:
• Target Shape
A suitable icon shape is one with a well defined point that the subject
can be instructed to focus on. This is desirable if TPV is to be used to
re-calibrate eye position. Since the position of the focal point is known,
a more accurate estimate of absolute eye position can be made if it can
be assumed that the subject is focusing on this point. Two candidate
shapes are shown in Figure 4.10. Each has a clearly defined point in the
centre of the icon. The shape in Figure 4.10(a) was arbitrarily chosen
for the experiments described below.
86
(a) Square
(b) Diamond
Figure 4.10: TPV Candidate target shapes
.
• Target Size
If the icon is too small it will be difficult to follow, if the icon is too
large the user’s gaze may drift away from the centre of the icon. An icon
width of 60 pixels was chosen. The screen size used for the experiments
was 32cm and the horizontal pixel width was 1024, so this corresponds
to an actual width of 1.875cm.
• Oscillation Pattern
Two target oscillation patterns were initially tested - a triangular wave
pattern, where the magnitude of the speed of the icon’s motion is constant, and a sinusoidally varying pattern. Although it was observed that
both patterns were reflected equally well in the recorded EOG, the sinusoidally varying pattern was chosen as it was deemed more easily recognisable spectrally. A sinusoidal pattern will correspond to a single peak
in the frequency spectrum, whereas a triangular wave will also contain
odd-numbered harmonics which may complicate the detection process
and also might introduce the constraint that two different icons could
not have frequencies that are odd-numbered multiples of each other.
• Oscillation Direction
The maximum variation obtainable in the EOG when the eye is in smooth
pursuit of an object occurs when the object is moving along the line of
87
the EOG recording. Thus to obtain this maximum variation, there were
two possibilities available - either the target could be made to oscillate
vertically and the vertical EOG recorded, or the target could be made to
oscillate horizontally and the horizontal EOG recorded. Since the vertical
EOG may introduce artifacts due to blinking, the horizontal EOG was
chosen for experiments and thus a horizontal oscillation direction was
used.
Procedure
Gold cup electrodes were used filled with an electrolyte gel. Since the horizontal EOG was to be measured, the active and reference electrodes were fixed to
the subjects temples, beside the two eyes’ outer canthi, the junction where the
upper and lower eyelids meet. The ground electrode was fixed to the subject’s
right earlobe. A DC custom-built EOG amplifier with an approximate voltage
gain of 1000 was used. The design is based on a classic instrumentation amplifier topology [59] that requires a single quad op-amp, along with a handful
of resistors and capacitors, which need not have particularly small tolerances.
The LTC1053 chopper-stabilised quad op-amp was chosen for its high input
impedance and low input offset voltage.
Two PCs were used for these experiments, a display PC to display the
moving icons and a recording PC to record the EOG data. Data was acquired
by the recording PC using a National Instruments NIDAQ PCI 6023E card.
The sampling frequency was 200Hz. Two channels were acquired, one for the
EOG and one to record synchronisation pulses from the corner of the display
PC’s screen, via a phototransistor. These pulses allow the recorded EOG data
to be accurately related to the oscillation of each icon’s position. The data
was acquired using MATLAB’s Real Time Window’s Target in the Real Time
Workshop.
The user was positioned opposite the centre of the display PC screen, 50cm
88
from the screen. Since the screen width was 32cm, if the visual angle is taken
to be at 0◦ when the user is focusing on the centre of the screen, then the
maximum visual angle through which the user’s eye position will extend while
still looking at the screen is ±17◦ . This is within the linear range of EOG
measurement, which is approximately ±30◦ , so the EOG amplitude will be
directly proportional to the angle of gaze.
The experimental procedure consisted of two separate parts.
Experiment 1
This part of the experiment consisted of 12 separate EOG recordings, each
lasting 20s each. For each recording, a single target was presented moving
with a horizontal sinusoidal oscillation centred at the middle of the screen at
position {x0 , y0}. A different combination of amplitude, A, and frequency, f,
was used for each of the recordings. The horizontal position for the icon at
any moment in time, xn , may be calculated as:
xn = x0 + A sin(2πf t)
(4.1)
Three amplitudes of oscillation were used and for each amplitude setting, four
different frequencies were tested. The three amplitudes of oscillation used were
25, 50 and 100 pixels from the centre in both directions, corresponding to a
maximum visual angle from the centre line of ±0.89◦ , ±1.79◦ and ±3.59◦ . The
four frequencies of oscillation used were 0.2Hz, 0.4Hz, 0.8Hz and 1.6Hz. For
each icon movement, the subject was instructed to keep his or her head still
and follow the position of the icon on screen. The data was recorded over the
20s period for each icon movement. The twelve graphs obtained over each 20s
period for one subject are shown in Figure 4.11. Note that the y-axis extends
over a range of 0.25V for each of the twelve graphs, although each may have a
different DC offset.
89
0.15
0.25
0.1
0.1
0.2
0.05
0.05
0.15
0
0
0.1
−0.05
−0.05
0.05
−0.1
−0.1
0
500
1000
1500
2000
2500
3000
3500
4000
0
0
(a) EOG A=25 f = 0.2Hz
500
1000
1500
2000
2500
3000
3500
4000
(b) EOG A=50 f=0.2Hz
−0.15
0
0.1
0
−0.05
0.05
−0.05
−0.1
0
−0.1
−0.15
−0.05
−0.15
−0.2
−0.1
−0.2
0
500
1000
1500
2000
2500
3000
3500
4000
0
(d) EOG A=25 f=0.4Hz
500
1000
1500
2000
2500
3000
3500
1000
1500
2000
2500
3000
3500
4000
(c) EOG A=100 f=0.2Hz
0
−0.25
500
4000
(e) EOG A=50 f=0.4Hz
−0.25
0
500
1000
1500
2000
2500
3000
3500
4000
(f) EOG A=100 f=0.4Hz
0
0
0
−0.05
−0.05
−0.05
−0.1
−0.1
−0.15
−0.15
−0.2
−0.2
−0.1
−0.15
−0.2
−0.25
−0.25
0
500
1000
1500
2000
2500
3000
3500
4000
−0.25
(g) EOG A=25 f=0.8Hz
0
500
1000
1500
2000
2500
3000
3500
4000
(h) EOG A=50 f=0.8Hz
500
0.1
0.05
0.15
0.05
0
0.1
0
−0.05
0.05
−0.05
−0.1
0
−0.1
−0.15
0
500
1000
1500
2000
2500
3000
3500
(j) EOG A=25 f=1.6Hz
4000
0
500
1000
1500
2000
2500
3000
3500
(k) EOG A=50 f=1.6Hz
1500
2000
2500
3000
4000
−0.2
0
500
1000
1500
2000
2500
3000
3500
(l) EOG A=100 f=1.6Hz
Figure 4.11: EOG Recordings of One Subject for TPV: Experiment 1
90
3500
4000
(i) EOG A=100 f=0.8Hz
0.2
−0.05
1000
4000
Visual inspection of the graphs shown in Figure 4.11 reveals the following
observations:
1. The phenomenon of baseline drift is evident, especially in Figures 4.11(a),
4.11(h), 4.11(k) and 4.11(i).
2. The EOG recordings for A=25, which corresponds to movement through
an angle of ±0.89◦ , are strongly contaminated by noise. This is especially evident for f=0.8Hz and f=1.6Hz, where the oscillation is barely
discernible from the background noise. Low pass filtering could be used
to reduce the effect of noise in the signal, but this may introduce a phase
lag which would not be desirable if phase difference was chosen as the
parameter used to recognise which icon is being followed.
3. The EOG for A=50 and A=100, which corresponds to movement through
an angle of ±1.79◦ and ±3.59◦ , both look promising, the oscillation is
clearly visible for each.
4. The four different frequencies (0.2Hz, 0.4Hz, 0.8Hz and 1.6Hz) appear
to all give satisfactory results, although the subjects reported that it
was beginning to become difficult to follow the icon when it was moving
through an amplitude of A=100 (±3.58◦ ) at 1.6Hz. The average eye
speed, ve , required to pursue a target moving with this frequency at this
amplitude is:
ve = (4 × 3.59)(◦/cycle) × 1.6(cycles/s) = 23◦ /s
(4.2)
This is approaching the limit of eye movement for smooth pursuit, which
is around 30◦ /s.
5. For f=0.2Hz, it takes 5s for one cycle of the target oscillation to be
completed in the EOG. Depending on the method used to recognise an
oscillation, this may introduce a significant delay between when the user
begins to follow the icon and when the system begins to recognise this
pattern and respond to this action.
91
Table 4.1: Parameters of the four icons for each of the four parts of Experiment 2
Fixed
Expt 2A
Expt 2B
Expt 2C
Expt 2D
x0
y0
Colour
θ
f
θ
f
θ
f
θ
f
I1
150
350
red
0◦
0.4Hz
0◦
0.8Hz
0◦
1.6Hz
0◦
0.2Hz
I2
850
350
green
90◦
0.4Hz
90◦
0.8Hz
90◦
1.6Hz
0◦
0.4Hz
I3
500
150
blue
180◦
0.4Hz
180◦
0.8Hz
180◦
1.6Hz
0◦
0.8Hz
I4
500
550
black
270◦
0.4Hz
270◦
0.8Hz
270◦
1.6Hz
0◦
1.6Hz
These experiments were repeated on a number of different subjects and similar
results were obtained from each subject. Based on these results, A=100 was
used for all oscillations in the next experiment.
Experiment 2
In this experiment, 4 icons, which can be labelled I1 , I2 , I3 and I4 , were
presented to the subject. Each icon oscillates with an amplitude of 100 pixels.
The experiment consisted of four parts. In Part A, all four icons oscillate at
0.4Hz, but each with a different phase (I2 , I3 and I4 are 90◦ , 180◦ and 270◦
out of phase with I1 ). For Part B and Part C of the experiment, the frequency
of all four icons was changed to 0.8Hz and 1.6Hz respectively and the phase
differences remained the same. In Part D, all four icons oscillate at different
frequencies. The horizontal position for each of the four icons at any moment
in time, xn may be calculated as:
xn = x0 + 100 sin(2πfn t + θn )
(4.3)
The individual values of f and θ used to calculate x1 , x2 , x3 and x4 for each of
the four parts of the experiment are summarised in Table 4.1.
The screen that the subject is presented with is shown in Figure 4.12. The
screen consists of a box in the centre of the screen, the control box, surrounded
by four moving icons. A box on the bottom left-hand corner of the screen
is used for synchronisation. Before the experiment starts, the experimental
92
Table 4.2: Sequence that the subject follows each of the icons in Experiment 2D.
0-5s
White
5-10s
Red
subject should follow I1
10-15s White
15-20s
Green
subject should follow I2
20-25s White
25-30s
Blue
subject should follow I3
30-35s White
35-40s
Black
subject should follow I4
40-45s White
procedure is described to the subject. The subject should look at the control
box at the centre of the screen which will initially be white. Once the control
box changes colour, the subject should move their gaze to the icon with the
colour corresponding to the colour of the control box. The subject should
follow the moving icon until they can see in their peripheral vision that the
control box colour has returned to white. The subject should then move their
gaze onto the control box until it changes colour again. The subject should
familiarise himself or herself with the location of the four coloured icons before
the experiment starts.
The duration of each of the four experiments was 45 seconds and the centre
box colour in each followed the same sequence which is shown in Table 4.2.
The resulting EOG for one subject for each of the four 45s periods is shown in
Figure 4.13. As the red and green icons are to the left and right of the centre
box, a vertical offset in EOG can be seen for the periods 5-10s and 15-20s,
where the user moves their eyes to follow those icons. The blue and black
icons are moving directly above and below the centre box and hence do not
cause a change in the DC offset when the user moves their eyes to look at these
icons.
93
Figure 4.12: TPV Experiment 2: Screenshot of the scene presented to the subject.
4.2.7
TPV Based Menu Selection
Different signal processing techniques were examined to identify a robust method
of determining which icon the subject is looking at from the recorded data.
The method developed here is named target identification. The aim was that
the method identified could be used in a system for menu selection, that would
contain K different icons, each moving at a different frequency. The method
should enable automatic determination of which icon, if any, the subject is
looking at and hence enable the program to select the menu option corresponding to that particular icon.
The initial impulse might seem to perform a full N-point FFT on a chunk
of the recorded EOG data, from some previous sample n − N to the current
sample n. Then this spectrum would be searched to find the frequency component with the maximum power, and a conclusion could be drawn that the
icon that the user is looking at is the icon with the closest frequency to this
maximum. However, this method is extremely computationally inefficient, and
94
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
−0.1
−0.1
0
2000
4000
6000
8000
−0.2
0
2000
(a) Expt 2A
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
−0.1
−0.1
−0.2
2000
4000
6000
8000
6000
8000
(b) Expt 2B
0.5
−0.2
0
4000
6000
8000
(c) Expt 2C
−0.3
0
2000
4000
(d) Expt 2D
Figure 4.13: TPV Experiment 2. The x-axis shows number of samples and the
y-axis shows Volts (gain = 1000). In Expt 2A, 2B and 2C the frequencies of the
icons are fixed to 0.4Hz, 0.8Hz and 1.6Hz respectively and the icons are moving each
90◦ out of phase with the last. In Expt 2D the icons are moving with frequencies
0.2Hz, 0.4Hz, 0.8Hz and 1.6Hz.
95
also introduces errors if there is some spurious frequency peak due to noise. In
reality, the only frequency components that need to be examined are those at
the same frequencies as each of the icons. If the power at any of these components is greater than a threshold, then the user is assumed to be following the
icon with the frequency in question, if not, no icon is being followed. Instead of
performing an FFT computation, which calculates the power in the spectrum
at every frequency interval, it is only necessary to calculate the components of
the Fourier Series corresponding to each of the K frequencies.
The EOG signal s(t) is sampled by the computer at a fixed sampling rate,
fs . The sampling interval ∆t = fs −1 . The notation used to represent the
sampled EOG signal is s[n] = [si , si−1 , si−2 . . . s1 , s0 ], where s0 is the the first
sample, sampled at t=0s (n=0) , and si is the current sample, sampled at
t = i∆t (n=i). There are K Fourier Series component coefficients to be calculated at each sampling instant i, which we can number from {0 → k → K}.
The current coefficient at sampling instant i corresponding to the particular
icon frequency, fk , with period Tk , is notated as cik . This coefficient can be
calculated using the last Tk samples of recorded data si−Tk +1 → si .
cik =
2
Tk
i
X
sm exp(2πfk m∆t)
(4.4)
m=i−Tk +1
Based on this, a candidate reconstruction signal ri [n], consisting of T elements, can be calculated at each of the K frequencies fk .
r ik [n] = Re(cik ) cos(2πfk nT [n]) + Im(cik ) sin(2πfk nT [n]) (4.5)
where rik [n] = [ri−Tk +1 , ri−Tk +2 , · · · ri−1 , ri]
and T [n] = [∆t(i-Tk +1), ∆t(i-Tk +2), · · · ∆t(i-1), ∆t(i)]
(4.6)
(4.7)
If the reconstructed signals for the period [n : n − Ti + 1] at any of the
K frequencies are a close fit to the recorded EOG data over the preceding Tk
seconds, then it can be assumed that the user is tracking an icon moving with
that frequency. The closeness of fit is quantified as follows. The EOG recorded
96
will have a DC offset which needs to be subtracted from the original signal
before it can be compared to the reconstructed signal. The signal average, cik ,
was calculated in each case over the previous Tk samples.
cik =
1
Tk
i
X
sm
(4.8)
m=i−Tk +1
The error value, eik between each of the K reconstructed signals and the actual recorded signal over the previous Tk samples is calculated for the each
candidate frequencies as follows.
eik =
i
X
m=i−Tk +1
|sm − cik − rm |
(4.9)
The fit value for each frequencies, F Vk is defined as:
F Vik =
|cik |
eik
(4.10)
At each point i, the user is assumed to be tracking an icon moving at frequency
fk if the fit value F Vik > F Vthreshold. The data from Experiment 2D was
used to calculate the fit value at each sample interval for each of the four
frequencies, f1 = 0.2Hz, f2 = 0.4Hz, f3 = 0.8Hz and f4 = 1.6Hz, called
F V0.2 [n], F V0.4 [n], F V0.8 [n] and F V1.6 [n]. There are five decision choices that
can be made by a system at each time interval for the data from Experiment
2D, either the user was looking at one of the four icons moving sinusoidally
with frequencies of 0.2Hz, 0.4Hz, 0.8Hz and 1.6Hz, or the user is not looking
at any of the four icons. The fit value at each sample is plotted for each of
the four frequencies in Figure 4.14. Note that in each case the fit value is only
calculated once Tk samples have elapsed. Note that each of the fit functions
shows a peak when the user is tracking the correct icon but not all to the same
amplitude, therefore different threshold values should be used for each of the
four fit functions. Also the peak in in Figure 4.14 (b), F V0.2 , only peaks at
the end of the oscillation. Since in this experiment, the user was only tracking
each icon for 5s, the function does not work very well to identify the icon at
lower frequencies. This is due to the fact that it takes 5s before one period
of oscillation of the icon moving at 0.2Hz occurs. In the figure shown, the
97
0.5
(a)
0
−0.5
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
35
40
45
0.01
(b)
0.005
0
0.02
(c)
0.01
0
0.04
(d)
0.02
0
0.1
(e)
0.05
0
Figure 4.14: (a) Original EOG recording (b) Fit values for f=0.2Hz (c) Fit values
for f=0.4Hz (d) Fit values for f=0.8Hz (e) Fit values for f=1.6Hz
user is tracking the 0.2Hz icon from approximately [t = 9.5s → t = 14.5s],
but a noticeable peak in the function F V0.2 can be only seen to occur slightly
around t=14s, i.e. once the reconstructed signal covers one period of recorded
data at 0.2Hz. Quicker recognition times for more slowly moving icons could
perhaps be achieved by modifying the definition of the fit value function at
lower frequencies - maybe by only attempting to calculate a reconstructed
signal that is a fraction of a period long. The best fit is for the f=0.8Hz or
f=1.6Hz data.
A working system using this method could calculate each of the four fit
functions at regular intervals and compare each one to a suitable threshold.
When one of the four fit functions crosses its threshold, the system recognises
that the user is following the icon with the frequency in question, and performs
the appropriate action. The theory of target position variation, the experimental work presented here and the method of target identification formed
the basis of a paper presented at the conference PGBIOMED’05 in Reading
[60].
98
4.2.8
Limitations of Eyetracking for Cursor Control
Mouse cursor control has been suggested here as a suitable application for eyegaze tracking. However, this method does have some difficulties. The use of
eye movements as a substitute for a manually controlled mouse to navigate a
pointing cursor on a computer screen initially seems to be an ideal application for eye movements. This proposal is not as straightforward to implement
as it may at first appear for a number of reasons. Firstly, even when a user
perceives that they are looking steadily at a single object the eye may make
slight continuous jerks. One of the reasons for this eye jerking is due to microsaccades, which occur to ensure that individual photoreceptor cells of the
retina are continuously stimulated [61]. Eye jerks may also occur due to the
eye drifting around various points of an icon on screen. These jerks may cause
problems when using eye tracking for cursor control since it will be hard to keep
the pointer fixed on a target. The system needs to recognise such situations
and respond as if the subject were looking steadily at the desired target. Just
and Carpenter [62] have developed an approach which attempts to overcome
this problem. If the user makes several fixations around an object on screen,
connected by small saccades, then the motion is grouped together into a single
“gaze” at the object.
The second problem of using eye movements continually to control a pointing cursor or other device is what is described by Jacob [63] as the Midas
Touch problem. This is described as follows:
“At first, it is empowering to be able simply to look at what you
want and have it happen, rather than having to look at it (as you
would anyway) and then point and click it with the mouse. Before
long, though, it becomes like the Midas Touch. Everywhere you
look, another command is activated; you cannot look anywhere
without issuing a command.”
99
4.2.9
A Model of the Eye
The investigation of eye movements for communication and control purposes
motivated an in-depth study of the processes involved which cause horizontal
and vertical rotation of the eyes. The effects of the eye’s muscle spindle on the
eyeball’s torque were also considered. Based on this study, a feedback model
of the eye has been developed. This system models rotation of the eye in
either the horizontal or the vertical plane. In either plane, rotation of the eye
is controlled by a pair of agonist-antagonist extraocular muscles. Contraction
of one muscle rotates the eye in the positive θ direction, and of the other in
the negative θ direction. In the model presented here, these two muscles are
condensed into a single equivalent bidirectional muscle, which can rotate the
eyeball in either direction. The influence of the muscle spindle is incorporated
into the model in an inner feedback loop. The model was developed initially
to model saccadic eye movements and it attempts to predict the eye response
to a saccade, such as the 15◦ jump shown in Figure 4.7(b). The original model
was later modified to include prediction of smooth pursuit movements. The
model of the eye for saccadic eye movements will first be described.
The structure of the control loop for this model is shown in Figure 4.15.
The model shown consists of two feedback loops. The outer loop consists of a
non-linear element in cascade with a controller in the forward path. The inner
loop, containing C1 (s) and G(s), models the effect of the muscle spindle on
the eye, which acts to speed up the response of the eye.
Inner Loop of Model
The eyeball muscle torque is controlled by the muscle spindle. The inner feedback loop, shown in Figure 4.15, represents the spindle feedback mechanism.
The muscle spindle essentially senses the error, el , between θrl , the locally generated reference value for θ and the current value of θ, and uses it to generate
the gross rotational torque Tg , on the eyeball. Drawing on Stark’s work [64]
100
on control of the hand, the transfer function relating el (s) and Tg (s) is taken
to be of the form
Tg (s)
f1 s + f0
= C1 (s) = 2
el (s)
s + h1 s + h0
(4.11)
G(s), the transfer function of the eye dynamics can be calculated using the
model proposed by Westheimer [65], which gives the following equation relating
the eye position, θ, to the net torque Tn , applied to the eyeball.
J
d2 θ
dθ
= Tn − F
− Kθ
2
dt
dt
(4.12)
The gross rotational torque, falls off with the velocity of contraction due to the
effects of friction, usually nonlinearly. However, in the interests of getting a
linear dynamic model, the fall-off is approximated as as being linearly related
to eye velocity. Thus the net torque applied to the eyeball, Tn , can be related
to the friction factor, f , and gross torque Tg .
Tn = Tg − f
dθ
dt
(4.13)
Substituting Equation 4.13 into Equation 4.12 and taking the Laplace transform with zero initial conditions gives the transfer function G(s) of the eye
dynamics.
G(s) =
θ(s)
Tg (s)
1
J
=
f
+
F
K
s2 +
s+
J
J
(4.14)
where, according to Westheimer [65], the values of the constants, in SI units
are:
J = 2.2 × 10−3
F
= 168
J
K
= 14400 = 1202
J
(4.15)
101
possible signal
from the brain
θref
102
from brain
delay
+
e
_
nonlinearity
NL
y
C2 (s)
θrl +
el
_
muscle
spindle
C11 (s)
Figure 4.15: Proposed feedback control structure for the eye.
Tg
eyeball
dynamics
G(s)
θ
The standard form of the transfer function of a second-order linear system is
Gdc ωn2
s2 + 2ζωn s + ωn2
(4.16)
where
ωn = undamped natural frequency
Gdc = DC gain of the system
ζ = damping ratio
Comparing this to the transfer function G(s) with no friction added (f = 0)
G(s) =
s2
454.55
+ 168s + (120)2
(4.17)
gives a damping ratio ζ = 0.7 and undamped natural frequency ω = 120 rad/s.
f
The literature gives no guidance as to what value to assign to . This model
J
proposes that the value of f is such that it increases the damping ratio to
f
ζ = 1, the value at which critical damping occurs, i.e.
= 72, giving
J
454.55
+ 240s + (120)2
454.55
=
(s + 120)2
G(s) =
s2
(4.18)
Gi (s), the overall transfer function of the inner loop relating the eye position
θ to the local reference value θrl then can be calculated as
Gi (s) =
C1 (s)G(s)
1 + C1 (s)G(s)
1
.(f1 s + f0 )
J
=
(s2
+ h1 s + h0 ).(s +
120)2
1
+ .(f1 s + f0 )
J
(4.19)
The denominator here is the characteristic polynomial of the inner loop, whose
roots determine the nature of its dynamics. From Equation 4.18, it can be seen
that the roots of the characteristic polynomial of the transfer function for the
eyeball dynamics alone, without the influence of the muscle spindle, would be
103
at s = −120. Assuming that the function of the muscle spindle is to speed
up the eyeball response, the roots of the overall characteristic polynomial are
lower than this value. This model proposes to place all four roots at the same
location i.e. at s = −120b, where b > 1. This gives
(s + 120b)4 = (s2 + h1 s + h0 ).(s + 120)2 +
1
.(f1 s + f0 )
J
(4.20)
Multiplying out and comparing coefficients gives the following values:
h1 = (4b − 2).(120)
h0 = (6b2 − 8b + 3).(120)2
f1 = J(4b3 − 12b2 + 12b − 4).(120)3
f0 = J(b4 − 6b2 + 8b − 3).(120)4
(4.21)
Comparison with the recorded data of Figure 4.7(b) seems to indicate that
a value b = 2, in conjunction with tuning of the integral controller C2 (s) in the
outer loop, gives a very good match to the real response. Substituting these
values into Equation 4.11 gives the expression
15206.4s + 2280960
s2 + 720s + 158400
(15206.4)(s + 150)
=
(s + 360 − 169.7056).(s + 360 + 169.7056)
C1 (s) =
(4.22)
(4.23)
As shown by Cogan and de Paor [66], assigning the four roots of the characteristic polynomial to the same location gives the interesting property of
optimum stability to the system. If all controller parameters but one are held
at their nominal values then, as that one is varied through its nominal value,
the right-most root is as deep in the left half plane as possible. The four parameters of this controller are f1 , f0 , h1 and h0 . To demonstrate the principle of
optimum stability, three of these parameters are held at their nominal value
in turn and the fourth is varied. The root locus for each of the four roots are
included in Appendix D. It can be seen that in each case, as the root passes
through its nominal value, the system is optimally stable in the sense that its
right-most eigenvalue is as deep in the left half plane as possible.
104
The effect of the muscle spindle to speed up the eyeball response can be
observed by comparing the unit step responses of the overall transfer function
G(s)C1 (s)
of the inner loop, Gi (s) =
, and the transfer function for the
1 + G(s)C1 (s)
eyeball dynamics alone, G(s). Note that these two transfer functions have
different static gains.
Gi (s) =
G(s)C1 (s)
1 + G(s)C1 (s)
(4.24)
454.55(16896s + 2304000)
(s +
+ 800s + 160000) + 454.55(16896s + 2304000)
454.55(2304000)
⇒ Gi (0) =
2
(120) (160000) + 454.55(2304000)
= 0.3125
(4.25)
=
120)2(s2
454.55
(120)2
= 0.0316
and G(0) =
(4.26)
For comparison purposes G(0) is scaled to have the same value as Gi (0). The
two unit step responses are shown in Figure 4.16. The effect of the muscle
spindle to speed up the step response is evident. With the influence of the
muscle spindle, the system reaches a steady state value of 0.3125 after 0.0267s,
compared to a time delay of 0.0701s without the muscle spindle included.
As seen in Equation 4.23, the poles of this transfer function C1 (s) are complex.
However, work by Stark [64] strongly suggests that both poles should be real.
This criterion can be demonstrated by considering the muscle spindle model
in Figure 4.17. The firing rate of the muscle spindle, f, is controlled by x2 ,
the stretched length of the second spring in the model. B(s) is a biochemical
k
length transducer with a transfer function of the form
, basically a low
1 + sτ
pass filter with gain k and time constant τ . The total stretch on the central
part of the muscle spindle, which is called the nuclear bag is the sum of the
two individual spring stretches, x1 + x2 .
105
Step Response
0.35
0.3
0.25
Amplitude
0.2
0.15
0.1
0.05
Gi(s)
G(s)
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Time (sec)
Figure 4.16: Step response of eye, the transfer function Gi(s) includes the influence
of the muscle spindle and the transfer function G(s) is of the eyeball dynamics alone,
without this influence. The effect of the muscle spindle is to speed up the eyeball
response.
λ1 , λ2 = spring stiffness
x1 , x2 = stretch of springs
f = firing rate of spindle on the spindle
nerve
D=damper
x1
x2
λ1
P
λ2
B(s)
D
f
x1 + x2
Figure 4.17: Model of the nuclear bag, the central part of the muscle spindle
106
F (s)
is equivalent to the muscle spindle controller
X1 (s) + X2 (s)
described above, C1 (s), so an expression for this must be found that will
In this model,
demonstrate that the poles of the transfer function are real. From examination of Figure 4.17, and expression for F (s) in terms of X2 (s) can be directly
written down.
k
F (s) =
.X2 (s)
1 + sτ
1 + sτ
⇒ X2 (s) =
.F (s)
k
(4.27)
The relationship between X2 (s) and X1 (s) can be found by balancing the forces
at the point P, and taking the Laplace transform with zero initial conditions.
dx1
= λ2 x2
dt
⇒ λ1 X1 (s) + DsX1 (s) = λ2 X2 (s)
λ2
⇒ X1 (s) =
.X2 (s)
λ1 + Ds
λ1 x1 + D
(4.28)
(4.29)
Using this equation and Equation 4.27 an expression for X1 (s)+X2 (s) in terms
of F(s) can be found.
X1 (s) + X2 (s) =
=
=
⇒
F (s)
=
X1 (s) + X2 (s)
λ2
+ 1 .X2 (s)
λ1 + Ds
λ1 + λ2 + Ds
.X2 (s)
λ1 + Ds
λ1 + λ2 + Ds
1 + sτ
.
.F (s)
λ1 + Ds
k
k(λ1 + Ds)
(λ1 + λ2 + Ds)(1 + sτ )
(4.30)
(4.31)
This transfer function is C1 (s), the muscle spindle controller, which can be
written in the following form.
k
λ1
s+
τ
D
C1 (s) = λ1 + λ2
1
s+
s+
D
τ
(4.32)
Since λ1 , λ2 , D and τ are all real-valued, this transfer function has real poles,
unlike the muscle spindle transfer function estimate given in Equation 4.23.
107
To obey this restriction, the original transfer function was replaced by a close
approximation which has real roots. The denominator of C1 (s) was approximated as s2 + 800s + 160000 = (s + 400)2 , which is close to the denominator
in Equation 4.23 but has real roots. Then f0 was scaled to preserve the static
gain C1 (0). The scaled value of f0 is fˆ0 .
f0
h0
fˆ0
2280960
=
160000
158400
ˆ
⇒ f0 = 2304000
C1 (0) =
(4.33)
f1
Similarly, f1 is scaled to fˆ1 to preserve the ratio
h1
15206.4
fˆ1
=
800
720
⇒ fˆ1 = 16896
(4.34)
Therefore the expression for C1 (s) actually used was
C1 (s) =
16896s + 2304000
s2 + 800s + 160000
(4.35)
To give an idea of the accuracy of the approximation, a comparison of the unit
step responses of the two controllers and their Bode magnitude diagrams are
shown in Figure 4.18.
Outer Loop of Model
As shown in Figure 4.15, the outer loop of the model conists of a non-linear
element NL(e) followed by a controller C2 (s). This is based on the model
proposed by Stark [64] for saccadic eye movements, with several important
differences. The controller in Stark’s model is a pure integrator and this has
ki
been kept in the model presented here, so C2 (s) = . The non-linear element
s
in Stark’s model has a dead-zone i.e. NL(e) = 0 for |e| < 0.03. This feature
is preserved, since it represents the fact that no correction need be applied
if the image remains focused on the fovea. However, for errors outside this
108
Step Response
25
(a)
(b)
20
Amplitude
15
10
5
0
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
Time (sec)
(a) Muscle spindle unit step responses
Bode Diagram
30
(a)
(b)
25
Magnitude (dB)
20
15
10
5
0
1
10
2
3
10
10
4
10
Frequency (rad/sec)
(b) Bode magnitude diagrams of the muscle spindle controllers
Figure 4.18: Unit step response and Bode magnitude diagrams of the muscle spindle
controllers. Blue Lines: Original transfer function Equation 4.23. Red Lines:
Modified Transfer Function Equation 4.35,
109
region, Stark’s model has a linear variation of NL(e) with e, and with this
linear variation it appears not possible to reproduce the linear transition of
the saccadic response in Figure 4.7(b). Since C2 (s) is an integrator, it seems
to suggest that NL(e) is saturated for |e| > 0.03, and that the input to the
inner loop (the muscle spindle controller), is ramping up linearly during the
transition. Thus the function NL(e) used in this model is as follows.
NL(e) =
|S| for|e| > 0.03
0
(4.36)
otherwise
From inspection of Figure 4.7(b), it is clear that the rate of change of θ
is governed by the product Ski . Therefore S = 1 can be arbitrarily chosen,
and ki adjusted to get the slope of the response of the model to match the
empirical data. A value of ki = 19 was found to give a simulated response
close to the measured response.
Stark’s model also has a pure time delay element of 0.235s within the outer
loop. However, in simulations, this delay leads to sustained oscillations of the
system. This may be prevented by placing the delay element outside the outer
loop, as shown in Figure 4.15. Increasing the delay to 0.267s gives the closest
match between the actual EOG response recorded in Figure 4.7(b) and the
simulated response. The two responses are shown superimposed on each other
in Figure 4.19. The real response is scaled vertically for comparison purposes.
In the simulated model, the input is 0.2618 radians (15◦ ). The simulation was
done in Simulink, part of the MATLAB environment. The Simulink model
used is given in Appendix B. This model is summarised in the paper [67].
Model for Smooth Pursuit Movements
As shown, a Simulink simulation of the model described gives a response to
a 15◦ input which is very close to the measured EOG response to a saccadic
eye movement of 15◦ . However, for smooth pursuit movement, the system
110
Saccadic Eye Movement
0.35
EOG
model
0.3
0.25
Voltage (V)
0.2
0.15
0.1
0.05
0
0.2
0.25
0.3
0.35
0.4
Time (s)
Figure 4.19: Actual EOG and Simulated Saccadic Responses to a target that suddenly jumps by approximately 0.2618 rad.
111
as described does not work so well. Figure 4.7(a) shows the measured EOG
response to a sinusoidally varying target. However, when a sinusoid is set as
the input to the control loop in Figure 4.15, simulation gives an output with
“jumpy” steps not apparent in the measured EOG response. Clearly, something
else is also happening to give the response measured in Figure 4.7(a). In the
model proposed here, when following predictable target movements such as
sinewaves, there is an extra input from the brain, along the path labelled
“possible signal from the brain” in Figure 4.15. This signal is synthesised in
such a way as to follow the predictable reference input with essentially zero
(or very small) error.
The model presented in Figure 4.15 is modified for smooth pursuit movements as shown in Figure 4.20. When the eye is moving in a saccadic fashion
this model is the same as the original model in Figure 4.15. However, when
the eye is in smooth pursuit of an object, this model proposes that the signal
labelled θref passes through two additional blocks. Firstly, it passes through a
“Predictor” block which effectively cancels out the effect of the time delay. For
a sinewave input this would be a phase advance network, to cancel the phase
lag introduced by the delay. The signal θref is adjusted by the brain in such a
way that the signal e is confined to the deadzone of the nonlinearity NL(e) i.e.
the image remains focused on the fovea. Therefore the output from the NL(e)
block is always zero and hence the portion consisting of NL(e) and C2 (s) is
effectively disconnected while the eye is in smooth pursuit of a target. The
system is therefore responding as if it were just the inner loop.
This model also proposes that the signal θref goes through another block
which is adjusted accordingly by the brain to make θref = θ. With the outer
loop disconnected the modified model is as shown in Figure 4.21. Therefore the
additional block H(s) is effectively an inverse model of the inner loop Gi (s).
If θref is a sinusoid,
θref = θm sin(ωt)
(4.37)
then at the frequency ω, H(s) is behaving as if it had a transfer function
112
Saccadic Mode
θref
Smooth Pursuit
Mode
effective
inverse model of
inner loop
Predictor
113
delay
+
e
_
nonlinearity
NL
y
C2 (s)
θrl +
+
el
_
muscle
spindle
C11 (s)
Tg
eyeball
dynamics
G(s)
Figure 4.20: Feedback Control Loop, modified to include effects of Smooth Pursuit Movements
θ
1 1
and its phase is −∠Gi (ω), which gives an
i.e. its gain is Gi (ω)
Gi (ω) output from the predictor block θ̂:
θ̂ =
θm
. sin(ωt − ∠Gi (ω))
|Gi (ω)|
(4.38)
The Bode magnitude and phase of Gi (ω) were plotted in MATLAB, and
can be seen in Figure 4.22. At the frequencies in question for smooth pursuit
(ω < 100rad/s), the gain of this transfer function is a constant and the phase
difference is approximately zero. It can be seen from Figure 4.22 that the gain
of Gi at low frequencies is around −10dB = 10−0.5 . This is approximately
equal to the static gain figure of 0.3125 calculated in Equation 4.25.
For initial tests of this model, the input from the outer loop was disconnected and H(s) was modelled by a simple gain block. The Simulink model
used is given in Appendix B, with H(s) replaced by a gain block of value
1
. The input sinusoid was chosen to match as closely as possible the
0.3125
measured EOG signal in Figure 4.7(a) which is following a target oscillating
at 0.8Hz or 5.027 rad/s. The results are plotted in Figure 4.23.
As can be seen from Figure 4.23(b) there is a very small offset between the
input and output of this model. This is due to approximating H(s) by a gain
block. This neglects any phase difference introduced by this block. The Bode
phase plot in Figure 4.22 shows that Gi (s) introduces a slight change in phase
with frequency and therefore the inverse model of H(s) should also account for
this. However, there is a problem with modelling H(s) as the inverse transfer
function of Gi (s). The inverse transfer function of Gi (s) can be written by
inverting Equation 4.24.
1
(s + 120)2(s2 + 800s + 160000) + 454.55(16896s + 2304000)
=
(4.39)
Gi (s)
454.55(16896s + 2304000)
This is an “improper” transfer function i.e. its numerator is of a higher
degree than the denominator. A proper rational function, which would be
characteristic of a realisable function, may be approximated by placing a fourth
114
115
−
+
H(s)
C1 (s)
θref
θ
θref
Figure 4.21: Modified loop for Smooth Pursuit
G(s)
H(s) θ̂
Gi (s)
θ
Bode Diagram
−10
−15
Magnitude (dB)
−20
−25
−30
−35
−40
−45
−50
0
Phase (deg)
−45
−90
−135
−180
−225
0
10
1
2
10
10
Frequency (rad/sec)
Figure 4.22: Magnitude and Phase Bode Plots for Gi (s).
116
3
10
Comparison of Input and Output of Smooth Pursuit Model
0.1
input
output
0.08
0.06
Amplitude (V)
0.04
0.02
0
−0.02
−0.04
−0.06
−0.08
−0.1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Number of Samples
(a) Graph
Comparison of Input and Output of Smooth Pursuit Model
input
output
0.061
Amplitude (V)
0.0605
0.06
0.0595
0.059
0.0585
685
690
695
700
705
710
715
Number of Samples
(b) Close up
Figure 4.23: Graphs showing the input and output of the model whose Simulink
code is shown in Appendix B. The two are very closely following each other although
there is a very slight offset present, which is observable in the close-up in (b).
117
order low pass filter in cascade with the inverse transfer function.
H(s) =
1
Gi (s)[1 + sτi ]4
(4.40)
τi may be chosen so that for the frequencies of interest ωτi << 1. τi = 0.0002s
was considered a sufficient value to pass eye movements due to smooth pursuit.
In simulation of the full model, the predictor was also modelled as a phase
lead network. The phase lead network was modelled as a predictor P(s) with
a transfer function of the form
P (s) =
kd sτd
1 + sτd
(4.41)
At the frequency of the input the predictor should cancel out the effect of the
time delay element, L i.e. the angle of the predictor should be equal to the
angle of the time delay
π
− tan−1 (ωτd ) = ωL
(4.42)
2
With ω = 5.027 rad/sec and L=0.267s, τd = 0.0443s. kd is then chosen so that
the gain of the predictor at ω = 5.027 rad/sec is 1.
kd × 5.027 × 0.0443
p
= 1
1 + (5.027)2 (τd )2
⇒ kd = 4.6004
(4.43)
These values were used to model the predictor block in Simulink. It was
found that the error e was not always exactly zero. However, it was always
within the deadzone of the non-linear element (i.e. |e| < 0.03) and thus the
output of this block was zero, and the portion consisting of NL(e) and C2 (s)
is effectively disconnected while the eye is in smooth pursuit, as hoped.
Development of these mathematical models which describe saccadic and
smooth pursuit eye movements led to a greater understanding of the physical
processes involved during motion of the eyes. This knowledge was invaluable
when considering how vestigial eye movements could be harnessed for communication and control purposes for physically disabled persons. In particular,
the study on smooth pursuit eye movements was beneficial when developing
the Target Position Variation algorithm described in Section 4.2.5.
118
4.3
4.3.1
Electrodermal Activity as a Control Signal
Introduction
The field of electrodermal activity (EDA) includes any electrical activity measurable on the surface of the skin. Electrical signals that are commonly measured are the skin resistance, the skin conductance, the skin impedance and the
skin potential. These signals are modified by activation of the sweat glands,
which changes the electrical properties of the skin. Although the sweat glands
are part of the autonomic nervous system and therefore not usually considered
to be under voluntary control the sweat glands can be consciously activated
under certain circumstances.
The main function of the sweat glands is thermoregulation, but they are
also activated due to emotional stress. Thus they may be consciously activated
by willing oneself into a tense state. When sweat glands are activated, the
skin resistance decreases. The time taken for the skin resistance to decrease
following a conscious effort to tense can be quite slow. It can take about 27s for the response to reach peak. Recovery time is considerably slower and
varies greatly, it can take anywhere from between 1-30s to return to 50% of
baseline [68]. The length of time taken to produce a response to a voluntary
action makes electrodermal activity an extremely slow method of control and
communication for severely disabled people. However in some circumstances,
it may be the only feasible option. Since the time taken to return to baseline
is quite long, it may be more suitable to applications where many switching
actions are not required to be performed quickly, such as in environmental
control.
The anatomy and physiology of the skin is briefly discussed before electrodermal activity is discussed in more detail. The feasibility of using skin
resistance as a control signal for people who are severely disabled is explored
by performing some experiments where voluntary responses are elicited. Fi119
nally a non-invasive method of measuring the firing rate of the sympathetic
nervous system from measurement of the skin conductance is proposed, based
on a model developed by Burke [69] for the skin conductance.
4.3.2
Anatomy and Physiology of the Skin
The Autonomic Nervous System
The autonomic nervous system (ANS) has already been briefly described in
the last chapter and the different types of nerve fibres in the body have been
discussed. The visceral afferent nerve fibres and the autonomic efferent nerve
fibres are the two types of nerve fibres of the ANS. The ANS controls automatic
functions of the body such as arterial pressure, sweating and body temperature and is not normally under voluntary control. The ANS is made up of
three separate systems: - the sympathetic system, the parasympathetic system
and the enteric system. The enteric system, responsible for the gut region of
the body, is usually considered separately. The sympathetic nervous system
and the parasympathetic nervous system are both responsible for transmitting ANS signals through the body. The two systems operate in antagonism,
e.g. the sympathetic nervous system is responsible for pupil dilation while
the parasympathetic nervous system controls pupil constriction. Sweat gland
activity is controlled by the sympathetic nervous system.
The Sweat Glands
Skin all over the human body contains sweat glands, which are activated by the
sympathetic nervous system. There are two types of sweat glands in the body,
apocrine sweat glands and eccrine sweat glands. Apocrine sweat glands are
found in the underarms and on the genitals, around the nipples and the navel.
The sweat produced by apocrine glands is a thick viscous fluid. The exact
function of the apocrine sweat glands is not fully understood although they
120
are thought to play a role in producing sexual scent hormones (pheromones)
[70]. They do not play a role in electrodermal activity so are not of much
importance here.
Eccrine sweat glands are found on the surface of the skin over most of the
body, but are especially dense on the palms of the hands (palmar surfaces)
and the soles of the feet (plantar surfaces). The eccrine sweat glands secrete
a thinner, more watery fluid than the apocrine sweat glands. Their main
function is thermoregulation. The sweat glands secrete moisture onto the
surface of the skin, which is evaporated into the air. The process of evaporation
requires latent heat, which is taken from the skin thus cooling the skin’s surface.
Eccrine sweat glands are also activated in response to emotional arousal, such
as anxiety or rage. The reason for emotional eccrine sweating is thought to
be a product of the evolutionary “fight or flight” response [69]. Emotional
sweating improves grip, gives greater tactile sensitivity and increases the skin’s
resistance to cutting or abrasion of the skin. The eccrine sweat glands on the
palmar and plantar surfaces respond more strongly to emotional stimuli than
heat stimuli, whereas the opposite is true of the glands found on the forehead,
neck and back of the hands.
Figure 4.24 shows the main parts of an eccrine sweat gland through the
different layers of skin. Skin consists of three main layers, the dermis, the
epidermis and the subdermis. The secretory portion of a sweat gland is located
in the subdermis. When a sweat gland is activated, this portion fills up with
sweat, which is a mixture of salt, water and urea. This fluid travels up through
the ducts in the dermis layer and deposits moisture on the skin’s surface,
through the sweat pores in the epidermis.
4.3.3
Electrodermal Activity
The study of electrodermal activity (EDA), or the electrical responses measurable at the surface of the skin, is common in the field of psychophysiology [71].
121
Pore
Stratum
Corneum
}
Eccrine Sweat
Duct
}
Epidermis
Dermis
}
Secretory
Portion of
Sweat Gland
Subdermis
Figure 4.24: Sweat Gland. The stratum corneum acts as a variable resistor with
decreased resistance due to sweat.
There are two main approaches to measuring EDA, the endosomatic method
and the exosomatic method [72]. Activation of the sweat glands, either by an
increase in body temperature or in response to an emotional stimulus, causes
both an increase in skin conductance (or a decrease in skin resistance) and a
change in the skin potential. Either of these features can be used to measure
the degree of sweat gland activation. The endosomatic method measures these
changes in potential at the skin surface. The exosomatic method measures
the skin resistance or conductance across an area of skin. The endosomatic
method typically uses invasive electrodes [73] so the exosomatic method is used
here. This will now be described in more detail.
The change in skin conductance due to sweat gland activity was first observed by Féré in 1888. It is now a well observed phenomenon that skin
conductance increases when a person becomes emotionally aroused [69]. As
already mentioned, when a person becomes anxious or angry, the “flight or
fight” response is invoked, which activates the eccrine sweat glands. The skin
conductance increases when the sweat glands are activated due to a number
of factors, but largely because sweat is the equivalent of a 0.3% NaCl solution,
122
and hence a weak electrolyte [74]. Skin is a good conductor of electricity and
skin conductance can be measured by passing a small current through an area
of the skin. The number of sweat glands that deposit sweat on the skin surface
is approximately proportional to the number of conductive pathways on the
skin’s surface and thus measuring the skin conductance gives some indication
of the number of sweat glands that are active [72].
Skin conductance is usually classified into two levels, the tonic skin conductance level and the phasic skin conductance level. The tonic skin conductance
level is the baseline level of skin conductance, and is also called the Skin Conductance Level (SCL). The phasic skin conductance is the response of the skin
conductance to an event and is also called the Skin Conductance Response.
SCRs may last 10-20s before returning to the SCL.
A number of different models of the process of skin conductance exist,
the most widely accepted of these is the Sweat Circuit Model (also known
as the Poral Valve Model ). This was developed by Edelberg in 1972 [68].
According to this model, when the sweat ducts begin to fill with sweat the
skin conductance increases which causes the phasic skin conductance response.
The skin conductance returns to its tonic level when the sweat is deposited on
the skin or reabsorbed by the sweat glands.
4.3.4
Skin Conductance as a Control Signal
Some preliminary experimental data was obtained to explore the possibility of
using conductance of the skin as a control signal. The circuit used is given in
Appendix E. It uses two electrodes which are attached to the medial phlanx on
the index and middle finger. The voltage output of the circuit is proportional
to the skin conductance. The data is read into the computer using the NIDAQ
6023E data acquisition card and sampled in MATLAB at 200Hz.
For the experimental setup used and the particular position of variable
123
Skin Conductance
−6
3
x 10
2.5
Amplitude (S)
2
1.5
1
0.5
0
0
50
100
150
200
250
300
Time (s)
Figure 4.25: Electrodermal response. The subject is asked to tense up after each
50s interval.
resistor, the skin conductance G, in Siemens, may be calculated from the
voltage at the output of the circuit according to the following equation, which
is obtained from analysis of the circuit in Appendix E.
1.405 − e0
G=
× 10−5 Siemens
5.252
(4.44)
The measured data is shown in Figure 4.25. The subject was asked to tense up
at t=50s, t=100s, t=150s, t=200s and t=250s. It appears that at t=100s the
subject did not tense up as it was perceived that the signal had not returned
sufficiently to baseline at this point. At all of the other points where the
subject tenses, a noticeable rise in the skin conductance is evident. This is a
promising indication of the potential of skin conductance as a control signal
for people who are very severely disabled.
124
4.3.5
Non-invasive Measurement of the Sympathetic System Firing Rate
Arising from the work on electrodermal activity, a non-invasive measurement
technique for the firing rate of the sympathetic nervous system was developed.
While this measurement technique is not directly related to communication
and control for disabled people, it evolved as a side interest based on study
of skin conductance and will be described here. It uses a model for the skin
conductance developed by Burke [69], which is shown in Figure 4.26. This
model has as its input the firing rate of the sympathetic nervous system, f ,
and as its output, the skin conductance, g. The output of this system, g, is
observable through measurement techniques such as the one described above.
However, the firing rate f is usually unobservable unless invasive instrumentation is used. A novel technique is described here which will allow the firing
rate f to be observed based on the measured skin conductance g.
A block diagram of the proposed measurement technique is shown in Figure
4.27. The sympathetic nervous system firing rate f is required and the skin
conductance g is directly measurable. The parameters of the controller C(s),
were tuned to get gm , the output of the skin conductance model, to follow the
measured skin conductance g as closely as possible. If the controller can be
adjusted so that
e = g − gm = 0
(4.45)
then it is hypothesised that the output of the system y will be equal to the
unobservable parameter, f .
The controller C(s) used is a PID controller whose transfer function is of
the form
C(s) = kp +
kd s
ki
+
s
1 + sτ
(4.46)
To tune the controller, the loop in Figure 4.27 was set up with the real input
block replaced by the skin conductance model. An artifically synthesised firing
rate was input into this block. It was hypothesised that if the controller could
125
Figure 4.26: Model of Skin Conductance, from [69]
126
f
Actual
Skin
Conductance
g
+
e
C(s)
y
gm _
Skin
Conductance
Model
Firing Rate
Immeasurable
Non−invasively
Skin Conductance
Measurable
Figure 4.27: Proposed loop which allows skin conductance measurements to be used
to observe the sympathetic system firing rate
be tuned for this artifical firing rate, which had very severe variations, then
the controller should also work with the measured input. The values that gave
the best match are as follows.
ki = 12; kp = 90; kd = 10; τ = 0.1
(4.47)
Using these parameters, the complete loop was simulated in Simulink. The full
Simulink block diagram is given in Figure B.4 in Appendix B. The input to the
system was the measured skin conductance in Figure 4.25. The model developed by Burke [?] uses a standard value for expressing skin conductance which
is microSiemens (µS) so the measured data was scaled accordingly. The results
of this simulation look very promising. A graph showing the measured skin
conductance g and the output of the model gm is given in Figure 4.28. After
a transient the modelled value follows the measured value of skin conductance
almost perfectly.
The firing rate y produced at the output of the model is shown in Figure
4.29 with the measured skin conductance. The measured skin conductance was
from the experiment described in Section 4.3.4 where the subject was asked
to tense up at 50s intervals. The increase in firing rate at the times that the
subject attempted to tense up are clearly visible.
127
Measured skin conductance g and modelled skin conductance gm
1
g
gm
0.9
0.8
Amplitude (V)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
200
250
300
Time (s)
Figure 4.28: The modelled value of skin conductance gm and the measured value
of skin conductance g
128
Measured Skin Conductance, g and Firing Rate y
4
g
y
3.5
Amplitude (V)
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200
250
300
Time (s)
Figure 4.29: The input to the model,g, the skin conductance,which was measured
experimentally and the output of the model, y, the sympathetic nervous system firing
rate
129
4.4
Conclusions
This chapter has described two biosignals, the electrooculogram and the conductance of the skin. In discussion of the electrooculogram, some alternatives
to the electrooculogram for measuring eye movements were described, including the magnetic search coil technique, corneal reflection technique and limbus
tracking. Some advantages of using the electrooculogram instead of these other
methods to measure eye movements were then given. The EOG may have a
larger range than visual methods and it is generally cheaper to implement. The
EOG amplitude is linearly related to the eye angle for small angles and may
be used in real-time applications. It is not affected by obstacles such as glasses
in front of the eye and it permits eye closure. Depending on the application,
it may permit head movements and may be used in variable lighting conditions. Some limitations of the EOG are also discussed, the main one being
the problem with baseline drift. This usually requires manual re-calibration
of the amplifiers when it occurs. A novel method called Target Position Variation is developed, which enables automatic software re-calibration of the eye
position when baseline drift is evidenced. Target Position Variation looks like
a promising approach that can be used with the EOG to provide automatic
re-calibration of the eye position or to use in a menu selection program. A
control model for the eye is also developed which models a saccadic and a
smooth pursuit eye movement. The saccadic model fits well with experimental
data.
Electrodermal activity is also briefly explored as a control signal. Conductance of the skin was chosen as the electrodermal phenomenon to measure.
The user can elicit a voluntary change in their skin conductance by tensing
up or imagining themselves in a state of stress or anger. The time taken for
the response to return to baseline is very slow, meaning it would only be a
feasible option in cases of very severe disabilities where there is no preferable
alternative. A method for measurement of the sympathetic nervous system
130
firing rate is also described. Results seem to indicate that this method could
be used to provide a low-cost, non-invasive tool for monitoring the firing rate,
possibly for clinical applications.
131
Chapter 5
Visual Techniques
5.1
Introduction
This chapter describes visual techniques for obtaining vestigial signals from the
body. These techniques detect movements using computer cameras or other
light detection devices. For various reasons, measurement of body signals by
the methods already discussed may not always be suitable or possible. For
example, a person requiring the use of a control or a communication system
may find it uncomfortable to have electrodes affixed to the skin. This may be
especially true when dealing with children. Also, disabled people often do not
like to use anything that visibly draws attention to their disability. Thirdly,
electrode based systems may be impractical if the person is prone to heavy
perspiration, as it may be difficult to keep the electrodes in place. Finally,
clothing may make it impractical to measure the movement that the person is
capable of making using the EMG or MMG.
Visual techniques can offer disabled people a non-contact solution for computer interaction. Often a person who has become disabled will retain the
ability to make slight movements or rotations of a finger or toe. If these
movements are repeatable then they may be used to indicate intent through
132
observation of these movements with a computer camera. This chapter firstly
describes some video analysis techniques that have been developed by others
in Section 5.2. A system developed as part of this thesis to investigate visual
methods of detecting vestigial flickers of movement is then described in Section
5.3.
5.2
Visual Based Communication and Control
Systems
Two applications for communication and control for disabled people using a
computer camera developed by others will now be described. Both of these
are based on tracking movement of a body part to control a mouse cursor
on screen. The first system, the “Camera Mouse”, uses a template matching
technique to track a particular body part. The second system tracks motion
based on movement of the reflected laser speckle pattern of skin.
5.2.1
The Camera Mouse
The “Camera Mouse” is a visual based system developed by Betke, Gips and
Fleming [75] to provide computer access for people with severe disabilities.
The system developed tracks movements of a particular body feature and uses
this to control movement of a mouse cursor on screen. Various body features
are explored in [75], including the tip of the nose, the lips, the thumb and
the eyes. The system developed uses two computers, a vision computer (the
“tracker”) and a user computer (the “driver”). The vision computer receives
the video signals, interprets the data and sends the appropriate control signal
to the user computer. Initial setup of the system is performed on the vision
computer. The user or a helper clicks on the body feature in the image that
is to be tracked, and adjusts the camera pan and tilt angles and zoom so
133
Figure 5.1: Camera Mouse Search Window, from [75].
the desired body feature is in the centre of the image. A template of the
body feature to be tracked is stored and subsequently used to determine the
co-ordinates of the body feature position. These co-ordinates are sent to the
user computer to control the mouse position. A search for a template match
is performed within a search window. The search area is centred around the
estimated position of the feature in the previous frame, as shown in Figure 5.1.
The template is shifted around the window and the correlation between the
template and the test template area is calculated. The best correlation is used
as the centre of the search window in the next frame and the centre of this
area is the new estimate of the feature position, which is used to determine
mouse cursor position. A mouse click is performed by “dwelling” on a region
for a certain length of time.
The nose was found to be a very reliable feature to track, as it tends to be
brighter than the rest of the face and does not become occluded when the user
moves their head significantly. The eyes have a distinctive template but may
be difficult to move while simultaneously viewing the screen and they also may
often be blocked by the nose. The lower lip and lip cleft have a good brightness
difference and hence are a good feature to use, particularly for users who may
not have the ability to move their head. The thumb did not work very well,
as it was difficult to focus the camera on the centre of the thumb.
134
5.2.2
Reflected Laser Speckle Pattern
This system, developed by Reilly and O’Malley [76], is a visual-based motion
detection system for communication and control based on the reflected laser
speckle pattern. When a laser beam is shone on a scattering object, a speckled
pattern will be reflected due to the roughness of the surface, as shown for the
surface of the skin in Figure 5.2. If the surface moves, the speckle pattern will
also move proportionally. Based on this principle, movement of a body part
can be estimated by monitoring the skin’s reflected speckle. The movement
is converted into two dimensional cursor co-ordinates to move a mouse cursor
on a computer screen. Two techniques of generating mouse click actions are
considered with this system, intensity variation and dwell time.
The system uses two laser diodes as emitters, and two linear charge-coupled
device (CCD) arrays as detectors, one for each axis. Motion estimation is
achieved through correlation of two consecutive frames. One of the problems
with using human skin as the scattering surface for motion detection is that
in addition to changes in light diffusion due to movements of the skin surface,
there will also be small changes in light diffusion due to the constant flow of
blood under the skin. This undesirable speckle pattern variance is referred
to as sensor interference and is in the range 20-100Hz. Also tremors, which
are described in more detail below, may introduce noise in the range 1-15Hz,
but this noise may be removed by low-pass filtering the signal. Mouse clicks
may be generated by varying the intensity of the speckle pattern, by moving
towards or away from the sensor. Alternatively, dwell time may be used to
generate a click by holding still for a set length of time.
135
Figure 5.2: The reflected speckle generated from skin, from [76]
5.3
5.3.1
Visual Technique for Switching Action
Introduction
This section describes an application developed as part of the work presented
here. The application actuates a switching action upon detection of movement. Two different approaches were tested in development of this application, which will be described here under the titles of frame difference method
and path description method. Program operation using the frame difference
method generates a switching action each time any arbitrary movement in
front of the computer camera is recognised. This method has many problems, especially when the symptoms characteristic of a typical disabled user
are considered. If head injuries include damage to the central gray matter of
the brain (the basal ganglia), then the person may often suffer from any of a
class of movement disorders known as the dyskinesias [77]. These disorders
cover a range of excessive abnormal involuntary movements. Tremor is one of
the dyskinesias, characterised by a low frequency (1-15Hz) rhythmic sinusoidal
oscillatory movement of a body part resulting from the involuntary alternating
contraction and relaxation of opposing groups of skeletal muscles [76]. Chorea
is another, consisting of irregular, unpredictable, brief jerky movements and
is common in the hereditary Huntington’s disease [77]. Obviously if the user
136
suffers from any of these dyskinesias, then their movements could generate
unintentional switching actions. Also any large background movement may
trigger the program to incorrectly detect movement. However, the frame detection method is presented here as it may be useful in a limited number of
applications, and has the advantage over the second approach in that it does
not require any initialisation before use.
The second approach, the path description method, aims to actuate a
switching action only when one particular defined movement is recognised.
This is designed for people who may only have a slight movement such as a
flicker of a thumb. An initialisation procedure is necessary to define the “path
description”, which requires the aid of a therapist or a helper. At the beginning of each session, the user performs the voluntary action that they intend
to use to actuate the switching action. The program records the movement
and uses it to calculate the parameters that describe the path of motion, which
will allow subsequent movements by the user to be compared to this particular action. An original algorithm is described that uses a six-dimensional
plane based on the centre of brightness of the red, green and blue pixels in the
horizontal and vertical directions. Two distinct regions within this plane are
defined that are entered during the path of motion. Movement between these
two regions can then be used to control a two-way switch in software, which
enables operation of any application requiring a switching action, such as the
Natterbox program described in Chapter 2. This method works best if the
finger or moving part is placed on a dark background. While this may seem
a very restrictive requirement, it is envisaged that in a working system using
this method the black background could be provided as part of the necessary
application equipment, attached beneath the computer camera.
137
Figure 5.3: Webcam used with the system described in Section 5.3
5.3.2
Technical Details
The computer camera or web-cam used with this system is shown in Figure
5.3. It is an iBot2 USB web-cam made by Orange Micro, which has been
specifically modified for this application. The camera has been mounted on a
stiff stand with a moveable arm to enable it to be easily aimed at the moving
body part. This step was performed to enable the system to be easily tested,
however, the program should work with any computer camera, provided it can
be suitably mounted to point at the moving body part. This technique offers
an inexpensive way of interfacing a user with a computer. Computer cameras
are becoming increasingly low-cost and are often even integrated into modern
computers.
The software program was written using some of the DirectX 9.0 application
programming interface (API) libraries. DirectX provides a method for software
programs in Windows applications to interface directly with hardware devices
connected to the computer. The graphical user interface was developed using
Direct3D and video capture and rendering was achieved using Direct Show.
138
Capture Filter
IBaseFilter
Transform Filter
CMyTransformFilter
Video Renderer
Filter
IBaseFilter
Figure 5.4: Filter Graph used for Video Data in application.
A DirectShow application is based on the concept of the filter graph. A
filter graph is basically a chain of filter blocks connected together in software.
Different filter blocks may be added to the filter graph depending on the individual requirements of a particular program. For the application here, the
filter graph is made up of three filter blocks - a capture filter, a transform filter
and a renderer filter. The capture filter is required to pass data from the computer camera to the computer. The transform filter processes and interprets
this data. The renderer filters takes the data and outputs the processed video
frames on screen. Each filter block has its own set of specific properties which
can be modified through use of the appropriate interface. The filter graph
manager handles the flow of data through the filter graph. A block diagram
of the filter graph used in this application is shown in Figure 5.4.
The transform filter used in this filter graph was developed specifically for
this application by creating a class of filter called CMyTransformFilter, which
inherits from the standard Direct Show filter type CTransformFilter. Most
of the processing and analysis of the video data is done by the Transform
function of this filter. The steps involved in the program are described in
Table 5.1. The frame comparison method will first be described, and then the
path description method.
5.3.3
Frame Comparison Method
Step 1: Video Capture
The capture filter used is of type IBaseFilter. The computer camera is
accessed by the program by creating a list of video capture devices connected
139
Table 5.1: Steps involved in the application, for both the Frame Comparison Method
and the Path Description Method
Step
Path Description Method
Frame Comparison Method
Step
1
Video Capture
1
2
Low Pass Filtering
2
3
Reduction of Data
3
4
Centre of Brightness Calculations
5
Path Description
6
Calculation of Regions
7
Definition of Region Spaces
Frame Comparison
4
8
Switch Actuation
5
9
Rendering
6
to the computer. The program retrieves the moniker to the video capture
device at the top of the list, which should be the computer camera. This
moniker is then used to create the capture filter and associate the camera with
that filter block. The desired parameters for video capture are set through the
interface IAMVideoProcAmp. The camera parameters used are shown in Table
5.2. The automatic brightness control feature is turned off by setting the
camera exposure time to a constant, since the automatic brightness feature
interferes with analysis of the data later. The format for the data is set to
RGB24, which means each pixel is described using 24 bits of data, or 3 bytes
- one each to describe the intensity value of red (R), green (G) and blue (B)
in that pixel. Table 5.3 shows some sample pixel values for different colours,
and their hexadecimal values.
Step 2: Low Pass Filtering
The interpretation of the data received is performed by the transform filter,
which was written specifically for this application. The handling of the data by
the transform filter can be divided into two parts: - image processing and image
140
Table 5.2: Video Capture Parameters
Parameter
Value
Brightness
Mid-range
Contrast
High
Camera Exposure
Constant
Frame Resolution Width
320 pixels
Frame Resolution Height
240 pixels
Format
RGB24
Table 5.3: The RGB24 format for some sample colours
Colour
Red
Green
Blue
Hexadecimal
Red
255
0
0
FF0000
Green
0
255
0
00FF00
Blue
0
0
255
0000FF
Yellow
255
255
000
FFFF00
Cyan
0
255
255
00FFFF
Magenta
255
0
255
FF00FF
White
255
255
255
FFFFFF
Black
0
0
0
000000
Midgrey
128
128
128
808080
141
analysis. Step 2 (filtering) and Step 3 (reduction of data) may be thought of as
image processing stages. The image is analysed in Step 4 (frame comparison),
and if sufficient movement has occurred, a switching action is performed. For
the path description method which is discussed later, the image analysis is
more complicated and is described in Steps 4-7.
Filtering of the video frames is necessary to remove noise from the captured image, which is most typically evidenced by spurious pixels with large
differences in value to their neighbouring pixels. A simple averaging filter was
found to give sufficiently good results. More complex filtering methods are
discussed in [78]. The filter sample size is n, where the sample is an n × n
subset of the video frame. This size is definable by the user since different
lighting conditions and different computer cameras may necessitate different
amounts of filtering.
The data received may be thought of as three two-dimensional arrays of
size 320 × 240 called Xr , Xg and Xb , where xr [i][j], xg [i][j] and xb [i][j] are the
red, green and blue components respectively of the pixel value in the ith row
and j th column. The filtered pixel values, xbr [i][j], xbg [i][j] and xbb [i][j] may be
calculated for each colour using the following formula:
x̂[i][j] =
k=i+n
P l=j+n
P
x[k][l]
k=i−n l=j−n
(2n + 1)2
(5.1)
Some filtered video frames showing the effects of different filter sizes are shown
in Figure 5.5.
Step 3: Reduction of Data
It was found that reducing the image to an 8-level image gave more reliable
results and also greatly reduced the computational complexity of the later
analysis. As mentioned already, each pixel of the video frame received consists
of three bytes (RGB24 format), one byte each for red, green and blue. Each of
142
(a) n=0
(b) n=1
(c) n=3
(d) n=4
(e) n=6
(f) n=9
Figure 5.5: The effect of different filter sizes on a single video frame
these bytes normally has a value in the range 0→255. Reduction of the image
to an 8-level image is an extension of the concept of bi-level images, which are
discussed in Chapter 2 for “black and white” images by Parker [79]. “Black
and white” images may be more accurately termed as grey-level images, since
they are actually made up of a spectrum of grey levels ranging from black to
white. Bi-level images are produced from grey-level images by compressing the
range of greys until only two levels remain, a process known as thresholding
the image.
This idea may be extended for colour images by translation of the original
colour image to an 8-level image. As mentioned before, each pixel in the
original image is represented by 3 bytes, or 24 bits, of data. This may be
reduced to 3 bits of data by assigning each of the bytes either the value 1
or the value 0, resulting in 23 = 8 possible colour combinations. The value
assigned to each byte is decided based on thresholding the data. If a pixel
value has its red, green or blue component above its respective threshold, then
the red, green or blue value of that pixel is turned “on” i.e. given the value 1,
if its original value is below the threshold then that pixel is turned “off” i.e.
143
given the value 0. Note that the renderer used to present the video image on
screen requires RGB24 format so that a pixel value of 1 was remapped to a
value of 255 in the final stage, but for the purposes here it is sufficient to think
of the pixel values in 3-bit binary form.
The biggest problem lies in choosing an appropriate threshold value for
each of the three colours red, green and blue. There were a number of different
methods considered for calculating this threshold, which will now be discussed.
1. The threshold could be chosen midway between the two extremes i.e.
at 128. This method is not particularly effective as it is very sensitive
to lighting conditions, if the room is bright then a lot of the picture
will be white, if the room is dark then a lot of the picture will be dark.
The effect of this type of thresholding is shown in Figure 5.6(b) for the
example frame in Figure 5.6(a). As the original image is quite dark,
only the brightest line along the centre of the finger is visible in the final
image.
2. Alternatively the median value can be used, which is the level with an
equal number of pixels above and below this value. This method gives
better results than the first method, although it is sensitive to the relative
size of the body part and the background. The effect of this type of
thresholding is shown in Figure 5.6(c). In this particular image, as the
finger is much smaller than the background, some of the background is
artificially lightened.
3. Another option is to use the mean pixel value as the threshold for each
of the three colours. The mean value X̄ of each video frame may be
calculated as:
X̄ =
i=239
P j=311
P
i=0
x[i][j]
j=0
240 × 320
(5.2)
The effect of using the mean value as the threshold is shown in Figure 5.6(d). Like the median, it is sensitive to the relative sizes of the
foreground and background.
144
(a)
(b)
(c)
(d)
(e)
Figure 5.6: Various Thresholding Methods.(a) Original video frame showing a finger
(b) Threshold is chosen at 128 (c) Threshold is chosen at the median (d) Threshold
is chosen at the mean (e) Threshold is chosen based on finding peaks in a bimodal
distribution
145
Figure 5.7: Typical histogram of one video frame showing the number of pixels with
each value from 0 to 255, for the red, green and blue channels.
4. The fourth method considered is based on calculation of three histograms
of the image for each video frame received, one for each of the red, green
and blue channels. Each histogram is produced by counting the number
of pixels in the frame with each possible value from 0→255 and creating
a chart containing each possible level on the horizontal axis and the
number of pixels in the frame with each level on the vertical axis. A
typical histogram distribution is shown in Figure 5.7 for a pale hand on
a black background. The total area under each curve is 240x320 pixels.
If the image is a bimodal image (e.g. an image with one clearly recognisable bright region and one clearly recognisable dark region such as
the finger on a dark background) then the histogram should have two
distinct peaks, corresponding to a bright peak and a dark peak. The
histogram shown in Figure 5.7 can roughly be considered an example
of a bimodal histogram with two distinct regions. The significance of
the white and black lines will be explained shortly. Theoretically, the
lowest point between these two peaks is the ideal threshold point to use
to reduce the data to an 8-level image but the difficulty lies in accurately locating these peaks, as discussed in Chapter 5 in Parker [79].
The threshold can be chosen either midway between these two points or
at the minimum value between these two points. For this system, the
minimum value was chosen. For darker skin tones, the two peaks would
146
be closer together but the black background should still be much darker
than the darkest possible skin tone, and the distant between peaks can
be increased if necessary by increasing the camera contrast. The effect
of this type of thresholding is shown in Figure 5.6(e).
The two peaks in the bimodal histogram were found using the following
method, for each of the three colours. The histogram may be written as a 256
element array Y = [y0 , y1 , · · · , y254 , y255 ].
A deadzone, w, is user-defined which should represent the expected width
of the peaks. The best value for this depends on the contrast of the two colours
in the image but a value of 30 was found to work well with the image of the
hand on the black background above. The width is adjustable by the user
or therapist as different values may work better with different cameras, scene
contrasts and lighting conditions. The first peak is found by searching the data
to find the level of maximum value. The first and last 10 levels are ignored
since an abnormally large percentage of the pixels will have these values due
to noise. The first maximum index is labelled i1 where
yi1 = max yj
10<j<245
(5.3)
The area around this peak is marked as {p1 , p2 } = {(i1 − w), (i1 + w)}. The
second peak i2 is then found by searching the histogram data for another
maximum outside the area of the first peak.
yi2 = max max yj , max yk
10<j<p1
p2 <k<245
(5.4)
Due to noise, each video frame often has a number of pixels that change
value sporadically from one frame to the next. This is undesirable since when it
occurs the two maximum indexes calculated, and thus the threshold level, will
jump about from frame to frame. This creates “flickering” of the image which
may be misinterpreted as actual pixel changes due to movement. This problem
was overcome by weighting the two maximum values based on previous values
147
as follows. w1 may be set by the user but a value of 0.5 seems to work well.
w2 = 1 − w1
i1w = w1 × i1 + w2 × i1wp
i2w = w1 × i2 + w2 × i2wp
(5.5)
i1w and i2w are the new weighted values of i1 and i2 and i1wp and i2wp are
the weighted values of i1 and i2 calculated for the previous frame. The initial
values of these indices are arbitrarily chosen at 50 and 200 but the influence
due to these two values should be negligible within a few seconds. For the
frame shown in Figure 5.7, the two white lines represent the weighted values
of the two peaks detected. The threshold point finally chosen is the index
T which is represented by the black line. This value is found by searching
between the two indices i1 and i2 to find a minimum, and weighting this index
by previous values as for the two maxima.
Based on this value T, three new two dimensional arrays Br , Bg and Bb are
cr , X
cg and X
cb . These new arrays consist
calculated from the filtered arrays X
solely of boolean or binary numbers, i.e. numbers which can only have the
value 0 or 1.
Each element of the new array is calculated based on the following rules
for each of the three colour channels:
\ < T ⇒ B[i][j] = 0
If x[i][j]
\ ≥ T ⇒ B[i][j] = 1
If x[i][j]
Step 4: Frame Comparison
The previous frame may be described by three boolean arrays Cr , Cg and Cb ,
where cr [i][j] is the pixel value in the ith row and j th column for Cr . The
calculation of the frame difference ∆BC between the current frame and the
148
previous frame may be written using the following equation:
∆BC =
i=311
X j=239
X
(b[i][j] ⊻ c[i][j])
i=0
(5.6)
j=0
The ⊻ symbol comes from the Latin word aut, which means “or, but not both”
and is used here to represent the binary operator XOR, or Exclusive OR. The
symbol ∧ is also commonly used to denote this operator. This operation will
output a value of 1 if one of the two operands is 1, but a value of 0 if both
of them are 1, or both of them are 0. This process may also be described in
terms of its software implementation i.e. the program compares the value of
each pixel in the current frame with the value of the pixel in the same position
in the previous frame, and each time they are different, ∆BC is incremented.
Step 5: Switch Actuation
The value of ∆BC represents the number of pixels that have changed between
the current frame and the previous frame. A switching action is actuated if
the value of ∆BC is greater than a threshold. The most appropriate threshold
to use depends on a number of factors so it is user definable. If the program is
only required to respond to large movements then a higher threshold should be
used. If the camera is very far away from the moving body part then perhaps
a lower threshold would be more suitable.
The switching action is actuated by simulation of an “F2” key press. This
particular key value was chosen as it is the expected input for the Natterbox
program described in Chapter 2. It is possible to change the simulated key
press to another keyboard value for operation with any other software program
requiring a different single key press.
Step 6: Rendering
The renderer filter block in the filter graph is also of type IBaseFilter. This
filter block is used to display the video frames on screen. The type of ren149
derer is a Video Mixing Renderer9. The interface IAMStreamConfig is used
to set the display size to height 240 and width 320. The renderer receives the
processed video frames from the transform filter, with one modification. In
order to display correctly on screen in RGB24 format, the pixels with value 1
are mapped back to a value of 255. Therefore each of the three colour channels of all the pixels in the displayed image will have either the value 0 or the
value 255, resulting in eight possible colour combinations in RGB24 format red, green, blue, black, white, cyan, magenta and yellow. All of the eight level
pictures that have been included in this chapter are produced by this renderer.
5.3.4
Path Description Method
Many of the steps in the path description method are identical to those described in the frame comparison method and so will not be covered here. These
steps are video capture, low pass filtering and reduction of data. However, the
image analysis process for this method is different, and this will now be described.
Step 4: Centre of Brightness Calculations
The centre of brightness for each of the three colours is calculated as follows.
The width of the image, W,is 320 pixels and the height of the frame H, is 240
pixels. As before, the image data are in three binary two dimensional arrays
called Br , Bg and Bb . br [i][j] is the value of the red pixel in the ith row and
the j th column and may be either 1 or 0. The average of row i, xi is calculated
by
j=W
xi =
X
b[i][j]
(5.7)
j=0
from which the global average X is calculated as
X=
i=H
X
i=0
150
xi
(5.8)
The centre of brightness for the row i is COBxi and is calculated as follows.
COBxi =
j=W
P
j=0
(b[i][j] × j)
xi
(5.9)
The overall x-coordinate for the centre of brightness COBx is then
COBx =
i=H
P
COBxi
i=0
H
(5.10)
and for the y-coordinate COBy is
COBy =
i=H
P
i=0
(xi × i)
X
(5.11)
The calculated centres of brightness for the little finger in Figure 5.8(a) are
shown in Figure 5.8(c). Note that the actual centres of brightness are at the
centres of the boxes, the boxes are enlarged for demonstration purposes. The
centres of brightness of the red and the green values appear to be in exactly
the same place and so the two boxes showing the centres overlap to give a
yellow box (red + green = yellow).
Step 5: Path Description
As discussed before, the path description method is based on recording the
path of the movement. Now that the centres of brightness have been defined,
the exact method used to describe the path may be explained. The therapist or
helper must press a start button to begin recording the movement. The subject
then makes their movement and the program records the path P by recording
the six centres of brightness for each frame. At the end of the movement the
therapist presses a stop button. If the number of frames is N, then the size of
151
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5.8: Path Description Method, see text in Section 5.3.4 further explanations.
152
P will then be 6 × N and may be described by the matrix
x [1] yr [1] xg [1] yg [1] xb [1] yb [1]
r
xr [2] yr [2] xg [2] yg [2] xb [2] yb [2]
P =
..
..
..
..
..
..
.
.
.
.
.
.
xr [N] yr [N] xg [N] yg [N] xb [N] yb [N]
V [1]
V [2]
=
..
.
V [N]
(5.12)
(5.13)
where xr [i] and yr [i] are the x-coordinates and y-coordinates of the centre of
brightness of the colour red in frame i and V[i] is the 1 × 6 vector whose six
points may be referred to as:
V [i] = {V [i][1], V [i][2], V [i][3], V [i][4], V [i][5], V [i][6]}
= {xr [i], yr [i], xg [i], yg [i], xb [i], yb [i]}
(5.14)
This process is shown in Figure 5.8 for movement of a little finger. The starting
and finishing positions of the little finger are shown in Figures 5.8(a) and 5.8(b).
The calculated centres of brightness are shown in Figure 5.8(c). Figure 5.8(d)
and Figure 5.8(e) show the path that is traced out as a little finger is moved.
Step 6: Calculation of Regions
The Euclidean distance between each set of six points, is defined between frame
i and frame j as Dij :
v
u k=6 uX t
V [i][k] − V [j][k]
Dij = Dji =
(5.15)
k=1
for 0 < i, j < N. The Euclidean distances between each frame in the path and
every other frame in the path are calculated by comparing each column in P.
From this the indices k and l are recorded, corresponding to the two frames
with the maximum Euclidean distance between them. P1 = V [k] and P2 =
153
V [l] are then recorded as the two region points which are used to define the
two regions corresponding to “switch closed” and “switch open”. The two sixdimensional region points calculated for the example of a little finger moving
are represented by two sets of three two dimensional boxes in Figure 5.8(f).
The centres of brightness are close to each other in this example so the boxes
overlap and a white box is displayed, but this may not always be the case, as
discussed below.
Step 7: Definition of Region Spaces
Since they are six dimensional regions it is hard to visualise how the two
regions are formed but an attempt at a description is shown in Figure 5.8(f).
The regions are enclosed by three boxes around the region points, one for each
of the red, green and blue channels. A six-dimensional threshold T is defined
with each element calculated as:
Ti =
P1 [i] + P2 [i]
for 0 < i < 6
2
(5.16)
Based on these values, a two-dimensional region is defined for each of the three
colours corresponding to the two different states. An example of this for one
colour is shown in Figure 5.9. The two crosses inside boxes represent P1 (at
{60, 180}) and P2 (at {240,40}) for that colour and the two shaded areas
represent the corresponding regions. The thresholds are calculated as:
60 + 240
2
= 150
180 + 40
=
2
= 110
T1 =
T2
from which the regions R1 and R2 may be defined as follows.
If x < 150 and y < 110 → {x, y} ∈ R1
If x > 150 and y > 110 → {x, y} ∈ R2
154
(5.17)
60
120
180
240
80
160
240
320
Figure 5.9: The two regions calculated from the two furthest points, indicated
by crosses
In many cases the centres of brightness for each of the three colours will be
in the same position, or close to the same position, as appears to be the case
in Figure 5.8(f). In this example, each of the three pairs of region spaces
appear to overlap each other closely and the six-dimensional region spaces are
effectively mapped onto a two-dimensional space. In cases like this, it may seem
redundant to use the three colour channels for the path and region description,
and it may appear that one colour should just be chosen. However, there are
certain instances where only one of the three colours will move or the colours
will move by different amounts and using all six parameters may allow for a
wider range of movements to be performed. The example in Figure 5.10 shows
an example of where the three colours are not overlapping.
In this example the movement performed by the user is the action of bending the thumb. However, this movement is complicated by the fact that the
user is wearing red nail-polish, which shifts the coordinates for the red centre
of brightness away from the other two in the initial position, as seen in Figure
5.10(a). However, when the thumb is bent, the nail area is not as large in the
image as it is initially and the three centres of brightness overlap, as seen in
Figure 5.10(b). Three two-dimensional representations of the six-dimensional
regions calculated are shown in Figure 5.10(c). Figure 5.10(d) shows these
more clearly. The centres of brightness surrounded by black and the corresponding lines drawn with black interleaving are all representative of the same
region, i.e. P1 and R1 and the boxes surrounded in white show P2 and the
155
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5.10: Overlapping
156
lines interleaved with white are R2 . The box in Figure 5.10(e) is enlarged in
Figure 5.10(f). Inspection of this figure reveals a few facts.
Firstly, in order to map the six-dimensional region to a two-dimensional
region, it would be necessary to reduce the region sizes to the two regions that
are shown surrounded in pink. Although the red part of P1 and P2 is inside
this box, the blue and the green parts of P1 and P2 are not. Remembering
that these points are the furthest points moved to during the thumb-bending,
it is unlikely that these points would ever be inside this box and therefore a
two-dimensional region would not be satisfactory using this method.
Secondly, in the centre of Figure 5.10(f) it can be seen that R1 [3 − 6]
and R2 [1] and R2 [2] overlap i.e. the initial position of the green and blue
coordinates and the final position of the red coordinates are almost in the
same place. This indicates that using all six dimensions will allow for the
movement to be more readily and accurately detected.
5.4
Conclusions
The generality of the application means that it can be adapted for use by
people with a range of different movement abilities. For people with a large
range of head movement it may be sufficient to use the frame difference method
with a large difference required between successive frames. The advantage of
using a system such as this is that it is easy to adapt it to the movement that
the disabled person is able to perform best i.e. the system is adaptable to suit
a particular user, rather than the user having to adapt to suit the system.
The centre of brightness method also offers potential for mouse cursor control. If the person has a greater range of movement, for example the ability to
fully move their hand, then they can direct the centres of brightness around
the screen by changing the position of their hand accordingly. While there are
many important potential applications for a visual based mouse cursor control
157
system, this idea was not explored in great detail as part of this thesis, because
the patients we worked with did not have adequate hand mobility. However it
may offer a promising application for future developments.
In summary, visual based methods offer a promising non-contact solution
for detecting flickers of movement which may be harnessed for communication
and control purposes.
158
Chapter 6
Acoustic Body Signals
6.1
Introduction
Acoustic body signals may be defined as any sound or noise that can be produced voluntarily by the body. In this chapter, we deal only with acoustic
body signals that have been created using the vocal organs of the human body
and neglect sounds that could be produced using other parts of the body, for
example, hand-clapping or foot-tapping. The primary acoustic signal produced
by the vocal organs is speech, which is one of the most important modes of
communication. People who have become disabled but retain the ability to
produce speech are at an immense advantage over those without full speech
production abilities. Recent advancements in speech recognition technologies
and the demand for “hands-free” operation of everyday appliances and gadgets
have made commercial speech-based environmental control systems readily
available to the public. However, for people who have lost the ability to speak
intelligibly, currently available speech recognition systems are generally not an
option. Nonetheless, acoustic signal based systems may provide a useful channel for environmental control and communication for this group of people in
certain cases. Often such people, while not able to coherently produce words or
sentences, will still remain capable of voluntary and reproducible utterances,
159
such as single phoneme utterances, grunts or whistles. This chapter investigates how these utterances may be harnessed for communication and control
purposes.
Speech recognition technologies are reviewed in Section 6.2, and some of
the common methods employed to recognise speech are described. In order
to understand the characteristic shapes of acoustic signals that make them
identifiable as speech, it is necessary to give a brief outline of how speech sounds
are created by the body. The process of speech production by the vocal organs
is discussed in Section 6.3. A speech signal may be thought of as a continuous
stream of phonemes, which can be defined as the basic sound units of speech
often used by speech therapists, linguists and speech recognition engineers.
The phoneme alphabet for the dialect of English used here in Ireland, HibernoEnglish, is introduced in Section 6.4. The characteristics of a speech signal
may then be explored by considering the different types of phonemes and the
physical processes performed by the vocal organs in making each of the different
phoneme sounds. An attempt is made to define features of different phonemes
that will make them distinguishable from each other. Phoneme recognition
as an alternative acoustic signal to speech recognition technology is discussed
in Section 6.4.4 and some advantages of using this method are given. Some
applications of phoneme recognition that have been developed as part of this
work are presented, both in hardware (Section 6.5) and in software (Section
6.6).
6.2
6.2.1
Speech Recognition
Speech Recognition: Techniques
There are many and varied speech recognition based applications available
commercially. The speech recognition technology market is dominated by
160
Scansoft Inc.1 , a Belgian based computer software technology company that
manufactures the speech recognition suite Dragon Naturally Speaking, a desktop dictation program with recognition rates of up to 99%. Embedded speech
recognition chips are becoming increasingly popular, and are frequently included in mobile phones enabling the user to record a person’s name and associate a particular phone number with it for automated voice dialling. Speech
recognition systems often have a large degree of variability in the methods used
to interact with them. Some systems require that the speaker chooses one-word
commands from a small vocabulary of words, while others attempt to recognise
continuous speech from a large or unlimited vocabulary. Some are tailored to
an individual user’s voice by training the system for a particular user. Other
systems are expected to work with a broad range of speakers and dialects or
adapt to the speaker’s voice over the time the application is in use. There
are a number of different approaches used in an attempt to correctly recognise
speech and three are briefly described here: - the acoustic-phonetic approach,
the artificial intelligence approach and the pattern matching approach.
The Acoustic-Phonetic Approach
The acoustic-phonetic approach was common in early speech recognition systems. The speech stream is analysed in an attempt to break the continuous
stream of data into a series of individual units, called phonemes, within the
stream. The series of phonemes is then analysed in an attempt to recognise
words from the phoneme stream. There may be a large variability within each
phoneme definition, due to speaker pitch and accent, and transducer variability, and thus it may be difficult to set accurate boundaries for each phoneme
type. Co-articulation, where two phonemes spoken in quick succession “blend
into” each other, may make it difficult to split the stream up accurately.
1
Website: http://www.scansoft.com
161
The Artificial Intelligence Approach
This approach attempts to mimic the process of speech recognition by the brain
by taking phonemic, lexical, syntactic, semantic and pragmatic knowledge and
using each of these to build up an artificial neural network which learns the
relationships between events and thus can make an “intelligent” guess at which
word was spoken based on this knowledge. More information on artificial
neural networks can be found in [41].
The Pattern Matching Approach
The pattern matching approach of speech recognition involves two stages: pattern training and pattern comparison. The system must be trained for each
word or phrase that needs to be recognised by the system. In the pattern
training stage, each word or phrase is spoken one or more times by the speaker
or speakers. The speech waveform is analysed and a template pattern of the
word or phrase is then built from the trainer data. Template patterns are built
using the feature extraction method or the statistical method.
The statistical method creates a statistical model of the behaviour of the
training model and uses this as a template to compare with the received speech
signal. The statistical model attempts to calculate the probability that a
certain phoneme or sequence of phonemes was uttered based on features of the
sounds. The most commonly used type of model in speech recognition is the
Hidden Markov Model(HMM) [80, 81].
With the feature extraction method, the speech waveform is broken down
into short fragments of the order of tens of milliseconds, that may or may not
overlap in time. For each fragment a number of pertinent features are calculated. These features are used to build a parametric model of that speech
sound. The features calculated may be spectral, temporal or both. Spectral
parameters commonly used include the output from a filter bank, a Discrete
162
Fourier Transform (DFT) or linear predictive coding(LPC) analysis. Temporal
features include the locations of various zero or level crossing times in the signal. In the pattern matching stage, the same parameters are calculated for the
received sound wave and the calculated features are compared to the reference
template set for each of the different possible words in the system’s vocabulary.
There are several different classification methods - discriminant functions are
often used [70]. A classification score is recorded for each comparison and the
word is recognised as being the one with the highest score. Due to variations in
lengths of different segments of a word, the signal is often first time warped to
get the best fit [80]. One of the techniques used to recognise phonemes in the
work presented here is based on a similar approach for phoneme recognition.
6.2.2
Speech Recognition: Limitations
A number of different factors must be taken into consideration when choosing
a speech recognition system.
Vocabulary Size Required In almost all speech recognition systems, there
is a trade off between vocabulary size and the accuracy of the system.
As the vocabulary size increases, the probability of a mis-classification
also rises. The time taken by the system to recognise a particular word
usually increases too, since there are more words for comparison.
System Training Some systems require each user to train the system to
their individual voice. This means the user has to repeat each word in
the system’s vocabulary multiple times to enable the system to create an
individual template for each word. This is cumbersome on the user and
may prevent a system from being used by more than one user.
Gaps between Words Speech recognition systems based on word recognition from a continuous speech stream often expect users to leave a distinct
pause between words to enable accurate detection, as the system needs to
163
recognise when a word has been spoken and separate out distinct words
from each other. This can be awkward as it is not the way people speak
naturally.
Users with Communication Disorders Most speech recognition systems
currently available assume that the user is capable of producing good
quality, easily comprehensible speech. The speech signal produced by
users with speech disorders such as speech dysarthria, sigmatisms (lisps)
and stutters may generate problems in speech recognition, often rendering speech recognition systems unusable for speech-impaired people.
However, these are the people who would have the most to gain from
acoustic based environmental controllers. People who have suffered a
stroke may often be left with speech dysarthria as well as severe motor
impairment - and thus could benefit greatly from systems that enable
automatic control of appliances in their surroundings.
Some of these issues are discussed in more detail in [80].
6.3
Anatomy, Physiology and Physics of Speech
Production
Human speech production is a complex process involving the careful interaction
of a number of different organs of the body. The formation of speech sounds can
be described as a process with four stages - these are respiration, phonation,
resonance and articulation. The organs relevant to speech production are
known as the vocal organs, shown in Figure 6.1. Note the location of the
important vocal organs which will be referred to now when discussing each
of these four stages. The lungs are responsible for respiration and phonation
occurs in the larynx, including the cartilages and the vocal cords. Resonance
occurs mainly in the vocal tract which is the pathway from the larynx to the
lips including the throat and the mouth. Articulation uses the group of organs
164
known as the articulators which include the tongue, the teeth, the nose, the
lips and the hard and soft palate.
6.3.1
Respiration
The first stage in the production of speech sounds is respiration, or breathing.
Sound waves propagate as pressure waves produced by causing air particles to
vibrate. In speech, this air originates as a continuous stream of air exhaled
from the lungs. When breathing normally, this exhalation of air is inaudible.
It is only when the stream of air is caused to vibrate that it can be detected as
sound by the human ear (or any other device capable of detecting sound such
as a microphone) and thus may be described as “speech”. Air which is exhaled
by the lungs travels up through the trachea (windpipe) and into the larynx.
6.3.2
Phonation
The larynx, or voice box, is the phonation mechanism of the speech production
system (“to phonate” means “to vocalise” or “to produce a sound”). The larynx
converts the stream of air from the lungs into an audible sound. The larynx
is located in the neck and is basically a stack of cartilages. There are nine
cartilages in the larynx in total. The thyroid cartilage is the most prominent
of these and is located at the front of the neck. It is also known as the Adam’s
Apple. (Both men and women have this cartilage but it is more prominent in
men as it is larger.) The arytenoid cartilages are a pair of cartilages located at
the back of the larynx. Two folds of ligament extend from the thyroid cartilage
at the front to the arytenoid cartilage at the back, known as the two vocal folds
or vocal cords.
The vocal cords act as an adjustable barrier across the air passage from the
lungs. When not speaking, the arytenoid cartilages remain apart from each
other and air is free to pass through the gap between the vocal cords (this
165
Figure 6.1: The Vocal Organs, from [82]
166
gap is known as the glottis). When the arytenoid cartilages push together,
the gap between the vocal cords closes, shutting off the air passage from the
lungs. During speech, the vocal cords open and close the glottis rapidly. This
chops up the continuous stream of air from the lungs into periodic puffs of air.
This series of puffs is heard as a buzz. The cycle of opening and closing of the
glottis is controlled by air pressure from the trachea. The process that enables
the glottis to open and close is described in [83].
The manner in which the vocal cords vibrate is a complicated process but
may be compared simplistically to a set of vibrating strings. The phenomenon
of vibrating a string to produce a sound is well understood and forms the basis
of sound production in many musical instruments such as a guitar or a piano. A
string vibrating as a whole will vibrate at its fundamental frequency. The string
can also vibrate in other modes at multiples (overtones or harmonics) of the
fundamental frequency. In accordance with this model, the muscle fibres of the
vocal cord muscles (the vocalis muscles) vibrate not only as a whole, but also in
localised groups. Thus, the sound coming through the glottis will be made up of
a number of frequency components, the fundamental frequency plus frequency
components at multiples of the fundamental frequency. Generally, the pitch
of the sound which is finally heard is equal to the fundamental frequency of
the vocal cords.(Note that the words “pitch” and “fundamental frequency” are
often used interchangeably but they may not always be equal. In telephone
line transmissions where the speech signal is bandpass filtered, the pitch, which
may be thought of as the perceived fundamental frequency, may not be the
same as the actual fundamental frequency of the sound transmitted.)
The frequency, f , of any wave in nature is related to its velocity, v, and
wavelength, λ, as:
f=
v
λ
(6.1)
For a vibrating string, the velocity of sound propagation along it v, its tension
167
T and its mass per unit length µ are related by:
s
T
v=
µ
(6.2)
If the string is vibrating in its fundamental mode then the wave produced has
a wavelength of twice the length, L, so λ = 2L. The relationship between
a string’s fundamental frequency, f , length, L, mass per unit length µ and
tension T may be summarised by the following equation:
p
T /µ
f=
2L
(6.3)
Thus the frequency of vocal cord vibration, and therefore the pitch of the
sound produced, may be determined by adjustment of these and other factors.
• Length of the vocal cords
The longer the vocal cords, the more slowly they will vibrate. The portion of the vocal cords which vibrates may also be constricted to produce
higher pitches. (Conversely, if the vocal cords are actively elongated they
will produce higher pitches due to thinning of the vocal cords.)
• Mass of the vocal cords
The chief mass of the vocal cords is due to the paired vocalis muscle.
The physical massiveness of the vocal cords sets the range of frequencies
achievable by any one person. In general, the male has heavier vocal
cords than the female and thus has a lower range of frequencies available
for speech.
• Tension in the vocal cords
Tension in the vocal cords may be altered by muscle action and thus can
be used to adjust the pitch of the sound produced.
• Subglottic air pressure
An increase in the subglottic air pressure raises the pitch (and also the
amplitude of the sound).
168
• Elasticity of the vocal cord margins
An increase in the elasticity of the vocal cord margins raises the pitch.
• Position and size of the larynx
Controversial opinion states that vertical position of the larynx may influence pitch.
The pitch of the final sound produced is determined by the frequency of the
vocal cord vibrations and is usually between 50-250Hz for men and 120-500Hz
for women. The tone produced by the vocal cords vibration is known as the
glottal tone. It is a dull, monotonous sound that is unlike the final speech
sounds that are uttered. Speech sounds are given a more musical quality by
the effects of resonance.
6.3.3
Resonance
As stated above, the purpose of resonance is to improve the quality of the
speech sound. Some resonance of the sound occurs before the sound passes
through the larynx, in the trachea and thoracic (chest) cavities. The supraglottic resonators are the cavities of the larynx above the vocal cords, the
pharynx (throat), the oral cavity (the mouth) and the nasal cavity (the nose).
The effect of resonance is to alter the different frequency components from the
glottal tone, amplifying some and weakening others.
For the purposes of describing speech production, the throat and the mouth
are usually grouped into one unit referred to as the vocal tract. The vocal tract
extends from the output of the larynx to the lips. The phenomenon of vocal
tract resonance may best be described by approximately modelling the vocal
tract as a tube that is closed at one end (at the vocal cords). Resonance may be
defined as the property whereby a vibratory body will amplify an applied force
having a frequency close to or equal to its own natural frequency - and is seen
to occur in a tube closed at one end. Such a tube has a series of characteristic
169
frequencies associated with it known as the natural frequencies or resonant
frequencies. In a tube of uniform cross-sectional area which is closed at one end,
the lowest resonant frequency will have a wavelength, λ, of 4 times the length,
L, of the tube (λ = 4L). This frequency is known as the fundamental frequency
or first harmonic. The higher resonant frequencies are odd-numbered multiples
of the lowest one, and are the higher order harmonics. A vocal tract of 17cm
long, with a uniform cross-sectional area for simplification, has a fundamental
frequency at 500Hz, a third harmonic at 1500Hz, a fifth at 2500Hz, and so on.
In speech production, the harmonic frequencies of the vocal tract are known
as formant frequencies since they tend to “form” the overall spectrum of the
sound. There are infinitely many formants for each sound. In digital speech
processing there are usually 3-5 left within the Nyquist band after sampling.
The formant frequencies may be altered by changing the shape of the vocal
tract.
When a sound wave that is made up of a number of different frequency
components enters a tube closed at one end, frequencies which are close to the
resonant frequencies of the tube will be amplified and frequencies which are
far away from the resonant frequencies of the tube will be attenuated. The
signal coming through the glottis and into the vocal tract is made up of several
frequency components which have been produced by vocal cord vibration. So,
the vocal tract will amplify frequency components that are close to its formant frequencies and attenuate the frequency components which are far away
from its formant frequencies. Resonance usually produces significant amplification of the signal overall. A sound which is lacking in resonance will sound
unpleasant to the ear. In music, the quality of a sound is called its timbre.
The timbre is a measure of the amount of harmonics in a sound. A musical
note of the same pitch will be distinguishable when played on a piano from
when it is played on a violin because of the presence of different proportions
of harmonics. The two notes have the same pitch but a different timbre. A
musical sound is described as having a good timbre if it has many harmonics.
170
Likewise, in speech, a sound with many harmonics will sound more musical
and pleasing to the ear than a monotone sound, which has one harmonic (the
fundamental) and has a flat sound [84].
The difference between the overtones of the vocal cord vibration signal and
the overtones of the vocal tract and how they influence the final speech signal
must be emphasised here. The overtones of the signal produced by the vocal
cords are the harmonic frequencies of the speech sound and they determine
the frequency components present in the final signal including the pitch. The
natural frequencies of the vocal tract are the formant frequencies of the speech
sound and they determine how the frequency components already present in
the signal entering the vocal tract are amplified. Resonance in the vocal tract
cannot add any extra frequency components to the signal, other than higher
order multiples of the frequency components already present in the signal,
which act to make the signal sound more pleasing to the ear. Therefore the
pitch of the sound can only be altered by altering the frequency of the vocal
cord vibration.
Resonance is important in production of different vowel sounds. Different
ratios of harmonic amplitudes are recognised as different vowels - thus a person
must change the configuration of their vocal tract to produce different vowel
sounds. Conversely, the pitch of a particular vowel sound is changed by altering
the speed of vibration of the vocal cords and keeping the same vocal tract
position. Since resonance adds higher frequency components to the sound, the
full range of speech sounds for all human voices is between about 50-2000Hz.
Resonance shapes the sound and gives it quality. Further sound shaping is
done by articulation, which also occurs in the vocal tract.
6.3.4
Articulation
Like resonance, the process of articulation is also a sound shaping one. Articulation shapes the sounds to make them acceptable to the listener and recog171
nisable as speech. The articulators are valves which can stop the exhaled air
completely or narrow the space through which it passes. They separate the
sounds transmitted to them and are particularly important in the production
of consonant sounds. The articulators include the lips, teeth, hard and soft
palate, tongue, mandible (jaw), and posterior pharyngeal wall and probably
the inner edges of the vocal folds. The structures of the mouth articulate
recognisable sounds. The tongue, the palate, the lips and the teeth play a part
in articulation.
• The tongue is the most important of the articulators. It also acts as
a resonator by working with the mandible to modify the shape of the
mouth. Many of the consonants are produced by movements of the
tongue against the gums, palate and teeth to create friction or plosion
effects.
• The palate consists of a bony hard palate that forms the roof of the
mouth and a muscular soft palate at the back of the mouth. The velum,
the lower portion of the soft palate, is especially important in controlling
the pressure within the mouth. The velum helps to dam up the air by
aiding closure of the nasal passages.
• The structure of human teeth, and the fact that they are even in height
and width, is an important prerequisite for the production of fricative
sounds, which will be defined in the next section.
• The mandible, or lower jaw, is one of the primary articulators, and also
performs an important role in resonance. A “tight” jaw adds to tonal
flatness.
• The lips are important in the production of the labial consonants, which
are defined in the next section. They also form certain vowels and diphthongs.
• The cheeks are used like the lips to articulate the labial consonants.
172
• The tonsils occasionally grow large enough to have an effect on the air
flow, and can add an adenoidal quality to the voice.
More information on the anatomy of speech production may be found in
[83, 82, 85]. For intelligible speech production, there must be a clear movement between each sound formation. The clarity needed for intelligibility is
provided by the consonants while the musical quality is provided by the vowels between. The consonants are shaped largely by the articulators while the
vowels are primarily a product of resonance. The different types of sounds are
now described in more detail.
6.4
Types of Speech Sounds
In the previous section, the organs used in the process of speech production
were described. The vocal organs’ configuration for each element of a word
and the features of each particular sound in a speech stream which make words
identifiable have not yet been discussed properly. A speech waveform may be
thought of as a continuous stream of “speech units” known as phonemes. The
set of all possible phonemes for a language covers all the possible combinations
of sounds that may be necessary to create a word in that language. Before the
different arrangements of the vocal organs for each phoneme and the spectral
and temporal features of each phoneme can be discussed, a more complete
description of the definition of a phoneme is given, and the phoneme alphabet
for Hiberno-English is defined.
173
6.4.1
The Phoneme
Definition of Phoneme
The basic unit of speech is the phoneme. The phoneme is the smallest element
of a word that distinguishes it from another word. For example, the word “cat”
is made up of three phonemes - a vowel sound between two consonant sounds.
It is distinguishable from the word “chat” as the first phoneme is different,
although the second and third phonemes are identical. Two words that are
dissimilar by only one phoneme, such as these, are known as a minimal pair.
The English language is composed of between 39-49 phonemes, depending on
different dialects. The phoneme can be thought of as a set of ideal sound units
which can be concatenated to produce speech as a stream of discrete codes.
In reality, each particular phoneme will have differences depending on accent,
gender and coarticulatory effects due to rapid transition from one phoneme to
the next. The different ways of pronouncing a particular phoneme are known
as allophones of that phoneme, and the decision whether to class two different
sounds as two allophones of one phoneme or two separate phonemes is not
always clear. For example, the “le” sound in “l ove” and the “el” sound in
“cattle” are classified as two different phonemes by some phoneticians, and
considered allophones of the same phoneme by others.
Phonetic Alphabet
A phonetic alphabet is an alphabet used to write down the correct pronunciation of words. There are several phonetic alphabets commonly used by
linguists and phoneticians to define the different sounds available. The Speech
Assessment Methods Phonetic Alphabet (SAMPA) and the Advanced Research
Projects Agency alphaBET (ARPABET) are two alphabets commonly used,
popular since they both consist solely of ASCII characters. The International
Phonetic Alphabet (IPA) is a more complete alphabet developed by the In-
174
ternational Phonetics Association to standarise transcription. It is the official
language of linguists and the first version of the alphabet was developed in
1888. Most of its symbols are from the Roman alphabet, some are from the
Greek alphabet and some unrelated to any other alphabet. For this reason it
may not be suitable for all computers, but it offers the greatest range of symbols for different phonemes. It also includes diacritic symbols which can be
used to indicate slight phonetic variations. For example, the phoneme [p] has
both aspirated and unaspirated allophones (aspirated in pin and unaspirated
in spin). A superscript h is sometimes used to indicate an aspirated phoneme
i.e. [ph ]. It is a recommendation of the International Phonetics Association to
use square brackets (e.g. [word ])to enclose phonetic transcriptions that include
diacritic markings (known as the narrow transcription). The broad transcription which omits slight phonemic differences may be enclosed in slashes (e.g.
/word /). For our purposes, the broad transcription is sufficient and will be
used here for all subsequent phoneme transcriptions. More on phoneme alphabets may be found in [83, 86].
As mentioned before, boundary definitions between different phoneme sounds
are not always clear. For example, in Hiberno-English, the dialect of English
used in Ireland, /w/ and /û/ are regarded as two different phonemes (e.g. wine
versus whine) whereas in other dialects the initial phoneme in these two words
is indistinguishable. Even within Ireland, differences in these two phonemes
may be more or less apparent depending on the speaker’s region and pronounciation. Regional accents have always been a feature of Hiberno-English speech
and one phonetic alphabet cannot represent all the possible combinations. For
a more detailed discussion on phoneme differences of different dialects, refer
to [87, 88]. A list of the phonemes which are commonly regarded as those
that make up the Hiberno-English dialect is shown in Table 6.1. Note that the
glottal stop is the sound made when the vocal cords are pressed together, as
in the middle of the word “uh-oh”.
175
Table 6.1: The Phonemes of Hiberno-English (based on phoneme definitions in [89] )
IPA Symbol Example
Long Vowels
i:
heat
e:
take
A:
father
o:
tone
O:
call
u:
cool
Short Vowels
æ
had
E
bed
2
put
I
hi t
6
not
@
above
Diphthongs
OI
boy
aI
fi ne
aU
shout
iE
field
uE
tour
Plosives
p
pea
b
bee
t
tee
d
d awn
k
k ey
g
go
P
glottal stop
„t
batter
176
IPA Symbol Example
Affricates
Ã
j ust
Ù
church
Fricatives
v
v iew
f
f ee
T
thin
D
then
s
see
z
z oo
S
shell
Z
measure
h
he
û
when
Nasals
m
me
n
no
N
sing
Approximants
Laterals
l
l aw
Tremulants
r
r ed
Semi-vowels
j
you
w
we
6.4.2
Types of Excitation
Speech sounds may be categorised as being either voiced or unvoiced sounds.
Voiced sounds are sounds produced by vocal cord vibration and unvoiced
sounds are produced by constricting the vocal tract at a certain point by
the tongue, teeth and lips, and forcing air through the constriction causing air
turbulence. All vowel sounds are voiced sounds, as well as some of the consonants. Voiced sounds have a periodic time domain waveform and a frequency
spectrum that is a series of regularly spaced harmonics. Unvoiced sounds have
a noisy time domain signal and a noisy spectrum.
The vowel sounds are produced by air passing through the vocal tract.
Different vowel sounds are produced by varying the amplitudes of the formant
frequencies, which were discussed in Section 6.3.3. Vowel sounds have waveforms which repeat periodically for the duration of the sound. The waveform
was recorded by the author for nine different vowel sounds, and is shown in
Figure 6.2, exhibiting the periodic nature of these phoneme sounds. The sampling rate used was 22050Hz and the x-axis scale shows the number of samples.
A diphthong is a special type of vowel phoneme that occurs when two vowels
are produced in quick succession. The vocal tract shape moves from one configuration to another in such a way as to cause the two vowels to “run into”
each other which produces a different phoneme sound than would be produced
if the two vowel sounds were sounded separately with a pause between.
The consonants may be classed either by their place of articulation or by
their manner of articulation. The categories for places of articulation are
labial (lips), labio-dental (lips and teeth), dental (teeth), alveolar (gums),
palatal (roof of mouth), velar (part of soft palate) or glottal (gap between
vocal cords).
The categories used for manner of articulation are plosive, fricative, semivowel, liquids or nasal and these are described in more detail below.
177
Table 6.2: Classification of English Consonants By Place of Articulation and Manner of Articulation(reproduced from [82])
Place of Articulation
Manner of Articulation
Plosive Fricative Semi-vowel Liquids Nasal
Labial
p, b
w
m
Labio-Dental
f, v
Dental
T, D
Alveolar
t, d
s, z
j
l, r
n
Palatal
S, Z
Velar
k, g
N
Glottal
h
• Plosives or stops are made by completely blocking the air flow somewhere in the mouth and then suddenly releasing the built up pressure.
The air flow can be blocked by pressing the lips together (labial), pressing
the tongue against the gums (alveolar) or by pressing the tongue against
the soft palate (velar). /p//t//k/ are unvoiced plosives, /b//d//g/ are
voiced.
• Fricatives are unvoiced consonants made by constricting the air flow
somewhere in the mouth to an extent that it makes the air turbulent,
producing a hissy sound.
• Nasal sounds are voiced consonants made by lowering the soft palate,
coupling the nasal cavities to the pharynx, and blocking the mouth somewhere along its length.
• Semi-Vowels are voiced consonants that are made by briefly keeping
the vocal tract in a vowel-like position and then moving it rapidly to the
next vowel sound in the syllable.
• Liquids - Voiced consonants. Laterals are a type of liquid, the voiced
consonant /l/ is made by putting the tip of the tongue against the gums
and allowing air to pass on either side of the tongue.
178
0.05
(a) 0
−0.05
0.05 0
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
(b) 0
−0.05
0.2 0
(c) 0
−0.2
0.1 0
(d) 0
−0.1
0.2 0
(e)
0
−0.2
0.2 0
(f) 0
−0.2
0.2 0
(g) 0
−0.2
0.2 0
(h) 0
−0.2
0.2 0
(i)
0
−0.2
0
Figure 6.2: Waveform of 9 phonemes over a time interval of approx 0.0454s. (a)
/i:/ (b) /e:/ (c) /o:/ (d) /u:/ (e)/O:/ (f) /æ/ (g)/E/ (h) /2/ (i)/I/
179
6.4.3
Characteristics of Speech Sounds
Sound Intensity
Sound intensity is a measure of the amount of power in a sound wave. The unit
of sound intensity is the decibel (dB), which is a ratio between the amount of
power in the sound wave and the amount of power in the conventionally defined
smallest sound intensity which is just about audible (10−16 W/cm2 ). A 10dB
increase in sound intensity corresponds to an increase of the power in the signal
by a factor of 10. In normal conversational speech, the sound intensity three
feet away from the speaker is around 65dB.
There is an approximately 700-to-1 range of intensities between the weakest
and strongest phoneme sounds in normal speech. The vowels produce the
strongest sound intensity, but even among these, there is a 3-to-1 difference.
The strongest vowel sound is /O:/ and the weakest is /i:/, which has about the
same intensity as the strongest consonant, /r/. This phoneme sound is two
and a half times more intense than /S/, six times more intense than /n/ and
200 times greater than the weakest sound /T/ [82].
Spectrum of Speech
As mentioned before, each vowel is individually recognisable due to different
proportions of the formant frequencies for each sound. Different vowels can
be recognised based on the amplitudes of different formants in the spectrum.
The log frequency spectrum for the nine vowels in Figure 6.2, as computed
by the author, is shown in Figure 6.3. Usually, the first three or four formant
frequencies are adequate for recognition (although there is evidence that vowels
can still be recognised when the first two formants are absent and higher
formants are present).
Even for consonant sounds, spectral features still play an important role in
180
their classification. The sounds /s/ and /S/ can be distinguished from other
fricatives as they have larger sound intensities. They are distinguished from
each other by spectral differences, /s/ has a lot of its energy above 4000Hz
while /S/ has its energy concentrated in the 2000-3000Hz region.
Number of Zero Crossings
Another popular feature often used to distinguish between phonemes in the
time domain is the number of zero crossings. A postive going zero crossing
(PGZC) may be defined as the point where the signal changes from negative
amplitude to positive amplitude and a negative going zero crossing (NGZC)
may be defined as the point where the signal changes from positive amplitude
to negative amplitude. The number of PGZC or NGZC are counted for a fixed
duration of the signal and used to characterise the sound. A signal resembling
random noise, such as the /s/ phoneme will typically have a much higher
number of zero crossings than a signal which is periodic, such as a vowel. A
pure sinewave signal, such as a whistle, will have one PGZC and one NGZC
per signal period.
Periodicity
A number of different features of a signal can be used to test for periodicity.
Two are used here in different applications, in the application described in Section 6.5.2, the signal is tested for periodicity temporally based on comparison
of the interval between successive positive going zero crossings. In the application described in Section 6.6.2, periodicity is determined by locating the three
highest peaks in the frequency spectrum. If the signal is periodic, then these
should be harmonics, and thus multiples of a common factor, the pitch.
More information on characteristics of speech can be found in [82].
181
0
(a)
−200
−400
00
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
(b)
−100
−200
00
−100
(c)
−200
00
Log Scale
(d)
−100
−200
00
−100
(e)
−200
00
(f)
−100
−200
00
(g)
−100
−200
200 0
(h) 0
−200
200 0
(i) 0
−200
0
Frequency(Hz)
Figure 6.3: Spectrum of 9 phonemes, shown in the time domain in Figure 6.2.
(a) /i:/ (b) /e:/ (c) /o:/ (d) /u:/ (e)/O:/ (f) /æ/ (g)/E/ (h) /2/ (i)/I/. Each of
the different vowel sounds has a number of spectral peaks, which are in general at
multiples of the lowest peak, the fundamental frequency component. These peaks
are known as harmonics. Different vowel sounds are distinguished from each other
based on the relative amplitudes of these harmonics.
182
6.4.4
Proposal of a Phoneme Recognition Based System
for Communication and Control
There are some instances where using a full speech recognition system for
communication and control would be impossible or infeasible. Section 6.2.2 has
already discussed some of the limitations of speech recognition technologies. A
phoneme recognition based system is proposed as an alternative acoustic based
system that may prove to be preferable under certain circumstances. Some of
the factors influencing choice of a phoneme recognition system over a speech
recognition system are discussed below.
• Users with Physical Speech Disorders
As mentioned before, currently available speech recognition systems may
not be suitable for people with speech disorders. Current figures from
the website Irish Health [6] state that 8,500 people in this country suffer
from a stroke annually, leaving an estimated 30,000 of the population
with a residual disability, with 20% of these people unable to walk and
50% in need of day to day assistance. People who have suffered a stroke
may often exhibit speech disorders such as oral-verbal apraxia or speech
dysarthria.
Oral-verbal apraxia arises due to damage of the anterior portion of the
dominant cerebral hemisphere. It is characterised by an inability to
perform the voluntary motor sequences required for the production of
speech. When a person suffering from oral-verbal apraxia attempts to
speak, often they can only produce a fraction of all the phonemic utterances required for intelligible speech. While this disability usually
renders speech recognition systems unusable, the phonemic utterances
produced by people with this condition may be harnessed through a
phoneme recognition system, if these sounds are consistently and voluntarily repeatable.
Speech dysarthria results from damage to the brainstem or damage to
183
one or both of the motor strips located in the frontal portion of the
cerebral cortex, which affects the motor aspects of speech (respiration,
phonation, resonance and articulation). If the damage is unilateral then
speech dysarthria is evidenced only by a slight slurring of consonants, a
change in voice quality or a reduction in the rate of speech. However,
if both sides of the cerebral cortex or brainstem are damaged, moderate
to severe speech dysarthria usually occurs. This has a range of manifestations. If the respiratory system is affected, there may not be enough
air expelled to vibrate the vocal cords. If phonation can be produced, it
may be so brief that only one word can be uttered. Poor sound intensity,
monopitch, monoloudness, hypernasality and a slow rate of speaking are
other features common to speech dysarthria.
• Users with Verbal Intellectual Functioning Disorders
As well as communication disorders arising from a physical inability to
perform the necessary processes for forming speech (respiration, phonation, resonance and articulation), communication disorders due to problems with verbal intellectual functioning may also render speech recognition technologies impractical. Cerebrovascular diseases, such as those
that cause strokes, may often cause damage to the left cerebral hemisphere. This area of the brain is usually where the brain’s language centre is located, and damage may result in acquired language impairments
such as aphasia. Aphasia may be defined as a loss of ability to understand
language or to create speech [7]. Aphasic symptoms vary greatly - mild
aphasics may find themselves unable to recall certain words, while those
with very severe aphasia may completely lose their linguistic abilities,
including their ability to recognise nonverbal symbols such as pictures
and gestures. A person with aphasia may not be able to use speech
recognition systems due to an inability to associate verbal commands
with desired actions. However, such people may have more success with
phoneme recognition systems since it relies on a simpler vocabulary based
184
on single phonemes. The reader is referred to [7] for more information
on communication disorders.
• Control of a Continuously Varying Parameter
Phoneme based systems may provide a more intuitive way of controlling
a parameter which requires continuous rather than discrete control. A
“continuously varying” parameter as discussed here includes any parameter that may ordinarily be controlled by turning a knob rather than by
pressing a switch, such as the volume control on a radio. It is appreciated
that, in many modern appliances the act of turning a knob to increase a
parameter in fact moves the parameter through a range of discrete values
rather than allowing strictly continuous selection where the parameter
can be set to any value within a range, but such instances may still be
included under the heading of a “continuously varying” parameter for
discussion here.
Two aspects of a phoneme may be used to control a continuously varying
parameter - pitch and duration. If a phoneme is a periodic sound, such as
any of the vowel phonemes in Table 6.1, then it will have a fundamental
frequency component, or pitch, associated with it. Variations in pitch
can be used to indicate the direction of change (e.g. for the radio volume
example, an increase in pitch could correspond to “volume up” and a
decrease in pitch could correspond to “volume down”). Pitch variation
can also be used to control the rate of change of the parameter, i.e.
how fast the volume is increased or decreased. Phoneme duration can
be used to indicate the extent of change (e.g. for the radio, the user
makes the required sound until the volume has reached an acceptable
level). Phoneme control of a continuous parameter is demonstrated in
the work presented here by the sample application Spelling Bee, described
in Section 6.6.2, in which a “bee” is moved around the computer screen
by varying the pitch and duration of a periodic sound.
185
• User Training
There is typically a much smaller degree of user variability when uttering a single phoneme than there would be when uttering a word or a
sentence, due to a more uniform method of pronunciation when uttering
single phonemes, which is largely independent of accent. The greatest
degree of variability for each particular phoneme is probably pitch differences, but even this can be taken into account by measuring the ratios of
the harmonics, rather than their exact location or by setting a range of
pitches characteristic of a phoneme. Because of this advantage, simple
phoneme recognition systems such as those described in Section 6.5 can
store a set of features common to a particular phoneme to enable recognition, rather than requiring the user to train the system to recognise
their individual voice. These systems may be suitable for environments
where the systems are required to respond to the voice input of a large
range of different users, such as in public buildings.
• Vocabulary Size
As the number of phonemes in the English language is only somewhere
between 39-49 depending on different dialects, the number of possible
commands for a phoneme based system is relatively small. While this
limitation does prevent phoneme recognition from being used in more
complex applications, a small vocabulary does offer the benefits of a
higher accuracy rate and a faster classification time.
6.5
Hardware Application
Two hardware based systems were developed as part of the work here. Both
are based on detection of the two phonemes /o:/and /s/. These phonemes
were chosen since they have spectral and temporal features which make them
readily distinguishable from each other, enabling them to be used without
186
requiring training to a particular user’s voice. The waveforms of these two
phonemes and their spectra are shown in Figure 6.4.
The systems described here can be interfaced with any application requiring
operation by one or two switches but have been specifically developed to be
used with a reading machine, which was previously developed in the lab in
the National Rehabilitation Hospital. The reading machine was designed to
provide people who have been severely disabled an alternative option to page
turners for reading books, since page turners have a number of problems - they
provide no means for turning back to a previous page, often turn more than
one page at a time and sometimes tear the pages of the book. In order to read
a book using this reading machine, the book must be first transferred page
by page onto a roll of acetate which is then affixed a moving motor on the
reading machine. The reading machine input is a 1/4 inch stereo phone plug
which can be thought of as two switches - Switch A closes when the tip of the
plug is connected to the plug’s sheath, and Switch B closes when the ring of
the plug is connected to the plug’s sheath. When Switch A closes the reading
machine will scroll the acetate onto the next page. When Switch B closes the
reading machine turns the entire roll around by 180◦ allowing the user to read
the other side of the page. To go back a page, Switch A must be held closed
for a fixed length of time. So to operate this machine using phonemic control,
the user makes a short /o:/ sound to scroll to the next page, an /s/ sound to
flip over to read the back of the page, and a long /o:/ sound to go back to the
previous page.
The system was designed first using analogue circuitry based on filters.
Circuitry based on a narrow-band, band-pass filter with a low frequency passband was designed to close Switch A if a low frequency signal was detected
(e.g. /o:/) and circuitry based on a wide-band, band-pass filter with a higher
frequency passband was designed to close Switch B if a high frequency signal
was detected (e.g. /s/). This circuit performed reasonably well but was eventually replaced by a microcontroller based circuit which replaces the spectral
187
based criterion for classification with a closely related temporal based criterion
based on the number of positive going zero crossings (PGZCs), and also adds
a requirement that the signal is periodic before it will be recognised as the
phoneme /o:/.
6.5.1
Analogue Circuit
As the analogue circuit developed was only briefly used, just a short description
of the theory of its operation will be given here. For those who are interested
in the full technical details, the complete circuit diagram for the circuit built
is given in Appendix F, along with the calculations of all the values given here.
The operation of this circuit is described as a series of stages, which are shown
in the block diagram in Figure 6.5.
Pre-Amplifier
The pre-amplifier stage is necessary to boost the signal before it is passed to
the filter. The gain of this stage is 10.
Filtering
The output from the pre-amplifier is passed to two band-pass filters. One of
these is designed to pass signals with frequency components characteristic of
an utterance of the /o:/ phoneme, the other is designed to pass signals with
frequency components characteristic of an utterance of the /s/ phoneme.
The /o:/ utterance is a periodic signal and hence can be considered to have a
narrow-band spectrum, although it will have harmonics at higher frequencies.
The pitch of this vowel sound varies between users. Therefore, to pass this
signal successfully, the filter used was a narrow-band, band-pass filter with
adjustable centre frequency. The maximum gain of the filter is 25, and the
centre frequency is adjustable between approximately 200Hz - 1.6kHz.
188
(a) /o:/ Waveform
(b) /s/ Waveform
(c) /o:/ Spectrum
(d) /s/ Spectrum
Figure 6.4: Phoneme Waveforms and Spectra. The /o:/ waveform is periodic and
has more of its power at a lower frequency than the /s/ phoneme, which has frequency
components across the spectrum.
189
Low−pass
Rectifier
Buffer
Threshold
Switch 1
Filter
Smoothing &
190
Pre−amplifier
Microphone
Band−pass
Filter
Amplifier
Rectifier
Buffer
Threshold
Delay &
Comparator
Switch 2
Figure 6.5: Block diagram of Circuit Stages for Phoneme Recognition of a Low Pitched and High Pitched Sound
The /s/ utterance resembles noise with a wide-band spectrum with most of
its energy between 2-8kHz. The filter required to pass this sound successfully is
a wide-band, band-pass filter. The filter chosen has a centre frequency at 5kHz,
a bandwidth of 1.25kHz, and a maximum gain of 12.5. Since this phoneme is
typically of a lower intensity than the /o:/ phoneme (refer to Section 6.4.3),
further amplification is required in the next stage.
Amplifier This stage was only required in the part of the circuit for detection
of the utterance /s/. The signal is further amplified by a factor of 3.9.
Rectifier The signal that is passed through each of the two filters is rectified
using an envelope detector.
Buffer This stage increases the current gain of the circuit to enable sufficient
current to be supplied to close the relay coil.
Threshold Thresholding will output a signal large enough to close the switch
if the sound intensity received is large enough. This prevents components
of other noises that are at a similar frequency to the frequency desired
from accidentally closing the switch.
Delay and Comparator A delay was used for subcircuit B, since often accidental noises can have a similar frequency spectrum to this phoneme
(e.g. sudden door slams etc.). The delay stage means that a signal of
the correct spectrum should be sustained for 0.27s before the switch will
close. The comparator then will suddenly switch on a current (and raise
the voltage level) once the required delay time has elapsed.
Relays The relay coils will close a switch when 5V is dropped across them.
These switches can then be connected to any system that requires two
switching actions.
191
6.5.2
Microcontroller Circuit
A system with similar functionality to the one described above was implemented using a PIC microcontroller, which meant that the condition of periodicity could be added to the criteria for recognition of the phoneme /o:/.
The criteria used in this case to recognise each of the two phonemes are:
• /o:/ Phoneme
(i) The time interval between successive positive going zero crossings
(PGZCs) must remain roughly constant.
(ii) The time interval must be within a range of thresholds.
• /s/ Phoneme
The average time interval between PGZCs must be greater than a threshold (typically 2000 PGZC/s)
The microcontroller based phoneme detection circuit is given in Appendix
F. The system consists of 7 stages: - the input stage (microphone), an amplification stage, an infinite clipper, a microcontroller, a debouncing stage, a
current amplification stage and an output stage (relay coils). The output of
the amplification and infinite clipping stages is shown in Figure 6.6. The final
signal is the signal received by the microcontroller. Two PIC16F84 microcontrollers were used, one to detect each phoneme. The pin-out of the microcontrollers and the external components used are given in Appendix G. The code
is given in Appendix H. The reader is referred to [90] for a comprehensive
reference to the assembly language commands used.
The technique for detecting the /s/ phoneme uses the fact that it has
a higher number of zero-crossings per unit time than the /o:/ phoneme. The
microcontroller’s timer TMR0 is configured to count from the microcontroller’s
external clock input pin. The TMR0 is a one-byte register which will set a flag
192
4.5V
6V
Figure 6.6: Pre-processing stages of the audio signal. The top graph shows the raw
signal. The middle graph shows the signal amplified and shifted so it rides around
4.5V. The bottom graph shows the infinitely clipped signal that is input into the
microcontroller.
193
when it overflows. It is initialised to 236, before starting a 10.24ms dummy
loop. While this loop is running, the timer increments each time a rising edge
of the input signal occurs e.g. each time a PGZC is detected. At the end of the
10.24ms period the timer overflow flag is checked. The timer will overflow if a
sufficient number of rising edges have occurred within the time period and the
timer overflow flag will be set. If this is the case, an /s/ phoneme is deemed
to have occurred and the appropriate action is taken.
The technique for detecting the /o:/ phoneme looks for two features of
this phoneme, firstly, that it has a periodic waveform with one zero-crossing
per period, and secondly, that is has a lower number of zero-crossings than
the /s/ phoneme. In this case, the microcontroller’s timer TMR0 is set to
count internally. It is pre-scaled so it will overflow if it counts uninterrupted
for approximately 66ms. If it overflows the interrupt service routine is called.
The signal is input on the microcontroller’s external interrupt pin, which will
also call the interrupt service routine when the value at this pin changes. The
main program runs a dummy loop that runs continuously until one of these
two interrupts calls the interrupt service routine. When this routine is called
the program runs through a loop of decision processes to decide what action
to take. The flow of control of the interrupt service routine is shown in the
flowchart in Figure H.1 in Appendix H.
6.6
Software Application
Two software phoneme recognition systems and two complementary sample
software applications using each of these recognition systems were developed
as part of the work described here. The system was first developed based on the
Linux operating system. A phoneme recognition tool named the AudioWidget
was created which can be included into more complex programs to provide
phoneme recognition capabilities. The AudioWidget operates in two modes.
The first of these modes is pitch detection and the second is phoneme detection.
194
The phoneme detection mode is again based on the two phonemes /o:/ and /s/.
An environmental control graphical menu system was developed incorporating
the AudioWidget to provide a sample application. This system is particularly
suited to use by aphasics since the menu items are pictures and the control
mechanism is non-verbal, meaning the whole system can be used completely
independently of words or text.
The final phoneme recognition system presented here has been developed
for the Windows operating system. All the methods described thus far are
based on recognition of the same two phonemes and may be thought of as the
phoneme recognition equivalent to the Acoustic Phonetic Approach of speech
recognition described in Section 6.2.1. These systems have limited applicability
and a more generic system configurable by the user’s therapist is described.
This system may be thought of as the phoneme recognition equivalent of the
Pattern Matching Approach discussed in Section 6.2.1. A sample application
for this system was also developed, which has been called the “Spelling Bee”
and is described below.
6.6.1
Application for Linux
The reason that the two software phoneme recognition systems that are presented here have been divided into two separate sections based on operating
systems is because each operating system requires a different programming
approach to enable data to be read into the programme from the sound card.
Linux probably provides an easier programming interface with the sound card
and thus was chosen as the operating system to initially use for developing
a system requiring access of the sound card. A separate system was later
developed for the Win32 operating system.
The Linux application developed uses the Open Sound System (OSS) [91]
application programming interface for capturing sounds, which is defined by
including the header file linux/soundcard.h in programs requiring this in195
terface. The OSS is a device driver developed for UNIX and UNIX-compatible
operating systems, such as Linux, which supports most of the common sound
cards available. Any sound card will have a number of different devices on
it. The Digitised Voice Device is used for recording and playback of digitised
sound. A sound card may also have a Mixer Device to control various input
and output levels and a Synthesiser Device for playing music and generating
sound effects. The OSS supports a number of different device files which enable access to various devices on the sound card. The most important device
file for the purposes here is dev/dsp. This device file can be treated like a
normal file stored on the hard drive and can be used with standard Linux
file control calls defined in fcntl.h such as open, close, read and write.
Reading this device file returns the audio data recorded from the current input
source, which is the microphone by default.
AudioWidget
The program AudioWidget uses OSS and the standard file control calls to read
data from the microphone using dev/dsp. The default audio data format when
reading from this device file is 8kHz/8-bit unsigned/mono but it is usually not
safe to assume that this will always be the case so the program explicitly sets
these parameters to these values. The fragment size is set to 512 samples
which are read into a 512 element array which we will call x. This corresponds
to a time of 0.064s, which is a reasonable time to take as the vocal tract
configuration and excitation will usually not vary significantly during this time.
A 512-sample Fast Fourier Transform (FFT) is performed on the 512-sample
fragment, using code taken from Dr. Dobb’s Journal [92] which is based on a
radix 2 algorithm for FFT computation. This returns an array, which we will
call y, with 256 elements, or N = 256. The program operates in two modes pitch tracking mode and phoneme recognition mode.
In pitch tracking mode the program responds only to signals with a single
196
frequency component, such as a whistle. The program continuously analyses each 512-sample fragment received to test for single frequency component
signals. The amplitude Amax of the maximum frequency component in the
spectrum is identified by looking within the y array in the range cutoff < i
< N , where cutoff is defined by the program. (This is usually around 10 and
is to ensure peaks due to low-frequency noise are eliminated from the search).
The index nmax , of this peak is recorded. The program then looks to see if
there is another peak in the array y outside of this peak area i.e. an mmax
within the array y such that:
ymmax
Amax
>
if
10.0
mmax < (nmax − 5)
mmax > (nmax + 5)
(6.4)
If there exists another peak such that mmax exists, then the signal at the microphone input is not a single frequency component signal and thus is ignored.
If only one main frequency component exists, then the pitch, and thus the
fundamental frequency fmax , of the signal may be calculated using N, nmax
and the sampling frequency fs :
fs /2
N
4000
= nmax ×
256
fmax = nmax ×
(6.5)
In phoneme tracking mode the program continuously classifies each fragment of data as one of three types:
1. NO PHONEME - neither phoneme sound was uttered
2. O PHONEME - the /o:/ phoneme was uttered
3. S PHONEME - the /s/ phoneme was uttered
Since the data in the x and y arrays are in 8 bit unsigned format, possible
sample values are in the range [0, 255]. The mean values x̄, ȳ = 128. The
“zero crossing” point for the spectrum may then be considered to be the point
197
where:
yi−1 <= 127
yi >= 128
(6.6)
The first step in classifying the signal is to decide whether or not any sound
occurred by calculating the variance S 2 from the sampled time signal x. In
this case x̄ = 128, and N = 512.
2
S =
PN −1
i=0
(xi − x̄)2
N
(6.7)
If S 2 is less than a threshold, then the fragment is classified as being of type
NO PHONEME. Otherwise, the number of PGZCs, p, are counted according
to the criteria in Equation 6.6. The maximum interval between PGZC, Imax ,
and the minimum interval between PGZCs, Imin , are recorded. Phoneme classification is made based on the following criteria which were set out based on
experimental findings:
if
p<8
if
8 < p < 20
if
p > 40
⇒ NO PHONEME
and
Imax
< 1.1 ⇒ O PHONEME
Imin
⇒ S PHONEME
otherwise
⇒ NO PHONEME
The graphical user interface (GUI) for the AudioWidget is based on the
cross-platform GUI toolkit for C++ called the Fast Light Toolkit2 (FLTK).
The AudioWidget GUI is shown in Figure 6.7. Note that the graph on the
left shows the time domain signal and the graph on the right shows the corresponding frequency spectrum for that fragment. The power in the spectrum is
scaled to fit inside the box which is why there appears to be a large spectrum
even when no sound is uttered, the spectrum shown just represents wide-band
ambient noise from other sources. The red part of the spectrum is the part of
the spectrum falling within the cutoff region which is ignored by the program
when looking for peaks. It can be seen that for a whistle there is only one
2
http://www.fltk.org
198
distinct peak, for the /o:/ phoneme there are a series of evenly spaced peaks
representing the harmonics, and for the /s/ phoneme the spectrum looks like
wide-band noise.
Graphical Menu
An environmental control program “Graphical Menu” was developed as part
of the work here to give an example of an application incorporating the AudioWidget described above. This program is configurable by a therapist or
helper and is designed to be operable solely by symbols and non-verbal utterances rather than words - enabling use by people regardless of their linguistic
abilities. The program is a list of menu items which are each individually created by the therapist. An arbitrary command is associated with each menu
item. There is a drawing facility for adding a symbolic representation to each
menu item, a textual description of the menu item may also be added optionally. The graphical menu is shown in Figure 6.8 with three menu items added:
“Turn Radio On”, “Turn Light On” and “Turn Light Off”. The user scrolls
through menu items by making the sound /s/ to move to the next menu item.
A menu item is selected by making the sound /o:/.
A commercially available supplementary home automation module called
the X10 module 3 was used to control appliances. The X10 command signals
are transmitted over domestic power lines. The command signals can be transmitted via the computer’s power line through a building’s electrical wiring to
the appliance’s power line. The X10 system is shown in Figure 6.9.
6.6.2
Application for Windows
In the Windows application developed here, the sound card was accessed using
DirectSound, which is part of the DirectX suite of application programming in3
X10 Home Automation System: http://www.x10.co.uk
199
(a) Whistle
(b) No input
(c) /o:/ phoneme
(d) /s/ phoneme
Figure 6.7: The AudioWidget responding to different signals (a) Pitch Detection
Mode - the tracker on the bottom varies according to pitch (b)(c)(d) Phoneme Detection Mode - the status bar at the bottom changes colour according to detected
phoneme.
200
Figure 6.8: Graphical Menu operated using the AudioWidget
Figure 6.9: The X10 module. A control signal from the computer travels down
through the computer’s electrical connection to the X10 receiver attached to the
lamp, enabling the computer to switch the lamp on.
201
terfaces. DirectSound enables WAV sounds to be captured from a microphone.
DirectSound and other DirectX suites are based on the Windows Component
Object Model (COM) technology.
The program called Phoneme Detection creates a DirectSound device which
is then used to create a capture buffer which captures the data from the microphone. In this application a 1s buffer was created with PCM wave format,
one channel, an 8kHz sampling rate and 16 bits per sample. This buffer is
continuously filled with data. Each time a 512-sample chunk is filled, it is read
into array x, with N = 512, and analysed. The spectrum of the fragment is
calculated, again by performing a 512-sample FFT based on code from Dr.
Dobb’s Journal [92] which is read into an array y with N = 256.
Phoneme Detection Program
On each fragment of audio data received the variance S 2 is calculated to check
if a sound was uttered, using Equation 6.7 with N = 512 and x̄ = 0, since in
this case the data is signed and thus has zero mean. If the variance is greater
than a threshold a number of features are calculated to enable the sound to
be characterised.
Number of PGZCs The number of PGZCs, p, is incremented when
yi−1
<
0
yi >= 0
(6.8)
Spectral Peaks Spectral peaks were detected using the Slope Peak Detection Method outlined below, where w is the peak width, defined at the
beginning of the program:
Peak is detected at i if and only if:
yi−w < yi−w+1 · · · yi−1 < yi
and
202
Table 6.3: Spectral Peaks for Signal in Figure 6.10
Peak
Freq (Hz) Harmonic
Highest Peak Amplitude
peak_index_max[0]
234.375
Fundamental
2nd
peak_index_max[1]
468.750
2nd harmonic
1st
peak_index_max[2]
937.500
4th harmonic
3rd
yi > yi+1 · · · yi+w−1 > yi+w
Each time a peak is detected its index is read into the next available
location in an array peak_index. Once all the peaks have been detected
peak_index is searched and the peak indices of the three highest peaks
are read into a three element array peak_index_max. These three indices
are then re-arranged in ascending order of index value. If the signal is
periodic, the first element of this array will usually correspond to the
pitch of the received signal. Peak detection for the signal in Figure
6.10(a) is shown in Figure 6.10(b). The circles represent all the detected
peaks (the red line marks the highest peak, the green line the second
highest and the blue line the third highest peak). The three highest
peaks are shown in Table 6.3.
Periodicity A truly periodic signal may be defined as a signal which exactly
repeats itself after every T seconds, where T is the period of the signal.
For the most part, “periodic” acoustic signals such as whistles and vowel
sounds will not be exactly periodic once received by the computer due
a number of factors including ambient noise, recording errors and slight
movement of the vocal organs when attempting to produce a constant
periodic tone. For the purposes here, we define a signal as being “approximately periodic” if most of the power in its frequency spectrum lies
within the fundamental frequency peak plus the harmonic peaks. Since
the three largest peaks in the frequency spectrum have already been
identified we can used these to define a test for approximate periodicity
of any arbitrary signal. If the lowest peak index is called P1 and the two
203
other main peaks are called P2 and P3 , then a signal is periodic if P2 and
P3 are multiples of P1 i.e. P1 is the fundamental frequency component
and P2 and P3 its harmonics (which are the two highest harmonics of
the signal other than the fundamental and will not necessarily be the 2nd
and 3rd harmonics). If this is the case, then the fragment is marked as
an “approximately periodic” signal.
P2 and P3 are tested to assess if they are multiples of P1 by calculating
P3
P2
the two ratios r1 =
and r2 =
, and from this d1 and d2 were
P1
P1
calculated as the portions of r1 and r2 , respectively, after the decimal
point. From this the final variables t1 and t2 were calculated based on
the following conditions:
t=
d
if d <= 0.5
(1 − d) if d > 0.5
(6.9)
P2 and P3 are multiples of P1 and thus the signal is marked as approximately periodic if t1 < 0.15 and t2 < 0.15. See the program code in
Appendix I for further details.
In the example signal in Figure 6.10 and in Table 6.3, the peaks at
468.75Hz and the peak at 937.5Hz are exact multiples of the fundamental
frequency peak at 234.374Hz, thus the signal is periodic. Inspection of
the time signal in Figure 6.10(a) indeed shows that this appears to be
the case.
Normalised Peak Values As mentioned before, different vowel sounds are
identifiable based on different relative amplitudes of their harmonics.
The normalised amplitudes of the three peaks in the array peak_index_max
were recorded and stored in an array called peak_ratio, normalised so
the maximum peak amplitude had a value of 100. Typical values for the
amplitudes of different vowel phonemes for a female are shown in Table
6.4. Note that the values shown are only those recorded for one fragment of data, and the values recorded will in general fluctuate from the
values shown throughout the duration of the vowel utterance, causing
204
Phoneme
peak_ratio[0]
peak_ratio[1]
peak_ratio[2]
/i:/
100
94
77
/e:/
100
95
85
/o:/
100
94
90
/O:/
100
97
95
/æ/
100
91
83
/E/
100
95
91
/2/
100
91
87
/I/
100
91
90
Table 6.4: Example values for relative harmonic amplitudes of vowels calculated by
the program Phoneme Detection.
considerable overlap between phoneme definitions. Thus, each of the 8
vowel sounds will not be readily distinguishable simultaneously, but it is
hoped that for each individual user, a subset of these sounds will exhibit
distinctive enough amplitudes to enable correct classification of some of
these sounds at the same time.
Like the AudioWidget program, this program can operate in two modes pitch tracking mode and phoneme detection mode. The pitch tracking mode
in this program is more adaptable than in the AudioWidget described in Section 6.6.1 since it also includes a facility to simultaneously detect non-periodic
phonemes while pitch tracking is running. If the audio signal received is periodic then the pitch of the signal is calculated and this can be associated with
any command requiring a continuous input (such as the volume on a radio
or moving a mouse up and down). Non-periodic utterances can be used to
create stored template feature sets, each of which can be associated with a
different command. When a non-periodic utterance is received its features are
compared to the stored feature sets. If a match is found then the associated
command is performed.
205
(a) Signal
(b) Spectrum
Figure 6.10: Signal and its spectrum, from phoneme detection program.
206
In pure phoneme detection mode, periodic utterances can also be used to
create stored template feature sets, allowing different vowel sounds to be used
to create different template feature sets. If two template feature sets are too
similar then the program generates a warning and the user is advised to rerecord both sounds or to choose a different sound to associate a command
with.
Spelling Bee
An example communication program called the Spelling Bee was developed,
incorporating the Phoneme Detection program described above. The GUI for
this program is shown in Figure 6.11, showing a bitmap of a “bee”. When no
input is received from the microphone, the bee drifts from left to right across
the middle of the alphabet board, at a user configurable speed. When the bee
reaches the end of the board it “wraps around” the alphabet board and will
reappear on the left side of the screen and drift over towards the right again.
The vertical direction of the bee’s drift is controlled by the user by making a
periodic sound, such as a vowel sound, and adjusting the pitch to move the
bee upwards or downwards. To direct the bee towards the top-right corner
the user needs to make a rising pitch sound, to direct the bee towards the
bottom-right corner the user needs to make a falling pitch sound.
Each time the program receives a periodic sound from the microphone, the
pitch of the sound received is calculated, using the pitch detection feature of
the Phoneme Detection program. The pitch difference, ∆f , is calculated by
subtracting the current pitch from the previous pitch. The vertical increment
or decrement of bee position, Ai , is then calculated using ∆f and the previous
value of Ai−1 , according to the following rules.
• if |∆f | > constant ⇒ A = 0. This ensures the program will only respond to pitch changes within a certain range, such as a rising or falling
vowel sound, and will not respond to sudden large pitch changes due to
207
Figure 6.11: Spelling Bee The bee is directed up and down by the user’s voice.
In the current screenshot the pitch difference between the last pitch (29) and the
current 2nd peak (30) is +1 so the bee will move upwards
.
arbitrary noises from other sources.
• if ∆f > 0 and if Ai−1 > 0 ⇒ Ai = Ai−1 + 5∆f . This enables the
program to increase the bee’s rate of movement upwards (accelerate) if
a continuously rising periodic sound is made.
• if ∆f < 0 and if Ai−1 < 0 ⇒ Ai = Ai−1 + 5∆f . This enables the
program to increase the bee’s acceleration downwards if a continuously
falling periodic sound is made.
• Otherwise Ai = 7∆f .
208
Note that in this program, the second peak (peak_index_max[1]) is chosen as
the peak to label as the pitch of the sound, although the true pitch will actually
be the first peak. The reasons for this choice are now explained. The program
is only looking at small windows of data at a time (512 samples per fragment at
8000 samples per second gives 0.064 seconds per fragment). This small window
was chosen so the assumption could be made that the vocal tract configuration
remains constant over the entire fragment and also to enable the program
to respond faster to user input. However, it does introduce the limitation
that each element of the frequency spectrum array spectrum corresponds to
a frequency step of approximately 16Hz. So, for the program to respond to an
increase in pitch using the fundamental frequency component, the user would
need to raise the pitch of their voice by 16Hz before any change in pitch would
be detected. By using the 2nd harmonic component, the user has twice the
degree of control over the pitch increment. Using the 3rd component of the
array peak_index_max would give even a greater degree of control although
in practice this was found not to work very well - the third peak detected
seemed to jump position frequently between the 3rd and 4th harmonics of the
spectrum.
The Spelling Bee also requires that a non-periodic sound is recorded at the
start of the program which is called the “Selection Indicator”. The features
of this sound are stored and each time a non-periodic sound is made, the
its features are compared to the stored features. If a match is found, the
program is indicated and the program waits for another match to certify that
the expected phoneme was actually uttered. If another matching fragment is
found within a short interval of time, then a “selection” is confirmed and the
bee “lands” on whichever letter he is currently hovering over. The chosen letter
then appears in the edit box across the bottom of the screen. Thus the user
can spell out a message.
209
6.7
Conclusions
Phoneme detection has been discussed in this chapter as an alternative method
to speech recognition for providing communication and control for disabled
people. The advantages of phoneme recognition based systems have been
discussed. For users with communication disorders such as apraxia, speech
dysarthria or aphasia, speech recognition systems may not be suitable and
phoneme recognition based systems could offer many of the same benefits as
speech recognition systems provide, such as “hands-free” control of the user’s
environment. Phoneme recognition based systems may be preferable in situations where control of a continuously varying parameter is required, such as
in volume control of a radio or television. It may be more suitable to environments where the system can not be trained to a particular user’s voice - such
as where there are many users or in a public building.
A number of phoneme recognition based systems have been developed and
a number of applications of these systems have been described in this chapter.
Phoneme recognition has been explored to assess how it may be used to control
a reading machine, an environmental control menu and an alphabet board
based communication tool. A system requiring training to a user’s voice and
a system designed to work with a broad range of different voices have been
developed, and both have advantages and limitations.
210
Chapter 7
Conclusions
7.1
Introduction
This thesis has explored a number of different signals from the body to investigate their potential as control signals for communication and control systems
for the severely disabled. It is impossible to choose any one of these signals
as the best signal for communication and control purposes from the signals
identified. For each person requiring the use of a communication and control
system, there will be a different set of variables which may determine the best
method of enabling interaction with a system. Firstly, the physical abilities
of the person must be identified to assess the range of options open to them.
Their individual motivations and practical requirements from an assistive technology system must also be taken into account. While some people may wish
to use a system that will allow more efficient communication and are willing
to spend some length of time mastering a technique for interaction, others
may prefer a simpler method such as a single switch operated system. Also,
a person may have different opinions about how an alternative control system
may draw attention to their disability. Obviously their personal preferences
should always be taken into consideration when choosing a method of control.
While some people may find that electrodes attached to the skin will make
211
their disability more conspicuous, others may feel uncomfortable making utterances if they are in an environment where others will be able to overhear
their commands.
7.2
Resolution of the Aims of this Thesis
The main aims of this thesis were outlined in Chapter 1 and methods used to
meet these aims will be discussed here. These were:
1. Overview of current methods of providing communication and control
for disabled people.
2. Identification of alternative signals from the body which may be harnessed from the body for communication and control purposes for people
with very severe disabilities.
3. Study of measurement techniques that may be used to acquire these
vestigial signals
4. Investigation of signal processing methods to enable these signals to be
correctly interpreted and development of working systems that demonstrate the capabilities of these techniques.
5. Testing of these techniques and systems with people with severe disabilities.
6. Development of some mathematical models that evolved as a result of
studying these body signals.
212
7.2.1
Overview of Current Communication and Control
Methods
There are a wide range of communication and control systems currently available and many different methods have been considered to enable a disabled
person to interface with these systems. A number of systems which were considered of particular relevance to the thesis were described in the main body
of the thesis.
7.2.2
Identification of Signals
This thesis sought to identify vestigial signals left to severely disabled people that could be harnessed for communication and control purposes. Four
principal signals were investigated in this thesis.
Muscle contraction was one of the vestigial signals explored as a method
of communication and control for disabled people, in Chapters 3 and 5. If
people who are disabled have the ability to contract a muscle, then this may
be harnessed to provide a method of communication and control. Obviously
if the muscle contraction is strong enough, then it enables the person to use a
mechanical switch. If this is not the case, then the signal must be harnessed
by other means.
Eye movements were discussed in Chapter 4. As the muscles in the eye
often remain under voluntary control even in very severe cases of disability, eye
movements are an important signal to consider for communication and control
purposes.
Acoustic signals as a method of communication and control were discussed
in Chapter 6. Speech recognition technologies may be used to provide control
by people with full speech production abilities. For those who are only capable
of producing a subset of the utterances necessary to create intelligible speech,
213
other methods of harnessing these utterances must be considered.
Skin conductance was briefly explored in Chapter 4 as another method of
providing a switching action. Measurement of the electrical conductance of
the skin may serve as a method of monitoring the activity of the sweat glands
on the skin’s surface. Sweat gland activity may be consciously controlled by
tensing up or imagining oneself in a state of stress or anger. This causes emotional sweating to occur, which will increase the measured skin conductance.
One of the drawbacks with using this signal is that it is a very slow method of
control. However, in cases where there is no preferable alternative it may be
the only option.
It is recognised that there may be other signals from the body that have
not been explored in this thesis that could be harnessed for communication
and control for disabled people. Some other possible signals that could be
considered for future investigations are mentioned in Section 7.3. The measurement techniques and signal processing methods used to develop working
systems will now be described for each of the four signals.
7.2.3
Measurement Techniques
Muscle Contraction
Muscle contraction in physically disabled people will often be very weak. A
number of methods of detecting this muscle contraction were considered. The
three methods of harnessing weak muscle contractions were considered. The
electromyogram (EMG) and the mechanomyogram (MMG) were discussed in
Chapter 3 and visual methods were discussed in Chapter 5.
The EMG is typically measured using three electrodes. Two recording
electrodes are place over the belly of the muscle and a ground electrode is
placed on a neutral part of the body, such as a wrist. The two recorded signals
are differentially amplified. Muscle contraction can be detected by harnessing
214
this signal since the amplitude of the EMG increases in almost all cases upon
contraction, due to the generation of action potentials on contraction.
In Chapter 3, the MMG was explored as an alternative to the EMG for
measuring muscle contraction to enable control of communication and control
systems for disabled people. The mechanomyogram may offer a number of
benefits over myoelectric control. The mechanomyogram is measurable using
a single small accelerometer, as opposed to the three electrodes which are
required for electromyographic recordings. The electromyogram requires skin
preparation to improve its conductance and this step is unnecessary for the
mechanomyogram as it is a mechanical signal. Skin conductance may also be
a problem for EMG recording since it can be affected due to varying thermal
conditions or emotional anxiety. The MMG is capable of detecting weaker
contractions than the EMG, which makes it an attractive option as a controller
for disabled people who may have very weak muscle activity. Also the MMG
can measure activity from deeper muscles than can the surface EMG, which
typically only detects activity from the surface muscles.
Visual techniques were explored in Chapter 5 to investigate the possibility
of using a computer camera to measure observable flickers of movement. This
may be a preferable option in cases where the person does not want to have
anything attached to their skin, or where they are prone to heavy perspiration.
The visual method of movement measurement that was developed as part of
the work here has an added benefit in that it responds only to the particular movement that the user or therapist has chosen. Thus other movements,
whether voluntary or involuntary (such as muscle tremors or spasticity) will
not unintentionally trigger the program to respond.
Eye movements
A number of different methods exist for measuring eye movement. It is often
measured visually, using methods such as the corneal reflection technique [52],
215
which shines light (usually infrared) into the eye and detects the reflected
pupil and cornea. The electrooculogram was the method of eye movement
measurement used in the work presented in this thesis. Two electrodes are
placed at opposite sides of the eyeball and the electrical signal varies between
the electrodes as the eye moves. Despite offering a number of benefits over
other methods, it is often overlooked as an eye movement method technique
due to problems with baseline drift. Often EOG based systems require manual
re-calibration of the amplifier when baseline drift occurs and this is impractical
if the aim is to achieve user independence. Aside from this limitation, the EOG
may be an attractive option for communication and control as it can provide
an inexpensive method of interfacing a user with a computer. The EOG also
has other benefits. It may have a wider range than visual techniques and is
not subject to interference from spectacles worn in front of the eyes.
Acoustic Utterances
Speech recognition technologies may not be an option for people who have lost
their speech production abilities but still remain capable of making non-verbal
but repeatable utterances. Phoneme detection is explored as an alternative
acoustic signal. Phoneme recognition may be suitable for users with communication disorders such as apraxia, speech dysarthria or aphasia, who often are
unable to use speech recognition systems because of their speech impairment.
Phoneme recognition based systems may also be preferable in situations where
control of a continuously varying parameter is required, such as volume control
of a radio or television.
Skin Conductance
Skin conductance was measured using the circuit in Appendix E, which outputs
a voltage proportional to the conductance on the surface of the skin. Skin
conductance was measured on the palmar surface of the fingers, since this area
216
typically has a higher number of sweat glands than other skin surfaces.
7.2.4
Signal Processing Techniques and Working Systems
Developed
Muscle Contraction
An MMG based communication and control application was developed, the
code for which is given in Appendix I. A number of signal processing steps
were performed which enable a switching action to be performed when the
MMG amplitude increases sufficiently.
The system was tested to assess the speed at which muscle contraction could
be used to spell out a 9-word message on a software alphabet board. The average speed over the four users and the two muscles was 1.56 words/min with
an average of 1.25 errors. While this is a very slow method of communication
compared to natural speech, for people who are severely disabled it could provide an invaluable tool to enable communication where they might otherwise
have none.
Two visual-based methods of measuring movements were also developed,
the Frame Comparison Method and the Path Description Method. These methods enable switching actions to be performed using flickers of movement which
are detected using a computer camera. The algorithms presented were incorporated into a software computer program which allows the person’s movement
to be recorded and will generate a switching action on repetition of that movement. The code for this program is given in Appendix I.
Eye movements
A novel technique of using the electrooculogram as a control signal for the
disabled was presented in Chapter 4, known as Target Position Variation.
217
Target Position Variation is based on the principle of monitoring the user’s
EOG to look for oscillations of known amplitude which identify when a user is
looking at a particular target moving on screen. Two possible applications for
Target Position Variation were described. It may be used in menu selection
to detect one of a number of options by tracking the target for that option
moving on screen. It may also be used as part of an eye-gaze controlled eye
cursor program to enable automatic software re-calibration of the eye position.
Acoustic Utterances
A number of phoneme recognition based systems were presented in Chapter
6. The spectral and temporal features of the two phonemes /o:/ and /s/
were explored to assess how they may be distinguished by a phoneme recognition system. Systems were developed in both hardware and software based
on recognition of these two phonemes to control a reading machine and an
environmental control menu. The microcontroller code for the hardware application is given in Appendix H. The C++ code used for the environmental
control menu is given in Appendix I.
A more flexible phoneme recognition approach was then investigated which
allows arbitrary utterances to be associated with switching actions. A pitch
detection algorithm was developed and used to control the vertical position of
a pointer (the “bee”) moving over an alphabet board. This may allow a person
to spell out messages using the pitch of their voice in conjunction with another
non-periodic utterance. The code is given in Appendix I.
7.2.5
Patient Testing
Many of the systems developed here were tested with patients in the hospital
in the NRH to assess their suitability for communication and control purposes.
The Natterbox was probably the most widely used program. This was modi-
218
fied to run directly from a compact disc which was given to the therapists in
the hospital. The therapists could then independently choose an appropriate
mechanical switch which could enable each particular patient to operate this
program.
For patients with a more severe level of disability, some of the methods
described in this thesis were used to discover the most suitable method of
harnessing a body signal to provide a switching action. Two cases will be
outlined here.
The first case was a male patient in his 50s who had suffered a brainstem
stroke. This stroke had left him completely paralysed, so much so that he was
almost completely locked-in. In fact, it took some time after he had suffered
the stroke before it was realised that he had retained almost complete mental
facilities. His only voluntary movement was an ability to slightly move his
eyes upwards. This movement was almost imperceptible to the naked eye
but it was considered that the EOG may offer a suitable means of harnessing
this action. The vertical EOG was acquired using a National Instruments data
acquisition card and thresholded in Simulink. By choosing a suitable threshold
each time, it was possible to detect when the patient moved their eyes upward
and actuate a switching action. This allowed him to use the Natterbox to spell
out messages.
The second case was a male patient in his 20s who had suffered a road traffic
accident. This patient was almost completely paralysed from the neck down,
although he did retain the ability to adduct and abduct his thumb and thus
was able to use this action to operate a mechanical switch placed between his
thumb and hand. This allowed him to use Natterbox and over time he became
quite proficient at spelling out messages to his friends, family and workers in
the hospital. The patient was also a great music lover and some methods of
offering him some independence to choose and play different songs and albums
were considered. His music albums were uploaded to Windows Media Player.
The graphical user interface of this software program allows a mouse to be
219
used to navigate through different albums and choose particular songs to play.
The volume may also be controlled through this program. As this patient was
incapable of using a conventional mouse, an alternative method of providing
mouse cursor control was necessary. The mouse cursor may be controlled
using three switches, using a program developed as part of the work here
called Three-Switch Mouse, which is described in Section 2.4.4. The patient
was already capable of using a mechanical switch so it was only necessary
to identify two additional signals that could be used to provide the second
and third switching actions. The patient had the ability to make slight neck
rotations to the left and right. As briefly described in Section 3.4.2, the EMG
was recorded from his neck muscles to detect when he was moving his head in
either direction. Thus he was able to use the Three-Switch Mouse to control
the mouse cursor.
7.2.6
Biological Studies
During exploration of body signals that may be harnessed by disabled people
for communication and control purposes, extensive studies were undertaken
on the biological functions of the human body. Two distinct results of these
studies were presented. These are the control model that was developed to
model saccadic and smooth pursuit eye movements and a method for indirect
measurement of the firing rate of the sympathetic nervous system.
Eye Movement Model
A control model for the eye was developed which models rotation of the eye in
either the horizontal or the vertical plane. In either plane, rotation of the eye
is controlled by a pair of agonist-antagonist extraocular muscles. Contraction
of one muscle rotates the eye in the positive θ direction, and of the other in
the negative θ direction. In the model that was presented, these two muscles
were condensed into a single equivalent bidirectional muscle, which can rotate
220
the eyeball in either direction. The effects of the eye’s muscle spindle on the
torque of the eye were also incorporated into the model in an inner feedback
loop.
The model was initially explored to study saccadic eye movements, which
are movements where the eye suddenly jumps from one location to another.
It was found that it was possible to use this model to simulate a saccadic
movement that fits very well the measured EOG response. The model was then
extended to assess its ability to correctly model smooth pursuit movements,
which are the movements that occur when the eye is following a moving target,
such as in the Target Position Variation method. Initial results seem to indicate
that it is possible to predict smooth pursuit movements with this model.
Firing Rate Measurement Technique
An original method for measurement of the sympathetic nervous system firing
rate was also described. To the best of the author’s knowledge there is no existing method for observation of this variable through non-invasive techniques.
The measurement technique developed uses measurement of the conductance
of the skin to observe the firing rate, through use of a skin conductance model
in a feedback loop under PID control. Results from this model show that the
modelled and measured skin conductanes seem to follow each other almost
exactly, which seems to indicate that this method could be used to provide a
low-cost, non-invasive tool for firing rate measurement.
7.3
7.3.1
Future Work
The Mechanomyogram
Further studies should be carried out on this signal to assess its ability to correctly detect weak muscle contractions in people with severe disabilities. Pat221
tern recognition techniques that allow differentiation between different muscle
actions should also be investigated in more detail for the MMG, ultimately to
provide a means of operating multiple-switch operated systems.
7.3.2
Target Position Variation
Target Position Variation has been explored in principle and it is found that
it is readily possible to detect when a user is looking at an on-screen target
from analysis of their EOG. The next step is to integrate this method into
a working EOG based system. In particular the application of TPV as an
automatic re-calibration tool for an EOG based mouse cursor control system
is of interest.
7.3.3
Visual Methods for Mouse Cursor Control
Mouse cursor control using the centre of brightness of the hand was mentioned
in Chapter 5. Movement of a person’s hand could be used to move the centre
of brightness around and thus be translated into mouse cursor movements. It
is important to identify intelligent ways of translating the centre of brightness
co-ordinates into mouse cursor co-ordinates, so as to provide an intuitive means
for cursor control.
The path description method developed in Chapter 5 may be sensitive to
gross movements by the user which can move their starting and final positions
from those used in the path description definition. This problem should be
addressed, perhaps by using some sort of technique that can detect when this
has occurred and instruct the user to move to the starting position and rerecord the new position. From there the new path of motion could be defined,
taking into account that the plane may also have rotated from its original
position.
222
7.3.4
Communication System Speed
Some possible future developments of the Natterbox have already been discussed in Chapter 2. The maximum communication rates achievable using any
of the techniques described in this thesis are ultimately limited by the speed of
the communication system used. It is important to investigate methods of increasing the speed of a communication system to enable faster communication
rates, maybe through some type of text prediction algorithm. Modification of
the Dasher program developed by Ward [25] for single switch operation could
offer faster communication rates.
7.3.5
Multi-Modal Control Signals
All the methods of providing control signals described in this thesis are based
on choosing one body signal which may be harnessed to provide a switching
action or other control signal. In theory, if a person has the ability to generate two or more signals consecutively then this may provide a more accurate
method of generating a control signal. For example, a system designed to actuate a switching action can monitor two or more signals from the body and
be designed to respond only when it detects that the person has made both of
these signals. Obviously, many of the patients encountered are so severely disabled that it may be difficult to recognise one action that may be voluntarily
repeatable, never mind two. Even if two such signals can be identified, it may
be difficult, if not impossible, for the patient to make both of these signals
consecutively.
7.3.6
Other Vestigial Signals
Several vestigial signals from the body have been investigated to discover their
potential for use for communication and control by disabled people. Of course,
there may be other signals from the body that may be harnessed for communi223
cation and control purposes. Two suggestions of signals that may be explored
in future research on this topic are tongue movements and whistling.
People with injuries below C4 level usually have control of muscles in the
neck and above. This generally includes the muscles in the tongue. The tongue
has a large number of muscles which enable very precise movements to be
performed and this could be of use for communication and control purposes.
A possible method would be to use some type of mouthguard consisting of
a number of sensor pads. The user could move their tongue to press on a
particular sensor pad to perform a certain switching action associated with
that pad. Since tongue movements can usually be quite exact, this could
potentially offer a large number of different switching actions to be performed.
Whistling is another acoustic signal that has been explored by others in the
laboratory in the NRH. Whistle pitch may be used to control a continuously
varying parameter, such as the mouse cursor position. For those who are
unable to whistle unassisted, a whistle placed in the mouth could be used to
provide a switching action.
224
Bibliography
[1] Central Statistics Office. http://www.cso.ie, accessed 1st August 2005.
[2] S L Glennen and D. C. DeCoste. Handbook of Augmentative and Alternative Communication. Singular Publishing Group, 1997.
[3] W J Perkins and B F Stenning. Control units for operation of computers
by severely physically handicapped persons. Journal of Medical Engineering and Technology, 10(1):21–23, January/February 1986.
[4] Joseph J. Lazzaro. Adapting PCs for Disabilities. Addison-Wesley Publishing Company, 1996.
[5] A M Cook and S M Hussey. Assistive Technologies: Principles and Practice. Mosby, 1995.
[6] Irish Health Website. http://www.irishhealth.com, accessed 30th June
2005.
[7] J W Sharpless. Mossman’s A Problem-Oriented Approach to Stroke Rehabilitation. Charles C Thomas Publisher, 2nd edition, 1982.
[8] Dorland’s Illustrated Medical Dictionary. W. B. Saunders Company, 27th
edition, 1988.
[9] M Johnstone. Chapter 1: Controlled Movement. In Restoration of Motor
Function in the Stroke Patient. Churchill Livingstone, 2nd edition, 1983.
[10] F Walshe. Diseases of the Nervous System. E & S Livingstone, 11th
edition, 1970.
225
[11] J Oliver and A Middleditch.
Functional Anatomy of the Spine.
Butterworth-Heinemann, Reed Educational and Professional Publishing
Ltd, 1991.
[12] Coccyx: Wikipedia Online Encyclopedia. http://en.wikipedia.org/
wiki/Coccyx, accessed 27th July 2005. Wikipedia Modification Date:
12th June 2005 22:27.
[13] Spinal Cord: Wikipedia Online Encyclopedia. http://en.wikipedia.
org/wiki/Spinal_cord, accessed 30th July 2005. Wikipedia Modification
Date: 13th July 2005, 08:45.
[14] K Whalley Hammell. Spinal Cord Injury Rehabilitation. Chapman and
Hall, 1995.
[15] Quadriplegia: Wikipedia Online Encyclopedia. http://en.wikipedia.
org/wiki/Quadriplegia, accessed 27th July 2005. Wikipedia Modification Date: 16th July 2005 02:33.
[16] D Grundy and A Swain. ABC of Spinal Cord Injury. BMJ Publishing
Group, 3rd edition, 1996.
[17] Irish Motor Neurone Disease Association. http://www.imnda.ie, accessed 29th July 2005.
[18] M Dunitz. Amyotrophic Lateral Sclerosis. Martin Dunitz Ltd, 2000.
[19] Directors of Health Promotion and Education.
facts.
Bacterial meningitis
http://www.astdhpphe.org/infect/Bacmeningitis.html, ac-
cessed 30th July 2005.
[20] A F Bergen, J Presperin and T Tallman. Positioning for function: wheelchairs and other assistive technologies. Valhalla Rehabilitation Publications, 1990.
226
[21] J H Wells, S W Smye and A J Wilson. A microcomputer keyboard substitute for the disabled. Journal of Medical Engineering and Technology,
10(2):58–61, March/April 1986.
[22] R C Simpson and H H Koester. Adaptive one-switch row-column scanning. IEEE Transactions on Rehabilitation Engineering, 7(4):464–473,
December 1999.
[23] H S Ranu. Engineering aspects of rehabilitation for the handicapped.
Journal of Medical Engineering and Technology, 10(1):16–20, January/February 1986.
[24] R Damper. Text composition by the physically disabled: a rate prediction
model for scanning input. Applied Ergonomics, 15:289–296, 1984.
[25] D J Ward and D J C MacKay. Fast hands-free writing by gaze direction.
Nature, 418(6900):838, 2002.
[26] David MacKay. Dasher: an efficient keyboard alternative. Interfaces, 60,
Autumn 2004.
[27] Dasher website. www.inference.phy.cam.ac.uk/dasher/, accessed 30th
July 2005.
[28] J R Wolpaw, N Birbaumer, W J Heetderks, D J McFarland, P H Peckham,
G Schalk, E Donchin, L A Quatrano, C J Robinson and T M Vaughan.
Brain-computer interface technology: A review of the first international
meeting. IEEE Transactions on Rehabilitation Engineering, 8(2):164–173,
June 2000.
[29] R F Schmidt (editor). Fundamentals of Neurophysiology. Springer-Verlag,
3rd edition, 1985.
[30] R D Keynes and D J Aidley. Nerve and Muscle. Cambridge University
Press, 3rd edition, 2001.
227
[31] M Epstein and W Herzog. Theoretical Models of Skeletal Muscle. John
Wiley and Sons, 1998.
[32] A F Huxley. Muscle structure and theories of contraction. Progress in
Biophysics and Biophysical Chemistry, 7:255–318, 1957.
[33] J G Broton C K Thomas and B Calancie. Motor unit forces and recruitment patterns after cervical spinal cord injury. Muscle and Nerve, pages
212–220, February 1997.
[34] C J De Luca.
ing.
Surface Electromyography:
Detection and Record-
Delsys Inc. E-book: http://www.delsys.com/library/papers/
SEMGintro.pdf, 2002.
[35] J L Echternach. Introduction to Electromyography and Nerve Conduction
Testing. Slack Inc., 2nd edition, 2003.
[36] D Gordon and E Robertson. Electromyography: Recording. Univerisity of
Ottawa, Canada, http://www.health.uottawa.ca/biomech/courses/
apa4311/emg_c.pdf accessed 31st July 2005.
[37] Jang-Zern Tsai. Chapter 7: Nervous system. In J G Webster, editor,
Bioinstrumentation. Wiley International, 2004.
[38] R N Scott and P A Parker. Myoelectric prostheses: state of the art. Journal of Medical Engineering and Technology, 12(4):143–151, July/August.
[39] B Hudgins, P Parker and R N Scott. A new strategy for multifunction myoelectric control. IEEE Transactions on Biomedical Engineering,
40(1):82–94, January 1993.
[40] G-C Chang, W-J Kang, J-J Luh, C-K Cheng, J-S Lai, J-J J Chen and TS Kuo. Real-time implementation of electromyogram pattern recognition
as a control command of man-machine interface. Medical Engineering &
Physics, 18(7):529–537, October 1996.
228
[41] S K Rogers and M Kabrisky. An Introduction to Biological and Artificial Neural Networks for Pattern Recognition. SPIE Optical Engineering
Press, 1991.
[42] M H Hayes. Statistical Digital Signal Processing and Modeling. Wiley,
1996.
[43] J Silva, W Heim and T Chau. MMG-Based classification of muscle activity
for prosthesis control. In Proceedings of the 26th Annual International
Conference of the IEEE EMBS, San Francisco, CA, USA, Sept 2004.
[44] G Oster. Early research on muscle sounds. In Proceedings of the 11th Annual International Conference of the Engineering in Medicine and Biology
Society, volume 3, page 1039, Seattle, WA, USA, November 1989.
[45] C Orizio. Muscle sound: bases for the introdution of a mechanomyographic signal in muscle studies. Critical Reviews in Biomedical Engineering, 21(3):201–243, 1993.
[46] D T Barry. IEEE Transactions on Biomedical Engineering, 37(5):525–531,
May 1990.
[47] M I A Harba and G E Chee. Muscle mechanomyographic and electromyographic signals compared with reference to action potential average
propagation velocity. In Proceedings of the 19th International Conference
EMBS, Chicago, IL, USA, Oct-Nov 1997.
[48] R F Schmidt, editor. pg. 129, Fundamentals of Sensory Physiology. New
York Springer, 1981, 2nd edition.
[49] C Boylan. An exploration of the electro-oculogram as a tool of communication and control for paralysed people. University College Dublin,
Ireland, Final Year Project Report, 2003.
[50] P J Oster and J A Stern. Chapter 5, “Measurement of Eye Movement”. In
Irene Martin and Peter H Venables, editors, Techniques in Psychophysiology. John Wiley and Sons, 1980.
229
[51] W Becker and A F Fuchs. Prediction in the oculomotor system: smooth
pursuit during transient disappearance of a visual target. Experimental
Brain Research, 57(3):562–575, 1985.
[52] K A Mason. Control Apparatus Sensitive to Eye Movement. US Patent
#3462604, August 1969.
[53] D A Robinson. A method of measuring eye movements using a scleral
search coil in a magnetic field. IEEE Transactions on Biomedical Engineering, 10:137–145, 1963.
[54] J Gips and P Olivieri. Eagle Eyes: An Eye Control System for People
with Disabilities. In Proceedings of the 11th International Conference on
Technology and Persons with Disabilities, March 1996.
[55] J J Teece, J Gips, C P Olivieri, L J Pok and M R Consiglio. Eye movement
control of computer functions. International Journal of Psychophysiology,
29:319–325, 1998.
[56] R Barea, L Boquete, M Mazo and E López. System for assisted mobility
using eye movements based on electrooculography. IEEE Transactions on
Neural Systems and Rehabilitation Engineering, 10(4):209–218, December
2002.
[57] M Mazo and the Research Group of the SIAMO Project. An integral
system for assisted mobility. IEEE Robotics and Automation Magazine,
pages 46–56, March 2001.
[58] Biocontrol Systems. EOG Biocontrol Technology and Applications. http:
//www.biocontrol.com/eog.html, accessed 14th July, 2004.
[59] A S Sedra and K C Smith. Microelectronic Circuits. Oxford University
Press, 3rd edition, 1982.
[60] E Burke, Y Nolan and A de Paor. An electro-oculogram based system
for communication and control using target position variation. In IEEE
230
EMBS UK and RI Postgraduate Conference on Biomedical Engineering
and Medical Physics, Reading, UK, July 2005.
[61] Eyes:
Wikipedia Online Encyclopedia.
http://en.wikipedia.org/
wiki/Eyes, accessed 14th July 2005. Wikipedia Modification Date: 11:08,
14 July 2005.
[62] M A Just and P A Carpenter. A theory of reading: From eye fixations to
comprehension. Psychological Reviews, 87(4):329–254, 1980.
[63] R J K Jacob. Eye movement-based human-computer interaction techniques: Towards non-command interfaces. Advances in Human-Computer
Interaction, 4:151–190, 1993.
[64] L Stark. Neurological Control Systems: Studies in Bioengineering. Plenum
Press New York, 1968.
[65] G Westheimer. Mechanism of saccadic eye movements. AMA Archives
Ophthalmology, 52:710–724, 1954.
[66] B Cogan and A de Paor. Optimum stability and mimimum complexity
as desiderata in feedback control system design. In IFAC Conference,
Control Systems Design, pages 51–53, Bratislava, Slovakia, June 2000.
[67] Y Nolan, E Burke, C Boylan and A de Paor. The human eye position
control system in a rehabilitation setting. In International Conference on
Trends in Biomedical Engineering, University of Zilina, Slovakia, September 7-9 2005.
[68] R Edelberg. Chapter 9: Electrical activity of the skin. In Handbook of
Psychophysiology. Holt, Rinehart and Winston Inc, 1972.
[69] D P Burke. Real-time Processing of Biological Signals to Provide Multimedia Biofeedback as an Aid to Relaxation Therapy. MEngSc Thesis,
University College Dublin, Ireland, 1998.
231
[70] W S T Hays. Human pheromones: have they been demonstrated? Behavioral Ecology and Sociobiology, 54(2):89–97, 2003.
[71] L E Lajos. The Relation Between Electrodermal Activity in Sleep, Negative
Effect, and Stress in Patients referred for Nocturnal Polysomnography.
PhD thesis, Department of Psychology, Louisiana State University, 2002.
[72] L A Geddes and L E Baker. Principles of Applied Biomedical Instrumentation. Wiley, 3rd edition, 1989.
[73] Electrodermal Activity.
http://www.paranormality.com/eda.shtml,
accessed 31st May 2005.
[74] The Electrodermal Response. http://butler.cc.tut.fi/~malmivuo/
bem/bembook/27/27.htm, accessed 31st May 2005.
[75] M Betke, J Gips and P Fleming. The Camera Mouse: Visual tracking of body features to provide computer access for people with severe
disabilities. IEEE Transactions on Neural Systems and Rehabilitation
Engineering, 10(1):1–10, March 2002.
[76] R B Reilly and M O’Malley. Adaptive noncontact gesture-based system
for augmentative communication. IEEE Transactions on Rehabilitation
Engineering, 7(2):174–182, June 1999.
[77] W J Weiner and A E Lang. Movement Disorders: A Comprehensive
Survey. Futura Publishing Company, Mount Kisco, New York, 1989.
[78] D A Forsyth and J Ponce. Chapter 7:Linear Filters. In Computer Vision: A Modern Approach. Pearson Education International, Prentice
Hall, 2003.
[79] J R Parker. Practical Computer Vision Using C. John Wiley & Sons,
Inc., 1994.
[80] L Rabiner and B Juang. Chapter 47, speech recognition by machine. In
The Digital Signal Processing Handbook. CRC Press LLC, 1998.
232
[81] D Kershaw et al S Young, G Evermann. The HTK Book. E-Book, Microsoft Corporation, 1995.
[82] P B Denes and E N Pinson. The Speech Chain:The Physics and Biology
of Spoken Language. Bell Telephone Laboratories, 1963.
[83] J R Deller, J G Proakis and J H L Hansen. Discrete-Time Processing of
Speech Signals. Macmillan, 1993.
[84] I Johnston. Measured tones: The interplay of physics and music. Institute
of Physics Publishing, 2nd edition, 2002.
[85] H M Kaplan. Anatomy and Physiology of Speech. McGraw-Hill, 2nd
edition, 1971.
[86] Summer Institute of Lingustics Website.
http://www.sil.org/
linguistics/GlossaryOfLinguisticTerms/, accessed 12th April 2005.
[87] IPA Chart for English: Wikipedia Online Encyclopedia. http://en.
wikipedia.org/wiki/IPA\_chart\_for\_English, accessed 11th May
2005. Wikipedia Modification Date: 07:01, 5th May 2005.
[88] S Lemmetty. Chapter 3, Phonetics and Theory of Speech Production. In
Review of Speech Synthesis Technology. http://www.acoustics.hut.fi/
~slemmett/dippa/index.html, accessed 11th May 2005.
[89] A Hiberno English Archive.
http://www.hiberno-english.com, ac-
cessed 11th May 2005.
[90] R A Penfold. An Introduction to PIC Microcontrollers. Babani Electronics
Books, 1997.
[91] Open Sound System Documentation. http://opensound.com/pguide/,
accessed 27th June 2005.
[92] J G G Dobbe. Algorithm alley - fast fourier transform. In Dr. Dobb’s
Journal. February 1995.
233
[93] R A Penfold. Electronic Hobbyist’s Data Book. Babani Electronics Books,
1996.
234
Appendix A
MMG Circuit
The circuit diagram for the MMG circuit in Section 3.5.2 is given in Figure
A.1. The purpose of this circuit is to remove the 2.5V offset from the signal, to
set the bandwidth and to amplify the signal. The bandwidth is set to 200Hz
by choosing Cx = Cy = 0.027µF , as described in the ADXL203E datasheet
from Analog Devices. The gain of the circuit is
2R
=2
R
R is chosen to be 1MΩ so as not to load the accelerometer appreciably.
235
(A.1)
+5V
2R
Vs
ADXL203E
R
X
−
Cx
Y
−5V
236
Cy
xout
+
2R
2R
R
−
−5V
+
2R
Figure A.1: MMG Circuit
yout
Appendix B
Simulink Models
Four simulink models are given here. The first is the model used to detect
muscle contraction from the MMG. The second model is the model used to
simulate a saccadic jump. The third is the model used to simulate a smooth
pursuit movement. The last model is the model used to observe the firing rate
of the sympathetic nervous system based on measurement of skin conductance.
237
238
Figure B.1: Simulink block diagram used to detect muscle contraction from the MMG
239
Figure B.2: Simulink model used to simulate a saccadic jump of 15◦ or 0.2618 rad. The output is shown in the main text in Figure 4.19
240
Figure B.3: Modified Simulink Model used to simulate a smooth pursuit movement. The output is shown in the main text in Figure 4.23
241
Figure B.4: Simulink model used to observe the firing rate of the sympathetic nervous system based on measurement of the skin conductance.
The estimated firing rate y is shown in Figure 4.29 and g and gm are shown in Figure 4.28.
Appendix C
MATLAB Code for TPV Fit
Function
This is the MATLAB code used to generate the four fit functions values in
Figure 4.14. The data used was sampled at 200Hz and 45s long so contained 9000 samples overall. The name of the array with the original data
is EOG.signals.values. The four frequency components that are present in
this signal and which the fit functions seek to identify are 0.2Hz, 0.4Hz, 0.8Hz
and 1.6Hz and the variable f is set to each of these values in turn and the
following code is run to generate each of the fit function values.
SE=EOG.signals.values(1:9000);
SET=transpose(SE);
t=(0:0.005:44.995);
p=SET.*exp(i*2*pi*f*t);
if (f==1.6)
period = 125;
end
242
if (f==0.8)
period = 250;
end
if (f==0.4)
period = 500;
end
if (f==0.2)
period = 1000;
end
for n=period:9000
c(n) = sum(p(n-period+1:n))/(0.5*period);
avg(n)=sum(SE(n-period+1:n))/period;
end
for n=period:9000
r(n) = sum(abs(SET(n-period+1:n)-avg(n)-real(c(n))*
cos(2*pi*f*t(n-period+1:n)) - imag(c(n))*sin(2*pi*f*t(n-period+1:n))));
end
243
Appendix D
Optimum Stability
The characteristic polynomial of the muscle spindle controller is:
P (s) = (s2 + h1 s + h0).(s + 120)2 +
1
(f1 s + f0 )
J
(D.1)
This fourth order equation has four roots. Placing all four at s = −240 gives a
good match between the overall step response of the muscle spindle loop and a
real step response, where the eye suddenly jumps (a saccadic movement). This
value gives the following values for the four free parameters:
h1 = 720
h0 = 158400
f1 = 15206.4
f0 = 2280960
(D.2)
Assigning the four roots of Equation D.1 to the same value gives the principle
of optimum stability. If all the controller parameters but one are held at their
nominal values, then, as that one is varied through its nominal value, the
right-most root is as deep in the left half plane as possible [66]. This will be
demonstrated now for each of the four parameters.
244
Root Locus as f0 varies
400
300
Imaginary Axis
200
100
0
−100
−200
−300
−400
−600
−500
−400
−300
−200
−100
0
100
200
Real Axis
Figure D.1: Root locus of controller with transfer function given by Equation D.1
with values h1 , h0 and f1 as given in Equation D.2, and f0 varies from 0 → ∞
245
Root Locus as f1 varies
800
600
Imaginary Axis
400
200
0
−200
−400
−600
−800
−1200
−1000
−800
−600
−400
−200
0
200
Real Axis
Figure D.2: Root locus of controller with transfer function given by Equation D.1
with values h1 , h0 and f0 as given in Equation D.2, and f1 varies from 0 → ∞
246
Root Locus as ho varies
1500
1000
Imaginary Axis
500
0
−500
−1000
−1500
−800
−700
−600
−500
−400
−300
−200
−100
Real Axis
Figure D.3: Root locus of controller with transfer function given by Equation D.1
with values h0 , f1 and f0 as given in Equation D.2, and h0 varies from 0 → ∞
247
0
Root Locus as h1 varies
500
400
300
Imaginary Axis
200
100
0
−100
−200
−300
−400
−500
−500
−400
−300
−200
−100
0
100
Real Axis
Figure D.4: Root locus of controller with transfer function given by Equation D.1
with values h0 , f1 and f0 as given in Equation D.2, and h1 varies from 0 → ∞
248
Appendix E
Circuit Diagram for Measuring
Skin Conductance
This is the circuit diagram used to measure conductance. The output voltage
e0 is proportional to the conductance of the skin. Each side of the “skin
conductance” block in the diagram corresponds to each of the two electrodes
that are used to measure skin conductance.
249
5V
C1
1.3kΩ
0V
202kΩ
2R
−
SKIN
e1
−
+
250
e2
2.4kΩ
0V
R
R
e0
+
e0 = −2[e1 + e2 ]
10kΩ
Figure E.1: Circuit used to measure skin conductance
Appendix F
Phoneme Detection Circuit
Diagrams and Circuit Analysis
The analogue circuit and microcontroller circuit used for phoneme detection
are given here.
F.1
Analogue Circuit
The analogue circuit diagram is given in Figure F.1. It consists of seven stages
- pre-amplifier, filtering, amplifier, rectifier, threshold, delay/comparator and
relays.
F.1.1
Pre-Amplifier
Gain =
10 × 103
R2
=
= 10
R1
1 × 103
251
(F.1)
F.1.2
Filtering
Two band-pass filters were used, one to pass the low-frequency, narrow-band
signal of an /o:/ sound (which we will call Filter A) and one to pass the
high-frequency, wide-band signal of an /s/ sound (Filter B).
Filter A
The band-pass filter contains a potentiometer R4 which may be adjusted according to user’s pitch to adjust the centre frequency of the band-pass filter.
The transfer function of the filter is:
(
T (s) =
−1
)s
R3 C3
1
C3
1
R3
+s
(1 +
)+
(1 +
)
C3 R3
C2
R3 R5 C2 C3
R4
For any filter with a transfer function of the form:
(F.2)
s2
b0 s
+ a1 s + a0
√
occurs at ω = a0 and is of the value
T (s) =
the maximum gain |T (jw)|peak
s2
(F.3)
b0
.
a1
In this case, choosing R3 = 2kΩ, C1 = C2 = 0.1µF and R3 = 100kΩ gives
a gain of:
Gain =
−1
R1 C1
b0
=
a1
1
C1
(1 +
)
C1 R3
C2
−1
3
2 × 10 × 0.1 × 10−6
=
1
0.1 × 10−6
(1
+
)
0.1 × 10−6 × 100 × 103
0.1 × 10−6
= 25
The centre frequency is given by:
r
ωc =
=
s
1
R3
(1 +
)
R3 R5 C2 C3
R4
(5 × 105 )(1 +
252
2 × 103
)
R4
(F.4)
(F.5)
As Equation F.5 shows, the centre frequency can be altered by adjusting
the value of R4 to suit different pitches. A 1kΩ potentiometer gives:
fc = 1.59kHz for R4 = 10Ω
fc = 515.72Hz for R4 = 100Ω
fc = 251.64Hz for R4 = 500Ω
fc = 194.92Hz for R4 = 1kΩ
Comparing this to the spectrum of the phoneme /o:/ back in Figure 6.5,
these values should be sufficient to pass the fundamental and/or first overtone
of the correct phoneme over a range of pitches.
Filter B
The circuit for the band-pass filter was obtained from pg. 35 of [93]. It
contains four resistors, R6, R7, R8a and R9 and two capacitors C3 and C4.
The component values are chosen to set the centre frequency according to the
following formula:
0.159
R×C
C = C3 = C4
p
R = R6 R7
fc =
R8 = R9 = 10kΩ
Choosing fc = 5kHz and C=22nF gives
R=
0.159
= 1.445kΩ
5 × 103 × 22 × 20−9
The values of R6 and R7 depend also on the Q value required.
For Q = 1: R6 = 2R, R7 = 0.5R
For Q = 0.5: R6 = R7 = R
253
(F.6)
(F.7)
(F.8)
as Q ↑, R1 ↓ R4↑
We want a bandwidth of approximately 1.25kHz (Q = 4). Choosing R1 =
180Ω and R4 = 12kΩ fits the requirements.
F.1.3
Amplifier
The second amplifier stage is only needed for the /s/ circuit. This is due to
the fact that the /s/ phoneme is generally of a much lower intensity than the
/o:/ phoneme, and also because the maximum gain of Filter B is only 12.5 (it
is 25 for Filter A). The signal needs to be further amplified to provide a high
enough signal to control the relay.
The gain of the amplifier stage is given by:
Gain =
F.1.4
R1 2
3.9 × 103
=
= 3.9
R1 1
1 × 103
(F.9)
Rectifier
The rectifier stage of the circuit is necessary to make sure the signal stays above
a threshold for a sufficient length of time (see Section F.1.5). It is basically
just an envelope detector with a slow time constant (at least 10 times the
maximum period of the signal). The rectifier for the /o:/ phoneme we will call
Rectifier A, and for the /s/ phoneme, we will call Rectifier B.
Rectifier A
The time constant of the circuit is given by:
τ1 = R10 × C8
(F.10)
Choosing R10 = 270kΩ and C8 = 220nF gives τ1 = 0.0594s. The max period
of the signal should be about 0.005s (200Hz).
254
Rectifier B
The time constant of the circuit is given by:
τ2 = R13 × C9
(F.11)
Choosing R13 = 1MΩ and C9 = 100nF gives τ2 = 0.1s. The max period of
signal should be about 0.001s (1000Hz).
F.1.5
Threshold
The purpose of the threshold stage of the circuit is to ensure the signal is
sufficiently large to turn on the switch. This is especially important in the
case of an /s/ sound being made. Since this is a wide-band signal, in some
cases, it may contain a small amount of low-frequency components of similar
frequency to the frequency of the /o:/ phoneme. This could accidently close
Switch A as well as Switch B if the thresholding stage is not performed to
ensure that the amount of that frequency is high enough. A comparator is
used for thresholding, if the signal is larger than a reference voltage, the output
will be high, if not, the output will be low.
Threshold A
The reference level is set using the following equation:
Vref =
R14
Vcc
R14 + R15
(F.12)
Choosing R14 = 1kΩ, R15 = 390Ω and Vcc = 9V gives Vref = 6.47V.
Threshold B
Vref =
R16
Vcc
R16 + R17
Choosing R16 = 4.7kΩ, R17 = 10kΩ and Vcc = 9V gives Vref = 2.88V.
255
(F.13)
F.1.6
Delay and Comparator
This stage is only required for the /s/ part of the circuit. The circuit is basically
an integrator with an output given by:
Vout =
Vcc
t
(R18 + R19 )C10
(F.14)
The time constant of the circuit is
τ3 = (R18 + R19 )C10
(F.15)
When the input signal is high (Vcc ), the output signal will initially be also at
Vcc . As long as the input stays high, the output will begin to drop. At t = τ3
the output will be at 0V. This output is connected to the inverting terminal
of a comparator, with its non-inverting terminal connected to ground. As long
as the input is higher than ground, or 0V, the comparator output will be low.
When the signal reaches 0V at t=τ3 , the output of the comparator suddenly
changes to Vcc causing the switch to close.
Choosing R18 = 10kΩ, R19 = 560kΩ and C10 = 470nF gives τ3 = 0.2679s.
F.1.7
Relays
The switching action is performed using relay coils. When sufficient voltage
is dropped across the coil, the switch closes. The coil requires a voltage drop
of about 5V across it, and has quite a low resistance of 83.3Ω. Therefore it
requires an output capable of supplying at least 60mA to close the switch.
256
FILTER
BUFFER
RECTIFIER
C3
THRESHOLD
R5
+
C2
R3
PRE−AMPLIFIER
−
−
+
R4
R10
R2
C1
RELAY
C8
−
+
+9V
R15
R14
R1
+
Microphone
C4
R12
257
R6
−
+
R7
R8
C10
C7
C5
R11
+9V
FILTER
−
+
R13
R9
R18
+
C9
−
+
C6
RECTIFIER
BUFFER
−
+
−
−
+
R17
R16
AMPLIFIER
R19
THRESHOLD
+9V
DELAY & COMPARATOR
Figure F.1: Circuit Diagram for Phoneme Detection. Component Values are given in Table F.1
RELAY
Table F.1: Component Values for circuit in Fig F.1
R1
1kΩ
C1
20µF
R2
10kΩ
C2
100nF
R3
2kΩ
C3
100nF
R4
1kΩ(pot)
C4
22nF
R5
100kΩ
C5
22nF
R6
12kΩ
C6
4.7µF
R7
180Ω
C7
10µF
R8
10kΩ
C8
220nF
R9
10kΩ
C9
200nF
C10
470nF
R12 3.9kΩ
Diodes
1N4148
R13 1MΩ
Op-amps 741 or 3140
R10 270kΩ
R11 1kΩ
R14 1kΩ
R15 390Ω
R16 4.7kΩ
R17 10kΩ
R18 10kΩ
R19 560kΩ
258
F.2
Microcontroller Circuit
The microcontroller circuit diagram is given in Figure F.3. It consists of six
stages - microphone, amplifier, infinite clipper, debouncing circuit, microcontroller and current amplifier/relay coils.
F.2.1
Microphone
The microphone input stage was designed for use with an electret condenser
microphone. Most computer microphones are electret microphones and so
a computer microphone may be connected to the circuit using a standard
2.5mm stereo jack. Electret condenser microphones exploit the phenomenon of
capacitance changes due to mechanical vibrations (e.g. changes in air pressure
due to sound pressure), to produce a voltage signal proportional to the sound
wave. The electret microphone already has a built in charge but a few volts
are needed to power the built in FET buffer. A circuit diagram of an electret
microphone is shown in Figure F.2. The three connections are the power, signal
and ground. The signal that is output by the electret microphone usually has
a few volts DC bias included, so this needs to be taken into account by using
a capacitor to block the DC component.
F.2.2
Amplifier
The amplifier stage has two purposes, amplification and moving the reference
level so the signal rides around 4.5V. The reference level is set using two
equal resistors R5 and a variable resistor R6 to allow the user to manually
compensate for any deviation from midlevel. The reference level, Vref is given
by:
Vref =
R5
× Vcc
2R5 + R6
(F.16)
Vcc = 9V, R5=150kΩ and R6 = 0 → 10kΩ, allowing Vref to be adjusted
between 4.5V and 4.645V.
259
+ve 5V
R1
C1
Vout
R2
Microphone
Gnd
Figure F.2: Circuit Required for Electret Microphone. R1 and C1 are usually
included within the microphone casing, R2 is a load resistor. Typical values: R1 =
2.2kΩ, C1 = 10µF and R2=10kΩ.
Table F.2: Component Values for circuit in Fig F.3
R1
2.2kΩ
C1
10µF
R2
10kΩ
C2
22µF
R3
1MΩ(pot)
C3
22nF
R4
680kΩ
C4
100µF
R5
150kΩ
R6
10kΩ(pot)
R7
910Ω
XTL
4MHz
R8
2.7kΩ
Diodes
1N4148
R9
2kΩ
Op-amps
741
R10 15kΩ
Regulator7806L
R11 2.2kΩ
R12 15kΩ
260
AMPLIFIER
MICROPHONE
CLIPPER
+9V
MICRO−CONTROLLER
CURRENT
AMPLIFIER
DEBOUNCER
RELAYS
R1
Voltage
Regulator
R5
+6V
(6)
R3
R6
R4
C3
261
R2
C1
MIC
+
_
+
R7
+
R9
(14) (4)
PIC
(17)
(5) (15) (16)
C4
XTL
C3
+6V
R11
+
R9
R8
(4)
(3)
PIC (18)
(5) (15) (16)
C2
C3
SWITCH A
R12
−
(14)
R5
R12
−
R10
C4
R10
XTL
C3
R11
Figure F.3: Circuit Diagram for PIC-Based Phoneme Detection. Component Values are given in Table F.2
SWITCH B
The gain of this stage can be calculated using the following equation:
Gain =
R3 + R4
R2
(F.17)
R2 = 10kΩ, to match the output impedance of the microphone. R4 = 680kΩ
and R3 = 0 → 1MΩ , allowing the gain to be adjusted between 68 and 168.
F.2.3
Infinite Clipper
This stage infinitely clips the signal, using a comparator. The signal received
from the output of the last stage is compared to Vref . If the signal is higher
than the reference level the comparator goes into positive saturation and if it
is lower the comparator goes into negative saturation. The comparator used is
a 3140 which can be powered off 0V and 9V. This gives an output signal which
switches between 0V and approximately 8V. The potential divider below is
used to convert this signal to a level suitable for input to the PIC(6V).
R8
× Vin
R7 + R8
2.7 × 103
=
(8)
910 + (2.7 × 103 )
= 5.98V
Vout =
F.2.4
(F.18)
Microcontroller
The amplified, infinitely clipped signal is input into the microcontroller. On
detection of the correct sound, the microcontroller sends its output high for as
long as the correct sound is detected. The methods used to determine if the
correct phoneme was uttered are given in the main body of the text.
F.2.5
Debouncing Circuit
The debounce circuit prevents “flickering” by only allowing the switch to close
once the output remains high for a set length of time. The op-amp acts as
262
a comparator. The inverting input of the op-amp at this stage is set to a
reference level of V− = 4.5V using the two equal valued resistors, both labelled
R11, to divide the 9V supply. When the output of the PIC, Vpic , is high
(6V ), the capacitor C4 begins to charge up with time constant τ = R10×C4
= 150ms. When the capacitor is charged, the voltage V+ at the non-inverting
terminal is
R10
× (Vpic − VD )
R10 + R9
= 4.68V
V+ =
(F.19)
where VD is the voltage drop across the diode (0.7V). Hence the output of the
comparator will only “turn on” (6V) after a time slightly less than τ .
F.2.6
Current Amplifier and Relay Coils
The maximum current available from the output of a 741 op-amp is only
about 10mA. The relay coil has a resistance RCOIL = 83.3Ω which needs 5V
dropped across it to close the switch, so requires available current of about
60mA. Hence a current amplifier was needed. A BJT based circuit was used
in common collector configuration. A suitable value for the base resistor, R12,
was calculated as 12kΩ (using BJT datasheet values β = 250 and VBEON =
0.8V ).
263
Appendix G
PIC 16F84 External Components
and Pinout
The microcontroller used was a PIC16F84, powered by a 9V supply. The
pin-out for this microcontroller is given in Appendix G. The only external
components necessary for this stage are a crystal and two capacitors, which sets
the clock rate. A 4MHz crystal and 2×22nF capacitors were used, which gives
a clock speed of one command execution per µs, a quarter of the crystal speed.
The two phonemes are detected independently using two separate PIC16F84
microcontrollers, which both use a relay at their output to close a switch upon
recognition of the appropriate phoneme. The relay-switched outputs can be
connected to the switch inputs of the reading machine or those of another
device. The code programmed onto each of the two microcontrollers is given
in this appendix, and described in Section 6.5.2. chkooo.asm is the code for
detection of /o:/ and chksss.asm is the code for detection of /s/.
264
1
18
RA2
RA1
RA3
RA0
OSC1/CLOCK IN
RA4/TOCK1
MCLR
OSC2/CLOCK OUT
GND
V+
RB0
RB7
RB1
RB6
RB2
RB5
RB3
RB4
10
9
Figure G.1: Pin-out for PIC 16F84 (see [90]).
265
Appendix H
Phoneme Recognition
Microcontroller Code and
Flowchart
Code for Detection of [s] Phoneme ;chkooo.asm
STATUS equ 3
PORTA equ 5
PORTB equ 6
TMR0 equ 1
OPT equ 1
INTCON equ 0BH
#DEFINE OO_OUT PORTA,0
#DEFINE SS_OUT PORTA,1
#DEFINE ZERO STATUS,2
#DEFINE PGNO STATUS,5
CNTR1 equ 0CH
CNTR2 equ 0DH
INT_OLD equ 0EH
INT_NEW equ 0FH
OO_INTS equ 2AH
;start of code
266
org 0
goto init
org 4
goto isr
; Configure inputs and outputs
init bsf PGNO ;Select page 1
clrf PORTB
clrf PORTA
movlw 10H
movwf PORTA ;RA4 is input (TMR0 Clock Input)
movlw 01H
movwf PORTB ;RB0 is input (external interrupt)
bcf PGNO ; Select page 0
; Initialise values
bcf OO_OUT
bcf SS_OUT
;configure timer
bsf PGNO
bcf OPT,5 ;use internal clock for TMR0
bcf OPT,3 ;use prescaler with RTCC
bsf OPT,2 ;256
bsf OPT,1
bsf OPT,0
bsf OPT,6 ; external interrupt occurs on the rising edge of signal
bcf PGNO
clrf INTCON
bsf INTCON,7 ;enable interrupts
bsf INTCON,4 ;enable external interrupt
bsf INTCON,5 ;enable timer overflow interrupt
clrf TMR0
clrf INT_OLD
267
clrf OO_INTS
; infinite loop, can be interrupted by service routine
loop goto loop
;interrupt service routine
isr btfsc INTCON,2 ;check if interrupt caused by timer overflow
goto ovrflw
;check freq<1000Hz
movf TMR0,0
movwf INT_NEW
andlw b’11111100’ ;result is zero if timer is <4 (1ms) - freq too high
btfsc ZERO
goto set_low
goto compare
;compare two consecutive intervals
compare movf INT_NEW,0
subwf INT_OLD,0; subtracts W from INT_OLD
btfss STATUS,0; If result is negative then complement
xorlw b’11111111’; bitwise complement result of subtraction
andlw b’11111100’; bitwise AND the result with 11111100
btfss ZERO ;zero if two numbers are similar (< 1ms difference)
goto set_low
goto chk_int
set_low clrf OO_INTS;reset ooh intervals counter if intervals were different
bcf OO_OUT
movf INT_NEW,0
movwf INT_OLD; copy INT_NEW into INT_OLD
clrf TMR0
bcf INTCON,1
retfie
chk_int incf OO_INTS,1
268
movf INT_NEW,0
movwf INT_OLD ;copy INT_NEW into INT_OLD
;check if OO_INTS has reached 4
movf OO_INTS,0
sublw d’4’
btfsc ZERO ;zero if 4 consecutive intervals are similar
goto set_oo
clrf TMR0
bcf INTCON,1
retfie ;return from interrupt routine (back to dummy loop)
set_oo bsf OO_OUT
decf OO_INTS
clrf TMR0
bcf INTCON,1 ;reset external interrupt flag
retfie
ovrflw bcf OO_OUT
clrf OO_INTS
bcf INTCON,2 ;reset timer overflow flag
retfie
end
Code for Detection of [s] Phoneme
; chksss.asm - written 16/1/2003
STATUS equ 3
PORTA equ 5
PORTB equ 6
TMR0 equ 1
OPT equ 1
INTCON equ 0BH
#DEFINE SS_OUT PORTA,1
#DEFINE OVRFLW INTCON,2
#DEFINE PGNO STATUS,5
269
CNTR1 equ 0CH
CNTR2 equ 0DH ;code starts here
org 0
goto init
; Configure inputs and outputs and timer
init bsf PGNO ;Select page 1
clrf PORTB
clrf PORTA
movlw 10H
movwf PORTA ;RA4 is input (TMR0 Clock Input)
bsf OPT,5 ; Use external clock for TMR0
bsf OPT,3 ; Don’t use prescaler with RTCC
bsf OPT,4
bcf PGNO ; Select page 0 ; Initialise values
bcf SS OUT
chk_sss clrf INTCON ;disable interrupts and reset timer overflow flag
clrf CNTR1
movlw b’00001100’ ; 12 decimal
movwf CNTR2
movlw b’11101100’ ;236 decimal
movwf TMR0
loop1 decfsz CNTR1,1
goto loop1
decfsz CNTR2,1
goto loop1 ; 10.24ms loop
btfsc OVRFLW
goto set_op
bcf SS_OUT
goto chk_sss
set_op bsf SS_OUT
goto chk_sss
270
end
271
ENTER ON
INTERRUPT
is
interrupt
caused by timer
overflow?
YES
clear output pin
clear interval counter
RETURN
NO
is
interval
between this interrupt
and the last too
small?
YES
record interval
NO
is
interval of
similar length to
last interval?
clear timer
NO
YES
has
number of
consecutive similar
intervals reached
4?
YES
SET OUTPUT PIN HIGH
Figure H.1: Flowchart for Interrupt Service Routine in PIC program to detect
utterance of the phoneme /o:/
272
Appendix I
Code for Programs
The code for the programs is on the included CD. The files included are as
follows:
I.1
Natterbox
Source code
main.cpp
main.h
Executable
natter.exe
I.2
USB Switch
Source Code
main.cpp
Icon files
These files are used to generate an iconic indicator on the system toolbar for
mouse cursor control:
icon1.ico
icon2.ico
resource.res
273
resource.h
I.3
MMG Detection Program
Source Files
main.cpp
mmg3.cpp*
msgproc.cpp
setup.cpp
Header Files
main.h
setup.h
mmg3.h*
mmg3_common.h*
mmg3_export.h*
mmg3_prm.h*
mmg3_reg.h*
*Modified Real-Time Workshop code generated for Simulink model “mmg3.mdl”
I.4
Path Description Program
Source Files
main.cpp
creategraph.cpp
setup.cpp
samplegrabber.cpp
render.cpp
Header Files
main.h
creategraph.h
setup.h
274
samplegrabber.h
render.h
I.5
Graphical Menu
Source Files
main.cpp
audio_widget.cpp
draw_window.cpp
fl_draw_button.cpp
pcfft.cpp
picture.cpp
Header Files
main.h
audio_widget.h
draw_window.h
fl_draw_button.h
pcfft.h
picture.h
The makefile is also included.
I.6
Spelling Bee
Source Files
main.cpp
audio_window.cpp
pcfft.cpp
render.cpp
setup.cpp
sound.cpp
275
Header Files
main.h
audio_window.h
pcfft.h
render.h
setup.h
sound.h
276
© Copyright 2026 Paperzz