Identification through Sound Analysis

EMBEDDED IMPLEMENTATION OF DISTRESS SITUATION IDENTIFICATION
THROUGH SOUND ANALYSIS
Dan Istrate
Michel Vacher
Jean François Serignat
Assistant Professor
ESIGETEL, Avon, France
Research Scientist
LIG, Grenoble, France
Senior Lecturer
LIG, Grenoble, France
E-mail: [email protected]
[email protected] [email protected]
ABSTRACT
The safety of elderly people living alone at home might be a crucial problem because of the growing aging
population and the high risk of home accidents such as falls. Medical remote monitoring systems may
increase the safety of such people in detecting and quickly announcing the state of emergency. We have
already proposed a sound medical remote monitoring system. Distress sounds like glass breaking,
screams, falls and distress expressions like “Help”, “A doctor, please!” are detected and recognized
through a continuous analysis of the sound flow. In the case of distress situation identification, the software
can send an alarm with the recognized data to a close person and/or to a medical center. In this paper, a
real-time implementation of this system is presented. The advantages of this implementation on an
Embedded PC, equipped with a classical sound card and a microphone, are the reduced dimensions, the
silence (fan less) and the cost. In the same time, this implementation is flexible and can be installed also
on desktop or laptop PC.
KEYWORDS: sound detection, sound/speech classification, sound recognition, signal processing,
embedded system, telemedicine.
INTRODUCTION
The number of elderly people which live alone in their own homes increases because of aging of
European population: in 2030, 37% of European population will be over the 60 years and the
elderly over 80 years which represent today 3% will increase to 10 % in 2050 [1]. In France also
the persons older than 60 years represent today ≈20% of the population and will represent 33%
in 2050 (INSEE Première n°1089 - July 2006). The el derly people living alone at home have an
increased risk of home accidents such as falls due to the cognitive or physical faintness. A
statistical study indicates that 7 % of elderly people have a home accident due to everyday life
activity and in 84% of cases a fall occurs [2]. In practice all the industrialized countries are
affected by this phenomenon.
E-Health systems, like medical remote monitoring, reduce the consequences of home accidents
through distress situation detection and quick alarm transmission to the emergencies or to a
close person. The automatic monitoring of the living alone person can detect not only a distress
situation like a fall or faintness but also other pathologies (cold, pulmonary disease). Currently,
remote monitoring systems use several fixed sensors (infrared) and mobile sensor (fall detector,
movement and pulse) to detect a distress situation [3], [4]. We already proposed a system which
extracts information regarding patient status, through a sound environment monitoring [5]. This
former system acquires and analyses data from 5 microphones. It detects everyday life sounds
like glass breaking, screams, falls and distress expressions such as “Help”, “A doctor, please!”.
The extracted sound or sentences are not recorded, except the case of an alarm situation, in
order to preserve patient privacy.
In this article a new real time implementation of the sound monitoring algorithms on an
embedded PC is proposed, using the standard sound card and a microphone. This real time
implementation allows the use of the sound modality coupled or not with other systems in order
1
to increase the safety of elderly people living alone. This paper starts with a global description of
the system followed by a succinct description of algorithms. The real time implementation is
preceded by the presentation of the sound corpus. Finally, the system evaluation and the result
interpretation are described in the last section of the paper.
SOUND REMOTE MONITORING
The proposed sound remote monitoring system analyzes the acoustical environment in real time
and is made up of four main modules which are presented in the Figure 1.
Figure 1 - Sound monitoring architecture
Alarm
Sound Event Detection and
Extraction Module (M1)
Sound/Speech
Classification Module (M2)
Speech
Sound
Sound Recognition (M3.1)
Speech Recognition (M3.2)
Help !
Door slap
Door lock Glass Breaking
It’s a nice day.
Fall
The signal extracted by the M1 module, which runs in real time is classified like sound or speech
by the M2 module. In the case of sound label, the sound recognition module M3.1 classifies the
signal between eight predefined sound classes, while in the case of speech label, the extracted
signal is analyzed by a speech recognition engine in order to detect distress sentences. For both
cases, if an alarm situation has been identified (the sound or the sentence belong to an alarm
class) an email message or a SMS is sent to the close person and/or a message to the medical
telemonitoring center. Firstly a local acoustic and/or visual alarm is generated and if the patient
does not respond by the alarm cancellation, the alarm message is sent.
Sound Event Detection Module (M1)
The sound flow is analyzed through a wavelet based algorithm aiming at sound event detection.
This algorithm must be robust to noise like neighbourhood environmental noise, water flow
noise, ventilator or electric shaver. Therefore an algorithm based on energy of wavelet
coefficients was proposed and evaluated in [6]. This algorithm detects precisely the signal
beginning and its end, using properties of wavelet transform.
Sound/Speech Classification Module (M2)
The method used by this module is based on Gaussian Mixture Model (GMM) [7] (K-means
followed by Expectation Maximisation in 20 steps). There are other possibilities for signal
classification: Hidden Markov Model (HMM), Bayesian method, etc. Even if similar results have
2
been obtained with other methods, their high complexity and high time consumption prevent from
real-time implementation.
A preliminary step before signal classification is the extraction of acoustic parameters: LFCC
(Linear Frequency Cepstral Coefficients) - 24 filters. The choice of this type of parameters relies
on their properties: bank of filters with constant bandwidth, which leads to equal resolution at
high frequencies often encountered in life sounds.
The BIC (Bayesian Information Criterion) is used in order to find the optimal number of
Gaussians [8]. The best performances have been obtained with 24 Gaussians.
Sound Recognition Module (M3.1)
This module is based, also, on a GMM algorithm. The LFCC acoustical parameters have been
used for the same reasons than for sound/speech module and with the same composition: 24
filters. The method BIC has been used in order to determine the optimum number of Gaussians:
12 in the case of sounds. A log-likelihood is computed for the unknown signal according to each
predefined sound classes; the sound class with the biggest log likelihood is the output of this
module.
Speech Recognition Module (M3.2)
For Speech Recognition, the autonomous system RAPHAEL is used [9]. The language model of
this system is a medium vocabulary statistical model (around 11,000 words). This model is
obtained by using textual information extracted from the Internet as described in [10] and from
”Le Monde” corpora. It is then optimized for the distress sentences of our corpus. In order to
insure a good speaker independence, the training of the acoustic models of RAPHAEL has been
made with large corpora recorded with near 300 French speakers [11]: BREF80, BREF120 and
BRAF100 corpora.
Sound Data Base
In order to train, test and validate the module of the system and the global system we have
composed a life sound data base and we have recorded a French adapted speech corpus. With
these two data basis we have generated a noised corpus with 4 levels of signal to noise ratio (0
dB, +10 dB, +20 dB, +40 dB) which was used to evaluate the detection and classification
modules. The HIS (”Habitat Intelligent pour la Santé”) noise was recorded in an experimental test
apartment.
The life sound data are divided into 8 classes corresponding to 2 categories: normal sounds
related to usual activities of the patient (door slap, phone ringing, step sounds, dishes sounds,
door lock) and abnormal sounds related to distress situations (breaking glasses, screams,
objects falls). This data base contains recordings made at LIG laboratory (66%), files of “Sound
Scene Database in Real Acoustical Environment” [12] (13%), files from the Internet [13] (10%)
and files from a commercial CD (11%). An omni-directional wireless microphone (SENNHEISER
eW500) was used for the recordings made at LIG. The life sound database has a total duration
of 35 minutes and is constituted of 1,985 audio sounds.
The speech data base has been recorded at LIG laboratory by 21 speakers (11 men and 10
women) between 20 and 65 years old. It is composed of 126 sentences in French: 66 are
characteristic of a normal situation for the patient: “Bonjour” (Hello), “Où est le sel” (Where is the
3
salt) and 60 are distress sentences: “Au secours” (Help), “Un médecin vite” (A doctor quickly).
The speech data base has a total duration of 38 minutes and is constituted by 2,646 audio files.
REAL TIME IMPLEMENTATION
The sound telemonitoring system has been implemented on an Embedded PC using the
integrated sound card. The advantages of this implementation are the reduced dimensions, the
silence (fan less) and the low cost. In the same time, this implementation is flexible and can be
installed also on desktop or laptop PC equipped with an internal/external sound card.
Figure 2 - Real Time Architecture for sound monitoring and application front panel
Speech Recognition (RAPHAEL)
Sound Signal
Sound Recognition
Signal Classification
Event Analysis
E-Mail
SMS
Recognition Thread
Monitoring
Center
Safe Thread
Communication
Sound Event Detection
Classified
Sound Event
List
Interruption
Double buffer
Sound acquisition through
PC sound card
Graphical User Interface
Alarm configuration
The system is divided in four parallel threads and implemented under LabWindows/CVI as
shown in the Erreur ! Source du renvoi introuvable.. The sound signal acquisition is made
through sound card using the low Win32 functions which allow the use of a double buffer
processed via software interruptions. The sample frequency is fixed to 16 KHz and the buffer
dimension to 2x2048 samples corresponding to algorithm constraints. Each time that the sound
buffer is full, an interruption calls the detection algorithm. In the case of sound event detection
the signal is recorded temporarily on hard disk in a wav file. Since the file is recorded, the
detection thread sends also a message (the file name) through a safe communication queue to
the recognition thread.
The recognition thread is started in parallel with detection thread and waits a message from
detection task. As soon as a message is received, the Sound/Speech Classification algorithm is
executed. Then, if the signal is labelled as everyday sound, the Sound Recognition algorithm is
started; otherwise, if the signal is labelled as speech, the corresponding wav file is send to the
speech recognition engine. In the two cases, the Event Analysis sub-module decides the action
to be started according to the recognized event: if an alarm sound or a distress sentence has
been detected, an alarm with the recorded sound is sent using the activated modality (email,
SMS or TCP/IP to the remote monitoring center); if the processed event does not indicate an
alarm situation the recorded file is deleted but however the type of event and the corresponding
time are written in the history file. The possible choices of the action to carry out in the case of
distress event detection allow an autonomous utilisation of the remote monitoring system.
The application front-panel, presented in Figure 2, displays in real time the sound signal, a list of
previous detected events and a summary of main alarm action parameters. A special menu
allows the user to specify the sound card to use (if more than one), to activate the action(s) to
4
carry out in the case of alarm and to configure the parameters of these actions (email of the
close person, SMTP email server, IP address of the remote monitoring center).
RESULTS
Each module of the proposed sound telemonitoring system has been validated separately and a
first validation of the global implementation has been done. The results of each module, except
for the speech recognition module, are shown in Table 1.
Table 1 - Sound Telemonitoring Modules Evaluation
Module
Detection
Sound/Speech Classification
Sound Recognition
SNR
0 dB
3.7 %
17.3 %
36.6 %
10 dB
0%
5.1 %
21.3 %
20 dB
0%
3.8 %
13 %
40 dB
0%
3.6 %
9.3 %
Detection
The detection module was evaluated via Receiver Operating Curves (ROC) giving missed
detection rate as function of false detection rate. The Equal Error Rate (EER) is 0 % above +10
dB of SNR and 3.7 % at 0 dB (Table 1). The time precision of signal beginning detection is less
than 30 ms and for the signal end is below 100 ms. The signal sample rate was 16 kHz and the
analysis window 2048 samples (128 ms).
Sound/Speech Classification
The analysis window was set to 16 ms (256 samples) with an overlap of 8 ms. The
sound/speech classification was evaluated using a cross-validation protocol: training is achieved
with 80% of the data base, and the rest of 20% are used in the test stage (no test is done on a
model trained with the same speaker or the same sentence). Training is performed with pure
sounds and testing with sounds mixed with HIS noise at 0, +10, +20 and +40 dB levels.
Speech/sound discrimination performances are evaluated through the Classification Error Rate
(CER). In Table 1 the classification results are presented for 24 LFCC parameters: the CER is
4% above +10 dB and 17.3% at 0 dB.
Sound Recognition
The analysis window was set to 16 ms with an overlap of 8 ms and 24 LFCC parameters have
been used. The classification was achieved with a cross-validation protocol: 90 % of the data
base for the training and 10 % for the test stage. The module is evaluated through CER and the
performances are presented in Table 1. The CER is 13% above +20dB and 21.3% at +10 dB.
Speech Recognition
It is very important that the key words related to a distress situation are well recognized. The
speech recognition system has been evaluated on sentences pronounced by 5 speakers of our
corpus (630 tests). For normal sentences, an unexpected distress key word is introduced by the
system in 6% of the cases and leads to a False Alarm Sentence. For distress sentences, the
distress key word is not recognized but missed in 16% of the cases; this leads to a Missed Alarm
Sentence. This often occurs in isolated words like “Aïe” (Ouch) or “SOS” or in French
syntactically incorrect expressions like “Ça va pas bien” (I am not feeling very well). The Speech
Recognition Error Rate is then 22%.
5
Real Time Evaluation
A first evaluation of global sound remote monitoring system implemented in real time was
performed. The implementation was tested on an Embedded PC (AEON-6810) under WindowsXP equipped with an USB sound card (Creative 24 bits) and a Sennheiser microphone (ME 104
ANT). These first results are encouraging and will be followed by a systematic test.
CONCLUSIONS
In this paper we have proposed a real time implementation on an Embedded PC of a sound
remote monitoring system. The implementation through a continuous analysis of the sound
environment allows recognizing life sounds which indicate a distress situation but also distress
speech sentences. The advantages of this implementation are provided by the use of an
Embedded PC and the flexibility of the software concerning the alarm generation.
The system will be improved by adding real time SNR estimator that will allow the adaptation of
the GMM models. The extracted information from sound can be used alone in a light
implementation of an e-Health system but also like complementary information for remote
monitoring systems. Future developments aim at combining this modality with the output of other
medical sensors in order to increase the system reliability.
REFERENCES
[1]. European Commission, “Europe’s response to World Ageing. Promoting economic and social
progress in an ageing world”, Second World Assembly on Ageing, 18 March 2002.
[2]. Thélot B (dir.). Résultats de l’Enquête Permanente sur les Accidents de la Vie Courante, Réseau
EPAC, Institut de Veille Sanitaire, Département maladies chroniques et traumatismes, june 2003.
[3]. G. L. Bellego, N. Noury, G. Virone, M. Mousseau, and J. Demongeot, “Measurement and model of
the activity of a patient in his hospital suite”, IEEE TITB, vol. 10, pp. 92–99, January 2006.
[4]. J. L. Baldinger, J. Boudy, et al., “Tele-Surveillance System for Patient at Home: the MEDIVILLE
System”, in ICCHP 2004, France, 2004.
[5]. M. Vacher, J.-F. Serignat, S. Chaillol, D. Istrate and V. Popescu, "Speech and Sound Use in a
Remote Monitoring System for Health Care", Lecture Notes in Computer Science, Artificial
Intelligence, Text Speech and Dialogue, vol. 4188, 2006, pp. 711-718, ISBN 978-3-540-39090-9.
[6]. D. Istrate, E. Castelli, M. Vacher, L. Besacier, and J. Serignat, “Information extraction from sound for
medical telemonitoring”, IEEE TITB, vol. 10, pp. 264–274, April 2006.
[7]. D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,”
Speech Comm., vol. 17, no. 1, pp. 91–108, January 1995.
[8]. G. Schwarz, “Estimating the dimension of a model”, Annals of Statistics, vol. 6, pp. 461–464, 1978.
[9]. M. Akbar et J. Caelen, “Parole et traduction automatique : le module de reconnaissance RAPHAEL“,
COLING-ACL’98, Montréal, Quebec, vol.2, p. 36-40.
[10]. D. Vaufreydaz, J. Rouillard, M. Akbar, “Internet Documents: a Rich Source for Spoken Language
Modelling“, IEEE Workshop ASRU’99, Keystone-Colorado, USA, December 1999, pp. 277-281.
[11]. J.L. Gauvain, L.F. Lamel, M. Eskenazi, “Design considerations and text selection for BREF, a large
French read-speech corpus”, ICSLP ’90, Kobe, Japan (1990), pp. 1097–1100.
[12]. RealWorld Computing Partnership, “CD – Sound Scene Database in Real Acoustical Environments”
(1998–2001).
[13]. “Bruitage”, Bruitage Gratuits, http://www.sound-fishing.net/bruitages.htm, Nov. 2005.
6