EMBEDDED IMPLEMENTATION OF DISTRESS SITUATION IDENTIFICATION THROUGH SOUND ANALYSIS Dan Istrate Michel Vacher Jean François Serignat Assistant Professor ESIGETEL, Avon, France Research Scientist LIG, Grenoble, France Senior Lecturer LIG, Grenoble, France E-mail: [email protected] [email protected] [email protected] ABSTRACT The safety of elderly people living alone at home might be a crucial problem because of the growing aging population and the high risk of home accidents such as falls. Medical remote monitoring systems may increase the safety of such people in detecting and quickly announcing the state of emergency. We have already proposed a sound medical remote monitoring system. Distress sounds like glass breaking, screams, falls and distress expressions like “Help”, “A doctor, please!” are detected and recognized through a continuous analysis of the sound flow. In the case of distress situation identification, the software can send an alarm with the recognized data to a close person and/or to a medical center. In this paper, a real-time implementation of this system is presented. The advantages of this implementation on an Embedded PC, equipped with a classical sound card and a microphone, are the reduced dimensions, the silence (fan less) and the cost. In the same time, this implementation is flexible and can be installed also on desktop or laptop PC. KEYWORDS: sound detection, sound/speech classification, sound recognition, signal processing, embedded system, telemedicine. INTRODUCTION The number of elderly people which live alone in their own homes increases because of aging of European population: in 2030, 37% of European population will be over the 60 years and the elderly over 80 years which represent today 3% will increase to 10 % in 2050 [1]. In France also the persons older than 60 years represent today ≈20% of the population and will represent 33% in 2050 (INSEE Première n°1089 - July 2006). The el derly people living alone at home have an increased risk of home accidents such as falls due to the cognitive or physical faintness. A statistical study indicates that 7 % of elderly people have a home accident due to everyday life activity and in 84% of cases a fall occurs [2]. In practice all the industrialized countries are affected by this phenomenon. E-Health systems, like medical remote monitoring, reduce the consequences of home accidents through distress situation detection and quick alarm transmission to the emergencies or to a close person. The automatic monitoring of the living alone person can detect not only a distress situation like a fall or faintness but also other pathologies (cold, pulmonary disease). Currently, remote monitoring systems use several fixed sensors (infrared) and mobile sensor (fall detector, movement and pulse) to detect a distress situation [3], [4]. We already proposed a system which extracts information regarding patient status, through a sound environment monitoring [5]. This former system acquires and analyses data from 5 microphones. It detects everyday life sounds like glass breaking, screams, falls and distress expressions such as “Help”, “A doctor, please!”. The extracted sound or sentences are not recorded, except the case of an alarm situation, in order to preserve patient privacy. In this article a new real time implementation of the sound monitoring algorithms on an embedded PC is proposed, using the standard sound card and a microphone. This real time implementation allows the use of the sound modality coupled or not with other systems in order 1 to increase the safety of elderly people living alone. This paper starts with a global description of the system followed by a succinct description of algorithms. The real time implementation is preceded by the presentation of the sound corpus. Finally, the system evaluation and the result interpretation are described in the last section of the paper. SOUND REMOTE MONITORING The proposed sound remote monitoring system analyzes the acoustical environment in real time and is made up of four main modules which are presented in the Figure 1. Figure 1 - Sound monitoring architecture Alarm Sound Event Detection and Extraction Module (M1) Sound/Speech Classification Module (M2) Speech Sound Sound Recognition (M3.1) Speech Recognition (M3.2) Help ! Door slap Door lock Glass Breaking It’s a nice day. Fall The signal extracted by the M1 module, which runs in real time is classified like sound or speech by the M2 module. In the case of sound label, the sound recognition module M3.1 classifies the signal between eight predefined sound classes, while in the case of speech label, the extracted signal is analyzed by a speech recognition engine in order to detect distress sentences. For both cases, if an alarm situation has been identified (the sound or the sentence belong to an alarm class) an email message or a SMS is sent to the close person and/or a message to the medical telemonitoring center. Firstly a local acoustic and/or visual alarm is generated and if the patient does not respond by the alarm cancellation, the alarm message is sent. Sound Event Detection Module (M1) The sound flow is analyzed through a wavelet based algorithm aiming at sound event detection. This algorithm must be robust to noise like neighbourhood environmental noise, water flow noise, ventilator or electric shaver. Therefore an algorithm based on energy of wavelet coefficients was proposed and evaluated in [6]. This algorithm detects precisely the signal beginning and its end, using properties of wavelet transform. Sound/Speech Classification Module (M2) The method used by this module is based on Gaussian Mixture Model (GMM) [7] (K-means followed by Expectation Maximisation in 20 steps). There are other possibilities for signal classification: Hidden Markov Model (HMM), Bayesian method, etc. Even if similar results have 2 been obtained with other methods, their high complexity and high time consumption prevent from real-time implementation. A preliminary step before signal classification is the extraction of acoustic parameters: LFCC (Linear Frequency Cepstral Coefficients) - 24 filters. The choice of this type of parameters relies on their properties: bank of filters with constant bandwidth, which leads to equal resolution at high frequencies often encountered in life sounds. The BIC (Bayesian Information Criterion) is used in order to find the optimal number of Gaussians [8]. The best performances have been obtained with 24 Gaussians. Sound Recognition Module (M3.1) This module is based, also, on a GMM algorithm. The LFCC acoustical parameters have been used for the same reasons than for sound/speech module and with the same composition: 24 filters. The method BIC has been used in order to determine the optimum number of Gaussians: 12 in the case of sounds. A log-likelihood is computed for the unknown signal according to each predefined sound classes; the sound class with the biggest log likelihood is the output of this module. Speech Recognition Module (M3.2) For Speech Recognition, the autonomous system RAPHAEL is used [9]. The language model of this system is a medium vocabulary statistical model (around 11,000 words). This model is obtained by using textual information extracted from the Internet as described in [10] and from ”Le Monde” corpora. It is then optimized for the distress sentences of our corpus. In order to insure a good speaker independence, the training of the acoustic models of RAPHAEL has been made with large corpora recorded with near 300 French speakers [11]: BREF80, BREF120 and BRAF100 corpora. Sound Data Base In order to train, test and validate the module of the system and the global system we have composed a life sound data base and we have recorded a French adapted speech corpus. With these two data basis we have generated a noised corpus with 4 levels of signal to noise ratio (0 dB, +10 dB, +20 dB, +40 dB) which was used to evaluate the detection and classification modules. The HIS (”Habitat Intelligent pour la Santé”) noise was recorded in an experimental test apartment. The life sound data are divided into 8 classes corresponding to 2 categories: normal sounds related to usual activities of the patient (door slap, phone ringing, step sounds, dishes sounds, door lock) and abnormal sounds related to distress situations (breaking glasses, screams, objects falls). This data base contains recordings made at LIG laboratory (66%), files of “Sound Scene Database in Real Acoustical Environment” [12] (13%), files from the Internet [13] (10%) and files from a commercial CD (11%). An omni-directional wireless microphone (SENNHEISER eW500) was used for the recordings made at LIG. The life sound database has a total duration of 35 minutes and is constituted of 1,985 audio sounds. The speech data base has been recorded at LIG laboratory by 21 speakers (11 men and 10 women) between 20 and 65 years old. It is composed of 126 sentences in French: 66 are characteristic of a normal situation for the patient: “Bonjour” (Hello), “Où est le sel” (Where is the 3 salt) and 60 are distress sentences: “Au secours” (Help), “Un médecin vite” (A doctor quickly). The speech data base has a total duration of 38 minutes and is constituted by 2,646 audio files. REAL TIME IMPLEMENTATION The sound telemonitoring system has been implemented on an Embedded PC using the integrated sound card. The advantages of this implementation are the reduced dimensions, the silence (fan less) and the low cost. In the same time, this implementation is flexible and can be installed also on desktop or laptop PC equipped with an internal/external sound card. Figure 2 - Real Time Architecture for sound monitoring and application front panel Speech Recognition (RAPHAEL) Sound Signal Sound Recognition Signal Classification Event Analysis E-Mail SMS Recognition Thread Monitoring Center Safe Thread Communication Sound Event Detection Classified Sound Event List Interruption Double buffer Sound acquisition through PC sound card Graphical User Interface Alarm configuration The system is divided in four parallel threads and implemented under LabWindows/CVI as shown in the Erreur ! Source du renvoi introuvable.. The sound signal acquisition is made through sound card using the low Win32 functions which allow the use of a double buffer processed via software interruptions. The sample frequency is fixed to 16 KHz and the buffer dimension to 2x2048 samples corresponding to algorithm constraints. Each time that the sound buffer is full, an interruption calls the detection algorithm. In the case of sound event detection the signal is recorded temporarily on hard disk in a wav file. Since the file is recorded, the detection thread sends also a message (the file name) through a safe communication queue to the recognition thread. The recognition thread is started in parallel with detection thread and waits a message from detection task. As soon as a message is received, the Sound/Speech Classification algorithm is executed. Then, if the signal is labelled as everyday sound, the Sound Recognition algorithm is started; otherwise, if the signal is labelled as speech, the corresponding wav file is send to the speech recognition engine. In the two cases, the Event Analysis sub-module decides the action to be started according to the recognized event: if an alarm sound or a distress sentence has been detected, an alarm with the recorded sound is sent using the activated modality (email, SMS or TCP/IP to the remote monitoring center); if the processed event does not indicate an alarm situation the recorded file is deleted but however the type of event and the corresponding time are written in the history file. The possible choices of the action to carry out in the case of distress event detection allow an autonomous utilisation of the remote monitoring system. The application front-panel, presented in Figure 2, displays in real time the sound signal, a list of previous detected events and a summary of main alarm action parameters. A special menu allows the user to specify the sound card to use (if more than one), to activate the action(s) to 4 carry out in the case of alarm and to configure the parameters of these actions (email of the close person, SMTP email server, IP address of the remote monitoring center). RESULTS Each module of the proposed sound telemonitoring system has been validated separately and a first validation of the global implementation has been done. The results of each module, except for the speech recognition module, are shown in Table 1. Table 1 - Sound Telemonitoring Modules Evaluation Module Detection Sound/Speech Classification Sound Recognition SNR 0 dB 3.7 % 17.3 % 36.6 % 10 dB 0% 5.1 % 21.3 % 20 dB 0% 3.8 % 13 % 40 dB 0% 3.6 % 9.3 % Detection The detection module was evaluated via Receiver Operating Curves (ROC) giving missed detection rate as function of false detection rate. The Equal Error Rate (EER) is 0 % above +10 dB of SNR and 3.7 % at 0 dB (Table 1). The time precision of signal beginning detection is less than 30 ms and for the signal end is below 100 ms. The signal sample rate was 16 kHz and the analysis window 2048 samples (128 ms). Sound/Speech Classification The analysis window was set to 16 ms (256 samples) with an overlap of 8 ms. The sound/speech classification was evaluated using a cross-validation protocol: training is achieved with 80% of the data base, and the rest of 20% are used in the test stage (no test is done on a model trained with the same speaker or the same sentence). Training is performed with pure sounds and testing with sounds mixed with HIS noise at 0, +10, +20 and +40 dB levels. Speech/sound discrimination performances are evaluated through the Classification Error Rate (CER). In Table 1 the classification results are presented for 24 LFCC parameters: the CER is 4% above +10 dB and 17.3% at 0 dB. Sound Recognition The analysis window was set to 16 ms with an overlap of 8 ms and 24 LFCC parameters have been used. The classification was achieved with a cross-validation protocol: 90 % of the data base for the training and 10 % for the test stage. The module is evaluated through CER and the performances are presented in Table 1. The CER is 13% above +20dB and 21.3% at +10 dB. Speech Recognition It is very important that the key words related to a distress situation are well recognized. The speech recognition system has been evaluated on sentences pronounced by 5 speakers of our corpus (630 tests). For normal sentences, an unexpected distress key word is introduced by the system in 6% of the cases and leads to a False Alarm Sentence. For distress sentences, the distress key word is not recognized but missed in 16% of the cases; this leads to a Missed Alarm Sentence. This often occurs in isolated words like “Aïe” (Ouch) or “SOS” or in French syntactically incorrect expressions like “Ça va pas bien” (I am not feeling very well). The Speech Recognition Error Rate is then 22%. 5 Real Time Evaluation A first evaluation of global sound remote monitoring system implemented in real time was performed. The implementation was tested on an Embedded PC (AEON-6810) under WindowsXP equipped with an USB sound card (Creative 24 bits) and a Sennheiser microphone (ME 104 ANT). These first results are encouraging and will be followed by a systematic test. CONCLUSIONS In this paper we have proposed a real time implementation on an Embedded PC of a sound remote monitoring system. The implementation through a continuous analysis of the sound environment allows recognizing life sounds which indicate a distress situation but also distress speech sentences. The advantages of this implementation are provided by the use of an Embedded PC and the flexibility of the software concerning the alarm generation. The system will be improved by adding real time SNR estimator that will allow the adaptation of the GMM models. The extracted information from sound can be used alone in a light implementation of an e-Health system but also like complementary information for remote monitoring systems. Future developments aim at combining this modality with the output of other medical sensors in order to increase the system reliability. REFERENCES [1]. European Commission, “Europe’s response to World Ageing. Promoting economic and social progress in an ageing world”, Second World Assembly on Ageing, 18 March 2002. [2]. Thélot B (dir.). Résultats de l’Enquête Permanente sur les Accidents de la Vie Courante, Réseau EPAC, Institut de Veille Sanitaire, Département maladies chroniques et traumatismes, june 2003. [3]. G. L. Bellego, N. Noury, G. Virone, M. Mousseau, and J. Demongeot, “Measurement and model of the activity of a patient in his hospital suite”, IEEE TITB, vol. 10, pp. 92–99, January 2006. [4]. J. L. Baldinger, J. Boudy, et al., “Tele-Surveillance System for Patient at Home: the MEDIVILLE System”, in ICCHP 2004, France, 2004. [5]. M. Vacher, J.-F. Serignat, S. Chaillol, D. Istrate and V. Popescu, "Speech and Sound Use in a Remote Monitoring System for Health Care", Lecture Notes in Computer Science, Artificial Intelligence, Text Speech and Dialogue, vol. 4188, 2006, pp. 711-718, ISBN 978-3-540-39090-9. [6]. D. Istrate, E. Castelli, M. Vacher, L. Besacier, and J. Serignat, “Information extraction from sound for medical telemonitoring”, IEEE TITB, vol. 10, pp. 264–274, April 2006. [7]. D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Comm., vol. 17, no. 1, pp. 91–108, January 1995. [8]. G. Schwarz, “Estimating the dimension of a model”, Annals of Statistics, vol. 6, pp. 461–464, 1978. [9]. M. Akbar et J. Caelen, “Parole et traduction automatique : le module de reconnaissance RAPHAEL“, COLING-ACL’98, Montréal, Quebec, vol.2, p. 36-40. [10]. D. Vaufreydaz, J. Rouillard, M. Akbar, “Internet Documents: a Rich Source for Spoken Language Modelling“, IEEE Workshop ASRU’99, Keystone-Colorado, USA, December 1999, pp. 277-281. [11]. J.L. Gauvain, L.F. Lamel, M. Eskenazi, “Design considerations and text selection for BREF, a large French read-speech corpus”, ICSLP ’90, Kobe, Japan (1990), pp. 1097–1100. [12]. RealWorld Computing Partnership, “CD – Sound Scene Database in Real Acoustical Environments” (1998–2001). [13]. “Bruitage”, Bruitage Gratuits, http://www.sound-fishing.net/bruitages.htm, Nov. 2005. 6
© Copyright 2026 Paperzz