Analyzing and Modeling Human Cognitive Approaches for Spoken Language Understanding Sakriani Sakti Nara Institute of Science and Technology (NAIST) [email protected] http://isw3.naist.jp/~ssakti/project.html 1. Project Goal Todays, the realization of human-machine interface via speech is becoming important. One of fundamental technologies is development of automatic speech recognition and understanding. In most cases, the quality of speech recognition is evaluated solely on aiming perfect transcription that are trained by minimizing the word error rate, where all words like functional words and fillers are treated uniformly, and all the errors are considered equally deleterious. However, despite the rapid progress, the performance is still far below human performances. On the other hand, the essence of communication in human-human interaction is about transmitting meaningful messages, and how those messages can be received exactly the way the speaker intended to, even though not all words are correctly recognized. And in some cases when all words are correctly recognized, communication failure may still occur due to the existence of unknown words by the listener. As the cognition process is related to human memory, existing studies have shown that it was possible to detect incongruity through the brain signals even before the listener reacted vocally. In this study, we take a step forward to study how the brain activities of the receiver perceives the system output based on event-related brain potential (ERP). Specifically, as shown in Fig. 1 we attempt to: (1) Analyze how the human brain processes the impact of communication failure, including: (a) when speech recognition errors occur, (b) when no recognition errors, but miss-understanding or the existence of unknown words occur. (2) Model the classifier based ERP wave patterns for automatic detection of communication failure. The aim of this project is to get insights about how human cognition process the messages during communication, and hopefully be able to improve speech recognition and understanding system capability with high-level conceptual of human language understanding. Figure 1. Analyze how the human brain processes the impact of communication failure IJARC CORE10 project summary booklet 18 2. Technical breakthrough EEG is an electrophysiological measurement that records the electric signal generated by the brain through electrodes placed on different points on the scalp, and ERPs are signal-averaged EEG epochs that are time-locked to the presentation of an external event. EEG/ERP can image brain activity online (i.e., immediately at the time point of stimulus processing) with high-temporal resolution in a millisecond range that reflects rapidly occurring cognitive processes. Over the last two decades, at least two well-known specific languages related to ERP signatures have been identified and analyzed, including N400 for semantically incongruent words and P600/SPS for syntactic anomalies. Here, we presented the system output as visual stimuli (See Figure 2(a) and (b)). We recorded EEG from 29 sites in a scalp using BrainAmp made in Brain Product company. The grounding electrode placed both earlobes, reference electrode to the apex of nose. To improve the signal to noise ratio, the impedance of each electrode was reduced to less than 5kW using exclusive paste. EEG data was recorded with sampling frequency of 1,000Hz. The EEG data recorded was cut a high frequency component such as muscle artifacts using low-path filter less than 40Hz. After 1024ms from a point of time when visual stimuli were shown, The EEG signals to be analyzed were extracted into successive 256ms (256 points) time segments (windows or epochs) with 50% overlap as the target data. A Hamming window was applied to each time segment to attenuate the leakage effect. And power density of the spectral components was calculated based on fast Fourier transformation. Furthermore, to calculate the power change or event-related desynchronization (ERD), same processing was carried out for the EEG data after 1024ms from a point of time when the warning stimuli were presented. And then, we calculated the mean power of each frequency bands as the reference of ERD. ERD value was calculated by using following simple equation: ERD = ((reference power) - (target power))/(reference power). ERP results on the speech recognition errors reveal that a positive shift (P600 of the ERP component) appeared around 600 ms after the error words were presented. The amplitudes of the positive shift after the substitution and deletion violations were much bigger than the insertion violation (shown in Fig. 2(c)). While for the case of unknown words perception (shown in Fig. 3), the amplitude of P600 significantly increased at the time of the known word perception, and N400 components significantly increased at the time of unknown word perception. The classifier shows the significantly better accuracy than the chance rate. Thus, it was confirmed that a difference occurred in EEG signals when miscommunication factors are perceived. Figure 2. (a) Examples of speech recognition errors including substitution, deletion, and insertion errors; (b) Presentation of visual stimuli with substitution error; (c) The resulting ERP waveforms for correct and three violation conditions: substitutions, deletions, and insertions. 19 IJARC CORE10 project summary booklet Figure 3. (a) ERP waveform for known and unknown words; (b) The accuracy of automatic classifier for known and unknown words based on brain signals. The bar marked ** had significant difference compared with the accuracy of chance rate ( p< 0.05, binomial test). 3. Innovative Applications One of innovative applications is to enable us to develop online incongruity detection during human-machine interaction as illustrated in Fig. 4. When miscommunication due to the occurrence of speech recognition errors is detected from the brain signal of the listener, the system would be able to give feedback to speech recognition and request a correction result, and when no recognition errors, but the occurrence of miss-understanding or the existence of unknown words are detected, the system would be able to give feedback to the speaker and request for respeaking with more easy understandable words. This could be done in real time before the listener reacted vocally. Another direction is that by getting a deeper understanding of human cognitive process, we could improve ASR capability with high-level conceptual of human language understanding. This in turn would enable us to develop more flexible and natural speech recognition and understanding systems with respect to how it can provide better outputs for human communication. However, to be able to develop such applications, further investigation on various aspects would be necessary. For example, we still need to investigate the impact of errors on different type of words, especially function words like nouns, verbs, particles, and non-function words like fillers. We would need also to investigate whether similar impacts appear in different languages or not. Therefore, we will nevertheless continue our research, and study on human cognitive approaches for spoken language understanding. Figure 4. An example of innovative applications, in which online incongruity detection on brain signals is applied during humanmachine interaction Microsoft Research CORE10 Project Summary Booklet 20 4. Academic Achievement 6. Collaboration with Microsoft Research We successfully published several papers at both domestic and international conferences. Regarding the EEG/ERP research for detecting communication mismatch, the results of ERP studies on the ASR errors has been published in APSIPA 2014 [listed below as paper (1)], and the results of ERP studies on unknown words has been published in IWSDS 2015 [listed below as paper (2) and (5)]. This paper (2) has also been selected to be included in post conference book publication, which is the Springer Lecture Notes In Electrical Engineering (LNEE) series. The research work on enhancing EEG signal has been published in ICASSP 2015 [listed below as paper (3)], and the additional experiments on improvement ASR system using biological inspired perspective has also been published in ICASSP 2015 [listed below as paper (4)]. The ICASSP is one of the top quality conferences in speech community. We are really grateful to had fruitful discussions with Prof. Junichi Tsuji and MSRA researchers during the visit of MSRA researchers to NAIST, with Dr. Frank Soong during Interspeech 2014 in Singapore, and with various researchers including Prof. Sadaoki Furui during last year’s CORE 9 review meeting. As we just start our research in this area, most of times in this one year project were spent mainly for preliminary studies and experiments. So deep collaboration with MSRA could not yet been done. However, as cognitive studies for communication might be in many aspect of human-machine interface, we hope we will have the opportunities to continue and expand this study with MSRA researchers. 5. Achievement in Talent Fostering This project involved one principle investigator and four students as a member: (1) Mr. Yu Odagaki, a PhD student, conducts the EEG/ERP research on natural language processing in general. He is currently in the process of writing Journal paper; (2) Mr. Takafumi Sasakura, a Msc student, is very active to involve in this project specially on EEG/ERP research for detecting communication mismatch, as this is his main theme for Msc dissertation. He worked very hard and successfully published both international and domestic conferences, and he graduate by this March 2015; (3) Mr. Hayato Maki, a Msc student, his research focuses on EEG signal enhancement, and his paper is accepted in ICASSP 2015. He also graduates by this March, and continue his PhD in our laboratory next year; (4) Mr. Andros Tjandra, an intern student from University of Indonesia (Indonesia) is very talented student, worked specifically on ASR side. He spent 3 months summer internship at our lab to improve our ASR system. His paper is accepted in ICASSP 2015. Although he has back to Indonesia, our collaboration still continues. 21 IJARC CORE10 project summary booklet 7. Project Development Thanks to Microsoft CORE project, we have the opportunities to do a new study and very challenging research. This project is part of our long-term researches plan in our laboratories, covering a study on understanding human cognitive process, in order to support human-human and human-machine communication. It is also related to other ongoing project supported by the Commissioned Research of National Institute of Information and Communications Technology (NICT) Japan and JSPS KAKENHI Grant Number 26870371. 8. Publications Paper publication International Conferences 1)Sakriani Sakti, Yu Odagaki, Takafumi Sasakura, Graham Neubig, Tomoki Toda, Satoshi Nakamura, An EventRelated Brain Potential Study on the Impact of Speech Recognition Errors, Proceedings of Asia Pacific Signal and Information Processing Association (APSIPA), Siem Reap, Cambodia, December 2014, pp. [*] 2)Takafumi Sasakura, Sakriani Sakti, Graham Neubig, Tomoki Toda, Satoshi Nakamura, Unknown Word Detection based on Event-Related Brain Desynchronization Responses, Proceedings of 6th International Workshop on Spoken Dialog Systems (IWSDS). Busan, Korea, January 2015, pp. [*] 3)Hayato Maki, Tomoki Toda, Sakriani Sakti, Graham Neubig, Satoshi Nakamura, EEG Signal Enhancement Using Multi-channel Wiener Filter with a Spatial Correalation Prior, 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015). Brisbane, Australia, April 2015 (To Appear). 4)Andros Tjandra, Sakriani Sakti, Graham Neubig, Tomoki Toda, Mirna Adriani, Satoshi Nakamura, Combination of Two-dimensional Cochleogram and Spectrogram Features for Deep Learning-based ASR, 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015). Brisbane, Australia, April 2015 (To Appear). [*] Domestic Conferences 5)[In Japanese] Takafumi Sasakura, Sakti Sakriani, Neubig Graham, Tomoki Toda, Satoshi Nakamura, 「単語視認 時の脳波信号を用いた未知語知覚検出」, SIG-SLUDB402, 12月2014, pp. 57-62 [*] 6)[In Japanese] Yu Odagaki, Sakriani Sakti, Graham Neubig, Tomoki Toda, Satoshi Nakamura, 「違和感が 事象関連電位に与える影響について」, Neuroscience, Poster: P3-258, September, 2014 [*] For these papers, we mentioned clearly that part of this work was supported by Microsoft CORE 10 Project. Other Publication The above Paper (2) entitled “Unknown Word Detection based on Event-Related Brain Desynchronization Responses” has been selected to be included in post conference book publication, which is the Springer Lecture Notes In Electrical Engineering (LNEE) series. LNEE (http://www.springer.com/series/7818) is indexed in SCOPUS. IJARC CORE10 project summary booklet 22
© Copyright 2026 Paperzz