Analyzing and Modeling Human Cognitive Approaches

Analyzing and Modeling Human
Cognitive Approaches for Spoken
Language Understanding
Sakriani Sakti
Nara Institute of Science and Technology (NAIST)
[email protected]
http://isw3.naist.jp/~ssakti/project.html
1. Project Goal
Todays, the realization of human-machine interface
via speech is becoming important. One of fundamental
technologies is development of automatic speech
recognition and understanding. In most cases, the quality
of speech recognition is evaluated solely on aiming perfect
transcription that are trained by minimizing the word error
rate, where all words like functional words and fillers are
treated uniformly, and all the errors are considered equally
deleterious. However, despite the rapid progress, the
performance is still far below human performances. On the
other hand, the essence of communication in human-human
interaction is about transmitting meaningful messages, and
how those messages can be received exactly the way the
speaker intended to, even though not all words are correctly
recognized. And in some cases when all words are correctly
recognized, communication failure may still occur due to
the existence of unknown words by the listener.
As the cognition process is related to human memory,
existing studies have shown that it was possible to detect
incongruity through the brain signals even before the
listener reacted vocally. In this study, we take a step forward
to study how the brain activities of the receiver perceives
the system output based on event-related brain potential
(ERP). Specifically, as shown in Fig. 1 we attempt to:
(1) Analyze how the human brain processes the impact
of communication failure, including: (a) when speech
recognition errors occur, (b) when no recognition
errors, but miss-understanding or the existence of
unknown words occur.
(2) Model the classifier based ERP wave patterns for
automatic detection of communication failure.
The aim of this project is to get insights about how human
cognition process the messages during communication,
and hopefully be able to improve speech recognition and
understanding system capability with high-level conceptual
of human language understanding.
Figure 1. Analyze how the human brain processes the impact of communication failure
IJARC CORE10 project summary booklet
18
2. Technical breakthrough
EEG is an electrophysiological measurement that records
the electric signal generated by the brain through electrodes
placed on different points on the scalp, and ERPs are
signal-averaged EEG epochs that are time-locked to the
presentation of an external event. EEG/ERP can image
brain activity online (i.e., immediately at the time point
of stimulus processing) with high-temporal resolution
in a millisecond range that reflects rapidly occurring
cognitive processes. Over the last two decades, at least two
well-known specific languages related to ERP signatures
have been identified and analyzed, including N400 for
semantically incongruent words and P600/SPS for syntactic
anomalies.
Here, we presented the system output as visual stimuli
(See Figure 2(a) and (b)). We recorded EEG from 29
sites in a scalp using BrainAmp made in Brain Product
company. The grounding electrode placed both earlobes,
reference electrode to the apex of nose. To improve the
signal to noise ratio, the impedance of each electrode was
reduced to less than 5kW using exclusive paste. EEG data
was recorded with sampling frequency of 1,000Hz. The
EEG data recorded was cut a high frequency component
such as muscle artifacts using low-path filter less than
40Hz. After 1024ms from a point of time when visual
stimuli were shown, The EEG signals to be analyzed
were extracted into successive 256ms (256 points) time
segments (windows or epochs) with 50% overlap as the
target data. A Hamming window was applied to each time
segment to attenuate the leakage effect. And power density
of the spectral components was calculated based on fast
Fourier transformation. Furthermore, to calculate the
power change or event-related desynchronization (ERD),
same processing was carried out for the EEG data after
1024ms from a point of time when the warning stimuli
were presented. And then, we calculated the mean power of
each frequency bands as the reference of ERD. ERD value
was calculated by using following simple equation: ERD =
((reference power) - (target power))/(reference power).
ERP results on the speech recognition errors reveal that
a positive shift (P600 of the ERP component) appeared
around 600 ms after the error words were presented. The
amplitudes of the positive shift after the substitution and
deletion violations were much bigger than the insertion
violation (shown in Fig. 2(c)). While for the case of
unknown words perception (shown in Fig. 3), the amplitude
of P600 significantly increased at the time of the known
word perception, and N400 components significantly
increased at the time of unknown word perception. The
classifier shows the significantly better accuracy than
the chance rate. Thus, it was confirmed that a difference
occurred in EEG signals when miscommunication factors
are perceived.
Figure 2. (a) Examples of speech recognition errors including substitution, deletion, and insertion errors; (b) Presentation of visual
stimuli with substitution error; (c) The resulting ERP waveforms for correct and three violation conditions: substitutions, deletions,
and insertions.
19
IJARC CORE10 project summary booklet
Figure 3. (a) ERP waveform for known and unknown words; (b) The accuracy of automatic classifier for known and unknown
words based on brain signals. The bar marked ** had significant difference compared with the accuracy of chance rate ( p< 0.05,
binomial test).
3. Innovative Applications
One of innovative applications is to enable us to develop
online incongruity detection during human-machine
interaction as illustrated in Fig. 4. When miscommunication
due to the occurrence of speech recognition errors is
detected from the brain signal of the listener, the system
would be able to give feedback to speech recognition and
request a correction result, and when no recognition errors,
but the occurrence of miss-understanding or the existence
of unknown words are detected, the system would be able
to give feedback to the speaker and request for respeaking
with more easy understandable words. This could be done
in real time before the listener reacted vocally. Another
direction is that by getting a deeper understanding of human
cognitive process, we could improve ASR capability with
high-level conceptual of human language understanding.
This in turn would enable us to develop more flexible and
natural speech recognition and understanding systems with
respect to how it can provide better outputs for human
communication.
However, to be able to develop such applications, further
investigation on various aspects would be necessary. For
example, we still need to investigate the impact of errors
on different type of words, especially function words like
nouns, verbs, particles, and non-function words like fillers.
We would need also to investigate whether similar impacts
appear in different languages or not. Therefore, we will
nevertheless continue our research, and study on human
cognitive approaches for spoken language understanding.
Figure 4. An example of innovative applications, in which online incongruity detection on brain signals is applied during humanmachine interaction
Microsoft Research CORE10 Project Summary Booklet
20
4. Academic Achievement
6. Collaboration with Microsoft Research
We successfully published several papers at both domestic
and international conferences. Regarding the EEG/ERP
research for detecting communication mismatch, the results
of ERP studies on the ASR errors has been published in
APSIPA 2014 [listed below as paper (1)], and the results
of ERP studies on unknown words has been published in
IWSDS 2015 [listed below as paper (2) and (5)]. This paper
(2) has also been selected to be included in post conference
book publication, which is the Springer Lecture Notes In
Electrical Engineering (LNEE) series. The research work
on enhancing EEG signal has been published in ICASSP
2015 [listed below as paper (3)], and the additional
experiments on improvement ASR system using biological
inspired perspective has also been published in ICASSP
2015 [listed below as paper (4)]. The ICASSP is one of the
top quality conferences in speech community.
We are really grateful to had fruitful discussions with
Prof. Junichi Tsuji and MSRA researchers during the visit
of MSRA researchers to NAIST, with Dr. Frank Soong
during Interspeech 2014 in Singapore, and with various
researchers including Prof. Sadaoki Furui during last year’s
CORE 9 review meeting. As we just start our research
in this area, most of times in this one year project were
spent mainly for preliminary studies and experiments. So
deep collaboration with MSRA could not yet been done.
However, as cognitive studies for communication might
be in many aspect of human-machine interface, we hope
we will have the opportunities to continue and expand this
study with MSRA researchers.
5. Achievement in Talent Fostering
This project involved one principle investigator and four
students as a member: (1) Mr. Yu Odagaki, a PhD student,
conducts the EEG/ERP research on natural language
processing in general. He is currently in the process of writing
Journal paper; (2) Mr. Takafumi Sasakura, a Msc student, is
very active to involve in this project specially on EEG/ERP
research for detecting communication mismatch, as this is
his main theme for Msc dissertation. He worked very hard
and successfully published both international and domestic
conferences, and he graduate by this March 2015; (3) Mr.
Hayato Maki, a Msc student, his research focuses on EEG
signal enhancement, and his paper is accepted in ICASSP
2015. He also graduates by this March, and continue his
PhD in our laboratory next year; (4) Mr. Andros Tjandra,
an intern student from University of Indonesia (Indonesia)
is very talented student, worked specifically on ASR side.
He spent 3 months summer internship at our lab to improve
our ASR system. His paper is accepted in ICASSP 2015.
Although he has back to Indonesia, our collaboration still
continues.
21
IJARC CORE10 project summary booklet
7. Project Development
Thanks to Microsoft CORE project, we have the opportunities
to do a new study and very challenging research. This
project is part of our long-term researches plan in our
laboratories, covering a study on understanding human
cognitive process, in order to support human-human and
human-machine communication. It is also related to other
ongoing project supported by the Commissioned Research
of National Institute of Information and Communications
Technology (NICT) Japan and JSPS KAKENHI Grant
Number 26870371.
8. Publications
Paper publication
International Conferences
1)Sakriani Sakti, Yu Odagaki, Takafumi Sasakura, Graham
Neubig, Tomoki Toda, Satoshi Nakamura, An EventRelated Brain Potential Study on the Impact of Speech
Recognition Errors, Proceedings of Asia Pacific Signal
and Information Processing Association (APSIPA),
Siem Reap, Cambodia, December 2014, pp. [*]
2)Takafumi Sasakura, Sakriani Sakti, Graham Neubig,
Tomoki Toda, Satoshi Nakamura, Unknown
Word Detection based on Event-Related Brain
Desynchronization Responses, Proceedings of 6th
International Workshop on Spoken Dialog Systems
(IWSDS). Busan, Korea, January 2015, pp. [*]
3)Hayato Maki, Tomoki Toda, Sakriani Sakti, Graham
Neubig, Satoshi Nakamura, EEG Signal Enhancement
Using Multi-channel Wiener Filter with a Spatial
Correalation Prior, 2015 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP
2015). Brisbane, Australia, April 2015 (To Appear).
4)Andros Tjandra, Sakriani Sakti, Graham Neubig, Tomoki
Toda, Mirna Adriani, Satoshi Nakamura, Combination
of Two-dimensional Cochleogram and Spectrogram
Features for Deep Learning-based ASR, 2015 IEEE
International Conference on Acoustics, Speech, and
Signal Processing (ICASSP 2015). Brisbane, Australia,
April 2015 (To Appear). [*]
Domestic Conferences
5)[In Japanese] Takafumi Sasakura, Sakti Sakriani, Neubig
Graham, Tomoki Toda, Satoshi Nakamura, 「単語視認
時の脳波信号を用いた未知語知覚検出」, SIG-SLUDB402, 12月2014, pp. 57-62 [*]
6)[In Japanese] Yu Odagaki, Sakriani Sakti, Graham
Neubig, Tomoki Toda, Satoshi Nakamura, 「違和感が
事象関連電位に与える影響について」, Neuroscience,
Poster: P3-258, September, 2014
[*] For these papers, we mentioned clearly that part
of this work was supported by Microsoft CORE 10
Project.
Other Publication
The above Paper (2) entitled “Unknown Word Detection
based on Event-Related Brain Desynchronization
Responses” has been selected to be included in post
conference book publication, which is the Springer
Lecture Notes In Electrical Engineering (LNEE) series.
LNEE (http://www.springer.com/series/7818) is indexed
in SCOPUS.
IJARC CORE10 project summary booklet
22