Multimodal Dialog System Using Hidden Information State Dialog

Multimodal Dialog System Using Hidden Information State
Dialog Manager
Kyungduk Kim and Gary Geunbae Lee
Department of Computer Science and Engineering
Pohang University of Science & Technology (POSTECH)
San 31, Hyoja-Dong, Pohang, 790-784, Republic of Korea
+82-54-279-5581
{getta, gblee}@postech.ac.kr
ABSTRACT
This demo presents a multimodal dialog system that uses Hidden
Information State (HIS) approach to manage the human-machine
multimodal interaction. HIS dialog manager is a variation of
classic Partially Observable Markov Decision Process (POMDP),
which provides one of the statistical dialog modeling frameworks.
But it has been hard to apply POMDP to the real domain of dialog
system since POMDP-based dialog management requires very
large size of state space. With HIS approach, system groups the
belief states to reduce the size of state space, so that HIS dialog
manager can be used in practical dialog systems. We adopted HIS
dialog manager to our smart-home control domain multimodal
dialog system that supports speech and pen gesture as input
channels.
Categories and Subject Descriptors
H.5.2 [Information Interfaces and Presentation]: User
Interfaces – Graphical user interfaces (GUI), Natural language,
User-centered design
General Terms
Design, Human Factors
Keywords
Multimodal dialog system, hidden information state
1. INTRODUCTION
Multimodal dialog systems allow natural and effective humancomputer interaction by providing multiple input and output
channels. And they are robust to recognition errors because users
can choose an input mode that is expected to reduce the errors.
But multimodal dialog systems are no exception to the problem of
speech recognition errors since speech is one of the important
input modes and it contains a large portion of user’s intention.
Therefore, many studies have focused on the robust dialog
manager, and considered Partially Observable Markov Decision
Process (POMDP) as a proper statistical framework for dialog
management because POMDP can model the uncertainty inherent
in human-machine dialog. But applying classic POMDP to
practical domain of dialog system incurs a scalability problem.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Conference’04, Month 1–2, 2004, City, State, Country.
Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.
The computational complexity of monitoring belief states and
optimizing the policy increases rapidly with the size of the state
space. Hidden Information State (HIS) model [1] is a kind of
variation of classic POMDP to solve this complexity problem.
This method adopts an ontology rule as a task model of dialog
scenario and represents the group of similar belief states as a
partition. We applied HIS dialog manager to our home network
domain multimodal dialog system.
2. SYSTEM DESCRIPTION
With our multimodal dialog system, users can interact with the
intelligent application system by using speech and pen gestures.
Fig. 1 shows the overall structure of our multimodal dialog system.
Our multimodal dialog system comprises of 3 modules: speech
recognizer, multimodal understanding module and HIS dialog
manager. We assume that users are interacting with speech and
pen gesture together. First, the system recognizes an utterance
from user’s speech and x-y position on the screen from user’s pen
input. Then, spoken language understanding (SLU) module is to
map the recognized words to semantic concepts while pen gesture
interpreter is resolving the candidate referents. As a result of this
understanding process, the meanings of each uni-modal input are
captured by the uni-modal interpretation frame. Using these
meanings and contextual information, multimodal fuser fills the
frame which captures the overall meanings of the multimodal
inputs. This fused frame is passed to HIS dialog manager to
respond to user’s request and to generate system actions. HIS
dialog manager maintains and updates the partition beliefs for
each turn and searches the optimized system action at current
partition beliefs.
Figure 1. Overall structure of the system
2.1 Speech Recognizer
The Korean speech recognizer in our multimodal dialog system
was developed based on Hidden Markov Model Toolkit (HTK)
[2]. HTK is a well known large vocabulary, speaker independent
continuous speech recognizer where the acoustic model is a
shared-state hidden Markov model. We used a bi-gram language
model to create a word lattice.
2.2 Multimodal Understanding Module
Multimodal understanding module interprets each uni-modal input
and fuses their meanings. It consists of three components – SLU
module, pen gesture interpreter and multimodal fuser. The pen
gesture interpreter is a simple geometry-based gesture recognition
and understanding module. SLU module transforms the
recognized utterances into the dialog manager understandable
frame. We employed the concept spotting SLU, and implemented
this module by applying a conditional random fields (CRF) model
[3].
2.3 Hidden Information State Dialog
Manager
To solve the complexity problem in POMDP, HIS dialog manager
groups the equivalent states to a partition and computes the
partition beliefs. At the start of dialog, all states belong to the root
partition p0. And as the dialog is progressed, this root partition is
split into smaller partitions by applying the ontology rules with
the probability of P(p’|p) that is specified in the ontology rules.
This splitting process is driven by the items in the user input and
system action. The system updates not the whole belief states but
the beliefs of the partitions, so that it makes the computational
complexity relatively small.
And then, dialog manager generates optimized system action
hypotheses according to each partition. In the optimizing process,
computational complexity problem arises again because there are
too many system actions to compute the expected reward. To
solve high complexity of optimizing process problem, the method
of 1) compressing the states and actions or 2) mapping the
original states and actions to summarized space can be used [4].
But these algorithms tend to approximate the state space
excessively. So we adopted the heuristic utility function rather
than computing the expected reward for infinite horizon. Dialog
manger computes the utility value for system action hypotheses
and selects a hypothesis as the final system action that has the
highest utility among all the hypotheses.
2.4 Application Environment
We developed a simulation GUI tool for the smart-home control
task multimodal dialog system. By using a graphic simulator,
users can test the system which can perform the functions related
to the smart-home control such as “turning on the light,” “closing
the curtain,” “locking the door” and so on. It also simulated an
intelligent robot that allows humans to order a robot to do some
errands such as “moving the cup” and “cleaning the room.” Fig. 2
shows a GUI simulator for the smart-home control.
Figure 2. GUI Simulator for smart-home control
3. CONCLUSION AND FUTURE WORKS
We developed a smart-home control domain multimodal dialog
system that supports user’s speech and pen gesture as input
channels. We adopted HIS approach to dialog management, so
that the system is not only robust to speech recognition errors, but
also capable of solving the complexity problem.
We plan to progress the research on making up for the weak
points of HIS approach. With the MDP or POMDP based dialog
managers, it is hard to find the proper reward function since the
dialog flow is very sensitive to reward function. And we can’t
always guarantee that human-machine dialog is a kind of
optimization problem which has high computational complexity.
So, we are interested in example-based dialog managing
framework [5] to solve these problems. Our future work will
focus to design the dialog managing method that has merits of
both of example based approach and HIS approach.
4. ACKNOWLEDGMENTS
This research was supported by the MIC (Ministry of Information
and Communication), Korea, under the ITRC (Information
Technology Research Center) support program supervised by the
IITA (Institute of Information Technology Assessment) (IITA2006-C1090-0603-0045)
5. REFERENCES
[1] Young, S., Schatzmann, J., Weilhammer K. and Ye, H. The
Hidden Information State Approach to Dialog Management.
In Proceedings of the ICASSP 2007
[2] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D.,
Moore, G., Ollason, J.O.D., Povey, D., Valtchev, V.,
Woodland, P. The HTK book: version 3.3. Technical Report,
Cambridge University, UK. http://htk.eng.cam.ac.uk, 2005.
[3] Lee, C., Eun, J., Jeong, M. and Lee, G.G., Hwang,Y., Jang,
M. A multi-strategic concept-spotting approach for robust
understanding of spoken Korean. ETRI Journal Vol.29, No.2
(Apr. 2007), pp. 179-188
[4] Williams, J., Young, S. Scaling POMDPs for dialog
management with composite summary point-based value
iteration (CSPBVI). In proceedings of the AAAI Workshop
on Statistical and Empirical Approaches for Spoken
Dialogue Systems, 2006
[5] Lee, C., Jung, S., Eun, J., Jeong M., Lee, G.G. A Situationbased Dialogue Management using Dialogue Examples. In
proceedings of the ICASSP 2006