Dialog State Tracking and Action Selection Using Deep Learning Mechanism for
Interview Coaching
Ming-Hsiang Su, Kun-Yi Huang, Tsung-Hsien Yang, Kuan-Jung Lai and Chung-Hsien Wu
Department of Computer Science and Information Engineering
National Cheng Kung University
Tainan, Taiwan
{huntfox.su, iamkyh77, yasamyang, vicvicvic47, chunghsienwu}@ gmail.com
to provide user response analysis, dialog state tracking, and
action selection.
Dialog state tracking (DST) is one of the key sub-tasks
of dialog management, which updates the dialog states at
each moment on a dialog conversation [3]. Some typical
examples of the proposed approaches include handcrafted
rule-based methods [4], Bayesian networks [5],
discriminative models [6], and LSTM Neural Networks
[7]. To provide a common test bed for the DST task, a
series of the Dialog State Tracking Challenge (DSTC1~4)
has been successfully organized and completed in the past
years. This challenge task series has spurred significant
work on dialog state tracking, yielding both numerous new
techniques as well as a standard set of evaluation metrics
[3].
In action selection research [8-12], three classes of
methods have been explored in the literature: hand-crafting
[10], supervised learning [11], and reinforcement learning.
Scholars adopted the reinforcement learning (RL) [8]
model to select actions because RL offers the possibility to
treat dialogue design as an optimization problem, and can
improve their performance over time with experience.
In this study, we apply the deep RL method to the
interview coaching system to select the appropriate action.
For this purpose, we first track interviewee’s dialog state
and accordingly select the best action to respond to the
interviewer’s question. Figure 1 shows the block diagram
of the proposed approach to dialog state tracking and
action selection. In the training phase, the interview dialog
corpus, collected by MHMC Lab, Taiwan, is used as the
interview conversation corpus for training and evaluation.
Word embedding technique is employed first for word
distributed representation using word2vec model [13].
Then, each answer vector is used to train the state tracking
model by using the LSTM [14]. Finally, the deep RLbased action selection model was constructed by
considering the dialog states.
In the test phase, when a new answer from the
interviewee enters the coaching system, the answer vector
is generated by using the word2vec model. Then, the
dialog state is obtained by feeding the answer vector to the
LSTM-based state tracking model. The action is then
obtained by feeding the dialog state to the deep RL-based
action selection model. Finally, we consider the
interviewer question generation templates, constructed by
dialog states and actions, to generate the interview
question.
The rest of this paper is organized as follows: Section
II presents the state-of-the-art approaches. Next, the data
collection and annotation is described in Section III.
Sections IV presents the proposed method based on LSTM
Abstract—The best way to prepare for an interview is to
review the different types of possible interview questions you
will be asked during an interview and practice responding to
questions. An interview coaching system tries to simulate an
interviewer to provide mock interview practice simulation
sessions for the users. The traditional interview coaching
systems provide some feedbacks, including facial preference,
head nodding, response time, speaking rate, and volume, to
let users know their own performance in the mock interview.
But most of these systems are trained with insufficient dialog
data and provide the pre-designed interview questions. In
this study, we propose an approach to dialog state tracking
and action selection based on deep learning methods. First,
the interview corpus in this study is collected from 12
participants, and is annotated with dialog states and actions.
Next, a long-short term memory and an artificial neural
network are employed to predict dialog states and the Deep
RL is adopted to learn the relation between dialog states and
actions. Finally, the selected action is used to generate the
interview question for interview practice. To evaluate the
proposed method in action selection, an interview coaching
system is constructed. Experimental results show the
effectiveness of the proposed method for dialog state
tracking and action selection.
Keywords- interview coaching system; deep reinforcement
learning; LSTM
I.
INTRODUCTION
Recently college graduates often have the opportunity
to participate in the interview when they try to pursue
further studies or find a job. In order to master all possible
questions in the interview, the best way is to know what
kinds of questions may be asked and practice responding
to questions. In general, college students rarely have the
opportunity to practice interview during school. In order to
increase opportunities for people to practice social skills,
such as admission interview and job interview, many
scholars engaged in the design and development of social
skill training systemsġ [1-2]. Hoque et al. [1] presented a
novel technology, My Automated Conversation coach
(MACH), that includes behavior analysis, conversational
analysis, and data visualization to help people improve
their conversational skills. In order to design and develop
an effective social skill training system, several functions
including social skill training, real-time feedback,
information visualization, and intelligent virtual agents
should be integrated. Most current studies focused on realtime/summary feedback during social interactions by
considering verbal/nonverbal sensing, such as smiling,
speaking rate and head nodding. The above systems lacked
c
978-1-5090-0922-0/16/$31.00 2016
IEEE
6
Figure 1. Schematic diagram of the proposed method for action selection and interview question generation.
ANN, and deep RL models. Section V presents
experimental setup and results. Finally, conclusions and
future works are described in Section VI.
II.
STATE-OF-THE-ART APPROACHES
In a spoken dialog system, the task of DST is to detect
the current dialog state to indicate the user’s intention. The
DST is crucial because the dialog policy needs the DST to
detect the correct dialog state in order to choose an
appropriate action. In the recent literature, researchers
proposed numerous methods for DST. There are three
common methods of DST: hand-crafted rules, generative
models, and discriminative models [15]. Sun et al. [16]
proposed a new general framework, referring to as Markov
Bayesian Polynomial (MBP), to formulate rule-based
models. They showed that the proposed approach not only
has competitive performance, but also has relatively good
generalization ability. Serban et al. [17] investigated the
task of building an open domain, conversational dialog
system based on large dialog corpora using generative
models. They defined the generative dialog problem as
modeling the utterances and interactive structure of the
dialog. Finally, they found their proposed method
outperformed both n-gram-based models and baseline
neural networks. Zilka et al. [18] described a
discriminative dialog state tracker and a generative dialog
state tracker; both were evaluated in the DSTC. Their
results show that the discriminative model consistently
outperformed the baseline tracker and the discriminative
model achieved comparable or better results than the
generative model.
III.
DATA COLLECTION AND ANNOTATION
In order to implement an interview coaching dialog
system, we invited 12 participants to record the interview
corpus. During corpus collection, two participants, one
serving as the interviewer and the other as the interviewee,
had the freedom to complete the interview without
predesigning the questions and answers. In this study, we
used only manually transcribed texts of the interview
dialogs. Finally, a total of 109 dialogs with 394 question
and answer (QA) pairs were collected. Table I lists the
detailed information of the corpus.
For the content of the answers in this study, six
categories and their corresponding ten semantic slots
illustrated in Table II are defined for the implementation of
the interview coaching system.
IV.
PROPOSED MODEL
A. Dialog state tracking
In this study, the dialog state at time t is defined as a
vector st ∈ SL1 × ... × SLK of K semantic slots (K=10 in this
study). Each semantic slot slk ∈ SLk = {v0 , v1} takes either
v0 or v1 values, where slk = v0 represents the absence of
the k-th semantic slot, while slk = v1 denotes the presence
of the k-th semantic slot. Our dialog state tracker estimates
a probability distribution over st factorized by the
semantic slots:
P(st | w1 ,..., wt )=∏ p( slk | w1 ,..., wt ; Ω)
(1)
k
where the detection models p( slk | ⋅) share a substantial
portion of the parameters. Although the detection models
are factorized and assumed independent, they are
optimized to obtain a minimum joint objective function
and thus model the dependence between the semantic slots.
Based on the dialog state tracking model, LecTrack,
proposed in [19], this study predicts dialog states by using
LSTM and the ANNs.
B. Deep RL
RL is a learning model concerned with learning to
control a system so as to maximize the long-term reward
signal [20]. Figure 2 shows the architecture of the deep RL
model which consists of a fully-connected multilayer
neural network. At the bottom, an action is received from
the deep RL model to output the next dialog state, action
and a reward. To do that, we first consider the probable
dialog state and the probable action in the interview dialog
corpus when the interview dialog system takes the action.
Finally, we calculate the reward by using the reward
function. At the top of the diagram, a deep RL model
receives the dialog state and reward, updates its policy
during learning, and outputs an action according to its
learnt policy.
In this study, we implement the deep Q learning with
experience replay (ER) [21] where we store the
experiences et = {st , at , rt , st +1} in a data set D = e1 ,...eT ,
collected over many episodes into a replay memory. In
deep Q learning, the reward function at time t is defined as
Rt (st , at , st +1 )={rt1 ,rt 2 ,...,rt N }
(2)
2016 International Conference on Asian Language Processing (IALP)
7
TABLE I. CORPUS INFORMATION
Number of subjects
Number of dialogs
Number of QA pairs
Avg. number of sentences in each answer
Total
12
109
394
3.95
TABLE II. THE DESCRIPTION OF CATEGORIES AND SLOTS
DEFINED BASED ON THE INTERVIEW CORPUS
Category
Ŷ
Ŷ
Ŷ
Ŷ
Semantic Slot
Ŷ
Studying experiences
Ŷ
Interests, and strengths Ŷ
and weaknesses
Ŷ
Learning motivation
Ŷ
and future prospects
Ŷ
Ŷ
Domain knowledge
Ŷ
Ŷ Personality trait
Ŷ Others
Association and cadre leadership
Performance and achievement
Leisure interest
Pros and cons
Motivation
Study and future plan
Curriculum areas
Programming language and
professional skills
Ŷ Personality trait
Ŷ Others
i
where rt means that the reward of the i-th action at time t.
To avoid repeatedly selecting the same action more than
twice, the reward function of the r-th action at time t, rti , is
defined as follows.
1, if Nati = 0 or 1
rti = ®
(3)
otherwise
¯0,
In order to meet the real interview situation, we define
Nati as the number of the i-th action which has been
positively confirmed at time t. In this equation, if the
action has been selected more than twice at time t, we set
the reward to 0 (i.e., no reward) to avoid asking the same
type of questions repeatedly. On the contrary, if the action
is selected less than twice at time t, we increment the
reward by 1, which means the action will have a higher
priority to be selected.
In Figure 2, the deep RL model consists of up to 28
nodes in the input layer, 100 nodes in the first hidden layer,
100 nodes in the second hidden layer, and 8 nodes (actions)
in the output layer. The learning parameters are as follows.
Experience replay size is 10000, discount factor is 0.5,
minimum epsilon is 0.01, and batch size is 64.
The RL algorithm is characterized by a set of states
S ={st1 , st2 ,..., stM } , a set of actions A={at1 , at2 ,..., atN } , a state
Figure 2. The architecture of the deep RL model.
8
transition function F ( st , at , st +1 ) , a reward function
R( st , at , st +1 ) and a policy π : S → A that defines a
mapping from states to actions, where M is the number of
states, N is the number of actions.
The goal of an RL agent is to select actions by
maximizing its cumulative discounted reward as
Q* ( st , at ) = Ε st +1 [rt −1 + γ max Q* ( st , at ) | st , at ]
(4)
at +1
where Q represents the maximum sum of rewards ri
discounted by factor γ at each time step. While the deep
RL method takes actions with probability P (at | st ) during
training, it takes the best actions max at P (at | st ) at test
*
time. In practice, this approach is impractical, because the
goal is estimated separately for each sequence, without any
generalization [22]. The multilayer neural network serving
as the function approximator with weight θ is regarded as
a Q-network, and this Q-network can be trained by the loss
function L(θ i ) that changes at each interaction i,
L(θ i ) = Est +1 [( yi − Q ( st , at ;θ i )) 2 ]
(5)
where yi ġis the target output of the Q-network, and is
calculated as follows.
yi = Est +1 [(rt + γ max Q* ( st +1 , at +1 ;θ i −1 ) | st , at ]
(6)
at +1
V.
EXPERIMENTAL SETUP AND RESULTS
A. Experimental setup
For the construction of the dialog state tracking model,
the LSTM network using the Keras framework was
adopted and the Scikit-learn 0.18 was used to construct the
ANN model. Scikit-learn 0.18 is a Python module
integrating a wide range of state-of-the-art machine
learning algorithms for medium-scale supervised and
unsupervised problems. Besides, parameter tuning was
conducted to obtain the best performance of the LSTM+
ANN model. According to the results, we set 30 for each
word vector dimension and the number of hidden nodes in
the LSTM was set to 32. For the deep RL-based action
prediction model, the SimpleDS [8] was adopted and
modified to construct the deep RL-based action selection
model.
B. Results
We evaluated the performance of dialog state detection
on the LSTM+ANN-based model. The experimental setup
was based on the 5-fold cross-validation method. In 5-fold
cross-validation, the 394 answers were randomly
partitioned into 5 subsets of equal size, in which the
answer from one subject were grouped together and
organized into one subset for subject-independent
evaluation. Of the 5 subsets, one subset is retained as the
validation data for testing, and the remaining 4 subsets
were used for training. TABLE III shows the average
detection hits ratio and F-measure for each dialog state. In
this experiment, we considered the LSTM hidden vectors
as the ANN input at time t. In the experimental result, the
accuracy of each slot was considered rather than the
average accuracy over all slots. The best accuracy of each
slot was 77.96% and the best experimental result for the Fmeasure was 0.147.
2016 International Conference on Asian Language Processing (IALP)
TABLE III. EXPERIMENTAL RESULTS FOR DST
Figure 3 illustrates the learning process of the deep RL
model. The pre-training Q-network means that we used the
corpus data to pre-train Q-network, and the no pre-training
Q-network means that the Q-network randomly generated
the weights of Q-network. The result in this figure shows
that average predicted T using deep RL with pre-training
Q-network increases more smoothly than that without pretraining Q-network.
VI.
CONCLUSIONS AND FUTURE WORKS
In this study, an interview coaching system is
proposed and constructed for dialog state tracking and
action selection. LSTM and ANN are employed to predict
dialog states and the deep RL is used to learn the relation
between dialog states and actions. Finally, the predicted
dialog state and action pair are used to generate an
interview question. For performance evaluation on the
proposed method in dialog state tracking and action
selection, an interview coaching system was constructed,
and an encouraging result was obtained for dialog state
tracking, action selection and interview question
generation.
Several concerns are still needed for further
investigation. First, for the DST system to handle more
interview situations, a richer interview corpus is required.
Second, a robust Q learning model is beneficial for policy
control so as to construct a useful interview coaching
system.
ANN
Hidden
Nodes
8
32
128
256
512
1024
2048
F-measure
28.93%
43.25%
39.09%
57.53%
72.76%
77.96%
46.37%
0.147
0.126
0.138
0.073
0.041
0.037
0.123
Figure 3: Learning curve of the deep RL with pre-training Q-network
and no pre-training Q-network.
[11]
[12]
REFERENCES
M. E. Hoque, M. Courgeon, J. C. Martin, B. Mutlu and R. W.
Picard, “Mach: My automated conversation coach,” in the 2013
ACM international joint conference on Pervasive and ubiquitous
computing, Proceedings, 2013, pp. 697-706.
[2] M. R. Ali, D. Crasta, L. Jin, A. Baretto, J. Pachter, R. D. Rogge
and M. E. Hoque, “LISSA - Live Interactive Social Skill
Assistance,” in 2015 International Conference on Affective
Computing and Intelligent Interaction (ACII), Proceedings, 2015,
pp. 173-179.
[3] S. Kim, L. F. D’Haro, R. E. Banchs, J. D. Williams, M. Henderson
and J. Williams, “The fourth dialog state tracking challenge,” in the
7th International Workshop on Spoken Dialog Systems (IWSDS),
Proceedings, 2016, pp. 1-14.
[4] D. Bohus and A. I. Rudnicky, “RavenClaw: Dialog management
using hierarchical task decomposition and an expectation agenda,”
in the 8th European Conference on Speech Communication and
Technology, Proceedings, 2003.
[5] J. D. Williams and S. Young, “Partially observable Markov
decision processes for spoken dialog systems,” Proceedings of
Computer Speech and Language, vol. 21, no. 2, pp. 393-422, 2007.
[6] D. Bohus and A. Rudnicky, “A “K hypotheses + other” belief
updating model,” in the AAAI Workshop on Statistical and
Empirical Methods in Spoken Dialogue Systems, Proceedings,
2006, vol. 62.
[7] K. Yoshino, T. Hiraoka, G. Neubig and S. Nakamura, “Dialog
State Tracking using Long Short Term Memory Neural Networks,”
in the Seventh International Workshop on Spoken Dialog Systems
(IWSDS), Proceedings, 2016, pp. 1-8.
[8] H. Cuayáhuitl, “SimpleDS: A Simple Deep Reinforcement
Learning
Dialogue
System,”
2016,
arXiv
preprint
arXiv:1601.04574.
[9] S. Larsson, and D. R. Traum, “Information state and dialogue
management in the TRINDI dialogue move engine toolkit,”
Natural language engineering, vol. 6, no. 3&4, pp. 323-340, 2000.
[10] T. Hiraoka, G. Neubig, K. Yoshino, T. Toda, and S. Nakamura,
“Active Learning for Example-based Dialog Systems,” in Intl
Accuracy
[1]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Workshop on Spoken Dialog Systems, Proceedings, 2016,
Saariselka, Finland.
C. P. Chen, C. H. Wu, and W. B. Liang, “Robust dialogue act
detection based on partial sentence tree, derivation rule, and
spectral clustering algorithm,” EURASIP Journal on Audio,
Speech, and Music Processing, vol. 2012, no. 1, pp. 1-9, 2012.
C. H. Wu, and G. L. Yan, “Speech act modeling and verification of
spontaneous speech with disfluency in a spoken dialogue system,”
IEEE transactions on speech and audio processing, vol. 13, no. 3,
pp. 330-344, 2005.
T.
Mikolov,
word2vec,
accessed
2016-03-20.
https://code.google.com/p/word2vec/.
S. Hochreiter, J. Schmidhuber, "Long short-term memory,"
Proceedings of Neural Computation, vol. 9, no. 8, pp. 1735-1780,
1997.
J. Williams, A. Raux and M. Henderson, “The Dialog State
Tracking Challenge Series: A Review,” Proceedings of Dialog and
Discourse, col. 7, no. 3, pp.4-33, 2016.
K. Sun, L. Chen, S. Zhu and K. Yu, “A generalized rule based
tracker for dialog state tracking,” in 2014 IEEE Spoken Language
Technology Workshop (SLT), Proceedings, 2014, pp. 330-335.
I. V. Serban, A. Sordoni, Y. Bengio, A. Courville and J. Pineau,
“Building end-to-end dialog systems using generative hierarchical
neural network models,” in the 30th AAAI Conference on
Artificial Intelligence (AAAI-16), Proceedings, 2016, pp. 1-8.
L. Zilka, D. Marek, M. Korvas and F. Jurcicek, “Comparison of
bayesian discriminative and generative models for dialog state
tracking,” in 14th annual SIGdial Meeting on Discourse and Dialog
(SIGDIAL), Proceedings, 2013, pp. 452-456.
L. Žilka and F. Jurþíþek, “LecTrack: Incremental Dialog State
Tracking with Long Short-Term Memory Networks,” in
International Conference on Text, Speech, and Dialogue,
Proceedings, 2015, pp. 174-182
C. Szepesvári, “Algorithms for reinforcement learning,” Synthesis
lectures on artificial intelligence and machine learning, vol. 4, no.
1, 2010, pp. 1-103.
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D.
Wierstra, and M. Riedmiller, “Playing atari with deep
reinforcement learning,” 2013, arXiv preprint arXiv:1312.5602.
K. Chen, Y. Zhou and F. Dai, “A LSTM-based method for stock
returns prediction: A case study of China stock market,” in 2015
IEEE International Conference on Big Data (Big Data),
Proceedings, 2015, pp. 2823-2824.
2016 International Conference on Asian Language Processing (IALP)
9
© Copyright 2026 Paperzz