Input Recognition in Voice Control Interfaces to Autonomous Agents

Input Recognition in Voice Control Interfaces to Three-Tier Autonomous Agents
Vladimir Kulyukin
Computer Science Department
Utah State University
Logan, UT, 84322
Adam Steele
School of Computer Science
DePaul University
Chicago, IL, 60604-2301
Abstract
deliberation tier plans and solves problems; the execution
tier translates goals into task networks and executes them;
the sensory-motor skills interact with the world. The execution tier of the 3T architecture is implemented using Reactive Action Packages (RAPs) [3]. 3T architectures are
featured on diverse robotic platforms to solve a variety of
problems [1, 9, 11, 8, 7].
In this paper, we show how VCI’s to 3T agents can benefit from the CFG formalism. Our approach also utilizes the
recent advances in speech recognition that enhance HMMbased voice input recognition with context-free command
and control grammars (CFCG’s), i.e., CFG’s with action
directives. In particular, we argue that the input recognition capacity of CFG’s is partially equivalent to the input recognition capacity of DMAP-Nets. We proceed to
use this theoretical result to construct the VCI’s to two autonomous agents. The first agent is Merlin, a Microsoft
software agent, that acts as a desktop assistant (see Figure 2). The second agent is a Pioneer 2DX mobile robot
assembled from the robotic toolkit from ActivMedia, Inc.
(www.activmedia.com) (see Figure 3). The robot patrols an
office area looking for soda cans, coffee cups, and crumpled
pieces of paper.
Figure 4 shows the hardware components of the Pioneer 2DX robot we used in our experiments. The robot
has a three-wheel mobile base with two sonar rings, front
and rear. The base has an onboard x86 computer with
32MB of RAM running Windows NT 4.0. The base also
has an EVI-D30 camera mounted on it. The camera can
pan, tilt, and zoom. It has a horizontal angle of view
of 48.8 degrees and a vertical angle of view of 37.6 degrees. The video feed between an offboard client computer
and the robotic base is done through a CCTV-900 wireless AV receiver and switcher and a Winnov video capture
card [www.winnov.com]. The commands from the client
computer to the robot base are sent via an InfoWave Radio
Modem manufactured by the InnoMedia Wireless Group
[www.innomedia.com]. The modem operates in the frequency band of 902-928 MHz with the air data rate of 85
Kbps. The robot has three on-board batteries that are periodically recharged with a PSC-124000 Automatic Battery
Charger.
Voice control interfaces are based on the assumption that
the difficult problem of understanding spoken utterances
can be sufficiently constrained if reduced to the problem
of mapping voice inputs to knowledge and control structures. Thus, a spoken input is recognized if and only if
it references an appropriate structure. Context-free command and control grammars are used in speech recognition to constrain voice inputs and improve recognition accuracy. We show how such grammars can be automatically
constructed from the knowledge structures used in three-tier
autonomous agents.
1. Introduction
The purpose of voice control interfaces (VCI’s) is to map
human voice inputs to the appropriate knowledge and control structures in autonomous robots or other computational
devices. VCI’s combine three aspects of natural language
processing (NLP): speech recognition, syntax, and semantics, each of which uses different formalisms. Speech recognition has traditionally relied on Hidden Markov Models
(HMM), while syntax and semantics have made heavy use
of context-free grammars (CFG’s), semantic networks, and
first-order predicate calculus [6]. A direct consequence of
this formalism divide is that intelligent VCI’s to three-tier
(3T) autonomous agents operate in two completely separate
phases: speech recognition and symbol interpretation. During speech recognition, voice inputs are mapped to symbols;
during symbol interpretation, symbols obtained from voice
inputs are used to identify appropriate knowledge structures
[5, 4].
However, symbol interpretation can be eliminated altogether if the recognition of knowledge structures occurs as
a natural by-product of speech recognition. We show that
this is possible due to the partial input recognition equivalence between CFG’s and Direct Memory Access Parsing semantic networks (DMAP-Nets) [12, 10], which are
knowledge structures used in many 3T agents [4, 7]. 3T
agents are viewed as consisting of three tiers of functionality: deliberation, execution, and sensory-motor skills. The
1
Figure 1: A DMAP-Net.
The paper is organized as follows. In section 2, we investigate the input recognition capacities of DMAP-Nets with
respect to context-free languages (CFLs). In section 3, we
use the construction inherent in the analysis from section 2
to build voice control interfaces to two autonomous agents.
We show how the voice inputs are mapped to the agents’
goals that, in turn, enable and disable the agents’ behaviors.
Section 4 offers implementation details. Section 5 outlines
future work. Section 6 offers conclusions.
2 Input Recognition Analysis
A DMAP-Net is a directed graph of nodes whose edges
have two types of labels: abstraction and packaging. If
two nodes are connected through an edge with an isa label, the node that receives the edge is an abstraction of the
node that emits it. For example, in Figure 1, M-COMMAND
is an abstraction of M-TURN-COMMAND. If two nodes are
connected through an edge with a label other than isa, the
receiving node is a frame and the emitting node is a slotfiller of the slot whose name is the edge’s label. For example, M-TURN-COMMAND is a frame with two slots: angle
and direction. M-ANGLE is the filler of the angle slot while
M-DIRECTION is the filler of the direction slot. The frame
name starts with an ”M-” prefix to indicate that each node
stands for a memory organization package (MOP), a term
introduced by Schank [1980] to refer to frames.
Frames are activated through recognition sequences associated with them. In Figure 1, the hyphenated box connected to M-TURN-COMMAND from below via a hyphenated arrow contains two recognition sequences at least one
of which must be completed by the input for M-TURNCOMMAND to be activated. Recognition sequences simulate spreading activation [10]. For, if a spreading activation
function is known and is provably deterministic, one can
effectively generate all of the recognition sequences necessary to activate a given frame.
DMAP-Nets connect to other modules through callbacks. A callback is an arbitrary piece of code that runs
as soon as the frame it is associated with is activated. In
Figure 1, the dotted box to the right of M-TURN-COMMAND
and connected to it with a dotted arrow denotes a callback
that installs an appropriate goal on the RAP executive’s
agenda and asks the executive to execute it.
Let D =< ; T; I; R; X; E > be a DMAP-net, where
is the set of frames, T is the set of tokens, I is the set of
frame ids, R [T [ I ]+ is the set of r-sequences, X is the
set of edge labels, and E is the set of labelled edges, i.e.,
E = f(Mi ; Mj ; x)jMi 2 I; Mj 2 I; x 2 X g. Note that is defined by I , X , and E . Let T + be the set of t-sequences.
Let T \ I = ; so that there is no confusion between tokens
and frame ids. Since and I are isomorphic, i.e., every
frame has a unique id, frames and frame ids are used interchangeably. Let : I ! 2 I be a function that associates
frames with sets of r-sequences. In the discussion below, it
is assumed that t-sequences are non-empty.
A frame can be activated by a t-sequence directly or indirectly. Let Ad (M; t) denote that a frame M is directly
activated by a t-sequence t and let A i (M; t) denote that M
is indirectly activated by t. Let A(M; t) denote that M is
activated by t either directly or indirectly.
A frame Mi is directly activated by a t-sequence t =
t1 t2 :::tn , n 1, denoted by A d (Mi ; t), iff there exists a
r-sequence r = r1 r2 :::rn 2 (Mi ) such that 8i; 1 i n,
one of the following conditions hold:
2 T + , then ri and ti are identical;
If ri 2 I , then A(ri ; ti ).
1. If ri
2.
A frame Mi is indirectly activated by a t-sequence t =
t1 t2 :::tn , n 1, Ai (Mi ; t), iff there exists Mj 6= Mi such
that A(Mj ; t) and (Mj ; Mi ; isa) 2 E . In other words,
a frame is indirectly activated by a token sequence if the
frame is an abstraction of another frame activated by that
sequence.
Let L(D) = ftjt 2 T + ^ 8M 2 I 0 I; A(M; t)g be
the language of D. In other words, a token sequence is in
the language of D if it activates a subset of frames. Note
that the exact definition of I 0 will vary for different DMAPnets. For example, one can define I 0 to be a singleton and
accept only those t-sequences that activate the only frame in
the singleton.
Lemma 2.1 Let be D =< ; T; I; R; X; E > a DMAPnet, then there exists a CFG G such that L(D) = L(G).
Proof:
Let G
=< ; N; P; S > such that = T , N = I ,
P = P1 [ P2 [ P3 , where
1.
P1 = fMi ! Mj j(Mj ; Mi ; isa) 2 E g;
2.
P2 = fM ! rjM
2 I ^ r 2 (M )g;
3.
P3 = fS ! M1 jM2 j:::jMn ; 1 n jI jg.
Let t be a t-sequence such that t 2 L(D). Let M 2 I 0
be a frame activated by t. If A d (M; t) holds, then there
exists an r-sequence r 2 (M ) such that r and t satisfy the
two conditions of direct activation. Since, by construction,
M ! r 2 P , M derives t. Since, by construction, S ! r 2
P , S derives t, i.e., t 2 L(G). If A i (M; t) holds, t activates
a frame N such that M is one of its abstractions. Without
loss of generality, assume that A d (N; t) holds. For, if M
is indirectly activated, there must be a frame N such that
Ad(N; t) holds, and M is an abstraction of N . If A d (N; t)
holds, there exists an r-sequence r 2 (N ) such that r and
t satisfy the two conditions of direct activation. Since, by
construction, both N ! r and M ! N are in P , M derives
t. Since, by construction, S ! M 2 P , S derives t, i.e.,
t 2 L(G).
Conversely, let t 2 L(G). Then S derives t in one of
two ways. Either S ) M ) r ) ::: ) t, where M 2 I
and r 2 (M ), or S ) M ) N ) r ) ::: ) t, where
M; N 2 I and r 2 (N ). In the former case, since M
derives t via r, by reading the yield of the derivation tree
rooted at M , one can find a strictly increasing sequence of
indices 1 through n, 1 n, such that r = r 1 r2 :::rn and t =
t1 t2 :::tn and 81 i n, ti is identical with ri or ri derives
ti . Since, by construction, r 2 (M ), A d (M; t) holds. In
the latter case, S derives t via M and N , and N derives t
via r. By reading the yield of the derivation tree rooted at
N , one can similarly find a strictly increasing sequence of
indices that make t and r satisfy the two conditions of direct
activation. Since, by construction, M is an abstraction of N
and Ad (N; t) holds, Ai (M; t) holds as well. Thus, in either
2
case t 2 L(D).
The proof of Lemma 2.1 offers an algorithm for constructing CFCG’s from DMAP-Nets. Given a DMAP-Net,
the algorithm automatically generates an equivalent CFCG
for the speech and frame recognition. Specifically, for
each frame in the DMAP-Net and for each recognition sequence associated with the frame, a CFCG production is
constructed such that the frame name is the production’s
left-hand side and the recognition sequence is its right-hand
side. If the frame has callbacks, each callback becomes an
action specification. If two frames are connected via an
abstraction edge, the abstraction frame becomes the lefthand side and the specification frame becomes the righthand side.
Lemma 2.1 leads to the following theorem.
Theorem 2.1 Let DMAP L be the set of DMAP-Net languages and CF L be the set of context-free languages, then
DMAP L CF L:
Proof: Let L 2 DMAP L. There there exists a DMAP-Net
D such that L(D) = L. By Lemma 2.1, there exists a CFG
G such that L(G) = L. Hence, DMAP L CF L.
2
The following lemma covers the construction of DMAPNets from CFG’s.
Lemma 2.2 Let G =< ; N; P; S >, then there exists a
DMAP-Net D such that L(D) L(G).
Proof:
D =< ; T; I; R; X; E > be a DMAP-Net defined as
follows:
1.
T = ;
2.
I = N;
3.
I = fS g;
4.
X = fisa; partof g;
5.
6.
0
S
N
R
(Ni ), where Ni 2 N and (Ni ) =
Sk = f ig=1such
that Ni ! j 2 P , and 1 k .
j
j =1
j
j
E = E1 [ E2 , where E1 = f(Ni ; Nj ; isa)jNj !
Ni 2 P g and E2 = f(Ni ; Nj ; partof )jNi ! Nj g,
where ; 2 [ [ N ]+ .
Let t 2 + and let t 2 L(G). Then S derives t in one
of two ways. Either S ) t or S ) r ) ::: ) t, where
r 2 [ [ N ]+ . In the former case, S ! t 2 P and, by
construction, t 2 (S ). Thus, t 2 L(D). In the latter case,
since r derives t, by reading the yield of the derivation tree
rooted at S , one can find a strictly increasing sequence of
indices to make r and t satisfy the two conditions of direct
activation, as was done in Lemma 2.1. By construction,
r 2 (S ). Hence, A(S; t) holds and t 2 L(D).
2
The question arises why the the construction offered in
Lemma 2.1 has the equality sign between the two languages
while the construction offered in Lemma 2.2 has the subset
sign. It turns out that the construction of Lemma 2.2 can
produce a DMAP-Net that recognizes a language strictly
larger than the language recognized by the corresponding
CFG. The following lemma formalizes this observation.
Lemma 2.3 Let the construction algorithm C that generates DMAP-Nets from CFG’s be as specified in Lemma 2.2.
Let C (G) = D, where G is a CFG and D is a DMAP-Net.
Then there exists a CFG G0 such that L(G0 ) L(C (G0 )).
Proof: Let G0 have the following productions: S ! ab
and S ! aSb, i.e., L(G0 ) = an bn . Let D 0 = C (G0 ). By
definition of activation, L(D) = a n bn [ an b, 1 n.
2
COMMAND is activated, a callback associated with that node
installs an appropriate goal on the RAP sequencer’s agenda.
Given the recognition equivalence of DMAP-nets and
CFG’s, we can construct a VCI that uses a CFCG to do goal
identification as a by-product of speech recognition. Thus,
only the approapriate goal is sent to the RAP sequencer. The
productions of the CFG are as follows:
Figure 2: Merlin.
Figure 3: Pioneer 2DX robot.
Figure 4: Pioneer 2DX Hardware.
3 Mapping Inputs to Knowledge
Structures
Now we consider how CFCG’s can be used in VCI’s to map
inputs to knowledge structures. Suppose that we want to
build a VCI to a 3T mobile robot. One of the robot’s physical abilities that the VCI needs to reference is turning a certain number of degrees left or right. A standard VCI carries
out the reference in two steps [4]. First, a audio stream uttered by the user is mapped into a symbolic representation
of the user’s utterance, e.g., a set of symbols or a string.
Second, the symbolic representation is used to activate the
goals in the agent’s internal representation, e.g., a DMAPNet.
For example, the agent uses the DMAP-Net given in Figure 1, M-TURN-COMMAND is activated on such inputs as
“turn left twenty,” “turn left twenty degrees,” “turn right
thirty,” or “turn right thirty degrees.” Once M-TURN-
M-TURN-COMMAND =>
turn M-ANGLE M-DIRECTION |
turn M-DIRECTION M-ANGLE
:: execute-goal(turn,
M-ANGLE,
M-DIRECTION)
M-ANGLE => M-NUMBER |
M-DEGREES degrees
M-NUMBER => ten | twenty ...
M-COMMAND => M-TURN-COMMAND ...
In the above CFCG, the nonterminals are capitalized; the
terminals are in lower-case letters. The double colon sign in
the first production separates the right-hand side of the production from an action specification. In this case, the action specification denotes a goal that will be installed on the
RAP executive’s agenda should the rule recognize the voice
input. For example, if the voice input is ”turn left twenty
degrees,” the RAP executive receives receive the following
goal: (turn -20 100)), which means that the robot
should turn left with a speed of 100 mm/sec. The key point
here is that the symbol interpretation that typically occurs
through the DMAP-Net is bypassed because it is no longer
necessary. In effect, the agent’s conceptual memory now
consists of a set of context-free command and control productions.
4 Implementation
Our VCI uses Microsoft Speech API (SAPI) 5.1 freely
available from www.microsoft.com/speech. SAPI
couples a HMM-based recognition engine with a system for
constraining recognized speech with a CFCG. It provides
speaker independent, relatively noise robust speech recognition.
The grammar to be recognized is defined by an XML
Data Type Definition (DTD). Here are three rules from the
XML grammar used in the VCI to the Pioneer robot.
<RULE NAME="M-TURN-COMMAND"
TOPLEVEL="ACTIVE">
<P>turn</P>
<RULEREF NAME=M-DIRECTION"/>
<RULEREF NAME="M-ANGLE"/>
</RULE>
Figure 5: VCI to the Pioneer 2DX mobile robot.
<RULE NAME="M-DIRECTION">
<L PROPNAME="direction">
<P VALSTR="left" >left</P>
<P VALSTR="right">right</P>
</L>
</RULE>
<RULE NAME="M-ANGLE">
<L PROPNAME="angle">
<P VALSTR="10">ten</P>
<P VALSTR="20">twenty</P>
<P VALSTR="30">thirty</P>
<O>degrees</O>
</L>
</RULE>
Given the above XML CFCG, we can extract semantic information from the parsing process, the values of the
properties associated with the production rules being the
values that populate the slots in the associated frame.
The VCI to the Pioneer robot is show in Figure 5. The
VCI runs in Allegro Common Lisp (ACL) 5.0.1. The top
window is the ACL menu bar. The bottom window is the
ACL’s interpreter (Debug Window). The leftmost window
in the middle is the GUI to the Saphira library freely available from www.activmedia.com. Saphira is a collection of C routines that directly interface to the robot’s hardware. The window to the right of the Saphira window is
our current implementation of the speech engine interface.
It is implemented as a palm pilot interface because of our
hope that eventually the human operator will be able to use
a handheld device to interact with the robot by voice. The
window to the left of the speech engine interface is the video
feed from the robot’s camera that allows the user to see what
the robot is seeing.
Figure 6: A RAP for turning the Pioneer 2DX robot.
The operator enters voice inputs through a push-to-talk
event, i.e., by clicking a button in the palm pilot interface.
After a voice input is recognized by a rule, a Visual Basic function implemented as part of the palm pilot interface, extracts the attributes from the derivation tree whose
leaves are the input words. For example, if the input is “turn
left twenty,” the value of the attribute Direction is “left”
while the value of the attribute Angle is “20.”
Once this information is extracted from the derivation
tree, the voice recognition module of the VCI packs this
information into a message and sends it to a voice skill
server through a TCP/IP socket. The voice skill server is
part of the skill system of the RAP executive and is started
when the robot wakes up. The messages from the VCI
are automatically queued on the server’s message queue.
The server checks the message queue at regular intervals
of adjustable frequency. When a message is dequeued, the
voice skill maps the message into a goal and then installs
the goal on the RAP executive’s agenda. Thus, the executive can integrate the goal into the robot’s current task networks unless, of course, the goal tells it to stop all activities.
For example, when the operator utters “turn left twenty,”
the goal that is installed on the RAP executive’s agenda is
(turn -20 100), which leads to the execution of the
RAP given in Figure 6.
The VCI to the Merlin software agent looks almost exactly like the Pioneer’s VCI except that the Saphira window
and the video feed window are not shown because they are
not necessary. Everything else works exactly like it does in
the Pioneer’s VCI.
5 Future Work
We believe that with the appropriate restriction on the definition of activation for DMAP-Nets we will be able to show
that the two language classes DMAPL and CFL coincide.
We also wish to show that under meaningful assumptions
we can reduce an arbitrary semantic network to a DMAPNet; this would argue for the adoption of DMAP-Nets as a
canonical representation of semantic networks. This, coupled with the equivalence results, will serve as an important means of bridging the formalism divide between speech
recognition and symbol interpretation. Furthermore, this
equivalence will also allow us to use the rich theory of CFLs
to characterize more fully the world of semantic networks.
Finally, from an HCI perspective, we wish to examine how
the grammars produced by the construction described in this
paper actually function in real-world VCIs.
Although, unlike other VCI solutions to autonomous
robots [2], our solution addresses the task interruption problem and does not ignore any user commands, it still has the
push-to-talk problem. The human operator must explicitly
notify the voice control interface before a voice command is
given, which makes communication unnatural and uncomfortable. Our future research effort will focus on eliminating the push-to-talk event altogether. The speech recognition engine will be operating continuously. If the operator’s
input is not understood, it will be either ignored or the system will notify the operator that the input was not recognized.
Another limiation of our VCI is that the messages that it
sends to the voice skill server do not directly encode goals
that can be readily installed on the RAP executive’s agenda.
Therefore, some interpretation must still be done on the
RAP executive’s side. A future release of the system will
eliminate that step so that the voice control server will not
have to translate messages into goals.
6 Conclusion
We investigated the input recognition capacities of DMAPNets with respect CFLs.We argued that the input recognition capacity of CFG’s is partially equivalent to the input
recognition capacity of DMAP-Nets. We used the construction inherent in our argument to build voice control interfaces to two autonomous agents. Our approach utilized the
recent advances in speech recognition that enhance HMMbased voice input recognition with CFCG’s. We showed
how the voice inputs are mapped to the agents’ goals that,
in turn, enable and disable the agents’ behaviors.
Theoretical Artifical Intelligence, Vol. 9(2), pp. 237256, 1997.
[2] Choset, H. and Kortenkamp, D. ”Path Planning and
Control for AERCam, a Free-Flying Inspection Robot
in Space.” ASCE Journal of Aerospace Engineering,
1999.
[3] Firby R. J., “Adaptive Execution in Complex Dynamic
Worlds,” PhD thesis, Yale Univ. Dept of Computer Science, 1989.
[4] Fitzgerald, W. and Firby, R. J. “Dialogue Systems Require a Reactive Task Architecture,” Proceedings of the
2000 AAAI Spring Symposium, AAAI Press, 2000.
[5] Fitzgerald, W. and Wiseman, J. “Approaches to Integrating Natural Language Understanding and Task Execution in the DPMA AERCam Testbed,” White paper,
Neodesic, Inc., 1997.
[6] Jurafsky, D. and Martin, J. Speech and Language Processing, Prentice Hall, New Jersey, 2000.
[7] Kulyukin, V. and Steele, A. “Instruction and Action in
the Three-Tier Robot Architecture,” In submission to
the International Symposium on Robotics and Automation (ISRA-2002), Toluco, Mexico, 2002.
[8] Kulyukin, V. and Morley, N. “Integrated Object Recognition in the Three-Tiered Robot Architecture,” Proceedings of the International Conference on Artificial
Intelligence (IC-AI 2002), IEEE Computer Society,
2002.
[9] Kulyukin V. and Zoko, A. “Hamming Distance for Object Recognition by Mobile Robots,” Proceedings of
CTIRS 2000 - the Research Symposium sponsored by
the DePaul University School of Computer Science,
Telecommunications, and Information Systems, DePaul
Univeristy Press, 2000.
[10] Kulyukin, V. and Settle, A. “Ranked Retrieval with
Semantic Networks and Vector Spaces,” Journal of the
American Society for Information Science and Technology, 52(4): pp. 1124-1233, 2001.
[11] Kulyukin, V. and Bookstein, A. “Integrated Object
Recognition with Extended Hamming Distance,” Technical Report. School of Computer Science, Telecommunications and Information Systems, DePaul University, 2001.
References
[12] Martin, C. “Direct Memory Access Parsing,” Technical Report CS93-07, The University of Chicago, Department of Computer Science, 1993.
[1] Bonasso, R.P., Firby, R.J., Gat. E., Kortenkamp, D., and
Slack, M. “Experiences with an Architecture for Intelligent, Reactive Agents,” Journal of Experimental and
[13] Schank, R. Dynamic Memory, Cambridge University
Press, New York, 1980.