Input Recognition in Voice Control Interfaces to Three-Tier Autonomous Agents Vladimir Kulyukin Computer Science Department Utah State University Logan, UT, 84322 Adam Steele School of Computer Science DePaul University Chicago, IL, 60604-2301 Abstract deliberation tier plans and solves problems; the execution tier translates goals into task networks and executes them; the sensory-motor skills interact with the world. The execution tier of the 3T architecture is implemented using Reactive Action Packages (RAPs) [3]. 3T architectures are featured on diverse robotic platforms to solve a variety of problems [1, 9, 11, 8, 7]. In this paper, we show how VCI’s to 3T agents can benefit from the CFG formalism. Our approach also utilizes the recent advances in speech recognition that enhance HMMbased voice input recognition with context-free command and control grammars (CFCG’s), i.e., CFG’s with action directives. In particular, we argue that the input recognition capacity of CFG’s is partially equivalent to the input recognition capacity of DMAP-Nets. We proceed to use this theoretical result to construct the VCI’s to two autonomous agents. The first agent is Merlin, a Microsoft software agent, that acts as a desktop assistant (see Figure 2). The second agent is a Pioneer 2DX mobile robot assembled from the robotic toolkit from ActivMedia, Inc. (www.activmedia.com) (see Figure 3). The robot patrols an office area looking for soda cans, coffee cups, and crumpled pieces of paper. Figure 4 shows the hardware components of the Pioneer 2DX robot we used in our experiments. The robot has a three-wheel mobile base with two sonar rings, front and rear. The base has an onboard x86 computer with 32MB of RAM running Windows NT 4.0. The base also has an EVI-D30 camera mounted on it. The camera can pan, tilt, and zoom. It has a horizontal angle of view of 48.8 degrees and a vertical angle of view of 37.6 degrees. The video feed between an offboard client computer and the robotic base is done through a CCTV-900 wireless AV receiver and switcher and a Winnov video capture card [www.winnov.com]. The commands from the client computer to the robot base are sent via an InfoWave Radio Modem manufactured by the InnoMedia Wireless Group [www.innomedia.com]. The modem operates in the frequency band of 902-928 MHz with the air data rate of 85 Kbps. The robot has three on-board batteries that are periodically recharged with a PSC-124000 Automatic Battery Charger. Voice control interfaces are based on the assumption that the difficult problem of understanding spoken utterances can be sufficiently constrained if reduced to the problem of mapping voice inputs to knowledge and control structures. Thus, a spoken input is recognized if and only if it references an appropriate structure. Context-free command and control grammars are used in speech recognition to constrain voice inputs and improve recognition accuracy. We show how such grammars can be automatically constructed from the knowledge structures used in three-tier autonomous agents. 1. Introduction The purpose of voice control interfaces (VCI’s) is to map human voice inputs to the appropriate knowledge and control structures in autonomous robots or other computational devices. VCI’s combine three aspects of natural language processing (NLP): speech recognition, syntax, and semantics, each of which uses different formalisms. Speech recognition has traditionally relied on Hidden Markov Models (HMM), while syntax and semantics have made heavy use of context-free grammars (CFG’s), semantic networks, and first-order predicate calculus [6]. A direct consequence of this formalism divide is that intelligent VCI’s to three-tier (3T) autonomous agents operate in two completely separate phases: speech recognition and symbol interpretation. During speech recognition, voice inputs are mapped to symbols; during symbol interpretation, symbols obtained from voice inputs are used to identify appropriate knowledge structures [5, 4]. However, symbol interpretation can be eliminated altogether if the recognition of knowledge structures occurs as a natural by-product of speech recognition. We show that this is possible due to the partial input recognition equivalence between CFG’s and Direct Memory Access Parsing semantic networks (DMAP-Nets) [12, 10], which are knowledge structures used in many 3T agents [4, 7]. 3T agents are viewed as consisting of three tiers of functionality: deliberation, execution, and sensory-motor skills. The 1 Figure 1: A DMAP-Net. The paper is organized as follows. In section 2, we investigate the input recognition capacities of DMAP-Nets with respect to context-free languages (CFLs). In section 3, we use the construction inherent in the analysis from section 2 to build voice control interfaces to two autonomous agents. We show how the voice inputs are mapped to the agents’ goals that, in turn, enable and disable the agents’ behaviors. Section 4 offers implementation details. Section 5 outlines future work. Section 6 offers conclusions. 2 Input Recognition Analysis A DMAP-Net is a directed graph of nodes whose edges have two types of labels: abstraction and packaging. If two nodes are connected through an edge with an isa label, the node that receives the edge is an abstraction of the node that emits it. For example, in Figure 1, M-COMMAND is an abstraction of M-TURN-COMMAND. If two nodes are connected through an edge with a label other than isa, the receiving node is a frame and the emitting node is a slotfiller of the slot whose name is the edge’s label. For example, M-TURN-COMMAND is a frame with two slots: angle and direction. M-ANGLE is the filler of the angle slot while M-DIRECTION is the filler of the direction slot. The frame name starts with an ”M-” prefix to indicate that each node stands for a memory organization package (MOP), a term introduced by Schank [1980] to refer to frames. Frames are activated through recognition sequences associated with them. In Figure 1, the hyphenated box connected to M-TURN-COMMAND from below via a hyphenated arrow contains two recognition sequences at least one of which must be completed by the input for M-TURNCOMMAND to be activated. Recognition sequences simulate spreading activation [10]. For, if a spreading activation function is known and is provably deterministic, one can effectively generate all of the recognition sequences necessary to activate a given frame. DMAP-Nets connect to other modules through callbacks. A callback is an arbitrary piece of code that runs as soon as the frame it is associated with is activated. In Figure 1, the dotted box to the right of M-TURN-COMMAND and connected to it with a dotted arrow denotes a callback that installs an appropriate goal on the RAP executive’s agenda and asks the executive to execute it. Let D =< ; T; I; R; X; E > be a DMAP-net, where is the set of frames, T is the set of tokens, I is the set of frame ids, R [T [ I ]+ is the set of r-sequences, X is the set of edge labels, and E is the set of labelled edges, i.e., E = f(Mi ; Mj ; x)jMi 2 I; Mj 2 I; x 2 X g. Note that is defined by I , X , and E . Let T + be the set of t-sequences. Let T \ I = ; so that there is no confusion between tokens and frame ids. Since and I are isomorphic, i.e., every frame has a unique id, frames and frame ids are used interchangeably. Let : I ! 2 I be a function that associates frames with sets of r-sequences. In the discussion below, it is assumed that t-sequences are non-empty. A frame can be activated by a t-sequence directly or indirectly. Let Ad (M; t) denote that a frame M is directly activated by a t-sequence t and let A i (M; t) denote that M is indirectly activated by t. Let A(M; t) denote that M is activated by t either directly or indirectly. A frame Mi is directly activated by a t-sequence t = t1 t2 :::tn , n 1, denoted by A d (Mi ; t), iff there exists a r-sequence r = r1 r2 :::rn 2 (Mi ) such that 8i; 1 i n, one of the following conditions hold: 2 T + , then ri and ti are identical; If ri 2 I , then A(ri ; ti ). 1. If ri 2. A frame Mi is indirectly activated by a t-sequence t = t1 t2 :::tn , n 1, Ai (Mi ; t), iff there exists Mj 6= Mi such that A(Mj ; t) and (Mj ; Mi ; isa) 2 E . In other words, a frame is indirectly activated by a token sequence if the frame is an abstraction of another frame activated by that sequence. Let L(D) = ftjt 2 T + ^ 8M 2 I 0 I; A(M; t)g be the language of D. In other words, a token sequence is in the language of D if it activates a subset of frames. Note that the exact definition of I 0 will vary for different DMAPnets. For example, one can define I 0 to be a singleton and accept only those t-sequences that activate the only frame in the singleton. Lemma 2.1 Let be D =< ; T; I; R; X; E > a DMAPnet, then there exists a CFG G such that L(D) = L(G). Proof: Let G =< ; N; P; S > such that = T , N = I , P = P1 [ P2 [ P3 , where 1. P1 = fMi ! Mj j(Mj ; Mi ; isa) 2 E g; 2. P2 = fM ! rjM 2 I ^ r 2 (M )g; 3. P3 = fS ! M1 jM2 j:::jMn ; 1 n jI jg. Let t be a t-sequence such that t 2 L(D). Let M 2 I 0 be a frame activated by t. If A d (M; t) holds, then there exists an r-sequence r 2 (M ) such that r and t satisfy the two conditions of direct activation. Since, by construction, M ! r 2 P , M derives t. Since, by construction, S ! r 2 P , S derives t, i.e., t 2 L(G). If A i (M; t) holds, t activates a frame N such that M is one of its abstractions. Without loss of generality, assume that A d (N; t) holds. For, if M is indirectly activated, there must be a frame N such that Ad(N; t) holds, and M is an abstraction of N . If A d (N; t) holds, there exists an r-sequence r 2 (N ) such that r and t satisfy the two conditions of direct activation. Since, by construction, both N ! r and M ! N are in P , M derives t. Since, by construction, S ! M 2 P , S derives t, i.e., t 2 L(G). Conversely, let t 2 L(G). Then S derives t in one of two ways. Either S ) M ) r ) ::: ) t, where M 2 I and r 2 (M ), or S ) M ) N ) r ) ::: ) t, where M; N 2 I and r 2 (N ). In the former case, since M derives t via r, by reading the yield of the derivation tree rooted at M , one can find a strictly increasing sequence of indices 1 through n, 1 n, such that r = r 1 r2 :::rn and t = t1 t2 :::tn and 81 i n, ti is identical with ri or ri derives ti . Since, by construction, r 2 (M ), A d (M; t) holds. In the latter case, S derives t via M and N , and N derives t via r. By reading the yield of the derivation tree rooted at N , one can similarly find a strictly increasing sequence of indices that make t and r satisfy the two conditions of direct activation. Since, by construction, M is an abstraction of N and Ad (N; t) holds, Ai (M; t) holds as well. Thus, in either 2 case t 2 L(D). The proof of Lemma 2.1 offers an algorithm for constructing CFCG’s from DMAP-Nets. Given a DMAP-Net, the algorithm automatically generates an equivalent CFCG for the speech and frame recognition. Specifically, for each frame in the DMAP-Net and for each recognition sequence associated with the frame, a CFCG production is constructed such that the frame name is the production’s left-hand side and the recognition sequence is its right-hand side. If the frame has callbacks, each callback becomes an action specification. If two frames are connected via an abstraction edge, the abstraction frame becomes the lefthand side and the specification frame becomes the righthand side. Lemma 2.1 leads to the following theorem. Theorem 2.1 Let DMAP L be the set of DMAP-Net languages and CF L be the set of context-free languages, then DMAP L CF L: Proof: Let L 2 DMAP L. There there exists a DMAP-Net D such that L(D) = L. By Lemma 2.1, there exists a CFG G such that L(G) = L. Hence, DMAP L CF L. 2 The following lemma covers the construction of DMAPNets from CFG’s. Lemma 2.2 Let G =< ; N; P; S >, then there exists a DMAP-Net D such that L(D) L(G). Proof: D =< ; T; I; R; X; E > be a DMAP-Net defined as follows: 1. T = ; 2. I = N; 3. I = fS g; 4. X = fisa; partof g; 5. 6. 0 S N R (Ni ), where Ni 2 N and (Ni ) = Sk = f ig=1such that Ni ! j 2 P , and 1 k . j j =1 j j E = E1 [ E2 , where E1 = f(Ni ; Nj ; isa)jNj ! Ni 2 P g and E2 = f(Ni ; Nj ; partof )jNi ! Nj g, where ; 2 [ [ N ]+ . Let t 2 + and let t 2 L(G). Then S derives t in one of two ways. Either S ) t or S ) r ) ::: ) t, where r 2 [ [ N ]+ . In the former case, S ! t 2 P and, by construction, t 2 (S ). Thus, t 2 L(D). In the latter case, since r derives t, by reading the yield of the derivation tree rooted at S , one can find a strictly increasing sequence of indices to make r and t satisfy the two conditions of direct activation, as was done in Lemma 2.1. By construction, r 2 (S ). Hence, A(S; t) holds and t 2 L(D). 2 The question arises why the the construction offered in Lemma 2.1 has the equality sign between the two languages while the construction offered in Lemma 2.2 has the subset sign. It turns out that the construction of Lemma 2.2 can produce a DMAP-Net that recognizes a language strictly larger than the language recognized by the corresponding CFG. The following lemma formalizes this observation. Lemma 2.3 Let the construction algorithm C that generates DMAP-Nets from CFG’s be as specified in Lemma 2.2. Let C (G) = D, where G is a CFG and D is a DMAP-Net. Then there exists a CFG G0 such that L(G0 ) L(C (G0 )). Proof: Let G0 have the following productions: S ! ab and S ! aSb, i.e., L(G0 ) = an bn . Let D 0 = C (G0 ). By definition of activation, L(D) = a n bn [ an b, 1 n. 2 COMMAND is activated, a callback associated with that node installs an appropriate goal on the RAP sequencer’s agenda. Given the recognition equivalence of DMAP-nets and CFG’s, we can construct a VCI that uses a CFCG to do goal identification as a by-product of speech recognition. Thus, only the approapriate goal is sent to the RAP sequencer. The productions of the CFG are as follows: Figure 2: Merlin. Figure 3: Pioneer 2DX robot. Figure 4: Pioneer 2DX Hardware. 3 Mapping Inputs to Knowledge Structures Now we consider how CFCG’s can be used in VCI’s to map inputs to knowledge structures. Suppose that we want to build a VCI to a 3T mobile robot. One of the robot’s physical abilities that the VCI needs to reference is turning a certain number of degrees left or right. A standard VCI carries out the reference in two steps [4]. First, a audio stream uttered by the user is mapped into a symbolic representation of the user’s utterance, e.g., a set of symbols or a string. Second, the symbolic representation is used to activate the goals in the agent’s internal representation, e.g., a DMAPNet. For example, the agent uses the DMAP-Net given in Figure 1, M-TURN-COMMAND is activated on such inputs as “turn left twenty,” “turn left twenty degrees,” “turn right thirty,” or “turn right thirty degrees.” Once M-TURN- M-TURN-COMMAND => turn M-ANGLE M-DIRECTION | turn M-DIRECTION M-ANGLE :: execute-goal(turn, M-ANGLE, M-DIRECTION) M-ANGLE => M-NUMBER | M-DEGREES degrees M-NUMBER => ten | twenty ... M-COMMAND => M-TURN-COMMAND ... In the above CFCG, the nonterminals are capitalized; the terminals are in lower-case letters. The double colon sign in the first production separates the right-hand side of the production from an action specification. In this case, the action specification denotes a goal that will be installed on the RAP executive’s agenda should the rule recognize the voice input. For example, if the voice input is ”turn left twenty degrees,” the RAP executive receives receive the following goal: (turn -20 100)), which means that the robot should turn left with a speed of 100 mm/sec. The key point here is that the symbol interpretation that typically occurs through the DMAP-Net is bypassed because it is no longer necessary. In effect, the agent’s conceptual memory now consists of a set of context-free command and control productions. 4 Implementation Our VCI uses Microsoft Speech API (SAPI) 5.1 freely available from www.microsoft.com/speech. SAPI couples a HMM-based recognition engine with a system for constraining recognized speech with a CFCG. It provides speaker independent, relatively noise robust speech recognition. The grammar to be recognized is defined by an XML Data Type Definition (DTD). Here are three rules from the XML grammar used in the VCI to the Pioneer robot. <RULE NAME="M-TURN-COMMAND" TOPLEVEL="ACTIVE"> <P>turn</P> <RULEREF NAME=M-DIRECTION"/> <RULEREF NAME="M-ANGLE"/> </RULE> Figure 5: VCI to the Pioneer 2DX mobile robot. <RULE NAME="M-DIRECTION"> <L PROPNAME="direction"> <P VALSTR="left" >left</P> <P VALSTR="right">right</P> </L> </RULE> <RULE NAME="M-ANGLE"> <L PROPNAME="angle"> <P VALSTR="10">ten</P> <P VALSTR="20">twenty</P> <P VALSTR="30">thirty</P> <O>degrees</O> </L> </RULE> Given the above XML CFCG, we can extract semantic information from the parsing process, the values of the properties associated with the production rules being the values that populate the slots in the associated frame. The VCI to the Pioneer robot is show in Figure 5. The VCI runs in Allegro Common Lisp (ACL) 5.0.1. The top window is the ACL menu bar. The bottom window is the ACL’s interpreter (Debug Window). The leftmost window in the middle is the GUI to the Saphira library freely available from www.activmedia.com. Saphira is a collection of C routines that directly interface to the robot’s hardware. The window to the right of the Saphira window is our current implementation of the speech engine interface. It is implemented as a palm pilot interface because of our hope that eventually the human operator will be able to use a handheld device to interact with the robot by voice. The window to the left of the speech engine interface is the video feed from the robot’s camera that allows the user to see what the robot is seeing. Figure 6: A RAP for turning the Pioneer 2DX robot. The operator enters voice inputs through a push-to-talk event, i.e., by clicking a button in the palm pilot interface. After a voice input is recognized by a rule, a Visual Basic function implemented as part of the palm pilot interface, extracts the attributes from the derivation tree whose leaves are the input words. For example, if the input is “turn left twenty,” the value of the attribute Direction is “left” while the value of the attribute Angle is “20.” Once this information is extracted from the derivation tree, the voice recognition module of the VCI packs this information into a message and sends it to a voice skill server through a TCP/IP socket. The voice skill server is part of the skill system of the RAP executive and is started when the robot wakes up. The messages from the VCI are automatically queued on the server’s message queue. The server checks the message queue at regular intervals of adjustable frequency. When a message is dequeued, the voice skill maps the message into a goal and then installs the goal on the RAP executive’s agenda. Thus, the executive can integrate the goal into the robot’s current task networks unless, of course, the goal tells it to stop all activities. For example, when the operator utters “turn left twenty,” the goal that is installed on the RAP executive’s agenda is (turn -20 100), which leads to the execution of the RAP given in Figure 6. The VCI to the Merlin software agent looks almost exactly like the Pioneer’s VCI except that the Saphira window and the video feed window are not shown because they are not necessary. Everything else works exactly like it does in the Pioneer’s VCI. 5 Future Work We believe that with the appropriate restriction on the definition of activation for DMAP-Nets we will be able to show that the two language classes DMAPL and CFL coincide. We also wish to show that under meaningful assumptions we can reduce an arbitrary semantic network to a DMAPNet; this would argue for the adoption of DMAP-Nets as a canonical representation of semantic networks. This, coupled with the equivalence results, will serve as an important means of bridging the formalism divide between speech recognition and symbol interpretation. Furthermore, this equivalence will also allow us to use the rich theory of CFLs to characterize more fully the world of semantic networks. Finally, from an HCI perspective, we wish to examine how the grammars produced by the construction described in this paper actually function in real-world VCIs. Although, unlike other VCI solutions to autonomous robots [2], our solution addresses the task interruption problem and does not ignore any user commands, it still has the push-to-talk problem. The human operator must explicitly notify the voice control interface before a voice command is given, which makes communication unnatural and uncomfortable. Our future research effort will focus on eliminating the push-to-talk event altogether. The speech recognition engine will be operating continuously. If the operator’s input is not understood, it will be either ignored or the system will notify the operator that the input was not recognized. Another limiation of our VCI is that the messages that it sends to the voice skill server do not directly encode goals that can be readily installed on the RAP executive’s agenda. Therefore, some interpretation must still be done on the RAP executive’s side. A future release of the system will eliminate that step so that the voice control server will not have to translate messages into goals. 6 Conclusion We investigated the input recognition capacities of DMAPNets with respect CFLs.We argued that the input recognition capacity of CFG’s is partially equivalent to the input recognition capacity of DMAP-Nets. We used the construction inherent in our argument to build voice control interfaces to two autonomous agents. Our approach utilized the recent advances in speech recognition that enhance HMMbased voice input recognition with CFCG’s. We showed how the voice inputs are mapped to the agents’ goals that, in turn, enable and disable the agents’ behaviors. Theoretical Artifical Intelligence, Vol. 9(2), pp. 237256, 1997. [2] Choset, H. and Kortenkamp, D. ”Path Planning and Control for AERCam, a Free-Flying Inspection Robot in Space.” ASCE Journal of Aerospace Engineering, 1999. [3] Firby R. J., “Adaptive Execution in Complex Dynamic Worlds,” PhD thesis, Yale Univ. Dept of Computer Science, 1989. [4] Fitzgerald, W. and Firby, R. J. “Dialogue Systems Require a Reactive Task Architecture,” Proceedings of the 2000 AAAI Spring Symposium, AAAI Press, 2000. [5] Fitzgerald, W. and Wiseman, J. “Approaches to Integrating Natural Language Understanding and Task Execution in the DPMA AERCam Testbed,” White paper, Neodesic, Inc., 1997. [6] Jurafsky, D. and Martin, J. Speech and Language Processing, Prentice Hall, New Jersey, 2000. [7] Kulyukin, V. and Steele, A. “Instruction and Action in the Three-Tier Robot Architecture,” In submission to the International Symposium on Robotics and Automation (ISRA-2002), Toluco, Mexico, 2002. [8] Kulyukin, V. and Morley, N. “Integrated Object Recognition in the Three-Tiered Robot Architecture,” Proceedings of the International Conference on Artificial Intelligence (IC-AI 2002), IEEE Computer Society, 2002. [9] Kulyukin V. and Zoko, A. “Hamming Distance for Object Recognition by Mobile Robots,” Proceedings of CTIRS 2000 - the Research Symposium sponsored by the DePaul University School of Computer Science, Telecommunications, and Information Systems, DePaul Univeristy Press, 2000. [10] Kulyukin, V. and Settle, A. “Ranked Retrieval with Semantic Networks and Vector Spaces,” Journal of the American Society for Information Science and Technology, 52(4): pp. 1124-1233, 2001. [11] Kulyukin, V. and Bookstein, A. “Integrated Object Recognition with Extended Hamming Distance,” Technical Report. School of Computer Science, Telecommunications and Information Systems, DePaul University, 2001. References [12] Martin, C. “Direct Memory Access Parsing,” Technical Report CS93-07, The University of Chicago, Department of Computer Science, 1993. [1] Bonasso, R.P., Firby, R.J., Gat. E., Kortenkamp, D., and Slack, M. “Experiences with an Architecture for Intelligent, Reactive Agents,” Journal of Experimental and [13] Schank, R. Dynamic Memory, Cambridge University Press, New York, 1980.
© Copyright 2026 Paperzz