Multimodal Interaction for Distributed Interactive Simulation

Multimodal Interaction for
Distributed Interactive Simulation
Philip R. Cohen, Michael Johnston, David McGee, Sharon Oviatt,
Jay Pittman, Ira Smith, Liang Chen and Josh Clow
Center for Human Computer Communication
Oregon Graduate Institute of Science and Technology
http://www.cse.ogi.edu/CHCC
Presenter: Keita Fujii
Overview
• Background: Military simulation
– LeatherNet
• QuickSet: Multimodal interface for simulation
– Architecture
– Technical Issues
• Gesture recognition
• Multimodal integration
• Agent infrastructure
– Lessons learned
Background
• U.S. government is developing large-scale military simulation
capabilities
– >50,000 entities (e.g., a vehicle or a person) in a simulation
• LeatherNet
– Virtual simulation system for training platoon leaders and company
commanders
– Based on ModSAF (Modular Semi-Automated Forces) simulator
– Support CommandVu
• Wall-sized vertual reality
Required Interface
• Simulation interface should provide the following
operations
–
–
–
–
Create entities
Supply their initial behavior
Interact with the entities
Review the results
• Simulation interface should be
– Multimodal: # of entities is large
– On a portable size device: for mobility and affordability
QuickSet
• QuickSet
–
–
–
–
Multimodal interface for LeatherNet
Offers speech and pen-based gesture input
Runs on 3-lb hand-held PC
Based on Open Agent Architecture
Architecture
QuickSet
Interface
ModSAF
Simulator
Multimodal
integration
agent
Speech
recognition
agent
Gesture
recognition
agent
Simulation
agent
Natural
language
agent
Web
display
agent
Open Agent Architecture
CORBA
bridge
agent
CORBA Architecture
CommandVu
agent
Application
bridge
agent
Architecture
QuickSet
Interface
Gesture
recognition
agent
ModSAF
Simulator
Multimodal
integration
agent
Speech
recognition
agent
Simulation
agent
Natural
language
agent
CommandVu
agent
• QuickSet interface
Web
– Draws map,Open
icons,
entities
Agent
Architecture
display
– Activates speech and gesture recognition agent when the pen is placedagent
on the
screen
• Speech recognition
agent
CORBA
– IBM’sbridge
VoiceType Application Factory
agent
• Gesture recognition agent
Application
bridge
agent
– Analyzes pen input and gives N-best list of possible interpretations
CORBA Architecture
Architecture
QuickSet
Interface
Gesture
recognition
agent
ModSAF
Simulator
Multimodal
integration
agent
Speech
recognition
agent
Simulation
agent
Natural
language
agent
CommandVu
agent
• Natural language agent
Web
– Analyzes aOpen
natural
language
input from speech recognition agent
and
Agent
Architecture
display
agent
produces typed feature structures
• Multimodal integration agent
–
CORBA
bridge
Accepts
typed
agent
Application
from bridge
the language
agent
feature structures
agent and the
gesture agent, unites those structures together and produces a
multimodal interpretation
CORBA Architecture
Architecture
Draw map
and entities
QuickSet
Interface
Gesture
recognition
agent
ModSAF
Simulator
Multimodal
integration
agent
Speech
recognition
agent
Simulation
agent
Natural
language
agent
CommandVu
agent
• Simulation agent
Web
– Serves as the
communication
channel between OAA agents and
Open
Agent Architecture
display
agent
ModSAF simulation system
• CommandVu agent
CORBA
alsobridge
implemented
agent
Application
thatbridge
the same
agent
– Is
as an agent so
multimodal interface
(speech and gesture) can be used in CommandVu
CORBA Architecture
Architecture
Draw map
and entities
• Application
bridge agent
QuickSet
ModSAF
Multimodal such as ModSAF,
Simulator
Interface
– Bridges APIs of the various applications,
CommandVu
integration
agent
• Web display agent
– Allows a user to manipulate ModSAF simulation through Java applet on a
Simulation
WWW browser
Speech
agent
Gesture
• CORBA bridgerecognition
agent
agent
recognition
– Onverts OAA messages
agent
to
Natural
language
CORBA
agent IIOP/GIOP
Web
display
agent
Open Agent Architecture
CORBA
bridge
agent
CORBA Architecture
CommandVu
agent
Application
bridge
agent
Gesture recognition
• QuickSet’s pen-based gesture recognizer
– Consists of neural network and hidden Markov models
– Combines results from two recognizers
• To yield probabilities for each of the possible interpretations
Neural Net
Route:0.4
Area:0.2
Tank:0.01
combine
hidden Markov
model
Gesture
Route:0.7
Area:0.1
Tank:0.1
Route:0.6
Area:0.1
Tank:0.01
Multimodal integration
• Based on unification operation over typed feature
structures
– If two pieces of partial information can be combined
without loosing their consistency, combine them into a
single result
Speech
recognition
operation draw_line
Line
operation draw_line
Line
(10,10)-(20,20)
Gesture
recognition
Point (10,10)
Line
(10,10)-(20,20)
Agent infrastructure
• Open Agent Architecture
– All communication among the agents takes place through the facilitator agent
• When an agent registers with the facilitator agent, it supplies a list of goals it can
solve
• Agents post goals to be solved to the facilitator agent
• The facilitator agent forwards the goals to the agents that can solve them
– Uses ICL (Interagent Communication Language)
• Similar to KQML (Knowledge Query and Manipulation Language) and KIF
(Knowledge Interchange Format)
Goals
Goals
Agents
register
Facilitator
request
Agents
Lessons learned
• Open Agent Architecture
– Does not provide features for authentication or locking
• Prevents one user’s speech from being combined with another user’s
gesture
– Does not support multi-thread
• Cannot support a large number of users
– Centralized architecture of facilitator agent
• Not scalable
• Multimodal interface
– QuickSet proves that multimodal interaction offers the possibility of
more robust recognition
ANIMATED CONVERSATION:
Rule-based Generation of Facial Expression, Gesture &
Spoken Intonation for Multiple Conversational Agents
Justine Cassell, Catherine Pelachaud, Norman Badler, Mark
Steedman, Brett Achorn, Tripp Becket, Brett Douville, Scott
Prevost, Matthew Stone
Department of Computer & Information Science
University of Pennsylvania
Presenter: Keita Fujii
Overview
• Introduction
• Background
– Face expression
– Hand gesture
• System Architecture
– Speech generation
– Gesture Generation
– Facial Expression Generation
Introduction
• This paper presents
“automatically animating conversations between
multiple human-like agents”
– With speech, intonation, facial expressions, and hand
gestures
– Those expressions are
synthesized to make the
agents look more realistic
Facial expression
• Facial expression can perform
– Syntactic functions
• Accompanies the flow of speech
– E.g., nodding the head, blinking
– Semantic functions
• Emphasizes a word
• Substitutes for a word
• Refers to an emotion
– E.g., smiling and say “it is a NICE DAY.”
– Dialogic functions
• Regulate the flow of speech
– Mutual gaze for smooth conversation turns
Hand gesture
• Hand gestures can be categorized into as
– Iconics
• Represents some feature of the word
– E.g., make a rectangular while saying “a CHECK”
– Metaphorics
• Represents an abstract feature/concept
– E.g., form a jaw-like shape with a hand and pull it while saying “I can
WITHDRAW fifty dollars”
– Deictics
• Indicates a point in space
– E.g., point to the ground and say “THIS bank”
– Beats
• Hand waves that occur with emphasized words etc
– E.g., wave a hand while saying “all right”
• Hand gestures, facial expressions, eye gaze and speech
need to be synchronized
System Architecture
Dialog Planner
Symbolic Gesture Specification
World and Agent Model
Symbolic Intonation Specification
Speech Synthesizer
Phoneme Timings
Gesture and Utterance Synchronization
Gesture PaT-Net
Facial PaT-Net
Movement Specification
Animation System
Sound
Graphic Output
Speech Generation
• Dialog planner
– Generates dialogs
• Based on the common knowledge, the agent’s goal, and its believes
– Dialog includes
• The timing of the phonemes and pauses
• The type and place of the accents
• The type and place of the gestures
• Speech Synthesizer
– Generates sound data from the dialogs
Gesture generation
•
Gesture is generated through three steps
A) Symbolic Gesture Specification
–
Decides what type of gesture to use for each word
B) PaT-Nets (Parallel Transition Networks)
– Determines shape, position, transition and timing of
gestures
C) Gesture Generator
–
Generates actual motion from the information sent by
PaT-Nets
Symbolic Gesture Specification
• Determines the type of gesture
– Words with literally spatial content (“check”)
 Iconic
– Words with metaphorically spatial content (“account”)
 metaphoric
– Words with physically spatializable content (“this bank”)
 deistic
– Other new references  beat
– Also based on the annotations from the dialog planner and
classification of reference (new to speaker and listener, new to speaker
but not to listener, or old)
PeT-Nets
• PeT-Net is a finite state machine
– Each state represents an action to be invoked
– State transition is made either conditionally or
probabilistically
– Thus, a state transition generates a sequence of actions
• Gesture PeT-Net generates gestures, Facial PeT-Net generates facial
expressions
Send beat
to beat
PaT-Net
Send gesture
info to gesture
PaT-Net
Beat
signaled
parsing
Gesture info
found
Gesture info
complete
Get
gesture
info
Coarticulation
• The structure of PeT-Net allows coarticulation
– Two gestures occurs without intermediary relaxations
• I.e., start the next gesture without waiting for the first one to finish
– Coarticulation occurs when there is no sufficient time to finish a
gesture
Finish
gesture B
Finish
gesture A
pausing
Start
gesture A
Start
gesture B
Gesture Generator
• The animation of a gesture is created as a
combination of
– Hand shape
– Wrist control
– Arm positioning
– The system tries to get as close as possible to the gesture
goals, but may fail because of coarticulation effects
Facial expression generation
•
Facial expression is generated through the
same steps as gesture
A) Symbolic Facial Expression/Gaze Specification
–
Decides what type of expression to use for each word
B) Facial/Gaze PaT-Nets
–
Determines shape, position, transition and timing of
gestures
C) Facial Expression/Gaze Generator
–
Generates actual motion from the information sent by
PaT-Nets
Symbolic Facial Expression/Gaze
Specification
• Symbolic Facial Expression Specification
– Generates facial expressions connected to intonation
• Symbolic Gaze Specification
– Generates the following types of gaze expression
• Planning
– E.g., look away while organizing thought
• Comment
– E.g., look toward the listener when asking a question
• Control
– E.g., gaze at the listener when ending speech
• Feedback
– E.g., look toward the listener to obtain feedback
PeT-Nets
• Facial expression PeT-Nets
– No information in paper
• Gaze PeT-Net
– Each node is characterized by a probability
• An action of a node is invoked probabilistically
gaze
feedback
planning
comment
Short turn
Within turn
control
Beginning
of turn
accent
Back channel
Utterance
answer
End of turn
Utterance
question
Turn request
Configuration
signal
Facial Expression/Gaze Generator
• Facial expression generator
– Classifies an expression into functional groups
• Lip shape, conversational signal, punctuator, manipulator and
emblem
– Uses FACS
• Represents an expression as a pair of timing and type
• Gaze and head motion generator
– Generates motion of eye and head
• Based on the direction of gaze, timing, and duration
Direct Manipulation
vs
Interface Agents
Ben Shneiderman and Pattie Maes
Interactions, Nov. and Dec. 1997
Presenter: Keita Fujii
Introduction
• This article is about a debate session in IUI* 97 and CHI** 97
• Topic
– Direct Manipulation vs Interface Agent
• Speaker
– Ben Shneiderman
• From University of Maryland, Human-Computer Interaction Lab
• Proponent of Direct Manipulation
– Pattie Maes
• MIT Media Laboratory
• Proponent of Intelligent Agent
* Intelligent User Interface Workshop **Conference on Human Factors in Computing Systems
Overview
• Direct Manipulation
• Software Agent
– Benefits
– Criticisms
– Misconceptions
• Objections to agent system
• Agreement
• Q&A
Direct Manipulation
(Ben Shneiderman)
• User interface using information visualization
techniques that provides
– Overview
• How much /what kind of information is in the system
– Great control
• E.g., zoom in, scroll, filter out
– Predictability
• User can expect what’s happing next
– Detail-on-demand
• Benefits
– Reduce errors, and encourage exploration
Examples of Direct Manipulation
• FilmFinder
– Organizes movies in 2D plane
with years and popularity
• Lifeline
– Shows a case history graphically
• Visible Human Explorer
– Displays coronal section and
cross sections of a human body
Software agent
(Pattie Maes)
• Software agent is the program that is
– Personalized
• Knows the individual user’s habits, preferences, and interests
– Proactive
• Provides or suggests information to user before being requested
– Long-lived
• Keeps running autonomously
– Adaptive
• Monitors the use’s interests as they change over time
– Delegate
• User can delegate some task to the agent
• Agent acts on the user’s behalf
Examples of Software Agent
• Letizia
– Pre-loads web pages that the user may be interested in
• Remembrance Agent
– Remembers who sent email or whether email is replied
• Firefly
– Personal filters / personal critics
• Yenta
– Matchmaking agent
– Introduces another user who shares the same interests
Benefits of software agent
(Pattie Maes)
• Software agents are necessary because
– The computer system is getting more complex,
unstructured, and dynamic
• E.g., WWW
– The users are becoming more naïve
• End users are not trained to use computers
– The number of tasks to be managed with computer
are increasing
• Some tasks need to be delegated to somebody
Criticisms of agents
(Pattie Maes)
• Well-designed interfaces are better
– Even if the interface is perfect, you may just not want to delegate some
tasks to somebody
• Agents make the user dumb
– Yes, it’s true. But as long as there’s always an agent available, it’s not a
problem
• Using agents implies giving up all control
– You don’t have to have full control. As long as your task is
satisfactorily done, that’s fine
– However, the system must allow user to choose between direct
manipulation and task delegation to the agent
Misconceptions about agent
(Pattie Maes)
• Agents replaces user interface
• Agents need to be personified or anthropomorphized
• Agents need to rely on traditional AI
 They all are NOT true
Objections to Agent System
and Responses to the Objections
(Both)
• “Agent” is not a realistic solution for making a good user
interface because
– Agent cannot be smart and fast enough to make some intelligent
decision for human user
• Direct manipulation is for
– Professional users  not for end users
– Very well structured and organized domain
 not for ill structured and dynamic domain
• Agent system can cooperate with Direct Manipulation
– E.g., FilmFinder with agent making movie suggestions
• Anthropomorphic interfaces/representation are not
appropriate
– Agents do not have to be visible
• There is no “agents” in the Firefly Web site
– “Agent” has a broader meaning than “software agent,” so
you need to distinguish different types of “agents”
• Autonomous robots, synthetic characters, software agents etc
Direct manipulation & agent system
(Both)
• Agent is NOT an alternative but a complementary technique to
direct manipulation (interface)
– Agent system needs a good user interface that provides good
understanding (overview) and control
– Agent designer must pay attention to user-interface issues such as
understanding and control
• Two layer model
– The user interface level
• Predictable and controllable
– The agent level
• Adaptive, proactive system to increase
usability
Visualization /
User Interface
Agent System
Q&A
• Q. How do speech technologies affect direct
manipulation and agent system?
– A. Speech won’t be a generally usable tool because
• It disrupt cognitive process
• Low bandwidth communication
• Ambiguous
– A. Speech can be used as a supportive medium
• Q. How can user interface and/or agent system
support time-critical decision-support
environment where mistakes are critical?
– Agent system is not suitable for such system
because it is very hard to make agents that never
make mistake
• Q. How can we build a direct manipulation system
for vision challenged or blind users?
– Direct manipulation can be used to make an interface for
such users because direct manipulation depends on spatial
relationships and blind users often are strong at spatial
processing
• Q. What is it about agents that you dislike? (to Ben
Shneiderman)
– “intelligent agent” notion avoids dealing with interface
issues, but this will be changed
So, where did they cheat???