(learning mechanisms, architectural constraints, reuse of model and

Abstracting and Composing High-Fidelity Cognitive Models
of Multi-Agent Interaction
MURI Kick-Off Meeting
August 2008
Christian Lebiere
David Reitter
Psychology Department
Carnegie Mellon University
Main Issues
• Understand scaling properties of cognitive
performance
– Most experiments look at a single performance point rather
than as a function of problem complexity, time pressure, etc
– Key component in abstracting performance at higher levels
• Understand interaction between humans and machines
– Most experiments study and model human performance under
a fixed scenario that misses key dynamics of interaction
– Key aspect of both system robustness and vulnerabilities
• Understand generality and composability of behavior
– Almost all models are developed for specific tasks rather than
assembling larger pieces of functionality from basic pieces
– Key enabler of scaling models and abstracting their properties
Cognitive Architectures
• What is a cognitive architecture?
– Invariant mechanisms to capture generality of cognition (Newell)
– Aims for both breadth (Newell Test) and depth (quantitative data)
• How are they used?
– Develop model of a task (declarative knowledge, procedural
strategies, architectural parameters)
– Limits of model fitting (learning mechanisms, architectural
constraints, reuse of model and parameters)
• ACT-R
– Modular organization, communication
bottlenecks, mapping to brain regions
– Mix of symbolic production system and
subsymbolic statistical mechanisms
ACT-R Cognitive Architecture
Intentions
W  S
B  ln  t
Activation Ai  Bi 
Learning
ji
 A
i
Ti  F  e
Retrieval
Productions
 Ai
Ui  Pi G  Ci   U
Succ i
Learning Pi 
Succi  Faili
Utility
Goal
j
d
j
j
Latency
j
Memory
Visual
Manual
Motor
Vision
IF the goal is to categorize new stimulus
and visual holds stimulus info S, F, T
THEN start retrieval of chunk S, F, T
and start manual mouse movement
World
Size Fuel Turb Dec
Stimulus
Bi
Chunk
S 20 1
SSL
L 20 3
S13
Y
Sample Task: AMBR Synthetic ATC
Model - Methodology
• Model designed to solve task simply and effectively
– Not engineered to reproduce any specific effects
• Reuse of common design patterns
– Makes modeling easier and faster
– Reduces degrees of freedom
• No fine-tuning of parameters
– Left at default values or roughly estimated from data (2)
• Architecture provides automatic learning of
situation
– Position & status of AC naturally learned from interaction
Model - Methodology II
• As many model runs as subject runs
– Performance variability is an essential part of the task!
– Model speed is essential (5 times real-time in this case)
– Stochasticity is a fundamental feature of the architecture
• Production selection
• Declarative retrieval
• Perception and actions
– Stochasticity amplified by interaction with environment
– Model captures most of variance of human performance
– No individual variations factored in the model (W, efforts)
Model - Overview
• 5 (simple) declarative chunks encoding instructions
– Associate color to action and penalty
• 36 (simple) productions organized in 5 unit tasks
–
–
–
–
–
Color-Goal(5): top-level goal to pick next color target
Text-Goal(4): top-level goal to pick next area to scan
Scan-Text(7): goal to scan text window for new messages
Scan-Screen(8): goal to scan screen area for exiting AC
Process(12): processes a target with 3 or 4 mouse clicks
• Unit tasks map naturally to ACT-R goal type and
production-matching - a natural design pattern
Flyoff - Performance
400
Subjects Mean
Penalty Points
300
Model Mean
200
100
0
Color - Low Color - Mid Color - High Text - Low
Text - Mid
Text - High
Condition
• Performance is much better in the color than text condition
• Performance degrades sharply with time pressure for text
• Good fit except for text-high: huge variation with tuneup too
Flyoff - Distribution
700
600
Penalty Points
500
400
300
200
100
0
Tuneup Mid Flyoff Mid
Model Mid Tuneup High Flyoff High Model High
Condition
• The model can yield a wide range of performances through
retrieval and effort stochasticity and dynamic interaction
• Model variability always tends to be lower than the subjects
Flyoff - Penalty Profile
150
Subjects Mid
125
Model Mid
Penalty Points
Subjects High
100
Model High
75
50
25
0
T
H
HD SE SD WD DM CE
IM
Condition
• Errors: no speed change error or click error but incorrect and
duplicated messages occurring during the handling of holds
• Delays: more holds for high but fewer welcome and speed
Flyoff - Latency
100
RT (sec)
RT (sec)
100
Sub-Txt-Low
10
Sub-Txt-Mid
10
Sub-Clr-Low
Sub-Clr-Mid
Sub-Txt-High
Sub-Clr-High
Mod-Txt-Low
Mod-Clr-Low
Mod-Txt-Mid
Mod-Clr-Mid
Mod-Txt-High
1
0
1
2
3
4
Number of Intervening Events
5
Mod-Clr-High
6
1
0
1
2
3
4
5
Number of Intervening Events
• Response times increase exponentially with number of
intervening events and faster for text than color condition
• Model is slightly faster in color but slower in text condition
6
Flyoff - Selection
300
Sub-Txt-Low
Sub-Clr-Low
Sub-Txt-Mid
Sub-Clr-Mid
Sub-Txt-High
200
Number of Selections
Number of Selections
300
Mod-Txt-Low
Mod-Txt-Mid
Mod-Txt-High
100
0
0
1
2
3
4
Number of Intervening Events
5
6
Sub-Clr-High
200
Mod-Clr-Low
Mod-Clr-Mid
Mod-Clr-High
100
0
0
1
2
3
4
5
6
Number of Intervening Events
• The number of selections decreases roughly exponentially,
with text starting lower but trailing off longer with final spike
• Ceiling effect in color condition (mid & high): see workload
Flyoff - Workload
6
Workload Rating (1-10)
5
Subjects
Model
4
3
2
1
0
Color - Low Color - Mid Color - High Text - Low
Text - Mid
Text - High
Condition
• Workload is higher in text condition and increases faster
• Model reproduces both effects but misses ceiling effect in
color condition even though it gets it for selection measure!
Learning Categories
0.6
Human 1
Human 3
0.5
Human 6
Percent Correct
Model 1
Model 3
0.4
model 6
0.3
0.2
0.1
0.0
1
2
3
4
5
6
7
8
Trial
• Model learns responses through instance-based categorization
• Learning curve and level of performance reflects degree of
complexity of function mapping aircraft characteristics to response
Transfer Errors
• Transfer performance is defined by (linear) similarities between
stimuli values along each dimension (size, fuel, turb.)
• Excellent match to trained instances (better than trial 8!).
• Extrapolated: syntactic priming or non-linear similarities?
Individual Stimuli Predictions
• Good match
to probability
of accepting
individual
stimuli for
each
category.
1.0
y = 5.2292e-2 + 0.80874x R^2 = 0.890
y = 2.4735e-2 + 0.92498x R^2 = 0.750
0.8
y = - 1.7731e-2 + 1.0172x R^2 = 0.485
Model
0.6
• RMSE:
0.4
Cat. 1 = 14.1%
Cat 1
0.2
Cat 3
Cat. 3 = 13.4%
Cat 6
0.0
0.0
0.2
0.4
Human
0.6
0.8
1.0
Cat. 6 = 12.5%
Task Approach
• Use similar task to AMBR - AMBR variant, Team
Argus, CMU-ASP (Aegis) - for exploration
• Introduce team aspect that is implicit in task by
interchangeably replacing controllers by humans,
models or agents
• Right properties, tractable, scalable even though
somewhat abstract
• Scale model to other domains (UAV control,
Urban Search and Rescue) and environments
(DDD, NeoCities)
• Force model generalization across environments
• Explore fidelity/tractability tradeoffs
Issue 1: Scaling Properties
• Cognitive science is usually concerned with absolute
performance (e.g. latency) at fixed complexity points
– Often less discriminative than scaling properties
• Study human performance at multiple complexity
points to understand scaling and robustness issues
– Scaling provides strong constraints on algorithms and
representations
– Robustness is a key issue in extrapolating individual
performance to multi-agent interaction and overall network
performance, reliability and fault-tolerance
• Quantify impact on all measures of performance
– Converging measures of performance provide stronger
evidence than separate measures susceptible to parametric
manipulation
• Understanding of scaling key to enabling abstraction
Constraints and Analyses
• AMBR illustrated strong cognitive constraints put on the
scaling of performance as a function of task complexity
• Past analyses have shown the impact of:
– Architectural component interactions (Wray et al, 2007)
– Representational choices (Lebiere & Wallach, 2001)
– Parameter settings on dynamic processes (Lebiere, 1998)
Focus Slope = 0.1
Matching Retrievals by Focus
200
20
Series1
180
Series2
16
Series3
160
Series1
14
Series4
140
Series2
12
Series5
120
Series3
Series6
10
Poly. (Series1)
8
Poly. (Series2)
6
Linear (Series3)
4
Log. (Series4)
Log. (Series5)
2
Total Retrievals
Chunk Retrievals
18
Series4
100
Expon. (Series1)
80
Linear (Series2)
60
Log. (Series3)
40
Log. (Series4)
20
Log. (Series6)
0
0
0
2
4
6
Log Chunks
8
10
0
5
10
Log Chunks
15
20
Scaling Experiments
• Study human performance at multiple complexity
points to understand scaling and robustness
issues
–
–
–
–
–
Vary task complexity (e.g. level of aircraft autonomy)
Vary problem complexity (e.g. number of aircraft)
Vary information complexity (e.g. aircraft characteristics)
Vary network topology (e.g. number of controllers)
Vary rate of change of environment (e.g. appearance or
disappearance of aircraft, weather, network topology)
• Quantify impact on all measures of performance
– Direct performance (number of targets handled, etc)
– Situation awareness (levels, memory-based measures)
– Workload (both self-reporting and physiological
measures)
Issue 2: Dynamic Interaction
• Main problem in developing high-fidelity cognitive
models of multi-agent interaction are the increased
degrees of freedom of open-ended agent interaction
• Methodology has been developed to model multiagent interactions in games and logistics (supply
chain) problems (West & Lebiere, 2001, Martin et al, 2004)
– Develop baseline model to capture first-order dynamics
– Replace most HITL with baseline model(s) to reduce DOF
– Refine model based on greater data accuracy and revalidate
• Methodology can be extended to multiple levels of
our hierarchy, each time abstracting to next level
• Also extends to heterogeneous simulations with
mixed levels including HITL, models and agents
Results: Model against Model
Lag2 Model Against Itself
Lag2 Model Against Lag1 Model
Score Differential (Lag2 - Lag1)
20
Score Differential
10
0
-10
-20
0
20
40
60
80
100
10
0
-10
0
Play
20
40
60
80
Play
• Performance resembles a random walk with widely varying outcomes
• Distribution of streaks hints at fractal properties
• The model with the larger lag will always win in the long run
100
Results: Model against Human
Human Against Lag1 Model
Human Against Lag1 and Lag2 Models
3
10
0
-10
0
20
40
60
Play
80
100
Score Differential (Human - Model)
Score Differential (Human - Lag1)
Human vs Lag 2
Human vs Lag1
2
1
0
-1
0
10
20
30
Play
• Performance of human against lag1 model is similar to lag2 model
• Lag2 model takes time to get started because of longer chunks whereas lag1
model starts faster because it uses fewer shorter chunks
Results: Effects of Noise
Effect of Noise (Lag2 Against Lag2)
Effect of Noise (Lag2 Against Lag1)
Noise = 0
Noise = 0.1
Noise = 0.25
300
200
100
0
0.0
0.2
0.4
0.6
Noise Level
0.8
Lag2 Noise = 0
400
Score Difference (Lag2 - Lag1)
Score Difference (High Noise - Low Noise)
400
1.0
Lag2 Noise = 0.1
Lag2 noise = 0.25
200
0
-200
-400
0.0
0.2
0.4
0.6
0.8
Noise Level of Lag1 Model
• Performance improves sharply with noise, then gradually decreases
• Noise fundamentally alters the dynamic interaction between players
• Noise is essential to adaptation in changing real-world environments
1.0
Interactive Alignment
• Tendency of interacting agents to align
communicative means at different levels
(Pickering & Garrod 2004)
• Task success is correlated with alignment
(Reitter & Moore 2007)
• More alignment if interlocutors are perceived to
be non-human
(Branigan et al. 2003)
Micro-Evolution
• Communities will evolve communicative standards
– e.g., Reference to Landmarks,
identification strategies for locations
(e.g., Garrod & Doherty 1994, Fay et al. in press)
Garrod & Doherty 1994 :
location identification strategy:
counting boxes vs. connections
Micro-Evolution
• Evolutionary dynamics apply
• How do cognitive agents enable and influence
evolution? (Pressure? Heat?)
Autonomous agents
• Can autonomous agents support alignment
and communicative evolution?
• Interaction of humanoid cognitive models
with autonomous agents
– as a testbed before testing with humans.
– How can communicative behavior of UAVs
be adapted to take limitations of human
cognition into account?
Interaction Experiments
• Impact of evolving, interactive communication
– Vary constraints on evolution of communication (e.g. fixed
vs. adaptive communication channel)
– Vary constraints on sharing of communication (e.g. pairwise vs. community communication development)
• Impact of fixed, flexible or emergent network
organization
– Vary network flexibility (e.g. communication beyond grid)
– Vary level of information sharing (e.g. information filters)
• Accurate cognitive models for human-machine
interaction
– Adaptive interfaces (e.g. to predicted model workload)
– Model-based autonomy (eg. handle monitoring, routine
decision-making)
Issue 3: Behavior Abstraction
• First two issues build solutions toward this one
• Study of scaling properties helps capture response
function for all aspects of target behavior
• Abstraction methodology helps iterate and test
models at various levels of abstraction to maximize
retention
• Issues:
– Grain scale of components (generic tasks, unit tasks?)
– Attainable degree of fidelity at each level?
– Capture individual differences or average, normative
behavior?
• Latter may miss key interaction aspects outliers
• Individual differences as architectural parameters (WM, speed)
• Use cognitive model to generate data to train machine
learning agent tailored to individual decision makers
ACT-R vs. Neural Network Model
Neural network model based on same principles (West, 1998;1999)
Answer
Lag 1
•Simple 2-layer neural network
•Localist representation
•Linear output units
•Fixed lag of 1 or 2
Lag 2
•Dynamics arise from the interaction of the
two networks
• Network structure (fields) can be mapped to chunk structure (slots)
• ACT-R and network both store game instances (move sequences)
• ACT-R and network are similarly sensitive to game statistics
• Noise plays a more deliberate role in ACT-R than neural network
Individual vs Group Models
• Model of sequence expectation applied to baseball batting
• Key representation and procedures general, not domain-specific
• Cognitive architecture constrains performance to reproduce all
main effects: recency, length of sequence and sequence ordering
• Variation in performance between subjects can be captured using
straightforward parameterization of perceptual-motor skills
180
Subject 1
Model
All Subjects
Model Scaled
160
Mean Temporal Error (msec)
Mean Temporal Error (msec)
80
60
40
20
140
120
100
80
60
40
20
0
F
S
F, F
F, S
S, F
S, S
F, F, F
F, F, S
F, S, F
F, S, S
Pitch Sequence
S, F, F
S, F, S
S, S, F S, S, S
0
F
S
F, F
F, S
S, F
S, S
F, F, F
F, F, S
F, S, F
F, S, S
Pitch Sequence
S, F, F
S, F, S
S, S, F S, S, S
Markov Model (Gray, 2001)
Basic Markov assumption:
Current state determines future
• 2 states: expecting fast
or slow pitch
• Probabilities of
switching state as, af and
temporal errors when
expecting fast and slow
pitch Tf, Ts need to be
estimated
• 2 more transition rules
and associated
parameters (ak, ab) to
handle pitch count
Markov vs. ACT-R
• State representation
– Markov has discrete states that represent decisions
– ACT-R has graded states that reflect the state of memory
• Transition probabilities
– Markov needs to estimate state transition probabilities
– ACT-R predicts state change based on theory of memory
• Pitch count
– Markov has to adopt additional rules and parameters
– ACT-R generalizes using established representation
• ACT-R is more constrained than Markov model
• Similar results for backgammon domain:
– Comparable results to NN and TD-learning with orders of
magnitude fewer training instances
Abstraction Experiments
• Impact of Representation Fidelity
– Vary degree of model fidelity to determine impact on
network dynamics (e.g. high- vs. low-fidelity nodes for
specialists vs. generalists)
– Determine which model aspects are critical to performance
• Impact of Skill Compositionality
– Enforce skill composition through standard, common
interface and determine impact on performance
– Evaluate impact of architectural constructs including
working memory support for multi-tasking
• Relevant computer science concepts
– Abstract Behavior Types
• Generalization of abstract data types to temporal streams
– Aspect-Oriented Programming
• Generalization to allow more complex procedural interaction