Bayesian models of human inference

Bayesian models of human
inference
Josh Tenenbaum
MIT
The Bayesian revolution in AI
• Principled and effective solutions for inductive
inference from ambiguous data:
–
–
–
–
–
Vision
Robotics
Machine learning
Expert systems / reasoning
Natural language processing
• Standard view in AI: no necessary connection to
how the human brain solves these problems.
– Heuristics & Biases program in the background (“We
know people aren’t Bayesian, but…”).
Bayesian models of cognition
Visual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman,
Kersten, Knill, Maloney, Olshausen, Jacobs, Pouget, ...]
Language acquisition and processing [Brent, de Marken, Niyogi, Klein,
Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …]
Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording,
Kawato, Doya, Todorov, Shadmehr, …]
Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …]
Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …]
Attention [Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …]
Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro,
Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …]
Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …]
Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …]
Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker,
Goodman, Tenenbaum, …]
How to meet up with mainstream JDM
research (i.e., heuristics & biases)?
1. How to reconcile apparently contradictory
messages of H&B and Bayesian models?
Are people Bayesian or aren’t they? When are they,
when aren’t they, and why?
2. How to integrate the H&B and Bayesian
research approaches?
When are people Bayesian, and why?
• Low level hypothesis (Shiffrin, Maloney, etc.)
– People are Bayesian in low-level input or output processes
that have a long evolutionary history shared with other
species, e.g. vision, motor control, memory retrieval.
When are people Bayesian, and why?
• Low level hypothesis (Shiffrin, Maloney, etc.)
• Information format hypothesis (Gigerenzer)
– Higher-level cognition
can be Bayesian when
information is
presented in formats
that we have evolved
to process, and that
support simple
heuristic algorithms,
e.g., base-rate neglect
disappears with
“natural frequencies”.
Explicit
probabilities
Natural
frequencies
When are people Bayesian, and why?
• Low level hypothesis (Shiffrin, Maloney, etc.)
• Information format hypothesis (Gigerenzer)
• Core capacities hypothesis
– Bayes can illuminate distinctively human cognitive
capacities for inductive inference – learning words and
concepts, projecting properties of objects, causal inference,
or action understanding: problems we solve effortlessly,
unconsciously, and successfully in natural contexts, which
a five-year-old solves better than any animal or computer.
Figure 13: Procedure used in Sobel et al. (2002), Experi
One-Cause Condition
When are people Bayesian, and why?
• Low level hypothesis (Shiffrin, Maloney, etc.)
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
Figure
13:
Procedure
used
in Sobel
et (2002),
al. et
(2002),
Experiment
Figure 13:
Figure
Procedure
13: Procedure
used
in Sobel
used
in
et Sobel
al.
al.Experiment
(2002),
Experiment
2 2(Gigerenzer)
2
• Information
format
One-Cause hypothesis
Condition
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
Backward Blocking Condition
One-Cause
Condition
One-Cause
One-Cause
Condition
Condition
One-Cause Condition
• Core capacities hypothesis
Both objects activate
the detector
Causal induction
Both objects activate
detector
Both objectsthe
activate
Both
objects activate
the detector the detector
Both objects activate
the detector
Object
A does not
activate
the
detector
Object
A does
Object
not
A does not
byactivate
itself the detector
activate the detector
by itself
by itself
AB
Object A does not
activate the detector
by itself
A B
A
B
Object A does not
Children are asked if
eachactivate
is a blicket
activate
theare
detector
Both objects
Object Aactivates the
Children
asked if
Thenare asked to
they
by
each is
aitself
blicket
the
detector
detector by itself
Children
are
asked
Children
if
are
asked
if
Both objects activate make
Object A does not
the machine go
Then
asked
eachthey
is aare
blicket
eachtois a blicket
the detector
activate the detector
Thenmake
Then
he machine
they
are tasked
they
to aregoasked to
by itself
makethe machine
make
go
the machine go
AB Trial
Backward Blocking Condition
Backward Blocking Condition
Backward
Backward
BlockingBlocking
Condition
Condition
C
ea
T
th
m
A Trial
A
B
C
ea
Children are aske
T
each is a blicketth
Thenare asked tom
they
makethe machine g
B
Backward Blocking Condition
?
?
E
Both objects activate
the detector
Both objects activate
Both objects activate
the detector the detector
Both objects activate
the detector
Object
Aactivates the
detector by itself
Object Aactivates
Object
theAactivates the
detector by itself
detector by itself
Object Aactivates the
detector by itself
Children are asked if
each is a blicketBoth objects activate
Children
areasked
asked
Children
are asked if
Thenare
they
toif the
detector
eachmake
is athe
blicket
each isgo
a blicket
machine
Thenare askedthey
Then
they
to are asked to
makethe machine
make
go
the machine go
Children are asked if
each is a blicket
Thenare asked to
they
Aactivates the
makethe machine Object
go
detector by itself
(Sobel,Children are aske
each is a blicket
Griffiths,
Thenare asked to
they
makethe machine g
Tenenbaum,
& Gopnik)
When are people Bayesian, and why?
• Low level hypothesis (Shiffrin, Maloney, etc.)
• Information format hypothesis (Gigerenzer)
• Core capacities hypothesis
Word learning
Hypothesis
space
Data
(Tenenbaum & Xu)
When are people Bayesian, and why?
• Low level hypothesis (Shiffrin, Maloney, etc.)
• Information format hypothesis (Gigerenzer)
• Core capacities hypothesis
– Bayes can illuminate distinctively human cognitive
capacities for inductive inference – learning words and
concepts, projecting properties of objects, causal inference,
or action understanding: problems we solve effortlessly,
unconsciously, and successfully in natural contexts, which
a five-year-old solves better than any animal or computer.
– The mind is not good at explicit Bayesian reasoning about
verbally or symbolically presented statistics, unless core
capacities can be engaged.
When are people Bayesian, and why?
Statistical version of
Diagnosis problem
Causal version of
Diagnosis problem
Correct Base-rate
neglect
• Low level hypothesis (Shiffrin, Maloney, etc.)
• Information format hypothesis (Gigerenzer)
• Core competence hypothesis
(Krynski & Tenenbaum)
How to meet up with mainstream JDM
research (i.e., heuristics & biases)?
1. How to reconcile apparently contradictory
messages of H&B and Bayesian models?
Are people Bayesian or aren’t they? When are they,
when aren’t they, and why?
2. How to integrate the H&B and Bayesian
research approaches?
Reverse engineering
• Goal is to reverse-engineer human inference.
– A computational understanding of how the mind
works and why it works it does.
• Even for core inferential capacities, we are
likely to observe behavior that deviates from
any ideal Bayesian analysis.
• These deviations are likely to be informative
about how the mind works.
Analogy to visual illusions
(Adelson)
(Shepard)
• Highlight the problems the visual system is designed to
solve: inferring world structure from images, not judging
properties of the images themselves.
• Reveal the implicit visual system’s implicit assumptions
about the physical world and the processes of image
formation that are needed to solve these problems.
How do we interpret deviations from a
Bayesian analysis?
• H&B: People aren’t Bayesian, but use some other
means of inference.
–
–
–
–
Base-rate neglect: representativeness heuristic
Recency bias: availability heuristic
Order of evidence effects: anchoring and adjustment
…
• Not so compelling as reverse engineering.
– What engineer would want to design a system based on
“representativeness”, without knowing how it is computed,
why it is computed that way, what problem it attempts to
solve, when it works, or how its accuracy and efficiency
compares to some ideal computation or other heuristics.
How do we interpret deviations from a
Bayesian analysis?
Multiple levels of analysis (Marr)
• Computational theory
– What is the goal of the computation – the outputs and
available inputs? What is the logic by which the inference
can be performed? What constraints (prior knowledge) do
people assume to make the solution well-posed?
• Representation and algorithm
– How is the information represented? How is the computation
carried out algorithmically, approximating the ideal
computational theory with realistic time & space resources?
• Hardware implementation
How do we interpret deviations from a
Bayesian analysis?
Multiple levels of analysis (Marr)
• Computational theory
– What is the goal of the computation – the outputs andBayes
available inputs? What is the logic by which the inference
can be performed? What constraints (prior knowledge) do
people assume to make the solution well-posed?
• Representation and algorithm
– How is the information represented? How is the computation
carried out algorithmically, approximating the ideal
computational theory with realistic time & space resources?
• Hardware implementation
Different philosophies
• H&B
– One canonical Bayesian analysis of any given task, and we
know what it is.
– Ideal Bayesian solution can be computed.
– The question “Are people Bayesian?” is empirically
meaningful on any given task.
• Bayes+Marr
– Many possible Bayesian analyses of any given task, and we
need to discover which best characterize cognition.
– Ideal Bayesian solution can only be approximately computed.
– The question “Are people Bayesian?” is not an empirical one,
at least not for an individual task. Bayes is a frameworklevel assumption, like distributed representations in
connectionism or condition-action rules in ACT-R.
How do we interpret deviations from a
Bayesian analysis?
Multiple levels of analysis (Marr)
• Computational theory
– What is the goal of the computation – the outputs and
available inputs? What is the logic by which the inference
can be performed? What constraints (prior knowledge) do
people assume to make the solution well-posed?
• Representation and algorithm
– How is the information represented? How is the computation
carried out algorithmically, approximating the ideal
computational theory with realistic time & space resources?
• Hardware implementation
The centrality of causal inference
• In visual perception:
– Judge P(scene|image features) rather than P(image
features|scene) or P(image features|other image features).
• Coin–flipping: Which sequence is more likely to come
from flipping a fair coin, HHTHT or HHHHH?
• Coincidences: How likely that 2 people in a random
party of 25 have the same birthday? 3 in a party of 10?
(Griffiths & Tenenbaum)
Judgments of randomness:
Judgments of coincidence:
P(data | random)
P(random | data) 
P(data | regular )
Rational measure of
evidential support:
P(data | h1 )
P(data | h0 )
P(data | regular )
P(regular | data) 
P(data | random)
(Griffiths & Tenenbaum)
How do we interpret deviations from a
Bayesian analysis?
Multiple levels of analysis (Marr)
• Computational theory
– What is the goal of the computation – the outputs and
available inputs? What is the logic by which the inference
can be performed? What constraints (prior knowledge) do
people assume to make the solution well-posed?
• Representation and algorithm
– How is the information represented? How is the computation
carried out algorithmically, approximating the ideal
computational theory with realistic time & space resources?
• Hardware implementation
Assuming the world is simple
• In visual
perception:
Figure 13: Procedure used in Sobel et al. (2002), Experimen
– “Slow and smooth”
prior on visual motion
One-Cause Condition
–
Figure
13: Procedure
us
Figure 13: Procedure used in Sobel et
al. (2002),
Experimen
One-Cause
Condition
Both objects
activate
the detector
One-Cause Condition
Childre
Object A does not
activate the detector
by itself
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
13:
Procedure
used
in
et (2002),
al.
(2002),
Experiment
FigureFigure
13: Figure
Procedure
used
in
Sobel
al.Sobel
(2002),
Experiment
2 Experiment
13:
Figure
Procedure
13: Procedure
used inetSobel
used
in
et Sobel
al.
et
al.Experiment
(2002),
2 2
2
One-Cause Condition
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
Backward Blocking Condition
One-Cause
Condition
One-Cause
Condition
One-Cause
One-Cause
Condition
Condition
One-Cause Condition
each is
Thenare
they
makethe
• Causal induction:
– P(blicket) = 1/6, “Activation law”
Object A doesBoth
not objects activate
activate the detector the detector
by itself
Both objects activate
the detector
A B
A
Child
each is
Thenar
they
makethe
Figure
13: Procedure
used2 in Sobel et al. (2002), Experiment 2
Figure
et
al. (2002),
Experiment
P(A is a blicket|data)
= 113: Procedure used
Ain Sobel
AB B
B
Figure
13:
Procedure
Figure
used
13:
Procedure
in
Sobel 2et
used
al.
(2002),
in SobelExperiment
et al. (2002),
2 Experiment 2
One-Cause
Condition
One-Cause
Condition
Figure
13:
Procedure
used
in
Sobel
et
al.
(2002),
Experiment
2
Figure
13:
Figure
Procedure
13:
Procedure
used
in
used
Sobel
in
et
Sobel
al.
(2002),
et
al.
(2002),
Experiment
Experiment
2
P(B is a blicket|data) ~ 1/6
Backward Blocking Co
Backward Blocking Condition
A Trial
One-Cause Condition
One-Cause Condition
AB Trial
One-Cause
Condition
One-Cause
One-Cause
Condition
Condition
Both objects activate
Object A does not
Children are asked if
eachactivate
is a blicket
the detector
activate
detector
Both objects
Object Aactivates the
Both objects activate
A does not
Children
asked if
Both objects activate
Object A does notObject
Children are asked
if theare
Thenare asked to
they
by
each is
aitself
blicket
the
detector
detector by itself
detector
activate
the
detector
Both objectsthe
activate
Both
objects activate
Object
A does
Object
not
A doeseach
not is a blicket
Children
are
asked
Children
if
are
asked
if
the detector
activate the detector
Both objects activate make
Object A does not
the machine go
Then
asked
is aare
blicket
eachtois a blicket
byactivate
itself the detector
the detector the detector
activate the detector
Thenare askedeach
they
to they
by itself
the detector
activate the detector
make
Then
Then
he machine
they
are tasked
they
to aregoasked to
by itself
by itselfmakethe machine
go
by itself
makethe machine
make
go
the machine go
Backward Blocking Condition
Backward
Blocking
Condition
Backward Blocking Condition
Backward
Backward
BlockingBlocking
Condition
Condition
P(A is a blicket|data) ~ 3/4
P(B is a blicket|data) ~ 1/4
A
A C
A BCondition
Backward Blocking
B C
Object A does Both
not objects activate
C
Childre
eachifis
Children are asked
Thenare
each is a blicketthey
Thenare asked tomakethe
they
makethe machine go
B
A does not
Children are asked if
Both objects activate
Children are askedObject
if
eachactivate
is a blicket Child
eachactivate
is a blicket activate the detector
the detector
activate the detector the detectorBoth objects
Object AactivatesBoth
the objects
Then
Thenare
are asked to each is
by itself
Both objects activate
Both
activate
Object
A does
notChildren
are
asked if detector
Children
asked
they
asked
to
itselfObject A does not
theifthey
detector
by are
itself
Both objectsBoth
activate
Object A does
not A does
Children
areobjects
askedby
if
objectsBoth
activate
objects activate
Object
Object
not A does not
Children
areChildren
asked if are asked if the detector
makethe machine they
maketthe
go ar
eachgois a blicket
each is a blicket
the detectoreach is a blicket
the detector activate the detector activate
detector
he machine
Then
the detector the detector the detector activate the detector
each is a blicket
each is a blicket
activate the detector
activate the detector
AB Trial
AC Trial
Recognizing the world is complex
• In visual
perception:
– Need uncertainty
about coherence
ratio and velocity
of coherent motion.
(Lu & Yuille)
• Property induction:
– Properties should
be distributed
stochastically over
tree structure, not
just focused on
single branches.
Gorillas have T9 cells.
Seals have T9 cells.
Horses have T9 cells.
Bayes:
single
branch
prior
r = 0.50
(Kemp & Tenenbaum)
Recognizing the world is complex
• In visual
perception:
– Need uncertainty
about coherence
ratio and velocity
of coherent motion.
(Lu & Yuille)
• Property induction:
– Properties should
be distributed
stochastically over
tree structure, not
just focused on
single branches.
Gorillas have T9 cells.
Seals have T9 cells.
Horses have T9 cells.
Bayes:
“mutation”
prior
r = 0.92
(Kemp & Tenenbaum)
“has T9
hormones”
“can bite
through wire”
“is found near
Minneapolis”
“carry E.
Spirus
bacteria”
(Kemp & Tenenbaum)
How do we interpret deviations from a
Bayesian analysis?
Multiple levels of analysis (Marr)
• Computational theory
– What is the goal of the computation – the outputs and
available inputs? What is the logic by which the inference
can be performed? What constraints (prior knowledge) do
people assume to make the solution well-posed?
• Representation and algorithm
– How is the information represented? How is the computation
carried out algorithmically, approximating the ideal
computational theory with realistic time & space resources?
• Hardware implementation
Sampling-based approximate inference
• In visual
perception:
– Temporal dynamics
of bi-stability due to
fast sampling-based
approximation of a
bimodal posterior (Schrater & Sundareswara).
• Order effects in category learning
– Particle filter (sequential Monte Carlo), an online
approximate inference algorithm assuming stationarity.
• Probability matching in classification decisions
– Sampling-based approximations with guarantees of near
optimal generalization performance.
(Griffiths et al., Goodman et al.)
Conclusions
• “Are people Bayesian?”, “When are they Bayesian?”
– Maybe not the most interesting questions in the long run….
• What is the best way to reverse engineer cognition at
multiple levels of analysis? Assuming core inductive
capacities are approximately Bayesian at the
computational-theory level offers several benefits:
–
–
–
–
Explanatory power: why does cognition work?
Fewer degrees of freedom in modeling
A bridge to state-of-the-art AI and machine learning
Tools to study the big questions: What are the goals of
cognition? What does the mind know about the world? How
is that knowledge represented? What are the processing
mechanisms and why do they work as they do?
Coincidences
(Griffiths & Tenenbaum, in press)
• The birthday problem
– How many people do you need to have in the
room before the probability exceeds 50% that
two of them have the same birthday?
23.
• The bombing of London
How much of
a coincidence?
P(d | latent )
Bayesian coincidence factor: log
P(d | random)
Chance:
Latent common cause:
C
x
x
x
x
x
x
x
x
x
x
August
Alternative hypotheses:
proximity in date, matching
days of the month, matching
month, ....
How much of
a coincidence?
P(d | latent )
Bayesian coincidence factor: log
P(d | random)
Latent common cause:
Chance:
C
x
x
x
x
x
uniform
x
x
x
x
x
uniform
+
regularity