Probabilistic Models in Cognitive Science and Artificial Intelligence

Probabilistic Models in Cognitive Science
and Artificial Intelligence
Very Brief History of Cog Sci and AI
1950’s-1980’s
● 
Symbolic models of cognition
● 
von Neumann computer architecture as metaphor
1980’s-1990’s
● 
Connectionist models of cognition
● 
Massively parallel neuron-like networks of simple
processors as metaphor
Late 1990’s -?
● 
Probabilistic / statistical models of cognition
● 
Formalizes the best of connectionist (subsymbolic) ideas
Relation of Probabilistic Models to
Connectionist and Symbolic Models
Connectionist
models
Probabilistic
models
Symbolic
models
weak (unknown) bias
strong bias
ad hoc, implicit
incorporation of prior
knowledge & assumptions
principled, elegant
incorporation of
prior knowledge & assumptions
via predicate calculus
statistical learning
(large # examples)
rule learning from
(small # examples)
vector
representations
structured
representations
Two Notions of Probability
Frequentist notion
●  Relative
frequency obtained if event were observed
many times (e.g., coin flip)
Subjective notion
●  Degree
of belief in some hypothesis
●  Analogous
to connectionist activation
Long philosophical battle between these two views
●  Subjective
notion makes sense for cog sci and AI
given that probabilities represent mental states
Why Probability?
Randomness in the brain and world introduces
uncertainty, and uncertainty is well described in the
language of random events.
Currency of probability provides strong constraints
(vs. neural net activation)
It’s the optimal thing to compute, in the sense that
any other strategy will lead to lower expected returns
e.g., “I bet you $1 that roll of die will produce number < 3. How
much are you willing to wager?”
Why Probability?
Leads to elegant theories to be based on premise
that human performance is optimal
Rational theories, ideal observer theories
Probably true in some areas of cognition (e.g., vision)
More interesting: bounded rationality
Optimality is assumed to be subject to limitations on processing
hardware and capacity, representation, experience with the world.
Explicit qualitative assumptions
Basic Probability
(most slides borrowed with permission
from Andrew Moore of CMU and Google)
http://www.cs.cmu.edu/~awm/tutorials
Notation Digression
•  P(A) is shorthand for P(A=true)
•  P(~A) is shorthand for P(A=false)
•  Similar notation applies to other binary RVs:
P(Gender=M), P(Gender=F)
•  Same notation applies to multivalued RVs:
P(Major=history), P(Age=19), P(Q=c)
•  Note: upper case letters/names for variables, lower
case letters/names for values
•  For RVs that have values other than true and false,
P(Q) is shorthand for P(Q=q) for some unknown q
Q
R
S
P(H|F) = R/(Q+R)
P(F|H) = R/(S+R)
If you have joint distribution, you can perform any
inference in the domain.
Simpler probabilistic facts and some algebra
What Is A Bayes Net?
Directed Graphical Model
Concept learning
From 25 – 1 = 31 parameters to 1+1+2+4+2=10
Earthquake
A node is conditionally independent of its
ancestors given its parents.
Radio
Burglary
Alarm
E.g., C is conditionally independent of R, E, and B
given A
Notation: C? R,B,E | A
Call
Tenenbaum (1999)
Concept learning
E.g., glorch vs. not glorch
E.g., word meanings
E.g., edible food
glorch
not
glorch
glorch
Focus on
● 
Learning concepts from positive examples
● 
Learning from a small number of examples
Contrast with machine learning approaches and
psychological models at the time
not
glorch
Domain
Two dimensional continuous feature
space
Categories defined by axis-parallel
rectangles
e.g., feature dimensions
cholesterol level (x1)
insulin level (x2)
e.g., concept
healthy (C)
Hypothesis (Model) Space
H: all rectangles on the plane,
parameterized by (l1, l2, s1, s2)
h: one particular hypothesis
Consider all hypotheses in parallel
In contrast to non-Bayesian approach of
maintaining only the best hypothesis
at any point in time.
Prediction via Model Averaging
Generalization function for unknown input Y given a
set of n examples X = {x1, x2, x3, …, xn}
● 
p(Y | X) = ⌠h p(Y & h | X)
Marginalization
● 
p(Y & h | X) = p(Y | h, X) p(h | X)
Chain rule
● 
p(Y | h, X) = p(Y | h) = 1 if y is in h
● 
p(h | X) ~ p(X | h) p(h)
likelihood
prior
Priors and Likelihood Functions
Priors, p(h)
● 
Location invariant
● 
Uninformative prior
(prior depends only on area of rectangle)
x
● 
Expected size prior
Likelihood function, p(X|h)
● 
X = set of n examples
● 
Size principle
Expected size prior
Generalization Gradients
MIN: smallest hypothesis consistent with data
weak Bayes: instead of using size principle, assumes
examples are produced by process independent of the true
class
Dark line =
50% prob.
Experimental Design
Subjects shown n dots on screen that are “randomly
chosen examples from some rectangle of healthy
levels”
n ∈ {2, 3, 4, 6, 10, 50}
Dots varied in horizontal and vertical range
r ∈ {.25, .5, 1, 2, 4, 8} units in a 24 unit window
Task
draw the ‘true’ rectangle around the dots
Experimental Results
Summary of Tenenbaum (1999)
Method
● 
Pick prior distribution (includes hypothesis space)
● 
Pick likelihood function
● 
leads to predictions for generalization as a function of r
(range) and n (number of examples)
Claims people generalize optimally given
assumptions about priors and likelihood
Bayesian approach provides best description of how
people generalize on rectangle task.
Explains how people can learn from a small number
of examples, and only positive examples.
Important Ideas in Bayesian Models
Generative models
● 
Likelihood function, prior distribution
Consideration of multiple models in parallel
● 
Potentially infinite model space
Inference
● 
prediction via model averaging
● 
role of priors diminishes with amount of evidence
Learning
● 
Just another form of inference
● 
Bayesian Occam's razor: trade off between model simplicity
and fit to data
Important Technical Issues
Representing structured data
Grammars
Relational schemas (e.g., paper authors and topics
Hierarchical models
Allows for weaker assumptions at the cost of more complex
inference
Nonparametric models
Flexible models that grow in complexity as the data justifies
Approximate inference
Markov chain Monte Carlo, particle filters, variational
approximations
Griffiths and Tenenbaum (2006)
Optimal Predictions in Everyday Cognition
If you were assessing an insurance case for an 18year-old man, what would you predict for his
lifespan?
If you phoned a box office to book tickets and had
been on hold for 3 minutes, what would you predict
for the total time you would be on hold?
If your friend read you her favorite line of poetry, and
told you it was line 5 of a poem, what would you
predict for the total length of the poem?
If you opened a book about the history of ancient
Egypt to a page listing the reigns of the pharaohs,
and noticed that in 4000 BC a particular pharaoh
had been ruling for 11 years, what would you
predict for the total duration of his reign?
Griffiths and Tenenbaum Conclusion
Average responses reveal a “close correspondence
between peoples’ implicit probabilistic models and
the statistics of the world.”
People show a statistical sophistication and
optimality of reasoning generally assumed to be
absent in the domain of higher-order cognition.
Griffiths and Tenenbaum Bayesian Model
If an individual has lived for tcur=50 years, how many
years ttotal do you expect them to live?
What Does Optimality Entail?
Individuals have complete, accurate knowledge
about the domain priors.
Fairly sophisticated computation involving Bayesian
integral
From The Economist (1/5/2006)
“[Griffiths and Tenenbuam]…put the idea of a
Bayesian brain to a quotidian test. They found that
it passed with flying colors.”
“The key to successful Bayesian reasoning is … in
having an appropriate prior… With the correct
prior, even a single piece of data can be used to
make meaningful Bayesian predictions.”
My Caution
Bayesian formalism is sufficiently broad that nearly
any theory can be cast in Bayesian terms
●  E.g.,
adding two numbers as Bayesian inference
Emphasis on how cognition conforms to Bayesian
principles often directs attention away from
important memory and processing limitations.
Latent Dirichlet Allocation
(a.k.a. Topic Model)
Problem
●  Given
a set of text documents, can we infer the
topics that are covered by the set, and can we
assign topics to individual documents
●  Unsupervised
learning problem
Technique
●  Exploit
●  E.g.,
statistical regularities in data
documents that are on the topic of education
will likely contain a set of words such as ‘teacher’,
‘student’, ‘lesson’, etc.
Generative Model of Text
Each document is a collection of topics (e.g.,
education, finance, the arts)
Each topic is characterized by a set of words that are
likely to appear
The string of words in a document is generated by:
1) 
Draw a topic from the probability distribution
associated with a document
2) 
Draw a word from the probability distribution
associated with a topic
Bag of words approach
Inferring (Learning) Topics
Input: set of unlabeled documents
Learning task
●  Infer
distribution over topics for each document
●  Infer
distribution over words for each topic
Distribution over topics can be helpful for classifying
or clustering documents
Dan Knights and Rob Lindsey’s work at JDPA
Rob’s Work: Phrase Discovery
0.17 new york
0.16 new
0.14 ny
0.14 vegas
0.12 strip
0.11 york
0.10 coaster
0.10 nyny
0.08 roller
0.08 las
0.07 it's
0.07 bars
0.07 las vegas
0.07 fun
0.06 drinks
0.06 mgm grand
0.06 you're
0.06 mgm
0.06 arcade
0.06 chin
0.06 italian
0.05 city
0.05 island
0.05 skyline
0.05 big apple
0.05 luxor
0.31 shuttle
0.23 lax
0.16 flight
0.12 early
0.11 sheraton
0.09 sheraton gateway
0.09 proximity
0.09 flights
0.08 catch
0.08 morning
0.07 bus
0.07 pick
0.07 shuttles
0.07 terminal
0.06 layover
0.06 international
0.06 driver
0.06 closeness
0.06 minutes
0.06 pickup
0.06 drop
0.05 ride
0.05 marriott
0.05 terminals
0.05 convenience
0.05 to/from
0.27 non
0.14 requested
0.14 smoke
0.12 room
0.11 given
0.09 smelled
0.08 reserved
0.08 change
0.07 told
0.07 cigarette
0.07 assigned
0.07 request
0.07 called
0.07 asked
0.07 reservation
0.06 advance
0.06 resolve
0.06 cigarette smoke
0.05 guaranteed
0.05 smokers
0.05 prior
0.05 upgrade
0.05 ended
0.05 checked
0.05 smell
0.05 asking
0.19 minutes
0.13 waited
0.11 30
0.10 20
0.10 15
0.10 45
0.10 check
0.10 min
0.10 waiting
0.09 arrived
0.09 wait
0.09 late
0.09 10
0.08 arrival
0.08 bell
0.08 late night
0.08 pm
0.07 luggage
0.07 took forever
0.07 told
0.06 called
0.06 took care
0.06 40
0.06 cleaned
0.06 checkout
0.05 took l ong
Bayesian Analysis
Make inferences from data using probability models
about quantities we want to predict
●  E.g.,
expected age of death given 51 yr old
●  E.g.,
latent topics in document
1.  Set up full probability model that characterizes
distribution over all quantities (observed and
unobserved)
2.  Condition model on observed data to compute
posterior distribution
3.  Evaluate fit of model to data