math 3104: neural decoding and bayes theorem

MATH 3104: NEURAL DECODING AND BAYES THEOREM
A/Prof Geoffrey Goodhill, Semester 1, 2009
Taking the organism’s point of view
Spikes are the language of the brain. So far we have just discussed the “dictionary” stimulus →
response. However, from the organism’s point of view, what’s needed is the dictionary
response → stimulus. In general, we would like to construct the complete 2-way dictionary.
But what does this mean when encoding is probabilistic? It means we would like to know the joint
probability distribution of the stimulus and the response.
Notation
P (r) = prior distribution of spike trains r.
P (s) = prior distribution of stimuli s.
P (r|s) = probability of spike train r given stimulus s, a conditional distribution.
P (s|r) = probability of stimulus s given spike train r, the “response-conditional ensemble”.
P (r, s) = joint distribution of all stimuli and spike trains.
If there is no correlation between the two this is simply P (r) × P (s).
Identities between distributions
P (r) =
X
P (s) =
X
P (r|s)P (s)
s
P (s|r)P (r)
r
P (r, s) = P (r|s) × P (s)
P (r, s) = P (s|r) × P (r)
From these we can derive Bayes’ rule:
P (s|r) =
P (r|s)P (s)
P (r)
Example: fly data
Fig 1 illustrates the above probabilities for data from an experiment on a motion sensitive cell (H1)
in the blowfly. The fly viewed a spatial pattern displayed on an oscilliscope screen, and this pattern
moved randomly, diffusing across the screen. At the same time, spikes from H1 were recorded. See
the figure caption and the Rieke et al book for more details.
A simple example of the application of Bayes theorem
Unfortunately the true significance of Bayes theorem is not easy to understand. We will therefore
briefly digress from it’s applications in neuroscience and consider instead how it can be applied in
some “real-world” circumstances.
1
FACT: Most people who are Australian citizens speak English. That is P (SE|A) is close to 1. But
it is clearly wrong to conclude from this that most English speakers are Australian citizens. That is
P (A|SE) 6= P (SE|A). In fact the two are related by
P (A|SE) = P (SE|A)
P (A)
P (A)
≈
P (SE)
P (SE)
This appears straightforward. However, we will now consider the use of Bayes’ theorem in legal
arguments regarding the guilt or innocence of someone accused of a crime, where issues analogous
to the Australian/English example above are often grossly misunderstood.
The Prosecutor’s fallacy
The so-called “Prosecutor’s fallacy” is that Because the story before the court is highly improbable,
the defendant’s innocence must be equally improbable.
Imagine the facts of a particular case are as follows:
• The defendent’s DNA matched that found at the scene.
• The probability of a match between a person chosen at random and the DNA found at the scene
is 1 in a million.
The “Prosecutor’s fallacy” is to conclude from this that
• The likelihood of this DNA being from any other person than the defendent is 1 in a million.
That this is a fallacy is straightforwardly revealed by Bayes’ theorem. Letting M mean “DNA match”
and G mean “guilty”, we want to know how we should update our estimate of the probability the
defendent is guilty, P (G), given the evidence of the DNA match: we want to know P (G|M ). Bayes’
theorem in this case says
P (G|M ) =
=
=
P (M |G)P (G)
P (M )
P (M |G)P (G)
P (M |G)P (G) + P (M |N G)P (N G)
P (M |G)
P (G)
P (M |G)P (G) + P (M |N G)P (N G)
(1)
(2)
(3)
where P (M |N G) is the probability of the DNA match when not guilty. If we now assume that
P (M |G) = 1, i.e., if the defendent really was guilty their DNA would definitely match, and divide
through by P (G), we get
1
P (G|M ) =
(G)
1 + P (M |N G) 1−P
P (G)
For P (M |N G) = 10−6 look at how P (G|M ) varies with P (G):
P (G)
10−9
10−8
10−7
10−6
10−5
10−4
P (G|M )
0.001
0.01
0.09
0.5
0.9
0.99
The key point here is that, although in all cases the existence of the match dramatically increases
the probability of guilt, whether it proves guilt beyond reasonable doubt depends strongly on the prior
probability of guilt P (G). Put another way, if there’s no other evidence except the DNA match, and it’s
a crime one believes the defendent is a priori very unlikely to have committed, the match should not
be enough to convict. Unfortunately however, people who are probably innocent have been sent to
jail because this statistical point is not well understood by judges, juries, or lawyers.
2
The defence attorney’s fallacy
Because several innocent people will have false matches, the defendent’s match tells us little.
Consider some facts from the well-known OJ Simpson murder trial:
• OJ’s DNA matched that found at the scene.
• The probability of a match between a person chosen at random and the DNA found at the scene
was 1 in 4 million.
The fallacy (in this case articulated by Johnnie Cochran), is that
• In a city of 20 million people one would expect 5 false DNA matches. Therefore the defendent’s
match only means that the probability of guilt is 1/5.
This is a fallacy for the same reason as the previous example: it fails to take into account the prior
probability of the defendent’s guilt.
Cochran asserted OJ’s prior P (G) to be the same as any person selected at random from the city of
Los Angeles. However, crime statistics show that if a woman is murdered, the probability it was her
husband/partner is about 0.25.1 Therefore, a much more reasonable starting estimate for OJ’s P (G)
might be 0.25 rather than 1 / 20 million. With P (M |N G) = 1 in 4 million this gives P (G|M ) = 0.999999.
A note about Bayesian versus frequentist statistics
All statisticians agree that Bayes theorem is mathematically true. However, some statisticians do not
believe it is very useful, because of the difficulty of choosing values for the prior probabilities (“priors”).
Alternatively to the “Bayesian” school, the “frequentist” school argue that probabilities should only be
assigned based on direct measurements.
As a simple example, if you were handed a random coin to toss you might assign P (heads) = 0.5;
however, a strict frequentist would argue that a number for P (heads) should only be assigned after
gathering data by tossing that coin a large number of times.
Most computational neuroscientists are Bayesians, as this approach has turned out to be very fruitful
for understanding various aspects of how the brain works.
Suggested reading
Dayan, P. & Abbott, L.F. (2001). Theoretical Neuroscience. MIT Press (pp 87-89).
Rieke, F., Warland, D., de Ruyter van Steveninck, R. & Bialek, W. (1999). Spikes: Exploring the
Neural Code. MIT Press (chapter 2).
Doya, K. Ishii, S., Pouget, A. & Rao, R.P.N. (2007). Bayesian Brain: Probabilistic Approaches to
Neural Coding. MIT Press (chapter 1).
There are also many good sources online to be found be searching for “Bayes theorem”, “Bayesian
statistics”, “Bayesian inferences” etc.
1
Women may be interested to apply Bayes theorem to reassure themselves that this does NOT mean that there’s a 25%
chance you will be murdered by your partner.
3
Figure 1: Statistical relations between the stimulus velocity v and a spike count n for a fly neuron.
v is the value of the stimulus velocity averaged over a 200ms time window, measured in ommatidia
per second. n is the number of spikes counted in this time window. A. Probability density P (v)
for all the 200ms windows in the expt. B. Probability P (n) of finding n spikes in a 200ms window.
C. Joint probability density P (n, v) for n and v; P (v) and P (n) are the two marginal distributions of
P (n, v). As can be seen, P (n, v) 6= P (n).P (v), which means there is indeed a correlation between
stimulus and response. We can look at this correlation in two ways, either forward or reverse. The
reverse description is summarized in P (v|n) shown in D, while the forward description is summarized
in P (n|v) shown in E. The white lines in panels F and G, replotted in more standard format in H and
I, show the average values of v given n, and n given v, respectively (i.e. Bayesian estimator with a
squared loss function). The average value vav in F and H gives the best estimate of the stimulus given
that a response n is observed; this is akin to the problem an observer of the spike train must solve.
The average nav in G and I gives the average response as a function of the stimulus, corresponding to
the forward description. Notice that the reverse estimator can be quite linear, even when the forward
description is clearly nonlinear.
4