Quantitative Biology Lecture 1 (Introduction + Probability)

21st Sep 2015
Quantitative Biology
Lecture 1
(Introduction + Probability)
Gurinder Singh Mickey Atwal
Center for Quantitative Biology
Why Quantitative Biology ?
(1. Models)
•  Galileo Galilei
•  Charles Darwin
The Book of Nature is written in the
language of mathematics.
"I have deeply regretted that I did
not proceed far enough at least to
understand something of the great
leading principles of mathematics;
for men thus endowed seem to
have an extra sense.”
Why Quantitative Biology ?
(2. Data Analysis)
Explosion of biological data by recent developments in high throughput biotechnology.
Moore s Law: doubling of transistors on a chip every 18 months.
Quantitative Biology at CSHL
Systems
Neuroscience
Developmental
Biology
QB
Genetics and
Genomics
Why Statistics ?
–  Biological molecules diffuse
–  Small number fluctuations
•  Genetics is stochastic
–  Genetic variation mainly due to random
events (drift)
•  Large data analysis
–  Experimental measurements are noisy
Require probabilistic models
•  Molecular biology is noisy
Key Numbers in Biology
•  See handout
•  Database of useful biological numbers
(http://bionumbers.hms.harvard.edu/
•  Quantified biological properties can
–  Test our understanding of the biology
–  Aid in experimental design
Key Numbers Example 1
(Class Exercise)
Q: Is there enough time to replicate the E.
coli genome ?
Bionumbers
Key Numbers Example 2
(Class Exercise)
Q: How far can macromolecules move by
diffusion?
It takes about 10 seconds on average for a protein to
traverse a HeLa cell. An axon 1 mm long is about 100
times longer than a HeLa cell, and as the diffusion time
scales as the square of the distance it would take 105
seconds or 2 days for a molecule to travel this distance by
diffusion. This demonstrates the necessity of mechanisms
other than diffusion for moving molecules long distances. A
molecular motor moving at a rate of 1 µm/s will take a
“reasonable” time (15 min) to traverse an axon 1 mm long.
Classic Examples of Quantitative
Reasoning in Biology
•  Two-hit Tumor Suppressor Hypothesis
•  Tumor Subtyping from Expression Profiles
•  Delbruck-Luria Experiment (next lecture)
Case study 1
•  Tumor suppressor gene hypothesis
Case Study 1
Incidence of childhood
retinoblastoma
•  Retinoblastoma : childhood
form of retinal cancer
•  Alfred Knudson (geneticist)
studied 48 patients and
tabulated age of diagnoses
•  Tumor occurred either in one
eye (unilateral) or both eyes
(bilateral)
Knudson 1971
Case Study 1
Incidence of childhood
retinoblastoma
Probability distribution of retinoblastoma cases by type and laterality
Bilateral
Unilateral
Total
Hereditary
0.25-0.30
0.10-0.15
0.35-0.45
Nonhereditary
0
0.55-0.65
0.55-0.65
Total
0.25-0.30
0.70-0.75
1.00
Knudson observed that bilateral hereditary cases of
retinoblastoma occurred at a younger age and showed a
curve (previous slide) consistent with a single-mutation process.
Case Study 1
Tumour Suppressor Genes
•  Statistical analyses by Knudson (1971) lead to
the concept of tumor suppressor genes and the
two-hit hypothesis
•  The gene’s normal function is to regulate cell
division. Both alleles need to be mutated or
removed in order to lose the gene activity.
•  The first mutation may be inherited or somatic.
•  The second mutation will often be a gross event
leading to loss of heterozygosity in the
surrounding area.
Case Study 1
Knudsen’s “two hit” hypothesis
Case Study 2
•  Cancer subtyping by genome-wide
expression analysis
Case Study 2
Cancer Classification
Leukemia: AML, ALL
Unusual morphology of lymphoblasts in ALL
Case Study 2
Microarray Clustering
Clustering of genomewide expression date revealed a new subtype
of ALL with poor prognosis from conventional therapy
Armstrong et al, Nature Genetics
ALL and MLL are indistinguishable from a pathologist’s perspective
Case Study 2
MLL translocations specify a distinct
gene expression profile that
distinguishes a unique leukemia
a, Principal component analysis (PCA) plot of ALL (red), MLL (blue) and AML (yellow)
carried out using 8,700 genes that passed filtering. b, PCA plot comparing ALL (red),
MLL (blue) and AML (yellow) using the 500 genes that best distinguished ALL from AML.
Now let’s dive into some
mathematics:
Probability theory
Probabilities and Ensembles
•  An ensemble is set of of probabilities describing a
particular random process. An ensemble is defined by
–  x, value of a random variable, taken from the alphabet
–  Ax={a1,a2,a3,……aI}, alphabet
–  PX={p1,p2,p3,…..pI}, probabilities
All the probabilities have to be positive and add up to one.
# p(x = a ) = 1
P(x=ai)=pi
pi≥0
i
a i "A x
Example: Probability of a single nucleotide in the human genome
pA ≈0.29 pC ≈0.21 pG ≈ 0.21 pT ≈0.29
!
Joint Probabilities
•  JOINT PROBABILITY:
P(x, y)
P(x, y) = P(x)P(y)
if X and Y are INDEPENDENT
!
! •  MARGINAL PROBABILITY: P(x) = " P(x, y)
y
!
Joint Probability: Dinucleotides
1st position
Single-strand
frequencies of
human genes
A
2nd position
C
G
T
MARGINAL
A
0.069
0.057
0.077
0.051
0.25
C
0.078
0.077
0.035
0.071
0.26
G
0.078
0.070
0.075
0.047
0.27
T
0.029
0.058
0.081
0.048
0.21
0.25
0.26
0.27
0.21
Conditional Probability
P(x, y)
P(x | y) =
P(y)
Probability of x given y
!
Probability Rules
•  Product Rule (Chain Rule)
P(x, y) = P(x | y)P(y)
•  Sum Rule
!
P(x) = " P(x, y)
y
= " P(x | y)P(y)
y
Bayes’ Theorem
P(x | y)P(y)
P(y | x) =
P(x)
P(x | y)P(y)
P(y | x) =
" P(x | y)P(y)
y
Bayes’ Theorem tells you how to switch the
order of variables in a condition probability,
i.e., how to convert P(x|y) into P(y|x).
!
Thomas Bayes (1701-1761),
English Presbyterian minister
Class Exercise: Am I diseased?
•  There is a rare disease that affects 0.1% of the
population.
•  You take a test, which is marketed to be 99% correct,
i.e., in 99/100 cases test shows positive when user
actually has disease, and in 99/100 cases test shows
negative when user actually does not have disease.
•  Test shows positive for you. What is the probability
that you have the disease given the test result?
Class Exercise continued
•  Write down all the probabilities
i)  P(disease)=0.001
ii)  P(test positive | disease)=0.99
iii)  P(test negative | no disease)=0.99
•  We want to evaluate P(disease|test positive).
Use Bayes Theorem
P (test positive|disease)P (disease)
P (disease|test positive) =
P (test positive)
Class Exercise continued
•  Use sum rule to obtain P(test positive):
P (test positive) =P (test positive|disease)P (disease)+
P (test positive|no disease)P (no disease)
=(0.99 × 0.001) + (0.01 × 0.999)
=0.01098
•  Plugging this into Bayes theorem:
0.99 × 0.001
P (disease|test positive) =
0.01098
=0.09
Class Exercise continued
•  Therefore the actual probability of you having
the disease given the test result is only 9%, i.e.
unlikely you are diseased.
•  Note that the p-value of getting the test result,
with the null hypothesis that you are diseasefree, is 0.01. Thus a frequentist statistician would
conclude that the null hypothesis can be safely
rejected, and you likely are diseased. This is
wrong!
Class Exercise continued
This picture may help in understanding the
“counterintuitive” Bayesian result
Sample of
1000 typical
individuals
Individual that truly has disease and tests positive (true positive)
Individual that does not have disease and tests positive (false positive)
Individual that does not have disease and tests negative (true negative)
Class Exercise continued
•  In a sample of 1000 typical individuals you may
have one individual who likely has the disease
[P(disease)=0.01] and likely tests positive
[P(positive|disease)=0.99].
•  Of the remaining 999 disease-free individuals,
approximately ten will likely test as a false
positive [P(positive|no disease)=0.01].
•  Therefore, since there are only 11 individuals
who likely test positive
1
P (disease|test positive) ≈
≈ 9%
11