Mixtures of Distributions

A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Analysis of
(cDNA) Microarray Data:
Part V. Mixtures of Distributions
Model-Based Clustering
via
Mixtures of Distribution
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Definition
• The mixture model assumes that each cluster (or component) of
the data is generated by an underlying normal distribution.
• Each of the data in y are assumed to be independent
observations from a mixture density with k (possibly unknown
but finite) components and with probability density function:
k
f  y;  k     i  y;  i , Vi 
i 1
Normal density function
Mixing proportions (add to 1)
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Introduction
f  y;  k  
    y;  , V 
k
j
j
j
j 1
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
The Guru
Mixtures of Distributions
http://www.maths.uq.edu.au/~gjm
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Software and Resources
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
k
EM Algorithm
f  y;  k     i  y;  i , Vi 
i 1
The EM algorithm obtains the maximum likelihood estimate of  by
iteration. In the (m+1)th iteration, the estimates of the parameters
of interest are updated by:

( m 1)
i
Vi
( m 1)
Where
n
 
j 1
( m)
ij
/n

( m 1)
i
n
 
j 1
n
( m)
ij
yi /  ij( m)
j 1
 n ( m)
 n ( m)
( m 1)
( m 1) T
  ij ( yi   i
)( yi   i
)  /  ij
 j 1
 j 1
 ij( m)   i( m) y j ;  i( m) , Vi ( m) / f ( y j ;  ( m) )
Is the Posterior Probability that yj belongs to the i-th component of
the mixture (…with a very elegant link to False Discovery Rate).
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
k
f  y;  k     i  y;  i , Vi 
EM Algorithm
i 1
• We proceed for k = 1, 2, 3, …, and so on components.
• Criteria for model selection includes the Akaike
Information Criterion (AIC) and the Bayesian Information
Criterion (BIC):
ˆ k )  2 k
AIC   2 log L(
ˆ k )  k log( n)
BIC   2 log L(
Where
k  3k 1
Is the number of independent
parameters in the mixture.
• Alternatively, the distribution of the likelihood ratio test
(LRT) can be estimated by bootstrapping and P-values
obtained to contrast a model with k components against a
model with k + 1 components.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Consider these
Distribution
N(1,5)
N(5,10)
Simulation 1
Records
10,000
5,000
…and simulate
ˆ )  2  N (1, 5)  1  N (5,10)
The Mixture becomes: f ( y; 
3
3
Posterior Prob:
 ij 
 i  y j ; i , Vi

f ( y j ; )
Likelihood
6
2
-1
0
1
5
7
N(1,5)
N(5,10)
0.120
0.161
0.178
0.036
0.005
0.021
0.036
0.056
0.126
0.103
4
3
Weighted average (by mixing proportions)
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Consider these
Distribution
N(0,1)
N(0,10)
Simulation 2
…and simulate
Records
Microarray
9,000
1,000
Non-DE Genes
DE Genes
ˆ )  0.9  N (0,1)  0.1 N (0,10)
The Mixture becomes: f ( y; 
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Simulation 2
ˆ )  0.9  N (0,1)  0.1 N (0,10)
1. Simulate: f ( y; 
2. Ask EMMIX to fit mixtures with up to 5 components and…
3. EMMIX model of best fit:
ˆ )  0.903 N (0.006, 0.993)  0.097  N (0.010,10.805)
f ( y; 
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Simulation 2
ˆ )  0.9  N (0,1)  0.1 N (0,10)
1. Simulate: f ( y; 
ˆ )  0.903 N (0.006, 0.993)  0.097  N (0.010,10.805)
3. EMMIX best fit: f ( y; 
Frequency
Post Prob
Posterior Probabilities are “Decision Function” changing at  2.75
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Linking Posterior Probabilities with False Discovery Rate
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Linking Posterior Probabilities with False Discovery Rate
Not-DE
DE
Select the N most
extreme genes, and
FDR is the average
posterior probability
of not being in the
cluster of extremes.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Simulation 2
ˆ )  0.9  N (0,1)  0.1 N (0,10)
1. Simulate: f ( y; 
ˆ )  0.903 N (0.006, 0.993)  0.097  N (0.010,10.805)
3. EMMIX best fit: f ( y; 
Select the N most extreme genes,
and FDR is the average Post Prob of
not being in the cluster of extremes.
FDR by N Genes
Post Prob
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Example
“Diets”
(only REFERENCE components of the design)
yiHvL 
g i  ri
g
i
8

r
i
8
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Example
“Diets”
(only REFERENCE components of the design)
yiHvL 
g i  ri
g
i
8

r
i
8
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Example
“Diets”
(only REFERENCE components of the design)
k
f  y;  k     i  y;  i , Vi 
i 1
yiHvL 
ˆ )  0.044  N (0.87, 67.46) 
f ( y; 
g i  ri
g
i
8

r
i
8
0.590  N (2.30,10.42) 
0.366  N (2.41, 2.32)
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Example
“Diets”
(only REFERENCE components of the design)
ˆ )  0.044  N (0.87, 67.46)  0.590  N (2.30,10.42)  0.366  N (2.41, 2.32) ,
f ( y; 
Armidale Animal Breeding Summer Course, UNE, Feb. 2006
A Quantitative Overview to Gene Expression Profiling in Animal Genetics
Mixtures of Distributions
Example
“Diets”
(only REFERENCE components of the design)
ˆ )  0.044  N (0.87, 67.46)  0.590  N (2.30,10.42)  0.366  N (2.41, 2.32) ,
f ( y; 
FDR by N Genes
In Reverter et al. ‘03
(JAS 81:1900), 27
genes were reported
as having a PP > 0.95
of being in the
extreme cluster.
Now, we can assess
that these 27 genes
include a FDR < 10%.
Armidale Animal Breeding Summer Course, UNE, Feb. 2006