Use of Mixture Model in a genome-wide DNA

A Microarray-Based Screening
Procedure for Detecting Differentially
Represented Yeast Mutants
Rafael A. Irizarry
Department of Biostatistics, JHU
[email protected]
http://biostat.jhsph.edu/~ririzarr
A
B
DOWNTAG
kanR
UPTAG
CEN/ARS
CEN/ARS
Circular pRS416
EcoRI linearized PRS416
Transformation into deletion pool
Select for Ura+ transformants
Genomic DNA preparation
Cy5 labeled PCR products
PCR
Cy3 labeled PCR products
Oligonucleotide array hybridization
NHEJ Defective
Which mutants are NHEJ
defective?
• Find mutants defective for
transformation with linear DNA
• Dead in linear transformation (green)
• Alive in circular transformation (red)
• Look for spots with large log(R/G)
YKU70
NEJ1
YKU80
YKU70
NEJ1
YKU80
• .
Data
•
•
•
•
5718 mutants
3 replicates on each slide
5 Haploid slides, 4 Diploid slides
Arrays are divided into 2 downtags, 3
uptag (2 of which replicate uptags)
Average Red and Green Scatter Plot
Average Red and Green MVA plot
Improvement to usual
approach
• Take into account that some mutants
are dead and some alive
• Use a statistical model to represent this
• Mixture model?
• With ratio’s we lose information about R
and G separately
• Look at them separately (absolute
analysis)
Histograms
Using model we can attach
uncertainty to tests
For example posterior z-test,
weighted average of z-tests with weights
obtained using the posterior probability
(obtained from EM)
Is Normal(0,1)
QQ-Plot
Uptag/Downtag Z-Scores
Average Red and Green MVA Plot
Average Red and Green Scatter Plot
ResultsTable
1
2
3
4
5
6
7
8
9
10
11
12
13
YMR106C
YOR005C
YLR265C
YDL041W
YIL012W
YIL093C
YIL009W
YDL042C
YIL154C
YNL149C
YBR085W
YBR234C
YLR442C
9.5
19.7
6.1
10.4
12.2
4.8
5.6
12.9
1.8
1.7
2.5
1.7
6.1
47 69.2
35 44.9
32 35.8
32 35.6
31 21.7
29 30.8
29 -23.5
29 32.1
28 91.3
27 93.4
26 -15.8
26 87.5
26 -100.0
a
a
a
a
a
a
a
a
m
m
a
m
a
a
d
m
m
a
a
a
d
m
d
a
d
a
100
100
100
100
100
100
100
100
82
71
84
75
100
Acknowledgements
•
•
•
•
Siew Loon Ooi
Jef Boeke
Forrest Spencer
Jean Yang
END
Summary
• Simple data exploration useful tool for
quality assessment
• Statistical thinking helpful for interpretation
• Statistical models may help find signals in
noise
Acknowledgements
Biostatistics
Karl Broman
Leslie Cope
Carlo Coulantoni
Giovanni Parmigiani
Scott Zeger
MBG (SOM)
Jef Boeke
Siew-Loon Ooi
Marina Lee
Forrest Spencer
UC Berkeley Stat
Ben Bolstad
Sandrine Dudoit
Terry Speed
Jean Yang
Gene Logic
WEHI
Francois Colin
Bridget Hobbs
Uwe Scherf’s Group Natalie Thorne
PGA
Tom Cappola
Skip Garcia
Joshua Hare
Warning
• Absolute analyses can be dangerous for
competitive hybridization slides
• We must be careful about “spot effect”
• Big R or G may only mean the spot they
where on had large amounts of cDNA
• Look at some facts that make us feel
safer
Correlation between replicates
R1 R2 R3 G1 G2 G3
R1 1.00 0.95 0.95 0.94 0.90 0.90
R2 0.95 1.00 0.96 0.90 0.95 0.91
R3 0.95 0.96 1.00 0.91 0.92 0.95
G1 0.94 0.90 0.91 1.00 0.96 0.96
G2 0.90 0.95 0.92 0.96 1.00 0.97
G3 0.90 0.91 0.95 0.96 0.97 1.00
Correlation between red, green,
haploid, diplod, uptag, downtag
RHD RHU RDD RDU GHD GHU GDD GDU
RHD 1.00 0.59 0.56 0.32 0.95 0.58 0.54 0.37
RHU 0.59 1.00 0.38 0.56 0.58 0.95 0.40 0.58
RDD 0.56 0.38 1.00 0.58 0.54 0.39 0.92 0.64
RDU 0.32 0.56 0.58 1.00 0.33 0.53 0.58 0.89
GHD 0.95 0.58 0.54 0.33 1.00 0.62 0.56 0.39
GHU 0.58 0.95 0.39 0.53 0.62 1.00 0.41 0.58
GDD 0.54 0.40 0.92 0.58 0.56 0.41 1.00 0.73
GDU 0.37 0.58 0.64 0.89 0.39 0.58 0.73 1.00
BTW
The mean squared error across slides is
about 3 times bigger than the mean
squared error within slides
Mixture Model
We use a mixture model that assumes:
• There are three classes:
– Dead
– Marginal
– Alive
• Normally distributed with same
correlation structure from gene to gene
Random effect justification
Each x = (r1,…,r5,g1,…,g5) will have the
following effects:
• Individual effect: same mutant same
expression (replicates are alike)
• Genetic effect: same genetics same
expression
• PCR effect : expect difference in uptag,
downtag
Does it fit?
Does it fit?
What can we do now that we
couldn’t do before?
• Define a t-test that takes into account if
mutants are dead or not when
computing variance
• For each gene compute likelihood ratios
comparing two hypothesis:
alive/dead vs.dead/dead or
alive/alive
QQ-plot for new t-test
Better looking than others
1
2
3
4
5
6
7
8
9
10
11
12
13
YMR106C
YOR005C
YLR265C
YDL041W
YIL012W
YIL093C
YIL009W
YDL042C
YIL154C
YNL149C
YBR085W
YBR234C
YLR442C
9.5
19.7
6.1
10.4
12.2
4.8
5.6
12.9
1.8
1.7
2.5
1.7
6.1
47 69.2
35 44.9
32 35.8
32 35.6
31 21.7
29 30.8
29 -23.5
29 32.1
28 91.3
27 93.4
26 -15.8
26 87.5
26 -100.0
a
a
a
a
a
a
a
a
m
m
a
m
a
a
d
m
m
a
a
a
d
m
d
a
d
a
100
100
100
100
100
100
100
100
82
71
84
75
100