Quality control and
normalization
Wolfgang Huber
European
Bioinformatics
Institute
Acknowledgements
Anja von Heydebreck (Darmstadt)
Robert Gentleman (Seattle)
Günther Sawitzki (Heidelberg)
Martin Vingron (Berlin)
Annemarie Poustka, Holger Sültmann, Andreas
Buness, Markus Ruschhaupt (Heidelberg)
Rafael Irizarry (Baltimore)
Judith Boer (Leiden)
Anke Schroth (Heidelberg)
Friederike Wilmer (Hilden)
X Which genes are differentially transcribed?
same-same
tumor-normal
lo
tio
ra
g-
Statistics 101:
← precision
variance→
←bias
accuracy→
Basic dogma of data analysis:
Can always increase
sensitivity on the cost
of specificity,
or vice versa,
X
X
X
X
X
X
X
X
X
the art is to find the
best trade-off.
(It can also be possible to increase both by
better choice of method / model)
X ratios and fold changes
Fold changes are useful to describe
continuous changes in expression
3000
1500
1000
x3
x1.5
A
B
C
But what if the gene is “off” (below
detection limit) in one condition?
3000
200
0
?
?
A
B
C
X ratios and fold changes
The idea of the log-ratio (base 2)
0: no change
+1: up by factor of 21 = 2
+2: up by factor of 22 = 4
-1: down by factor of 2-1 = 1/2
-2: down by factor of 2-2 = ¼
X ratios and fold changes
The idea of the log-ratio (base 2)
0: no change
+1: up by factor of 21 = 2
+2: up by factor of 22 = 4
-1: down by factor of 2-1 = 1/2
-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that
a change from 1000 to 2000 units has a similar biological
meaning to one from 5000 to 10000.
X ratios and fold changes
The idea of the log-ratio (base 2)
0: no change
+1: up by factor of 21 = 2
+2: up by factor of 22 = 4
-1: down by factor of 2-1 = 1/2
-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that
a change from 1000 to 2000 units has a similar biological
meaning to one from 5000 to 10000.
What about a change from 0 to 500?
- conceptually
- noise, measurement precision
A complex measurement process lies between
mRNA concentrations and intensities
o tissue
contamination
o RNA
degradation
o amplification
efficiency
o reverse
transcription
efficiency
o hybridization
efficiency and
specificity
o clone
identification and
mapping
o PCR yield,
contamination
o spotting
efficiency
o DNA-support
binding
o other array
manufacturingrelated issues
o image
segmentation
o signal
quantification
o ‘background’
correction
A complex measurement process lies between
mRNA concentrations and intensities
o tissue
contamination
o clone
o image
identification and segmentation
mapping
The problem is less that these
o RNA
o PCR yield,
o signal
steps
are
‘not
perfect’;
it
is
that
degradation
contamination
quantification
they vary
from array to oarray,
o amplification
o spotting
‘background’
to experiment.
efficiencyexperiment
efficiency
correction
o reverse
o DNA-support
transcription
binding
efficiency
o hybridization
o other array
efficiency and
manufacturingspecificity
related issues
X Questions
♦ How to compare microarray
intensities with each other?
♦ How to address measurement
uncertainty (“variance”)?
♦ How to calibrate (“normalize”) for
biases between samples?
X Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-fluorescent detection
probe purity and length
distribution
spotting efficiency, spot size
cross-/unspecific hybridization
stray signal
X Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-fluorescent detection
Systematic
o similar effect on many
measurements
o corrections can be
estimated from data
probe purity and length
distribution
spotting efficiency, spot size
cross-/unspecific hybridization
stray signal
X Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-fluorescent detection
Systematic
o similar effect on many
measurements
o corrections can be
estimated from data
Calibration
probe purity and length
distribution
spotting efficiency, spot size
cross-/unspecific hybridization
stray signal
X Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-fluorescent detection
Systematic
o similar effect on many
measurements
o corrections can be
estimated from data
Calibration
probe purity and length
distribution
spotting efficiency, spot size
cross-/unspecific hybridization
stray signal
Stochastic
o too random to be explicitely accounted for
o remain as “noise”
X Sources of variation
amount of RNA in the biopsy
efficiencies of
-RNA extraction
-reverse transcription
-labeling
-fluorescent detection
Systematic
o similar effect on many
measurements
o corrections can be
estimated from data
Calibration
probe purity and length
distribution
spotting efficiency, spot size
cross-/unspecific hybridization
stray signal
Stochastic
o too random to be explicitely accounted for
o remain as “noise”
Error model
X Error models
describe the possible outcomes of a set of
measurements
Outcomes depend on:
-true value of the measured quantity
(abundances of specific molecules in biological sample)
-measurement apparatus
(cascade of biochemical reactions, optical detection
system with laser scanner or CCD camera)
X Error models
Purpose:
1. Data compression: summary statistic instead of
full empirical distribution
2. Quality control
3. Statistical inference: appropriate parametric
methods have better power than non-parametric
(this has practical, financial, and ethical
aspects)
X The two component model
measured intensity = offset +
gain
× true abundance
yik = aik + bik xk
aik = ai + εik
ai per-sample offset
εik ~ N(0,
bi2s12)
“additive noise”
bik = bi bk exp(ηik )
bi per-sample
normalization factor
bk sequence-wise
probe efficiency
ηik ~ N(0,s22)
“multiplicative noise”
X The two-component model
raw scale
log scale
B. Durbin, D. Rocke, JCB 2001
X The two-component model
“multiplicative” noise
“additive” noise
raw scale
log scale
B. Durbin, D. Rocke, JCB 2001
X Parameterization
y = a + ε + b ⋅ x ⋅ (1 + η)
y = a + ε + b⋅ x⋅e
a systematic
background
ε random
background
b systematic gain
factor
η random gain
fluctuations
η
same for all probes
(per array x color)
iid in whole
experiment
per array x color
iid in whole
experiment
two practically
equivalent forms
(η<<1)
per array x color x
print-tip group
iid per array
per array x color x
print-tip group
iid per array
X variance stabilizing transformations
Xu a family of random variables with
EXu=u, VarXu=v(u). Define
x
f (x ) =
∫
1
v(u )
du
⇒ var f(Xu ) ≈ independent of u
derivation: linear approximation
9.5 10.0
9.0
8.5
8.0
f(x)
transformed scale
11.0
X variance stabilizing transformations
0
20000
x
40000
raw scale
60000
X variance stabilizing transformations
x
f (x ) =
∫
1
v(u )
du
1.) constant variance (‘additive’)
v (u ) = s2
2.) constant CV (‘multiplicative’)
v (u ) ∝ u 2 ⇒ f ∝ log u
3.) offset
v (u ) ∝ (u + u 0 ) 2
4.) additive and multiplicative
⇒
⇒
f ∝u
f ∝ log(u + u 0 )
u + u0
v (u ) ∝ (u + u0 ) + s ⇒ f ∝ arsinh
s
2
2
X the “glog” transformation
- - - f(x) = log(x)
——— hs(x) = asinh(x/s)
-200
0
200
400
intensity
arsinh( x ) = log x +
(
600
800
x2 + 1
1000
)
lim ( arsinh x − log x − log 2 ) = 0
x →∞
P. Munson, 2001
D. Rocke & B. Durbin,
ISMB 2002
W. Huber et al., ISMB
2002
X glog
raw scale
di
ff
er
en
ce
log
lo
gra
t io
glog
ge
n
lo e r
g- ali
ra z e
t io d
X glog
raw scale
di
ff
er
en
ce
log
lo
gra
t io
glog
ge
n
lo e r
g- ali
ra z e
t io d
variance:
constant part
proportional part
X the transformed model
arsinh
Yki − asi
bsi
= µk + εki
εki ∼ N(0, c )
2
i: arrays
k: probes
s: probe strata (e.g. print-tip, region)
x1
log
x2
“usual” log-ratio
'glog'
(generalized
log-ratio)
log
x1 +
x1 + c
x2 +
x2 + c
2
2
c1, c2 are experiment specific parameters (~level
of background noise)
2
1
2
2
Estimated log-fold-change
X Variance Bias Trade-Off
log
glog
Signal intensity
X Variance-bias trade-off and shrinkage estimators
Shrinkage estimators:
pay a small price in bias for a large decrease of variance, so
overall the mean-squared-error (MSE) is reduced.
Particularly useful if you have few replicates.
Generalized log-ratio:
= a shrinkage estimator for fold change
There are many possible choices, we chose “variancestabilization”:
+ interpretable even in cases where genes are off in some
conditions
+ can subsequently use standard statistical methods
(hypothesis testing, ANOVA, clustering, classification…)
without the worries about low-level variability that are often
warranted on the log-scale
difference red-green
evaluation: effects of different data transformations
rank(average)
X Normality: QQ-plot
X “Single color normalization”
n red-green arrays (R1, G1, R2, G2,… Rn, Gn)
within/between slides
for (i=1:n)
calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi)
normalize Mi vs Ai
normalize M1…Mn
all at once
normalize the matrix of (R, G)
then calculate log-ratios or any other
contrast you like
X What about non-linear effects
o Microarrays can be operated in a linear regime,
where fluorescence intensity increases proportionally
to target abundance (see e.g. Affymetrix dilution
series)
Two reasons for non-linearity:
o At the high intensity end: saturation/quenching. This
can and should be avoided experimentally - loss of
data!
o At the low intensity end: background offsets, instead
of y=k·x we have y=k·x+x0, and in the log-log plot this
can look curvilinear. But this is an affine-linear effect
and can be correct by affine normalization. Nonparametric methods (e.g. loess) risk overfitting and
loss of power.
X Non-linear or affine linear?
X Definitions
linear
affine linear
genuinely
non-linear
X How to compare and assess different
‘normalization’ methods?
Normalization :=
1. correction for systematic experimental biases
2. provision of expression values that can subsequently be
used for testing, clustering, classification, modelling…
3. provision of a measure of measurement uncertainty
Quality trade-off: the better the measurements, the less
need for normalization. Need for “too much” normalization
relates to a quality problem.
Variance-Bias trade-off: how do you weigh measurements
that have low signal-noise ratio?
- just use anyway
- ignore
- shrink
X How to compare and assess different
‘normalization’ methods?
Aesthetic criteria
Logarithm is more beautiful than arsinh
Practical critera
It takes forever to run method XX. Referees will only
accept my paper if it uses the original MAS5.
Silly criteria
The best method is that that makes all my scatterplots
look like straight, slim cigars
Physical criteria
Normalization calculations should be based on
physical/chemical model
Economical/political criteria
Life would be so much easier if everybody were just using
the same method, who cares which one
X How to compare and assess different
‘normalization’ methods?
Comparison against a ground truth
But you have millions of numbers – need to choose the
metric that measures deviation from truth.
FN/FP: do you find all the differentially expressed genes,
and do you not find non-d.e. genes?
qualitative/quantitative: how well do you estimate
abundance, fold-change?
Spike-In and Dilution series
… great, but how representative are they of other data?
Implicitely, from resampling / cross-validating with the
actual experiment of interest
… but isn’t that too much like Münchhausen’s bootstrap?
X
evaluation: a benchmark for Affymetrix
genechip expression measures
o Data:
Spike-in series: from Affymetrix 59 x HGU95A,
16 genes, 14 concentrations, complex background
Dilution series: from GeneLogic 60 x HGU95Av2,
liver & CNS cRNA in different proportions and amounts
o Benchmark:
15 quality measures regarding
-reproducibility
-sensitivity
-specificity
Put together by Rafael Irizarry (Johns Hopkins)
http://affycomp.biostat.jhsph.edu
Quality assessment and
control:
an overview over some
diagnostic plots and
common artifacts
PCR plates
Scatterplot, colored by PCR-plate
Two RZPD Unigene II filters (cDNA nylon membranes)
PCR
PCR plates
plates
PCR
PCR plates:
plates: boxplots
boxplots
array
array batches
batches
print-tip
print-tip effects
effects
0.2
0.4
F(q) 0.6
1:1
1:2
1:3
1:4
2:1
2:2
2:3
2:4
3:1
3:2
3:3
3:4
4:1
4:2
4:3
4:4
0.0
^
F
0.8
1.0
41 (a42-u07639vene.txt) by spotting pin
-0.8
-0.6
-0.2
q-0.4 (log-ratio)
log(fg.green/fg.red)
0.0
0.2
spotting
spotting pin
pin quality
quality decline
decline
after delivery of 5x105 spots
after delivery of 3x105 spots
H. Sueltmann DKFZ/MGA
spatial
spatial effects
effects
R
Rb
R-Rb
color scale by rank
another
array:
print-tip
color
scale ~
log(G)
spotted cDNA arrays, Stanford-type
color
scale ~
rank(G)
1.2
1.0
30
0.8
20
0.6
10
1:nrhyb
40
1.4
50
1.6
1.8
60
Batches: array to array differences dij = madk(hik -hjk)
1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899
100
10
20
30
1:nrhyb
40
50
60
arrays i=1…63; roughly sorted by time
Density representation of the scatterplot
(76,000 clones, RZPD Unigene-II filters)
See: package
hexbin; also,
smoothscatter in
package prada
Oligonucleotide chips
Affymetrix files
Main software from Affymetrix:
MAS - MicroArray Suite.
DAT file: Image file, ~108 pixels, ~200 MB.
CEL file: probe intensities, ~106 numbers
CDF file: Chip Description File. Describes
which probes go in which probe sets
(genes, gene fragments, ESTs).
1LQ file: Probe sequences and intended
targets in the transcriptome
Image analysis
DAT image files Î CEL files
Each probe cell: 10x10 pixels.
Gridding: estimate location of probe cell centers.
Signal:
– Remove outer 36 pixels Î 8x8 pixels.
– The probe cell signal, PM or MM, is the 75th
percentile of the 8x8 pixel values.
Background: Average of the lowest 2% probe cells
is taken as the background value and
subtracted.
Compute also quality values.
Data and notation
PMijg , MMijg = Intensities for perfect match and
mismatch probe j for gene g in chip i
i = 1,…, n
one to hundreds of chips
j = 1,…, J
usually 11 or 16 probe pairs
g = 1,…, G 6…30,000 probe sets.
Tasks:
calibrate (normalize) the measurements from different
chips (samples)
summarize for each probe set the probe level data, i.e.,
11 PM and MM pairs, into a single expression
measure.
compare between chips (samples) for detecting
differential expression.
expression
expression measures:
measures:
MAS
MAS 4.0
4.0
Affymetrix GeneChip MAS 4.0
software uses AvDiff, a trimmed
mean:
1
AvDiff =
#J
(PMj − MMj )
∑
j J
∈
o sort dj = PMj -MMj
o exclude highest and lowest value
o J := those pairs within 3 standard
deviations of the average
Expression
Expression measures
measures
MAS
MAS 5.0
5.0
Instead of MM, use "repaired" version CT
CT=
=
MM
PM / "typical log-ratio"
if MM<PM
if MM>=PM
"Signal" =
Tukey.Biweight (log(PM-CT))
(… ≈median)
Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise
Expression
Expression measures:
measures:
Li
Li && Wong
Wong
dChip fits a model for each gene
PMij − MMij = θi φ j + εij ,
εij ∝ N (0, σ )
2
where
– θi: expression index for gene i
– φj: probe sensitivity
Maximum likelihood estimate of MBEI is used as
expression measure of the gene in chip i.
Need at least 10 or 20 chips.
Current version works with PMs only.
Expression
Expression measures
measures
RMA:
RMA: Irizarry
Irizarry et
et al.
al. (2002)
(2002)
o Estimate one global background value
b=mode(MM). No probe-specific
background!
o Assume: PM = strue + b
Estimate s≥0 from PM and b as a
conditional expectation E[strue|PM, b].
o Use log2(s).
o Nonparametric nonlinear calibration
('quantile normalization') across a set of
chips.
Expression
Expression measures
measures
RMA:
RMA: Irizarry
Irizarry et
et al.
al. (2002)
(2002)
AvDiff-like
RMA =
1
Α
log (PMj − BGj )
∑
j A
∈
2
with A a set of “suitable” pairs.
Li-Wong-like: additive model
log2 (PMij − BG ) = ai + bj + εij
Estimate RMA = ai for chip i using robust method
median polish (successively remove row and
column medians, accumulate terms, until
convergence). Works with d>=2
Affymetrix: IPM = IMM + Ispecific ?
From: R. Irizarry et al.,
Biostatistics 2002
0
log(PM/MM)
X Sequence-dependent preprocessing
position- and
sequence-specific
effects wi(s):
25
log Y = log x + ∑ wi (si ) + ε
i=1
Naef et al., Phys Rev E 68 (2003)
wi
i
Software for pre-processing
Affymetrix data
•
•
•
•
•
•
Bioconductor R package affy.
Background estimation.
Probe-level normalization.
Expression measures
Two main functions: ReadAffy, expresso.
See also: gcrma, tilingArray, vsn.
X References
Bioinformatics and computational biology solutions using R and Bioconductor,
R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).
Variance stabilization applied to microarray data calibration and to the
quantification of differential expression. W. Huber, A. von Heydebreck, H.
Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.
Exploration, Normalization, and Summaries of High Density Oligonucleotide
Array Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed.
Biostatistics 4 (2003) 249-264.
Error models for microarray intensities. W. Huber, A. von Heydebreck, and M.
Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley
& sons (2005).
Differential Expression with the Bioconductor Project. A. von Heydebreck, W.
Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and
Bioinformatics. John Wiley & sons (2005).
© Copyright 2026 Paperzz