From reference genes to global mean normalization

From reference genes to global mean normalization
Jo Vandesompele
professor, Ghent University
co-founder and CEO, Biogazelle
EMBL Master Class - Following MIQE recommendation
Heidelberg, July 4, 2012
Critical elements contributing to successful qPCR results
“normalization is the single most important factor contributing to (more)
accurate qPCR results”
Derveaux et al., Methods, 2010
Three types of normalization will be discussed
n  reference genes: gold standard for normalization
n  global mean normalization and selection of stable references
n  expressed repeat elements
http://genomebiology.com/2002/3/7/research/0034.1
Open Access
et al.
Mestdagh
2009
Volume
10, Issue 6, Article R64
Method
Research
!"#$%&'()"*+(,(-#.%/,((&#0(#12(/(2-#34,4+#1%//5&-#6278(#1"++(9%'4&( $%&#:"5-#;&&(#0(#1%(+(#%&'#32%&<#=+(,(*%&
comment
Accurate normalization of real-time quantitative RT-PCR data by
geometric averaging of multiple internal control genes
A novel and universal method for microRNA RT-qPCR data
normalization
Pieter Mestdagh*, Pieter Van Vlierberghe*, An De Weer*, Daniel Muth†,
Frank Westermann†, Frank Speleman* and Jo Vandesompele*
Addresses: *Center for Medical Genetics, Ghent University Hospital, De Pintelaan 185, Ghent, Belgium. †Department of Tumour Genetics,
German Cancer Center, Im Neuenheimer Feld 280, Heidelberg, Germany.
?"22()+"&'(&8(>#32%&<#=+(,(*%&N#OJ*%4,>#@2%&<4N)+(,(*%&P27MN%8NQ(
Correspondence: Jo Vandesompele. Email: [email protected]
Published: 18 June 2002
Genome Biology 2002, 3(7):research0034.1–0034.11
Received: 20 December 2001
Revised: 10 April 2002
Accepted: 7 May 2002
reviews
;''2())>#?(&/(2#@"2#A('48%,#B(&(/48)-#BC(&/#D&4E(2)4/5#F")+4/%,#G.H-#0(#14&/(,%%&#GIH-#6JKLLL#BC(&/-#6(,M47*N
Published: 16 June 2009
Genome Biology 2009, 10:R64 (doi:10.1186/gb-2009-10-6-r64)
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2002/3/7/research/0034
reports
© 2002 Vandesompele et al., licensee BioMed Central Ltd
(Print ISSN 1465-6906; Online ISSN 1465-6914)
Abstract
%&%,5)4)# "@# /C"7)%&')# "@# M(&()# 4&# /S"# '4@@(2(&/4%,,5# ,%Q(,('
:9;# +"+7,%/4"&)# VGW-# SC4,(# 2(%,J/4*(# :TJ1?:# +2"E4'()# /C(
)4*7,/%&("7)#*(%)72(*(&/#"@#M(&(#(R+2())4"&#4&#*%&5#'4@J
@(2(&/# )%*+,()# @"2# %# ,4*4/('# &7*Q(2# "@# M(&()-# %&'# 4)# ()+(J
84%,,5# )74/%Q,(# SC(&# "&,5# %# )*%,,# &7*Q(2# "@# 8(,,)# %2(
%E%4,%Q,(# VXJYWN# 6"/C# /(8C&4U7()# C%E(# /C(# %'E%&/%M(# "@
)+(('-#/C2"7MC+7/#%&'#%#C4MC#'(M2((#"@#+"/(&/4%,#%7/"*%/4"&
8"*+%2('# /"# 8"&E(&/4"&%,# U7%&/4@48%/4"&# *(/C"')-# )78C# %)
&"2/C(2&JQ,"/# %&%,5)4)-# 24Q"&78,(%)(# +2"/(8/4"&# %))%5-# "2
information
B(&(J(R+2())4"&#%&%,5)4)#4)#4&82(%)4&M,5#4*+"2/%&/#4&#*%&5
@4(,')# "@# Q4","M48%,# 2()(%28CN# D&'(2)/%&'4&M# +%//(2&)# "@
(R+2())('#M(&()#4)#(R+(8/('#/"#+2"E4'(#4&)4MC/#4&/"#8"*+,(R
2(M7,%/"25#&(/S"2<)#%&'#S4,,#*")/#+2"Q%Q,5#,(%'#/"#/C(#4'(&J
/4@48%/4"&# "@# M(&()# 2(,(E%&/# /"# &(S# Q4","M48%,# +2"8())()-# "2
4*+,48%/('# 4&# '4)(%)(N# TS"# 2(8(&/,5# '(E(,"+('# *(/C"')# /"
*(%)72(#/2%&)824+/#%Q7&'%&8(#C%E(#M%4&('#*78C#+"+7,%24/5
%&'# %2(# @2(U7(&/,5# %++,4('N# A482"%22%5)# %,,"S# /C(# +%2%,,(,
interactions
Conclusions: The normalization strategy presented here is a prerequisite for accurate RT-PCR
expression profiling, which, among other things, opens up the possibility of studying the biological
relevance of small expression differences.
refereed research
Results: We outline a robust and innovative strategy to identify the most stably expressed
control genes in a given set of tissues, and to determine the minimum number of genes required to
calculate a reliable normalization factor. We have evaluated ten housekeeping genes from different
abundance and functional classes in various human tissues, and demonstrated that the conventional
use of a single gene for normalization leads to relatively large errors in a significant proportion of
samples tested. The geometric mean of multiple carefully selected housekeeping genes was
validated as an accurate normalization factor by analyzing publicly available microarray data.
deposited research
Background: Gene-expression analysis is increasingly important in biological research, with realtime reverse transcription PCR (RT-PCR) becoming the method of choice for high-throughput
and accurate expression profiling of selected genes. Given the increased sensitivity,
reproducibility and large dynamic range of this methodology, the requirements for a proper
internal control gene for normalization have become increasingly stringent. Although
housekeeping gene expression has been reported to vary considerably, no systematic survey has
properly determined the errors related to the common practice of using only one control gene,
nor presented an adequate way of working around this problem.
Background
Received: 2 April 2009
Revised: 2 April 2009
Accepted: 16 June 2009
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2009/10/6/R64
© 2009 Mestdagh et al.; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
<p>The
Normalization
iments.</p>
mean expression
of microRNA
value:
RT-qPCR
a new method for accurate and reliable normalization of microRNA expression data from RT-qPCR exper-
Abstract
Gene expression analysis of microRNA molecules is becoming increasingly important. In this study
we assess the use of the mean expression value of all expressed microRNAs in a given sample as a
normalization factor for microRNA real-time quantitative PCR data and compare its performance
to the currently adopted approach. We demonstrate that the mean expression value outperforms
the current normalization strategy in terms of better reduction of technical variation and more
accurate appreciation of biological changes.
Background
MicroRNAs (miRNAs) are an important class of gene regulators, acting on several aspects of cellular function such as differentiation, cell cycle control and stemness. Not
surprisingly, deregulated miRNA expression has been implicated in a wide variety of diseases, including cancer [1]. Moreover, miRNA expression profiling of different tumor entities
resulted in the identification of miRNA signatures correlating
with patient diagnosis, prognosis and response to treatment
[2]. Despite the small size of miRNA molecules, several technologies have been developed that enable high-throughput
and sensitive miRNA profiling, such as microarrays [3-8],
real-time quantitative PCR (RT-qPCR) [9,10] and bead-based
flow cytometry [2]. In terms of accuracy and specificity, RTqPCR has become the method of choice for measuring gene
expression levels, both for coding and non-coding RNAs.
However, the accuracy of the results is largely dependent on
proper data normalization. As numerous variables inherent
to an RT-qPCR experiment need to be controlled for in order
to differentiate experimentally induced variation from true
biological changes, the use of multiple reference genes is generally accepted as the gold standard for RT-qPCR data normalization [11]. Typically, a set of candidate reference genes is
evaluated in a pilot experiment with representative samples
from the experimental condition(s). Ideally these candidate
reference genes belong to different functional classes, significantly reducing the possibility of confounding co-regulation.
In case of miRNA profiling, only few candidate reference
miRNAs have been reported [12]. Generally, other small noncoding RNAs are used for normalization. These include both
small nuclear RNAs (for example, U6) and small nucleolar
RNAs (for example, U24, U26).
Strategies for normalization of high-dimensional expression
profiling experiments (using, for example, microarray technology, but recently also transcriptome sequencing) generally
take advantage of the huge amount of data generated and
often use (almost) all available data points. These strategies
range from a straightforward approach based on the mean or
median expression value to more complex algorithms such as
Genome Biology 2009, 10:R64
Why do we need normalization?
n  2 sources of variation in gene expression results
n  biological variation (true fold changes)
n  experimentally induced variation (noise and bias)
n  purpose of normalization is removal or reduction of the experimental
variation
n  input quantity: RNA quantity, cDNA synthesis efficiency, …
n  (input quality: RNA integrity, RNA purity, …)
Various normalisation strategies have been proposed
Huggett et al., Genes and Immunity, 2005
Various normalisation strategies have been proposed
n  sample size or volume
n  total RNA
n  rRNA genes (e.g. 18S rRNA)
n  spike-in molecules
n  reference genes (mRNA) (‘housekeeping genes’)
The use of reference genes (housekeeping genes) is the most
universal and most appropriate method for normalization.
The problem of using a single non-validated reference gene
Cq values
GOI
ACTB
GAPDH
T
21.0
U
23.0
T
18.0
U
19.0
T
19.0
U
22.0
normalized relative quantities
GOIACTB
GOIGAPDH
T
2
U
1
T
1
U
2
4-fold difference
different directions
the geNorm solution to the normalization problem
n  framework for qPCR gene expression normalisation using the reference
gene concept
n  quantified errors related to the use of a single reference gene
(> 3 fold in 25% of the cases; > 6 fold in 10% of the cases)
n  developed a robust algorithm for assessment of expression stability of
candidate reference genes
n  proposed the geometric mean of at least 3 reference genes for
accurate and reliable normalisation
n  Vandesompele et al., Genome Biology, 2002
Candidate reference genes in normal blood samples
n  RT-qPCR analysis of 5 candidate reference genes (belonging to different
functional and abundance classes) on 7 normal blood samples
4
3
ACTB
HMBS
2
HPRT1
TBP
1
UBC
0
A
B
C
D
E
F
G
15 fold difference between A and B if normalized
by only one gene (ACTB or HMBS)
geNorm expression stability parameter
n  pairwise variation V (between any 2 candidate reference genes)
gene A
gene B
sample 1
a1
b1
log2(a1/b1)
sample 2
a2
b2
log2(a2/b2)
sample 3
a3
b3
log2(a3/b3)
…
…
…
…
sample n
an
bn
log2(an/bn)
standard deviation = V
n  gene stability measure M: average pairwise variation V of a gene with all
other tested candidate reference genes
n  iterative procedure of removing the worst reference gene followed by
recalculation of M-values
geNorm analysis is straightforward and helps you on 2 levels
n  ranking of candidate reference genes according to their stability
n  determination of how many genes are required for reliable normalization
http://www.genorm.info
Calculation of a sample specific normalization factor as the
geometric mean of the reference genes
n  geometric mean of 3 reference gene expression levels
geometric mean = (a x b x c)
arithmetic mean =
1/3
a+b+c
3
n  controls for outliers
n  compensates for differences in expression level between the reference
genes
Geometric averaging is robust (insensitive to outliers)
ACTB
HMBS
HPRT1
TBP
UBC
NF
The use of multiple reference genes for normalization gives
statistically more significant results
Kaplan-Meier cancer patient survival curve
log rank statistics
NF4
0.003
NF1
0.006
0.021
0.023
0.056
Hoebeeck et al., Int J Cancer, 2006
The use of multiple reference genes for normalization enables
accurate measurement of small expression differences
LEMD3 levels drop to 50% in patient with NMD
patient / control
n  3 independent experiments
n  95% confidence intervals
n 
Hellemans et al., Nature Genetics, 2004
geNorm is the de facto standard for reference gene validation
and normalization
n  > 4,500 citations of the geNorm technology
n  > 15,000 geNorm software downloads in 100 countries
Large and active geNorm discussion community
> 1000 members, almost 2000 posts
http://tech.groups.yahoo.com/group/genorm/
genormPLUS is a much improved version built into qbasePLUS
classic
geNorm
improved
geNorm
Excel
Windows
qbasePLUS
Win, Mac, Linux
1x
20x
expert interpretation + report
-
+
ranking best 2 genes
-
+
handling missing data
-
+
raw data (Cq) as input
-
+
platform
speed
genormPLUS result interpretation is easy
n  expert report without need to understand formulas
n  time saver
n  higher confidence in the results
other
methods
for selection
stable
reference
genes reference
Alternative
approaches
tooffind
stably
expressed
genes
n 
n 
n 
n 
n 
n 
n 
Global Pattern Recognition (Akilesh et al., Genome Research, 2003)
BestKeeper (Pfaffl et al., Biotechnology Letters, 2004)
Equivalence test (Haller et al., Analytical Biochemistry, 2004)
ANOVA test (Brunner et al., BMC Plant Biology, 2004)
Normfinder (Andersen et al., Cancer Research, 2004)
Szabo et al., Genome Biology, 2004
Abruzzo et al., Biotechniques, 2005
present mathematical (linear mixed-effects) models to further analyze
candidate reference genes
log yij = µ + Ti + Gj + εij
n  Vandesompele, Kubista & Pfaffl
Reference gene validation software for improved normalization
book chapter in “Real-time PCR: an essential guide”,
Horizon Bioscience, 2nd edition (2009)
Misconceptions and tips for normalization
n  expression levels of reference gene and gene of interest do not need to be
similar
n  no need to measure reference genes in the same run (plate) as genes of
interest
n  geNorm pilot experiment can fit in a single 96-well plates
n  10 representative samples x 8 candidate reference genes
(no replicates, no controls, no standard curve)
n  results that can be used in the future if experimental conditions do not
change
n  at least 2 reference genes are needed
n  quality control of the expression stability and normalization factor
n  more accurate results
Large scale gene expression studies benefit from different
normalization strategy
n  service applications at Biogazelle
n  755 microRNAs (OpenArray)
n  1718 long non-coding RNAs (SmartChip)
n  gene panels (96 or 384-well plates)
A new normalization method: global mean normalization
n  hypothesis: when a large set of genes are measured, the average
expression level reflects the input amount and could be used for
normalization
n  microarray normalization (lowess, mean ratio, …)
n  RNA-sequencing read counts
n  the set of genes must be sufficiently large and unbiased
n  we test this hypothesis using genome-wide microRNA data from
experiments in which Biogazelle quantified a large number of miRNAs in
different studies
n  cancer biopsies & serum
o  neuroblastoma, T-ALL, EVI1 leukemia, retinoblastoma
n  pool of normal tissues, normal bone marrow set
n  induced sputum of smokers vs. non-smokers
How to validate a new normalization method?
n 
n 
n 
n 
geNorm ranking global mean vs. candidate reference genes
reduction of experimental noise
balancing of expression differences (up vs. down)
identification of truly differentially expressed genes
n  original global mean (Mestdagh et al., 2009)
n  improved global mean (D’haene et al., 2012)
n  mean center the data > equal weight to each gene
n  allow PCR efficiency correction
n  improved global mean on common targets (D’haene et al., 2012)
n  improved global mean
n  average only genes that are expressed in all samples
MicroRNA normalization strategies
n  small-RNA controls
n  frequently used normalization strategy
n  small nuclear RNAs, small nucleolar RNAs
n  18 available from Applied Biosystems
n  global mean normalization
n  method applied for microarray data
n  universal: applicable for all large miRNA datasets
n  many datapoints needed (megaplex vs. multiplex)
n  miRNAs/small RNA controls that resemble the mean
n  minimal standard deviation when comparing miRNA expression with
mean ( geNorm V value, standard deviation of log transformed ratios)
n  compatible with multiplex assays
n  need to determine mean expression level first
Frequently used small RNA controls for miR normalization
n  How ‘stable’ is the global mean compared to controls?
n  geNorm analysis using controls and mean as input variables
n  exclusion of potentially co-regulated controls
HY3
RNU19
RNU24
RNU38B
RNU43
RNU44
RNU48
RNU49
RNU58A
RNU58B
RNU66
RNU6B
U18
U47
U54
U75
Z30
RPL21
7q36
5q31.2
9q34
1p34.1-p32
22q13
1q25.1
6p21.32
17p11.2
18q21
18q21
1p22.1
10p13
15q22
1q25.1
8q12
1q25.1
17q12
13q12.2
T-ALL geNorm ranking
1,8
expression stability
1,6
1,4
1,2
1
0,8
0,6
0,4
0,2
0
geNorm ranking
neuroblastoma
leukemia EVI1 overexpression
bone marrow
pool normal tissues
Cumulative variance plot demonstrates most pronounced
removal of variation when using global mean (neuroblastoma)
120
100
80
not normalised
stable controls
60
mean
miRNAs
40
20
0
0
50
100
150
200
250
300
Removal of variation in other datasets
T-ALL
bone marrow
leukemia EVI1 overexpression
pool normal tissues
Biological validation of normalization strategy
n  choice of normalization strategy influences differential miRNA expression
n  miR-17-92 expression in 2 subgroups of neuroblastoma
3,5
3
2,5
2
1,5
1
0,5
0
stable controls
mean
miRNAs
Global mean trategy also works for microarray data
n  each sample is measured by
RT-qPCR and microarray
n  global mean normalization
n  standardization per method
n  hierarchical clustering
n  samples cluster by sample
(and NOT by method)
Global mean strategy also works for mRNA data
n  4 MAQC samples
n  201 MAQC consensus genes are measured
n  geNorm analysis
n  9 classic reference genes
n  1 Alu-Sq assay
n  global mean of 201 mRNAs
0.7
expression stability
0.6
0.5
0.4
0.3
0.2
0.1
0
Global mean normalization method is preferred when many
genes are simultaneously measured
n  novel and powerful normalization strategy
n  only useful when measuring a large and unbiased set of genes
n  e.g. microRNA gene expression profiling of 755 human miRs
n  e.g. measuring panels of pathway genes
n  maximal reduction of technical noise
n  improved identification of differentially expressed genes
n  Mestdagh et al., Genome Biology, 2009 (original global mean)
n  D’haene et al., Methods Mol Biol, 2012 (improved global mean)
n  integrated in qbasePLUS
Novel strategy for normalization
n  need for something new
n  reference gene validation requires (extensive) experimental work
n  sometimes not possible (lack of sample material, funding, time or
devotion)
n  there’s a need for an alternative
n  EAR normalization (Expressed Alu Repeat)
using a repetitive sequence in the human transcriptome as a measure
for the mRNA fraction
EAR normalization - principle
rationale: repeat sequences are present in the UTR of many genes, and
the differential expression of a small number of genes won’t influence
the overall repeat abundance in the transcriptome
Alu repeat elements
n  by far the most abundant repeats in the human genome
n  1 million copies (10% of the genome), 31 subfamilies (well conserved)
n  short interspersed elements (SINE) replicating via retrotransposition
n  ~280 bp long, followed by a variable poly-A tail
n  no known biological function
n  implicated in human disease (unequal recombination)
n  roughly 1500 human genes contain one or more Alu repeats
geNorm analysis of a very heterogeneous experiment confirm
Alu content as good normalizer
3rd EMBO qPCR course, Heidelberg, 2006
Good correlation between geometric mean of 4 stably
expressed reference genes (NF) and Alu-Sq signal
Alu repeat normalization results in more significant genes
n  59 prognostic marker genes
n  WT-Ovation amplified cDNA from 30 primary neuroblastoma tumors
belonging to 2 different risk groups (high risk vs. low risk)
(Vermeulen et al., Lancet Oncology, 2009)
n  multiple reference gene normalization (geometric mean of 4 stably
WSB1
ULK2
TYMS
TNFRSF25
SNAPC1
SLC25A5
SLC6A8
SCG2
QPCT
PTPRN2
PTPRH
PTPRF
PTN
PRKCZ
PRKACB
PRDM2
PRAME
PMP22
PLAT
PLAGL1
PIK3R1
PDE4DIP
PAICS
ODC1
NTRK1
NRCAM
NM23A
NHLH2
MYCN
MTSS1
MRPL3
MCM2
MAPT
MAP7
MAP2K4
INPP1
IGSF4
HIVEP2
GNB1
FYN
EPN2
EPHA5
EPB41L3
ELAVL4
ECEL1
DPYSL3
DDC
CPSG3
CLSTN1
CDKN3
CDH5
CDCA5
CD44
CAMTA2
CAMTA1
BIRC5
ARHGEF7
AKR1C1
AHCY
Alu
NF
NF Alu
expressed reference genes) (NF)
n  expressed Alu repeat normalization
n  more significant genes (45 vs. 35)
n  lower p-values (41 out of 59 have lower p-value)
Good correlation between arithmetic mean Cq of 4 stably
expressed reference genes (X) and Alu repeat signal (Y)
18
16
14
12
ALUsq
20
22
n  cancer cell lines treated with therapeutic compounds
n  Spearman r = 0.849, p = 2.2E-16
22
24
26
28
NF4
30
Normalization in practice
n  most powerful, flexible and user-friendly real-time PCR data-analysis
software
n  based on Ghent University’s geNorm and qBase technology
n  state of the art normalization procedures
o  one or more classic reference genes
o  global mean normalization
o  expressed repeat normalization
o  no normalization
n  detection and correction of inter-run variation
n  dedicated error propagation
n  fully automated analysis; no manual interaction required
http://www.qbaseplus.com
qbasePLUS is the only 3rd party qPCR data-analysis software
that is MIQE and RDML compliant
n  RDML universal file format compatible
n  peer-reviewed data-analysis algorithms
n  multiple reference gene normalization
n  global mean normalization
n  inter-run calibration
n  error propagation
n  assay information
n  data quality control
n  reference gene stability analysis
n  NTC analysis
n  PCR efficiency calculation
n  replicate variability assessment
Conclusions
n  proper normalization has a major impact on your results
n  provides statistically more significant results
n  enables accurate assessment of small expression differences
n  gold standard for RT-qPCR gene expression analysis
n  (geNorm) evaluation of candidate reference genes
n  geometric mean of multiple stably expressed reference genes
n  global mean normalization and subsequent geNorm based selection of
reference genes that resemble the mean is a valid option when measuring
a large and unbiased set of genes (e.g. all miRNAs)
n  expressed Alu repeat elements emerges as an alternative
normalization strategy (Witt, Pattyn, et al., in preparation)
Weekly qPCR tips and tricks via Twitter
https://twitter.com/#!/Biogazelle
@Biogazelle
Acknowledgments
n  Ghent University
n  Pieter Mestdagh
n  Filip Pattyn
n  Steve Lefever
n  Jan Hellemans
n  Stefaan Derveaux
n  Katleen De Preter
n  Sarah De Brouwer
n  Frank Speleman
n  Vladimir Benes, Nina Witt, Tania Nolan, Jim Huggett, …