Protein mass spectra data analysis for clinical

B RIEFINGS IN BIOINF ORMATICS . VOL 12. NO 2. 176 ^186
Advance Access published on 9 June 2010
doi:10.1093/bib/bbq019
Protein mass spectra data analysis
for clinical biomarker discovery:
a global review
Pascal Roy, Caroline Truntzer, Delphine Maucort-Boulch, Thomas Jouve and Nicolas Molinari
Submitted: 12th February 2010; Received (in revised form): 9th May 2010
Abstract
The identification of new diagnostic or prognostic biomarkers is one of the main aims of clinical cancer research. In
recent years there has been a growing interest in using high throughput technologies for the detection of such biomarkers. In particular, mass spectrometry appears as an exciting tool with great potential. However, to extract
any benefit from the massive potential of clinical proteomic studies, appropriate methods, improvement and validation are required. To better understand the key statistical points involved with such studies, this review presents
the main data analysis steps of protein mass spectra data analysis, from the pre-processing of the data to the identification and validation of biomarkers.
Keywords: clinical proteomics; statistics; pre-processing; biomarker identification; validation
INTRODUCTION
The identification of new diagnostic or prognostic
biomarkers is one of the main aims of clinical research. A biomarker is defined as a biological constituent whose value differs between groups.
Depending on the study design, these groups can
be either diagnostic (diseased or healthy subjects) or
prognostic (relapse or event free cases) groups.
Currently, clinical research is making use of new
high throughput technologies, like transcriptomics
or proteomics. Proteomics is a rapidly developing
field among other omics disciplines, focusing on
large biological datasets. While transcriptomics
focuses on RNA level data, proteomics is concerned
with the next biological level in the universal dogma
of genetics, namely proteins. Proteins are the actual
effectors of biological functions and the measure of
their expression level is in direct connection with
their activity. Proteins are less influenced by downand up-regulation than RNA and might offer a
better proxy for biological activity. Furthermore,
proteins are exported out of the cell and can be detected in various biological fluids that are easily
sampled, like blood. Nevertheless, it should also be
pointed out that protein expression levels and activities are not exactly correlated.
Mass spectrometry (MS) is a technology recently
used for the separation and large-scale detection of
proteins present in a complex biological mixture.
This technology is increasingly being used in proteomic clinical research for the identification of new
biomarkers and offers an interesting insight for
Corresponding author. Nicolas Molinari, Laboratoire de Biostatistique, IURC, 641 avenue Gaston Giraud, 34093 Montpellier, France;
BESPIM, CHU de Nı̂mes, France. Tel: þ33467415921; Fax: þ33467542731; E-mail: [email protected]
Pascal Roy, MD, PhD, is Professor of Biostatistics, Hospices civils de Lyon, Service de Biostatistique, Lyon, F-69003, France;
Université de Lyon, Lyon; CNRS, UMR 5558, Pierre-Bénike, F-69310, France, interested in diagnostic and prognostic modelling
in clinical research.
Caroline Truntzer works as Research Engineer in a Clinical Proteomic Platform, CLiPP, CHU Dijon, where she manages the
statistical team. She is PhD in Biostatistics and is interested in specific issues related to the statistical analysis of high-dimensional data.
Delphine Maucort-Boulch is Associate Professor in the ‘Equipe Biostatistique Santé’. She is MD, PhD in Biostatistics, and mainly
interested in modelling and models predictive properties.
Thomas Jouve received a PhD in Biostatistics from the University of Lyon, France, and works on power issues in high-throughput
technologies biomarker detection.
Nicolas Molinari is Associate Professor of Biostatistics, EA 2415 Université Montpellier 1, University Hospital of Nı̂mes. He is
interested in specific issues related to the statistical analysis in clinical research.
ß The Author 2010. Published by Oxford University Press. For Permissions, please email: [email protected]
Protein mass spectra data analysis
biological samples containing large numbers of proteins such as plasma or tumour extracts. Initially
surface enhanced laser desorption/ionization
time-of-flight (SELDI-TOF) and thereafter matrix
assisted laser desorption and ionization time-of-flight
(MALDI-TOF) are the most commonly used MS
instruments to achieve these clinical objectives.
MS analysis relies on protein identification by
their time-of-flight (TOF), which is related to their
mass (m) to charge (z) ratio, usually written m/z. MS
output is a mass spectrum, associating TOF with
signal intensities. These intensities are related to the
concentrations of proteins in the processed sample.
The MS signal resulting from SELDI-TOF or
MALDI-TOF measurements is contaminated by different sources of technical variations that can be
removed by a prior pre-processing step. Whether
this method is quantitative is still questioned by
some authors [1]. This aspect deserves further investigations before judging the potential of MS for
quantitative analysis [2, 3]. MS nevertheless appears
as an exciting tool with great potential [4]. Despite
numerous studies stating that MS methods are a
powerful approach for detecting diseases in medicine
and biology, they still require appropriate methods,
improvement and validation [5–7].
This review presents the data analysis steps of MS
experiments in the context of high throughput identification studies. Section 2 presents the specificity of
high throughput proteomics and the pre-processing
steps. After the signal is thus cleaned, Section 3 deals
with biomarker identification and validation. A discussion is proposed in Section 4.
177
performed still remains an open question and different positions have been taken by different authors.
The order of pretreatment steps and interactions between methods were investigated by Arneberg et al.
[8] through an original modelling approach. The
interested reader is referred to this article for more
details. A summary of the entire analytical workflow
is presented in Figure 1.
Currently, there is no standard consensus as to
which algorithms to use during the pre-processing
steps; this is still an active field of research and a
wide variety of algorithms have been proposed.
Morris et al. rapidly provided some well-established
guidelines for the pre-processing of MS data [9].
More recently, Cruz et al. [10] and Yang et al. [11]
have proposed an extensive comparison of
some popular and current methods for
pre-processing. Most of these methods are available
through specific packages, as Bioconductor repository, in the open-source R-Gui software
(http://www.R-project.org). A succinct description
of the main and most recent methods for each of the
pre-processing steps is provided below.
PRE-PROCESSING STEPS
Data acquisition leads to various sources of experimental noise. As a consequence, the observed signal
(I) for sample i can be decomposed into several components described in the following equation: the
biological signal of interest (S) weighted by a factor
of normalization (k), a baseline (B) corresponding to
systematic artifacts usually attributed to clusters of
ionized matrix molecules hitting the detector or detector overload, and random noise from the electronic measuring system or of chemical origin (N)
I i ðtÞ ¼ k:Si ðtÞ þ Bi ðtÞ þ N i ðtÞ,
where t refers to TOF values (related with m/z). The
aim of pre-processing steps is to isolate the true signal
S. The order in which these steps should be
Figure 1: Workflow for MS data analysis: from
raw spectra to identification of biomarkers and
classification.
178
Roy et al.
Noise filtering
The most commonly adopted method is the use of
the wavelet transform [12, 13]. For this purpose,
spectra are first decomposed into wavelet bases,
usually using an undecimated discrete wavelet transform (UDWT), and in this way expressed through
detail and approximation coefficients. By shrinking
small detail coefficients to zero, noise is then
removed from the original spectrum signal. Two
possibilities are offered for this thresholding: (i)
hard thresholding replaces small coefficients below a
given threshold with zeroes, (ii) soft thresholding
shrinks coefficients, possibly to zero. Hard thresholding might distort the signal more than soft
thresholding, but it remains a simpler approach
with good performance.
Du et al. [14] proposed the use of continuous
wavelet transforms that simultaneously filter out
noise and baseline. Later, Kwon et al. [15] proposed
a novel wavelet strategy that accommodates errors
that vary across the mass spectrum. Recently,
Mostacci et al. [16] proposed an effective multivariate
denoising method combining wavelets and principal
component analysis (PCA), which takes into account
common structures within the signals.
Baseline correction
Two main approaches can be described for removing the baseline: the baseline is considered either (i)
as the part of the signal remaining after features of
interest have been removed [6], or (ii) as some kind
of smooth curve underlying the spectrum [12, 17,
18]. Malyarenko et al. [19] offered a time-series perspective on the problem. In their article, baseline arises from a constant offset and a slowly
decaying charge, plus some shift after detector overload events. Another original approach was developed by Dijkstra et al. [20]. These authors set
up a mixture model to deconvolve each signal
component of a spectrum. An appropriate component in this mixture model corresponds to the
baseline.
Alignment of spectra
Due to physical principles underlying MS, a shift on
the m/z axis can appear. Alignment corresponds to
an adjustment of the time of flight axes of the
observed spectra so that the features of the spectra
are aligned with one another. In fact, for a reliable
identification of local features, one needs to carefully associate each feature of interest with a specific
m/z. The easiest way to perform alignment would
be to allow spectra to shift a few time-steps to the
left or to the right so that their correlation with
other spectra is maximized. Nevertheless, this
method may be too simplistic as it does not take
into account the differences in shift that may occur
along the m/z axis. More elaborate algorithms were
thus developed for stronger misalignment. Based on
a set of peaks in a reference spectrum, Jeffries et al.
[21] proposed stretching each spectrum so that the
distance between peaks from the reference and
from other spectrum are minimized. The same
year, Wong et al. [22] developed a strategy that
also aligns spectra using selected local features (usually peaks from the average spectrum) as anchors. In
the proposed method, spectra are locally shifted by
inserting or deleting some points on the m/z axis. In
2006, Pratapa et al. [23] have proposed an original
method for alignment of repetitions of mass spectra,
i.e. multiple spectra for the same sample. All spectra
are thought of as variations of one and the same
latent spectrum, which is inferred by a Hidden
Markov Model (HMM). Later, Antoniadis et al.
[12] proposed an alignment based on landmarks in
the wavelet framework. More recently, Kong et al.
[24] proposed a Bayesian approach that uses the
expectation–maximization algorithm to find the
posterior mode of the set of alignment functions
and the mean spectrum for a population of patient,
and Feng et al. [25] modelling the m/z by an integrated Markov chain shifting (IMS) method.
Normalization
A normalization step is necessary to ensure comparable spectra on the intensity scale. The idea is to
consider that the total amount of protein is roughly
the same between samples. In other terms, a small
proportion of proteins are differentially expressed
among all proteins in the sample. Total ion current
(TIC) is a useful proxy for the total amount of protein. It is related to the total number of ion collisions
with the detector and output by the mass spectrometer. Each of the intensities in the spectrum is divided
by the TIC. This method is debatable [26], and more
robust but less intuitive approaches are being studied.
A good comparison study of these methods is available [27].
Peak detection
Once the true signal is identified, peak detection
aims to identify locations in the m/z range that
Protein mass spectra data analysis
corresponds to peptides and to quantify corresponding intensities. Several approaches were proposed for
this purpose [6], with only the main strategies being
described here.
The most intuitive strategy, initially developed by
Yasui et al. [28], relies on local maxima, that is to say a
point with higher intensity than other points in a
neighbourhood. Once identified, these maxima
may be filtered so as to keep only peaks that
emerge from the noise and (ideally) really correspond
to a peptide. For example, when one or all of the
following filtering criteria exceed a user defined
threshold, this may be used to help define the existence of a local maxima as a meaningful peak:
signal-to-noise ratio, intensity and area [29]. Other
authors have proposed to move the peak finding
problem to the wavelet coefficient space; the goal
here is to find high coefficients that cluster together
on a similar position for different scales and that thus
corresponds to peaks [12, 14, 30]. Another interesting strategy that has been proposed is to use
model-based criterion that propose model functions
to fit peaks [31].
The above methods subsequently require peak
alignment to decide which peaks in different samples
correspond to the same biological input [29].
Moreover, only peaks that appear in a large enough
proportion of collected samples will be considered
in the further analysis, leading to the choice of an
arbitrary threshold proportion. To avoid such arbitrary decision making, Morris et al. [9], proposed to
identify peaks on the meanspectrum. This has additional advantages: it filters out some instrument noise,
plus some noisy peaks that appear in too few spectra.
Finding features of interest
Using peaks as features of interest requires estimating
the intensities (as a function of the number of molecules hitting the detector) associated to these peaks.
Peaks are actually two-dimensional (2D) features
with a height and a width. The simpler approach is
to use peak heights as intensities. However, as can
easily be verified on a spectrum, peaks are narrow for
low m/z but get broader with increasing m/z values.
Computing the area under the peak is therefore another proxy for intensity. For a given peak, this is not
necessarily a complicated step, but it requires evaluating the peak width which requires some detail. For
close or overlapping peaks (on the m/z axis) corresponding to different peptides, estimating the intensity can be a really hard step. If such overlapping
179
peaks are visible in the spectrum, several relatively
recent methods [20, 32] using distribution mixtures
offer the possibility to deconvolve features, thus
enhancing the resolution for each peak and enabling
a better intensity reading [33].
At this step, each spectrum is characterized by a
finite, common number of peaks. Using adapted
statistical methods, biomarker discovery analysis
then aims to detect which of these peaks are associated with the factors of interest. Another approach
considers spectra as functional data and analyses
wavelet coefficients rather than detected peaks
[9, 12, 18]. Although promising, this later approach
seems little used and merits further investigation.
BIOMARKER IDENTIFICATION
AND VALIDATION
Biological constituents associated with either diagnostic or prognostic groups are tested in diagnostic
or prognostic studies, respectively. These studies correspond to the first phases of biomarker development
formalized by Pepe et al. [34] in the context of early
detection of cancer. Guidelines have also been proposed to encourage transparent and complete reporting in the context of prognostic studies to better
understand conclusion contribution to scientific
knowledge [35].
Generally, value distributions of constituents
overlap between the groups. In classical clinical studies, one or few biological constituents are tested.
These candidate biomarkers correspond to a priori
hypotheses issued from a biological pathway.
Recent developments in molecular biology have
led to high throughput technologies, and consequently the question of how best to adequately
design studies. A sequential approach has been
adopted to discover new biomarkers. The first step
corresponds to identification studies that aim to select
a list of candidate biomarkers among a large number
of biological constituents, and to estimate the
strength of association between those candidate biomarkers and disease status or outcome. The second
step corresponds to validation studies designed to
retain, among previously selected candidates, the
confirmed biomarkers, and to re-estimate the
strength of association between those biomarkers
and disease status or outcome. Classification refers
to the assignment of each spectrum to a group, i.e.
for each sample or equivalent individual. Given a
serum sample, the aim of classification is to allocate
180
Roy et al.
the patient to a specific diagnostic or prognostic
group. Though tightly related to biomarker discoveries, profiling, similar to the signature concept in
trascriptomics, has appeared as a new strategy for
classification mass-spectra. The idea is to use a protein profile, defined as a list of intensities for different
m/z positions, instead of a unique protein concentration. Such a profile can be used in two different
ways: (i) as a set of proteins that must (or could
later) be individually identified, and (ii) as a set of
distinguishing features without searching for biological support for these. This can be an efficient
approach in clinical applications.
of the test is calculated from the cumulative distribution function. Since features are never considered
simultaneously, correlations can not be taken into
account and information from a set of features
might be redundant. Biomarker identification is performed with control of type-1 and -2 error rates
adapted to the large number of statistical tests performed (Table 1). The control of false discovery rate
(FDR), introduced by Benjamini and Hochberg
[36]:
Identification studies
V being the number of false positive tests and R the
total number of rejected null hypotheses, is frequently involved in controlling type-1 error [37]. Storey
and Tibshirani [38] proposed to use the q value, i.e.
the expected proportion of false positives among all
features as or more extreme than the observed one,
as a FDR-based measure of significance. They also
provided an estimation of ^ 0 , the proportion of features following the null hypothesis.
To control type-2 error, the proportion of detected genes of interest
The aim of identification studies is the selection of
candidate biomarkers from a large number of biological constituents. In the proteomic context, a biomarker can be defined as a protein that differentiates
samples from different groups. Several methods have
been used or developed in order to identify candidate biomarkers. These methods have been discussed
in the transcriptomic field. This search for biomarkers is tightly linked with the well-known curse of
dimensionality that exists in all high-throughput methods, where the number of variables (peak intensities)
exceeds the number of samples. Intuitively, the dimension of the regression space must be reduced,
and two approaches have been proposed to do this.
In the first approach, the subspace is defined by selecting certain variables within the matrix corresponding to features of interest, whereas in the
second case, the subspace is defined by components.
While the former of these methods directly focuses
on the identification of features that exhibit differences in intensities between the groups, the latter
evaluates the importance of features embedded in a
classification algorithm.
Feature selection
When two groups are compared, simple approaches
consider each feature individually. The aim is to
compare intensities between the groups. Each test
tells whether the two groups can be distinguished
based solely on information from one particular feature. This corresponds to univariate statistical tests,
such as the classical Student t-test. The observed
test value is compared to the test distribution under
the null hypothesis, H0 , of equality between mean
expression levels in the two groups. The significance
FDR ¼ EðQÞ,
V=R, if R > 0
Q¼
,
0, if R ¼ 0
Power ¼
E ðSÞ
m1
has been proposed as a definition of power in the
context of multiple testing [39, 40].
Beyond biomarker identification, clinical studies
aim to estimate the strength of the association between the candidate biomarker level and disease
status or disease outcome. For parametric models,
variable selection results from parameter estimation.
Because of the selection mechanism involved in
identification studies, the strength of association is
commonly over-estimated. This optimistic bias is
easily understood: only variables from which the corresponding test-values exceed an a priori defined
Table 1: Classical outcome for a binary decision
Conclusion
Truth
H0 True
H0 Wrong
Accept H0
Reject H0
U
T
mR
V
S
R
m0
m1
m
Protein mass spectra data analysis
threshold are selected. Over-estimations of the
strength of association and false discoveries are both
explained by this well-known regression to the mean
phenomena [41]. Optimism increases when the proportion of features of interest decreases and is
reduced by increasing the number of samples [42].
Whereas multivariate modelling is useful for identifying information redundancy provided by the selected features with adjusting on the confounders, it
does not correct the optimism bias linked to the selection process.
The parameter estimation bias results in an
over-estimation of the predictive ability of statistical
models when applied to new samples. More sophisticated methods for the selection of predictors exist,
jointly known as penalized regression. The idea
behind these techniques is to constrain the regression
coefficients. Several constraints have been proposed
to shrink the parameter estimators identified as L1 or
L2 penalizations [43–45]. Those penalizations result
in filtering out weak predictors, and often improve
prediction accuracy. The extension of the threshold
gradient descent (TGD) method [46] to the survival
model [47] is similar.
Classification algorithms
Given Y, the vector of disease status or outcome
status, dimension reduction techniques aim to explain Y with information from X, the matrix of features of interest. X-based reduction methods, i.e.
non supervised methods, only use information
from X, while X- and Y-based methods, i.e. supervised methods, also use information from Y. A typical example of X-based methods is PCA, where
components define the subspace of X that maximize
the projected X-variability. The supervised X- and
Y-based methods define the subspace that maximizes
the projected X- and Y-covariability. Several methods can be used, and an a priori knowledge of that
structure may guide the choice of the analysis
method [48]. Partial least square regression (PLS) is
an example of a dimension reduction technique that
directly maximizes the components associated with
Y [49]. Linear discriminant analysis (LDA) first requires dimension reduction, such as PCA
(PCA LDA) or PLS (PLS LDA). A subset of
the first components is usually sufficient to catch
most of the data covariance, and the optimal
number of components can be chosen by
cross-validation, as proposed by Boulesteix in the
case of PLS þ DA.
181
Wu et al. [50] proposed a comparison of certain
classification algorithms for MS data. They compared
LDA and quadratic discriminant analysis (QDA) to
other supervised learning algorithms: k-nearest
neighbors (KNN), bagging, boosting, random
forest (RF) and support vector machines (SVM).
KNN uses a vote of nearest samples (in a space
defined by intensities for different features) to
choose a class for a new sample. Bagging, boosting
and RF are methods to aggregate classifiers as classification and regression trees (CART). Hastie et al.
[45] present in detail learning algorithms and their
properties.
Combined methods
These methods combine feature selection and classification algorithms. Classification algorithms involve
a large set of potential biomarkers in classifier building. In the combine methods, biomarkers that most
help in classification are retained and used in the
classifier, while the others are discarded.
Classification and regression trees are an example,
as well as their combination with bagging, boosting
or random-forest. Sparse PLS combines PLS regression and variable selection into a one-step procedure
[51]. Reynes etal. [52] used a genetic algorithm (GA)
to select feature that contributes to a voting rule,
whereas Koomen et al. [53] combined a GA to
select features that contribute to large Mahanalobis
distance between groups with FDR controlled
Student’s t-test.
Non-peak-based classification
Some classification methods were developed that do
not rely on the peak concept. Instead, these methods
use every single point of the spectra to look for regions that can distinguish different groups of samples.
Indeed, the search for part of the spectrum containing useful information can be performed under the
guidance of variables related to clinical events. The
pioneering study by Petricoin et al. [54] is a typical
example of this strategy using every point of the
spectrum as a potential predictor. They developed
an algorithm that combined a genetic algorithm
with self-organizing map to select a few points on
the m/z axis as the best set of group predictors.
Li et al. [55] combined GA for feature selection and
KNN to perform classification in a restricted subspace. Tong et al. [56] used the same initial idea.
They consider each point of a spectrum as a potential
feature of interest, in the same way as Petricoin et al.,
but select m/z positions using CART associated in a
182
Roy et al.
decision forest (a specific tree aggregator with constraints on tree heterogeneity and quality).
Validation studies
Data splitting of the study sample into a learning set
and a test set has been proposed as a process of internal validation. The classifier is trained on the
learning set and later validated on the test set. Such
an internal validation is a required evaluation procedure to avoid over-fitting, i.e. avoid the use of
characteristics of the learning set that are not of interest but rather specific to this particular set. The use of
one particular learning set on the 2n possible splits,
and the power limitation of the procedure result in a
preference for the leave-K-out or bootstrap methods
[57–60]. When the statistical analysis of the identification study first requires a selection of differential
biological constituents and then a classification algorithm, the test set must be excluded from all
pre-processing and biomarker identification steps.
Hilario et al. [61] drew attention to the potential misemployment of cross-validation techniques. They
suggest not performing the pre-processing steps on
the full dataset before a split between training and
test set is decided. This can result in artificially
over-estimated biomarker performances. Even if an
internal validation has been performed, external validation is a necessary step of biomarker validation.
Pepe et al. [62] underlined the difference between
the strength of association between a biomarker
level and disease status or outcome estimated by
the odds-ratio, and its ability to classify subjects.
They proposed to use the receiver operating characteristic (ROC) curves to estimate the classification
performance of a biomarker or its incremental
value in addition to the usual diagnostic or prognostic clinical and biological factors. This approach has
been extended to the analysis of positive and negative predictive values using predictive receiver operating characteristic (PROC) curves [63].
Statistical power
The power of transcriptomic studies has been shown
with increasing the number of non-differentially expressed genes considered in the study [39]. This theoretical result applies to proteomics although the
fewer tests performed in proteomics lead to smaller
power decrease. In addition to power calculations
developed in the context of transcriptomic analysis
[39, 40], Jouve et al. [64] have underlined that the
high instrumental variability encountered in MS,
together with FDR control, builds a detrimental
synergy leading to a low statistical power for usual
MS study sample sizes. Larger identification studies
and improvement of measurement variability are
both needed.
DISCUSSION
Proteomic technologies have been recently adapted
to the new field of clinical proteomics. Our article
deals with the different steps involved in proteomic
data analysis. These methods assume that the biological sample is of high quality. It is therefore
quite useful to add a preanalytical step concerning
sample preparation. The pivotal article by Petricoin
et al. [54] emphasized the impact of preanalytical factors in proteomic studies and the predominant role
of good study design towards better reproducibility
and hence, biologically meaningful results [5]. The
various origins of errors and biases were well identified in the preanalytical and analytical steps, leading
to a good discussion concerning the conditions necessary to avoid them (temperature handling, time
delay or freeze/thaw cycle). The quality of the
pre-analytical steps, as well as the amount of samples
available, is crucial, regardless of the technology
involved later on.
An inadequate quality of samples will impact the
fractionation steps, such as major protein depletion,
that allows the investigation of low-concentration
constituents and MS analysis. Albrethsen [65] made
an interesting review about reproducibility in several
protein profiling studies using MALDI-TOF instruments and highlighted the several sources of technical variation. Villanueva et al. [66] and Callesen
et al. [67] showed that using an optimized serum
sample preparation method allows overcoming the
challenge of reproducibility. It was also demonstrated
that high throughput MS may be a promising biomarker discovery tool as long as analytical sources of
variation are identified and controlled [68].
In addition to these sample quality considerations
that could benefit in the future from good quality
control, experiments must be designed so as to avoid
potential sources of bias and confounding factors;
inadequate study designs may lead to a confounding
effect between technical factors and the biological
factor of interest. Two strategies have to be employed to this end: blocking and randomizing. The
first one allows avoiding systematic bias due to the
block effect to balance the samples within each block
Protein mass spectra data analysis
(plate) with respect to the study factors of interest.
The second one allows avoiding unknown or unexpected bias like position on the plate or tip differences. Addition of technical replicates to biological
replicates in study designs have recently been discussed [69]. Technical replicates allow controlling
technical variability, and knowledge about biological
and technical variability can then be used for sample
size calculations [70].
Today, cancer clinicians would like to combine
proteomic profiles and classical clinical variables in
the same models to improve assessment of cancer
prognosis. The first intuitive way to do this is to
involve classical clinical and ‘omic’ variables in the
same models [71–73]. However, this approach is not
satisfying since it does not take into account that the
effects of ‘omic’ variables are overestimated with
regard to the classical clinical variables [42, 74].
In the very recent literature, certain authors have
proposed models that take this optimism into account. These models simultaneously construct a classifier combining both types of variables and
determine whether ‘omic’ data represent additional
predictive value. In 2008, Binder and Schumacher
[75] proposed an offset-based boosting approach in
the context of survival data. Shortly later, Boulesteix
et al. [76] and LeCao et al. [77] proposed a more
general approach, in the sense that it was not limited
to survival data. This approach is based on PLS, RF
and the pre-validation strategy suggested by
Tibshirani and Efron [74]. In a more recent article,
Boulesteix et al. [78] proposed a permutation-based
procedure to test the additional predictive value of
high-dimensional molecular data. In addition, Le
Cao et al. [79] recently proposed a joint analysis of
multiple datasets from different technology platforms. More precisely, they proposed a regularized
version of canonical correlation analysis to analyse
correlations between two data sets, and a sparse version of PLS regression that includes simultaneous
variable selection in the two types of data sets.
In this work, we decided to focus on the analysis
of proteomic data from MS, which provides an efficient tool for discovering new biomarkers in a large
amount of samples. One potential drawback of MS is
that the data generated are not directly quantitative.
In fact, samples analysed with MS are highly complex, thus limiting the use of MS as a quantification
tool. Once selected and statistically validated, candidate markers have to be identified and quantitatively
validated. Tandem MS (MS/MS) completes this type
183
of work. This technique involves multiple steps of
MS selection in which each of the peptides of interest (candidate markers) are isolated and fragmented.
A sequence search engine—like MASCOT for example—then tries to match these observed spectra
with known peptides whose sequences are stored
in dedicated databases. This label-free technique
also provides a direct mass spectrometric signal intensity for the identified peptides. To reduce the sample
complexity and thus allow a better quantification,
proteins are often first isolated through separation
techniques like liquid chromatography (LC). Such
label-free techniques allow a direct comparative
quantification of proteins [80, 81]. These methods
need heavy analytical processes and for this reason
are not extensively used on large-sample datasets at
the moment. In addition to this biological aspect,
label-free proteomics generate an amount of data
higher than those generated by MS, thus leading to
additional statistical questions for which statistical
analytical tools are still immature [82–84].
However, label-free proteomics sounds promising
and current work is evaluating their use in large clinical proteomics studies.
Some alternative methods to standard MS for differential proteomics have appeared in the last years,
like isobaric tags for relative and absolute quantification (iTRAQ) or difference gel electrophoresis
(2D-DIGE). These methods aim at both selecting
and quantifying candidate biomarkers. Briefly, the
iTRAQ technique utilizes four isobaric amine specific tags to determine relative protein levels in up to
four samples simultaneously. 2D-DIGE is a 2D gel
separation technology for proteins in which the different samples to be compared are labelled with different dyes (Cy3 and Cy5), which enables signal
detection at different wave length emissions.
Interested readers may read the works of Wu et al.
[85] who provided a comparative study of these
methods. Although interesting, these two latter techniques cannot be used in large-scale studies at the
moment. In fact, both techniques are limited by
the low number of samples that can be analysed simultaneously. Many hundreds of samples can be simultaneously analysed with MS, while only <10 can
be analysed using 2D-DIGE or iTRAQ. This issue
results in a lack of reproducibility for the former, and
costs (both in material and time) that are too high for
well powered discovery studies for the latter. This is
a drawback for biomarker discovery in large-scale
studies.
184
Roy et al.
Proteomics represents, in our opinion, a key field
for the future of clinical research. However, improvements can be made in the methodologies
involved. This improvement can only result from a
tight collaboration between biostatisticians, computer scientists, biologists and clinicians.
Key Points
Efficient pre-processing is an essential pre-requisite for retrieving meaningful proteomic biological information from raw spectra and reaching meaningful clinical conclusions.
The identification and validation of pertinent biomarkers
requires large, well-designed studies.
Methodology improvement would benefit from a tight collaboration between biostatisticians, computer scientists, biologists
and clinicians.
References
1.
Diamandis EP. How are we going to discover new cancer
biomarkers? A proteomic approach for bladder cancer. Clin
Chem 2004;50:793–5.
2. Semmes OJ, Feng Z, Adam BL, et al. Evaluation of serum
protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of
prostate cancer: I assessment of platform reproducibility.
Clin Chem 2005;51:102–12.
3. West-Nørager M, Kelstrup CD, Schou C, et al. Unravelling
in vitro variables of major importance for the outcome of
mass spectrometry-based serum proteomics. J Chromatogr B
AnalytTechnol Biomed Life Sci 2007;847:30–7.
4. Lin Z, Jenson SD, Lim MS, et al. Application of SELDITOF mass spectrometry for the identication of differentially
expressed proteins in transformed follicular lymphoma. Mod
Pathol 2004;17:670–8.
5. Baggerly KA, Morris JS, Coombes KR. Reproducibility of
seldi-tof protein patterns in serum: comparing datasets from
different experiments. Bioinformatics 2004;20:777–85.
6. Coombes KR, Baggerly KR, Morris JS. Pre-Processing Mass
Spectrometry Data, Fundamentals of Data Mining in Genomics
and Proteomics. Boston: Kluwer, 2007.
7. Hu J, Coombes KR, Morris JS, et al. The importance of
experimental design in proteomic mass spectrometry
experiments: Some cautionary tales. Brief Funct Genomic
Proteomic 2005;3:322–31.
8. Arneberg R, Rajalahti T, Flikka K, et al. Pretreatment of
mass spectral profiles: application to proteomic data. Anal
Chem 2007;79:7014–26.
9. Morris JS, Coombes KR, Koomen J, et al. Feature extraction and quantification for mass spectrometry in biomedical
applications using the mean spectrum. Bioinformatics 2005;
21:1764–75.
10. Cruz-Marcelo A, Guerra R, Vannucci M, et al. Comparison
of algorithms for pre-processing of SELDI-TOF mass spectrometry data. Bioinformatics 2008;24:2129–36.
11. Yang C, He Z, Yu W. Comparison of public peak detection algorithms for MALDI mass spectrometry data analysis.
BMC Bioinformatics 2009;10:4.
12. Antoniadis A, Bigot J, Lambert-Lacroix S, et al.
Nonparametric pre-processing methods and inference
tools for analyzing time-of-flight mass spectrometry data.
CurrAnal Chem 2007;3:127–47.
13. Coombes KR, Tsavachidis S, Morris JS, et al. Improved
peak detection and quantification of mass spectrometry
data acquired from surface-enhanced laser desorption and
ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics 2005;5:4107–17.
14. Du P, Kibbe WA, Lin SM. Improved peak detection
in mass spectrum by incorporating continuous wavelet
transformbased pattern matching. Bioinformatics 2006;22:
2059–65.
15. Kwon D, Vannucci M, Song JJ, et al. A novel wavelet-based
thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise. Proteomics
2008;8:3019–29.
16. Mostacci E, Truntzer C, Cardot H, et al. Multivariate
denoising methods combining Wavelets and PCA for
mass spectrometry data. Proteomics. In Press.
17. Fung ET, Enderwick C. ProteinChip clinical proteomics:
computational challenges and solutions. Biotechniques 2002;
34(Suppl 8):40–1.
18. Tan CS, Ploner A, Quandt A, et al. Finding regions of
significance in SELDI measurements for identifying protein
biomarkers. Bioinformatics 2006;22:1515–23. 21. Li X.
PROcess: Ciphergen SELDI-TOF Processing. 2005.R
package version 1.16.0.
19. Malyarenko DI, Cooke WE, Adam BL, et al. Enhancement
of sensitivity and resolution of surface-enhanced laser desorption/ionization time-of-flight mass spectrometric records
for serum peptides using time-series analysis techniques.
Clin Chem 2005;51:65–74.
20. Dijkstra M, Roelofsen H, Vonk RJ, et al. Peak
quantification in surface-enhanced laser desorption/
ionization by using mixture models. Proteomics 2006;6:
5106–16.
21. Jeffries N. Algorithms for alignment of mass spectrometry
proteomic data. Bioinformatics 2005;21:3066–73.
22. Wong
JWH,
Cagney
G,
Cartwright
HM.
SpecAlign_processing and alignment of mass spectra
datasets. Bioinformatics 2005;21:2088–90.
23. Pratapa PN, Patz EF, Hartemink AJ. Finding diagnostic
biomarkers in proteomic spectra. Pac Symp Biocomput. New
Jersey: World Scientific, 2006;279–90.
24. Kong X, Reilly CA. Bayesian approach to the alignment of
mass spectra. Bioinformatics 2009;25:3213–20.
25. Feng Y, Ma W, Wang Z, et al. Alignment of protein mass
spectrometry data by integrated Markov chain shifting
method. Statistics and Its Interface 2009;2:329–40.
26. Cairns DA, Thompson D, Perkins DN, et al. Proteomic
profiling using mass spectrometry - does normalising by
total ion current potentially mask some biological differences? Proteomics 2008;8:21–7.
27. Meuleman W, Engwegen J, Gast M, et al. Comparison of
normalisation methods for surface-enhanced laser desorption and ionisation (seldi) time-of-flight (tof) mass spectrometry data. BMC Bioinformatics 2008;9:88.
28. Yasui Y, Pepe M, Thompson ML, et al. A data analytic
strategy for protein biomarker discovery: profiling of
high-dimensional proteomic data for cancer detection.
Biostatistics 2003;4:449–63.
Protein mass spectra data analysis
29. Coombes KR, Fritsche J, Clarke C, et al. Quality control
and peak finding for proteomics data collected from nipple
aspirate fluid by surface-enhanced laser desorption and ionization. Clin Chem 2003;49:1615–23.
30. Randolph TW, Mitchell BL, McLerran DF, et al.
Quantifying peptide signal in MALDI-TOF mass spectrometry data. Mol Cell Proteomics 2005;4:1990–9.
31. Lange E, Gropl C, Reinert K, et al. High accuracy peak
picking of proteomics data using wavelet techniques.
Pacific Symposium on Biocomputing Maui, Hawaii, USA. New
Jersy: Word Scientific, 2006;243–54.
32. Noy K, Fasulo D. Improved model-based, platformindependent feature extraction for mass spectrometry.
Bioinformatics 2007;23:2528–35.
33. Du P, Angeletti RH. Automatic deconvolution of isotoperesolved mass spectra Using variable selection and quantized
peptide mass distribution. Anal Chem 2006;78:3385–92.
34. Pepe MS, Etzioni R, Feng Z, et al. Phases of biomarker
development for early detection of cancer. JNCI 2001;93:
1054–61.
35. McShane LM, Altman DG, Sauerbrei W, et al. REporting
recommendations for tumor MARKer prognostic studies
(REMARK). JNCI 2005;97:1180–4.
36. Benjamini Y, Hochberg Y. Controlling the false discovery
rate: a practical and powerful approach to multiple testing.
JRSS B 1995;57:289–300.
37. Dudoit S, Shaffer J, Boldrick J. Multiple hypothesis testing
in microarray experiments. Stat Sci 2003;18:71–103.
38. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 2004;100:9440–5.
39. Lee M-LT, Whitmore GA. Power and sample size for DNA
microarray studies. Stat Med 2002;21:3543–70.
40. Pawitan Y, Murthy KR, Michiels S, et al. Bias in the estimation of false discovery rate in microarray studies.
Bioinformatics 2005;21(20):3865–72.
41. Truntzer C, Maucort-Boulch D, Roy P. Consequences of
regression to the mean in the high dimensional setting 2010.
submitted manuscript.
42. Truntzer C, Maucort-Boulch D, Roy P. Comparative optimism in models involving both classical clinical and gene
expression information. BMC Bioinformatics 2008;9:434.
43. Efron B, Hastie T, Johnstone I, et al. Least angle regression.
Ann Stat 2004;32:407–51.
44. Tibshirani R. Regression shrinkage and selection via the
Lasso. J R Stat Soc B 1996;58:267–88.
45. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical
Learning. 3rd edn. New York: Springer Science, 2009.
46. Friedman J, Popescu B. Gradient directed regularization for
linear regression and classification. Technical Report,
Stanford University, Department of Statistics, 2004.
47. Gui J, Li H. Penalized Cox regression analysis in the highdimensional and low-sample size settings, with applications
to microarray gene expression data. Bioinformatics 2005;21:
3001–8.
48. Truntzer C, Mercier C, Estève J, et al. Importance of data
structure in comparing two dimension reduction methods
for classification of microarray gene expression data. BMC
Bioinformatics 2007;8:90.
49. Boulesteix AL. PLS dimension reduction for classification
with microarray data. Stat Appl Genet Mol Biol 2004;3:
Article33.
185
50. Wu B, Abbott T, Fishman D, et al. Comparison of statistical
methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003;19:1636–43.
51. Lê Cao KA, Rossouw D, Robert-Granié C, et al. A sparse
PLS for variable selection when integrating omics data. Stat
Appl Genet Mol Biol 2008;7:Article 35.
52. Reynes C, Sabatier R, Molinari N, et al. A new genetic
algorithm in proteomics: feature selection for SELDI-TOF
data. Comput Stat Data Anal 2008;52:4380–94.
53. Koomen JM, Shih LN, Coombes KR, et al. Plasma protein
profiling for diagnosis of pancreatic cancer reveals the presence of host response proteins. Clin Cancer Res 2005;11:
1110–8.
54. Petricoin EF, Ardekani AM, Hitt BA, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet
2002;359:572–7.
55. Li L, Umbach DM, Terry P, et al. Application of the
GA/KNN method to SELDI proteomics data.
Bioinformatics 2004;20:1638–40.
56. Tong W, Xie Q, Hong H, et al. Using decision forest
to classify prostate cancer samples on the basis of SELDITOF MS data: assessing chance correlation and prediction
confidence. Environ Health Perspect 2004;112:1622–7.
57. Efron B, Tibshirani R. An introduction to the bootstrap.
Monographs on Statistics & Applied Probability. New York:
Chapman & Hall/CRC, 1993.
58. Shao J, Tu D. The Jackknife and Bootstrap. New York:
Springer, 1995.
59. Altman DG, Royston P. What do we mean by validating a
prognostic model? Stat Med 2000;29:453–73.
60. Harrell F. Regression Modelling Strategy with Applications in
Linear Models, Logistic Regression and Survival Model. New
York: Springer, 2002.
61. Hilario M, Kalousis A, Pellegrini C, et al. Processing and
classification of protein mass spectra. Mass Spectrom Rev
2006;25:409–49.
62. Pepe MS, Janes H, Longton G, et al. Limitations of the odds
ratio in gauging the performance of a diagnostic, prognostic,
or screening marker. AmJ Epidemiol 2004;159:882–90.
63. Shiu SY, Gatsonis C. The predictive receiver operating
characteristic curve for the joint assessment of the positive
and negative predictive values. PhilTrans R Soc A 2008;366:
2313–33.
64. Jouve T, Maucort-Boulch D, Ducoroy P, et al. Statistical
power in mass-spectrometry proteomic studies. submitted
manuscript.
65. Albrethsen J. Reproducibility in protein profiling by malditof mass spectrometry. Clin Chem 2007;53:852–8.
66. Villanueva J, Lawlor K, Toledo-Crow R, et al. Automated
serum peptide profiling. Nat Protoc 2006;1:880–91.
67. Callesen AK, Vach W, Jørgensen PE, et al. Reproducibility
of mass spectrometry based protein profiles for diagnosis of
breast cancer across clinical studies: a systematic review.
J Proteome Res 2008;7:1395–402.
68. Hu J, Coombes KR, Morris JS, et al. The importance of
experimental design in proteomic mass spectrometry
experiments: some cautionary tales. Brief Funct Genomic
Proteomic 2005;3:322–31.
69. Oberg A, Vitek O. Statistical design of quantitative mass
spectrometry-based proteomic profiling experiments.
J Proteome Res 2009;8:2144–56.
186
Roy et al.
70. Cairns DA, Barrett JH, Billingham LJ, et al. Sample size
determination in clinical proteomic profiling experiments
using mass spectrometry for class comparison. Proteomics
2009;9:74–86.
71. Dettling M, Bühlmann P. Finding predictive gene groups
from microarray data. J MultivarAnal 2004;90(1):106–31.
72. Gevaert O, Smet FD, Timmerman D, et al. Predicting the
prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics 2006;22:
184–90.
73. Li L. Survival prediction of diffuse large-B-cell lymphoma
based on both clinical and gene expression information.
Bioinformatics 2006;22:466–71.
74. Tibshirani R, Efron B. Pre-validation and inference in
microarrays. Stat Appl Genet Mol Biol 2007;1:1.
75. Binder H, Schumacher M. Allowing for mandatory covariates in boosting stimation of sparse high-dimensional survival models. BMC Bioinformatics 2008;9:14.
76. Boulesteix A, Porzelius C, Daumer M. Microarray-based
classification and clinical predictors: on combined classifiers
and additional predictive value. Bioinformatics 2008;24:
1698–706.
77. Le Cao K, Meugnier E, McLachlan GJ. Integrative mixture
of experts to combine clinical factors and gene markers.
Bioinformatics 2010. to appear.
78. Boulesteix A, Hothorn T. Testing the additional predictive
value of high-dimensional molecular data. BMC
Bioinformatics 2010;11:78.
79. Le Cao K, Gonzalez I, Dejean S. IntegrOmics: an R package to unravel relationships between two omics data sets.
Bioinformatics 2009;25:2855–6.
80. Wang M, You J, Bemis KG, et al. Label-free mass
spectrometry-based protein quantification technologies in
proteomic analysis. Brief Funct Genomics Proteomics 2008;7:
329–39.
81. Wong JWH, Sullivan MJ, Cagney G. Computational
methods for the comparative quantification of proteins
in label-free LCn-MS experiments. Brief Bioinform 2008;9:
156–65.
82. Karpievitch YV, Taverner T, Adkins JN, et al.
Normalization of peak intensities in bottom-up MS-based
proteomics using singular value decomposition.
Bioinformatics 2009;25:2573–80.
83. Yu T, Park Y, Johnson JM, Jones DP. apLCMS–adaptive
processing of high-resolution LC/MS data. Bioinformatics
2009;25:1930–6.
84. Pham TV, Piersma SR, Warmoes M, et al. On the betabinomial model for analysis of spectral count data in labelfree tandem mass spectrometry-based proteomics.
Bioinformatics 2010;26:363–9.
85. Wu WW, Wang G, Baek SJ, Shen RF. Comparative study
of three proteomic quantitative methods, DIGE, cICAT,
and iTRAQ, using 2D gel- or LC-MALDI TOF/TOF. J
Proteome Res 2006;5:651–8.