OutlierD: an R package for outlier detection using quantile

BIOINFORMATICS APPLICATIONS NOTE
Vol. 24 no. 6 2008, pages 882–884
doi:10.1093/bioinformatics/btn012
Data and text mining
OutlierD: an R package for outlier detection using quantile
regression on mass spectrometry data
HyungJun Cho1,2, Yang-jin Kim3, Hee Jung Jung4, Sang-Won Lee4 and Jae Won Lee1,*
1
Department of Statistics, 2Department of Biostatistics, 3Institute of Statistics and 4Department of Chemistry, Korea
University, Seoul, Korea
Received on August 10, 2007; revised and accepted on January 4, 2008
Advance Access publication January 10, 2008
Associate Editor: Limsoon Wong
ABSTRACT
Summary: It is important to preprocess high-throughput data
generated from mass spectrometry experiments in order to obtain
a successful proteomics analysis. Outlier detection is an important
preprocessing step. A naive outlier detection approach may miss
many true outliers and instead select many non-outliers because
of the heterogeneity of the variability observed commonly in highthroughput data. Because of this issue, we developed a outlier
detection software program accounting for the heterogeneous
variability by utilizing linear, non-linear and non-parametric quantile
regression techniques. Our program was developed using the R
computer language. As a consequence, it can be used interactively
and conveniently in the R environment.
Availability: An R package, OutlierD, is available at the
Bioconductor project at http://www.bioconductor.org
Contact: [email protected]
Supplementary information: Supplementary Data are available at
Bioinformatics online.
1
INTRODUCTION
Mass spectrometry (MS), in conjunction with liquid chromatography (LC), is a powerful analytical technique used to
identify and quantify complex proteome samples. In labelfree approaches for biomarker discovery efforts, a number of
peptide IDs and their corresponding intensity values are
generated from multiple LC/MS experiments. For reliable
discovery by quantitative analysis, it is essential to obtain MS
data with high quality, followed by careful quality control.
Prior to comparative quantitative analyses, it is necessary to
detect and investigate outliers contained in the multiplicated
LC/MS data. The outliers detected can be re-investigated
carefully to be replaced by more precise values.
For outlier detection, upper and lower fences (Q3 þ 1.5IQR
and Q1 1.5IQR) of the differences are often used in statistics,
where Q1 is the lower 25% quantile, Q3 is the upper 25%
quantile and IQR ¼ Q3 Q1. However, heterogenous variability is often observed in high-throughput data from MS.
By ignoring heterogenous variability and using the fences
that are constant over all values, we may often miss true
*To whom correspondence should be addressed.
outliers and detect false outliers. Therefore, a more elaborate
approach accounting for heterogenous variability is needed for
outlier detection with low false positive and negative rates. We,
here, introduce our developed software program (called
OutlierD) for outlier detection in heterogeneous MS data
using quantile regression (Koenker and Bassett, 1978).
OutlierD can be used interactively and conveniently in an R
environment.
2
DATA
For an illustration, we use a real MS dataset obtained from
analyzing human sera. To obtain the MS data, human sera
were picked out from 15 healthy females and enrichments of
N-glycosylated peptides were performed three times for each
serum sample by using hydrazide resin (Zhang et al., 2003). The
resultant glycopeptide enriched samples (a total of 45 samples)
were pooled into a single sample right before LC/MS/MS
experiments. LC/MS/MS experiments were performed on a
home-built ultrahigh-pressure dual on-line solid phase extraction/capillary reverse-phase liquid chromatography (DO-SPE/
cRPLC) (Min et al., 2007) that was coupled to a Fouriertransform ion cyclotron resonance (FT-ICR, LTQ-FT,
Thermo, San Jose) mass spectrometer by means of a homebuilt nanoESI interface.
LC/MS/MS experiments were replicated on the pooled
sample. The mass spectrometer was operated in data-dependent
mode where each of the three most abundant ions detected in
precursor mass spectrum was dynamically selected for dissociation. The resultant mass spectrometric data was subjected to
mass transformation by using ICR2LS (http://ncrr.pnl.gov/).
It then subsequently underwent unique mass classification,
monoisotopic mass filtering, Sequest protein database search
and statistical evaluation of the identified peptide sequences
by target-decoy method (Peng et al., 2003). The sum of each of
the individual peak intensities corresponding to the same
peptide was assigned to a peptide ID by means of home-built
software (see Supplemental Information for detailed description). All peptide IDs, having intensity information from the
two sets of LC/MS/MS data, were compared. This comparison
resulted in 501 common peptides of all the 796 peptides. We then
manually inspected the MS/MS spectra of these 501 peptides
and obtained 437 peptides with high confidence in their
identification.
ß 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Outlier detection
Linear
−2
0
M
2
2
0
M
−2
−4
−4
20
25
20
30
25
30
Non-parametric
0
−4
−4
−2
−2
0
M
2
2
4
4
Non-linear
20
25
30
20
A
25
30
A
Fig. 1. Outlier detection using quantile regression. The dotted lines
represent Q1(A) and Q3(A) and the solid lines represent lower and upper
bounds classifying outliers and non-outliers.
3
SOFTWARE DESCRIPTION
For description, we use the peptide intensity ratios from the
above two LC/MS/MS experiments replicated under the same
biological and experimental condition. The duplicates for
peptide i are denoted by X1i and X2i, where i ¼ 1, 2, . . ., n and
n is the number of peptides. Replicates are theoretically
identical, but in practice have variability, which is particularly
heterogeneous. The heterogeneity of variability can be seen
in a MA plot (Fig. 1). In the MA plot, Mi is the difference
between duplicates of peptide i and Ai is the average of
them, i.e. Mi ¼ log(X1i/X2i) ¼ log(X1i) log(X2i) and Ai ¼ (1/2)
log(X1iX2i) ¼ (log(X1i) þ log(X2i))/2. As shown in the first panel
of Figure 1, the constant fences are not enough to detect
outliers correctly. Using the constant fences Q3 þ 1.5IQR and
Q1 1.5IQR for the differences, we can miss many true outliers
in high levels and select many false outliers in low levels.
To account for the heterogeneity of variability, we utilize
quantile regression (Koenker and Bassett, 1978) on a MA plot.
The q-quantile linear regression with {(Ai, Mi), i ¼ 1, . . ., n} is to
find the parameters minimizing
X
X
qjMi i j þ
ð1 qÞjMi i j;
fi:Mi i g
fit15 OutlierDðx1; x2; k ¼ 1:5; method ¼ }constant}Þ
fit25 OutlierDðx1; x2; k ¼ 1:5; method ¼ }linear}Þ
fit35 OutlierDðx1; x2; k ¼ 1:5; method ¼ }nonlin}Þ
fit45 OutlierDðx1; x2; k ¼ 1:5; method ¼ }nonpar}Þ
A
A
M
fences, which are built from four quantile regression
approaches, including constant quantile regression.
To detect outliers in high-throughput MS data using
OutlierD, we run the following commands:
4
4
Constant
fi:Mi <i g
where 05q51 and i ¼ 0 þ 1Ai. By applying the regression,
we compute the 0.25 and 0.75 quantile estimates, Q1(A) and
Q3(A), of the differences, M, depending on the levels, A. Then
we construct the lower and upper fences: Q1(A) 1.5IQR(A)
and Q3(A) þ 1.5IQR(A), where IQR(A) ¼ Q3(A) Q1(A). To
obtain the quantile estimates depending on the levels more
flexibly, we can also utilize non-linear and non-parametric
quantile regression approaches (Koenker, 2005). Thus, our
developed software OutlierD provides intensity-level adaptive
In the commands, arguments x1 and x2 are vectors for
duplicate data and k is a parameter determining whether we
select outliers conservatively or liberally. With a very large
k value, we can select very extreme observations which may
result in missing true outliers. In contrast, with a very small k
value we can select a number of observations, many of which
may not be outliers. k ¼ 1.5 is a commonly used value. One of
constant, linear, non-linear and non-parametric quantile
regression techniques can be chosen by argument method ¼ ‘‘’’
in the command. Imposing the quantile regression fence lines
on a MA plot (Fig. 1), we can classify data points into outliers
and non-outliers.
By applying OutlierD to the 437 peptide intensity ratios with
a high level of confidence in the identification, we detected
47 ratios as outliers by at least one of the four methods. The
constant, linear, non-linear and non-parametric methods
resulted in 32, 21, 21 and 18 outliers, respectively. The
intensities of these 47 peptides were manually inspected using
the original LC/MS/MS data. This manual inspection confirmed that 10 of them had incorrect intensity ratios due to
the computational errors. The four methods identified 8, 7, 6
and 5 out of these 10 true outliers identified by our manual
inspection; thus, the numbers of false negatives were 2, 3, 4 and
5, whereas the numbers of false positives were 24, 14, 15 and 13,
respectively. The predicted values of positive test were 8/
32(¼25.0%), 7/21(¼33.3%), 6/21(¼28.6%) and 5/18(¼27.8%)
and the predicted values of negative test were 403/
405(¼99.5%), 413/416(¼99.3%), 412/416(¼99.0%) and 414/
419(¼98.8%), respectively. However, our limited manual
inspection may have missed true experimental outliers, so the
numbers may be changed. The details can be found at our
Supplementary Data.
To investigate the performance of the methods by simulation,
we first removed all the outliers detected by each of four
methods in the above raw data with all the peptides before
manual inspection and assumed that the rest of the data are nonoutliers. Then we partitioned A values into 50 intervals by
quantiles and computed standard deviations of M values in each
interval. The mean of A values and the standard deviation of M
values in each interval are denoted by l and l, respectively. By
letting X1l ¼ l þ Zl and X2l ¼ l Zl, and Zl (2B 1)U l,
l ¼ 1, . . ., 50, where B Bernoulli(1/2) and U Uniform(3, 5),
we generated artificial outliers, which were combined into nonoutliers to construct datasets. We applied each method to each
dataset and computed the sensitivity and specificity. This
procedure was repeated 100 times to obtain the averaged
sensitivities and specificities in Table 1. It was found that the
constant method had the lowest sensitivity because it missed
many true outliers in high intensity levels. The others’ results
883
H.Cho et al.
Table 1. Simulation Result With Sensitivity And Specificity In ()
Simulated by
Detection method
Constant
Constant
Linear
Non-linear
Non-parametric
(0.84,
(0.80,
(0.78,
(0.79,
0.99)
0.97)
0.97)
0.97)
Linear
(0.89,
(0.93,
(0.91,
(0.90,
0.99)
1.00)
0.99)
0.99)
Non-linear Non-parametric
(0.87,
(0.91,
(0.90,
(0.87,
0.99)
0.99)
1.00)
0.99)
(0.88,
(0.92,
(0.92,
(0.92,
0.99)
0.99)
0.99)
1.00)
were similar. As the linear method is the simplest, this linear
method can be the best choice. If the magnitude of heterogeneous variability decreases non-linearly or fluctuates, the nonlinear or non-parametric method might have better results.
ACKNOWLEDGEMENTS
This work was supported by the Korea Research Foundation
Grant (R14-2003-002-01002-0) for J.W. Lee. It was also
884
supported by the Korea Research Foundation Grant
(KRF-2007-331-C00065) for H. Cho and by 21C Frontier
Project for Functional Proteomics (FPR05A1-400) for
S.W. Lee.
Conflict of Interest: none declared.
REFERENCES
Koenker,R. and Bassett,G. (1978) Regression quantiles. Econometrics, 46, 33–50.
Koenker,R. (2005) Quantile Regression. Cambridge University Press, New York.
Min,H.-K. et al. (2007) Ultrahigh-pressure dual online solid phase extraction/
capillary reverse-phase liquid chromatography/tandem mass spectrometry
(DO-SPE/cRPLC/MS/MS): a versatile separation platform for highthroughput and highly sensitive proteomic analyses. Electrophoresis, 28,
1012–1021.
Peng,J. et al. (2003) Evaluation of multidimensional chromatography coupled
with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein
analysis: the yeast proteome. J. Proteome Res., 2, 43–50.
Zhang,H. et al. (2003) Identification and quantification of N-linked glycoproteins
using hydrazide chemistry, stable isotope labeling and mass spectrometry.
Nat. Biotechnol., 21, 660–666.