BIOINFORMATICS APPLICATIONS NOTE Vol. 24 no. 6 2008, pages 882–884 doi:10.1093/bioinformatics/btn012 Data and text mining OutlierD: an R package for outlier detection using quantile regression on mass spectrometry data HyungJun Cho1,2, Yang-jin Kim3, Hee Jung Jung4, Sang-Won Lee4 and Jae Won Lee1,* 1 Department of Statistics, 2Department of Biostatistics, 3Institute of Statistics and 4Department of Chemistry, Korea University, Seoul, Korea Received on August 10, 2007; revised and accepted on January 4, 2008 Advance Access publication January 10, 2008 Associate Editor: Limsoon Wong ABSTRACT Summary: It is important to preprocess high-throughput data generated from mass spectrometry experiments in order to obtain a successful proteomics analysis. Outlier detection is an important preprocessing step. A naive outlier detection approach may miss many true outliers and instead select many non-outliers because of the heterogeneity of the variability observed commonly in highthroughput data. Because of this issue, we developed a outlier detection software program accounting for the heterogeneous variability by utilizing linear, non-linear and non-parametric quantile regression techniques. Our program was developed using the R computer language. As a consequence, it can be used interactively and conveniently in the R environment. Availability: An R package, OutlierD, is available at the Bioconductor project at http://www.bioconductor.org Contact: [email protected] Supplementary information: Supplementary Data are available at Bioinformatics online. 1 INTRODUCTION Mass spectrometry (MS), in conjunction with liquid chromatography (LC), is a powerful analytical technique used to identify and quantify complex proteome samples. In labelfree approaches for biomarker discovery efforts, a number of peptide IDs and their corresponding intensity values are generated from multiple LC/MS experiments. For reliable discovery by quantitative analysis, it is essential to obtain MS data with high quality, followed by careful quality control. Prior to comparative quantitative analyses, it is necessary to detect and investigate outliers contained in the multiplicated LC/MS data. The outliers detected can be re-investigated carefully to be replaced by more precise values. For outlier detection, upper and lower fences (Q3 þ 1.5IQR and Q1 1.5IQR) of the differences are often used in statistics, where Q1 is the lower 25% quantile, Q3 is the upper 25% quantile and IQR ¼ Q3 Q1. However, heterogenous variability is often observed in high-throughput data from MS. By ignoring heterogenous variability and using the fences that are constant over all values, we may often miss true *To whom correspondence should be addressed. outliers and detect false outliers. Therefore, a more elaborate approach accounting for heterogenous variability is needed for outlier detection with low false positive and negative rates. We, here, introduce our developed software program (called OutlierD) for outlier detection in heterogeneous MS data using quantile regression (Koenker and Bassett, 1978). OutlierD can be used interactively and conveniently in an R environment. 2 DATA For an illustration, we use a real MS dataset obtained from analyzing human sera. To obtain the MS data, human sera were picked out from 15 healthy females and enrichments of N-glycosylated peptides were performed three times for each serum sample by using hydrazide resin (Zhang et al., 2003). The resultant glycopeptide enriched samples (a total of 45 samples) were pooled into a single sample right before LC/MS/MS experiments. LC/MS/MS experiments were performed on a home-built ultrahigh-pressure dual on-line solid phase extraction/capillary reverse-phase liquid chromatography (DO-SPE/ cRPLC) (Min et al., 2007) that was coupled to a Fouriertransform ion cyclotron resonance (FT-ICR, LTQ-FT, Thermo, San Jose) mass spectrometer by means of a homebuilt nanoESI interface. LC/MS/MS experiments were replicated on the pooled sample. The mass spectrometer was operated in data-dependent mode where each of the three most abundant ions detected in precursor mass spectrum was dynamically selected for dissociation. The resultant mass spectrometric data was subjected to mass transformation by using ICR2LS (http://ncrr.pnl.gov/). It then subsequently underwent unique mass classification, monoisotopic mass filtering, Sequest protein database search and statistical evaluation of the identified peptide sequences by target-decoy method (Peng et al., 2003). The sum of each of the individual peak intensities corresponding to the same peptide was assigned to a peptide ID by means of home-built software (see Supplemental Information for detailed description). All peptide IDs, having intensity information from the two sets of LC/MS/MS data, were compared. This comparison resulted in 501 common peptides of all the 796 peptides. We then manually inspected the MS/MS spectra of these 501 peptides and obtained 437 peptides with high confidence in their identification. ß 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Outlier detection Linear −2 0 M 2 2 0 M −2 −4 −4 20 25 20 30 25 30 Non-parametric 0 −4 −4 −2 −2 0 M 2 2 4 4 Non-linear 20 25 30 20 A 25 30 A Fig. 1. Outlier detection using quantile regression. The dotted lines represent Q1(A) and Q3(A) and the solid lines represent lower and upper bounds classifying outliers and non-outliers. 3 SOFTWARE DESCRIPTION For description, we use the peptide intensity ratios from the above two LC/MS/MS experiments replicated under the same biological and experimental condition. The duplicates for peptide i are denoted by X1i and X2i, where i ¼ 1, 2, . . ., n and n is the number of peptides. Replicates are theoretically identical, but in practice have variability, which is particularly heterogeneous. The heterogeneity of variability can be seen in a MA plot (Fig. 1). In the MA plot, Mi is the difference between duplicates of peptide i and Ai is the average of them, i.e. Mi ¼ log(X1i/X2i) ¼ log(X1i) log(X2i) and Ai ¼ (1/2) log(X1iX2i) ¼ (log(X1i) þ log(X2i))/2. As shown in the first panel of Figure 1, the constant fences are not enough to detect outliers correctly. Using the constant fences Q3 þ 1.5IQR and Q1 1.5IQR for the differences, we can miss many true outliers in high levels and select many false outliers in low levels. To account for the heterogeneity of variability, we utilize quantile regression (Koenker and Bassett, 1978) on a MA plot. The q-quantile linear regression with {(Ai, Mi), i ¼ 1, . . ., n} is to find the parameters minimizing X X qjMi i j þ ð1 qÞjMi i j; fi:Mi i g fit15 OutlierDðx1; x2; k ¼ 1:5; method ¼ }constant}Þ fit25 OutlierDðx1; x2; k ¼ 1:5; method ¼ }linear}Þ fit35 OutlierDðx1; x2; k ¼ 1:5; method ¼ }nonlin}Þ fit45 OutlierDðx1; x2; k ¼ 1:5; method ¼ }nonpar}Þ A A M fences, which are built from four quantile regression approaches, including constant quantile regression. To detect outliers in high-throughput MS data using OutlierD, we run the following commands: 4 4 Constant fi:Mi <i g where 05q51 and i ¼ 0 þ 1Ai. By applying the regression, we compute the 0.25 and 0.75 quantile estimates, Q1(A) and Q3(A), of the differences, M, depending on the levels, A. Then we construct the lower and upper fences: Q1(A) 1.5IQR(A) and Q3(A) þ 1.5IQR(A), where IQR(A) ¼ Q3(A) Q1(A). To obtain the quantile estimates depending on the levels more flexibly, we can also utilize non-linear and non-parametric quantile regression approaches (Koenker, 2005). Thus, our developed software OutlierD provides intensity-level adaptive In the commands, arguments x1 and x2 are vectors for duplicate data and k is a parameter determining whether we select outliers conservatively or liberally. With a very large k value, we can select very extreme observations which may result in missing true outliers. In contrast, with a very small k value we can select a number of observations, many of which may not be outliers. k ¼ 1.5 is a commonly used value. One of constant, linear, non-linear and non-parametric quantile regression techniques can be chosen by argument method ¼ ‘‘’’ in the command. Imposing the quantile regression fence lines on a MA plot (Fig. 1), we can classify data points into outliers and non-outliers. By applying OutlierD to the 437 peptide intensity ratios with a high level of confidence in the identification, we detected 47 ratios as outliers by at least one of the four methods. The constant, linear, non-linear and non-parametric methods resulted in 32, 21, 21 and 18 outliers, respectively. The intensities of these 47 peptides were manually inspected using the original LC/MS/MS data. This manual inspection confirmed that 10 of them had incorrect intensity ratios due to the computational errors. The four methods identified 8, 7, 6 and 5 out of these 10 true outliers identified by our manual inspection; thus, the numbers of false negatives were 2, 3, 4 and 5, whereas the numbers of false positives were 24, 14, 15 and 13, respectively. The predicted values of positive test were 8/ 32(¼25.0%), 7/21(¼33.3%), 6/21(¼28.6%) and 5/18(¼27.8%) and the predicted values of negative test were 403/ 405(¼99.5%), 413/416(¼99.3%), 412/416(¼99.0%) and 414/ 419(¼98.8%), respectively. However, our limited manual inspection may have missed true experimental outliers, so the numbers may be changed. The details can be found at our Supplementary Data. To investigate the performance of the methods by simulation, we first removed all the outliers detected by each of four methods in the above raw data with all the peptides before manual inspection and assumed that the rest of the data are nonoutliers. Then we partitioned A values into 50 intervals by quantiles and computed standard deviations of M values in each interval. The mean of A values and the standard deviation of M values in each interval are denoted by l and l, respectively. By letting X1l ¼ l þ Zl and X2l ¼ l Zl, and Zl (2B 1)U l, l ¼ 1, . . ., 50, where B Bernoulli(1/2) and U Uniform(3, 5), we generated artificial outliers, which were combined into nonoutliers to construct datasets. We applied each method to each dataset and computed the sensitivity and specificity. This procedure was repeated 100 times to obtain the averaged sensitivities and specificities in Table 1. It was found that the constant method had the lowest sensitivity because it missed many true outliers in high intensity levels. The others’ results 883 H.Cho et al. Table 1. Simulation Result With Sensitivity And Specificity In () Simulated by Detection method Constant Constant Linear Non-linear Non-parametric (0.84, (0.80, (0.78, (0.79, 0.99) 0.97) 0.97) 0.97) Linear (0.89, (0.93, (0.91, (0.90, 0.99) 1.00) 0.99) 0.99) Non-linear Non-parametric (0.87, (0.91, (0.90, (0.87, 0.99) 0.99) 1.00) 0.99) (0.88, (0.92, (0.92, (0.92, 0.99) 0.99) 0.99) 1.00) were similar. As the linear method is the simplest, this linear method can be the best choice. If the magnitude of heterogeneous variability decreases non-linearly or fluctuates, the nonlinear or non-parametric method might have better results. ACKNOWLEDGEMENTS This work was supported by the Korea Research Foundation Grant (R14-2003-002-01002-0) for J.W. Lee. It was also 884 supported by the Korea Research Foundation Grant (KRF-2007-331-C00065) for H. Cho and by 21C Frontier Project for Functional Proteomics (FPR05A1-400) for S.W. Lee. Conflict of Interest: none declared. REFERENCES Koenker,R. and Bassett,G. (1978) Regression quantiles. Econometrics, 46, 33–50. Koenker,R. (2005) Quantile Regression. Cambridge University Press, New York. Min,H.-K. et al. (2007) Ultrahigh-pressure dual online solid phase extraction/ capillary reverse-phase liquid chromatography/tandem mass spectrometry (DO-SPE/cRPLC/MS/MS): a versatile separation platform for highthroughput and highly sensitive proteomic analyses. Electrophoresis, 28, 1012–1021. Peng,J. et al. (2003) Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res., 2, 43–50. Zhang,H. et al. (2003) Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nat. Biotechnol., 21, 660–666.
© Copyright 2026 Paperzz