Bioinformatics Bioinformatics Chapter 1 WAPDAP-Software that Automates Proteomics Data Analysis Pipeline Adnan Ahmed, Masoud Zabet-Moghaddam and Chiquito Crasto Center for Biotechnology and Genomics, Texas Tech University, USA Corresponding Author: ChiquitoCrasto, Center for Biotechnology and Genomics, #115 Experimental Sciences Building, Texas Tech University, Lubbock, TX 79409, USA,Tel: 806-8345448; Email: [email protected] * First Published March 20, 2017 Copyright: © 2017 Adnan Ahmed, Masoud ZabetMoghaddam and Chiquito Crasto. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source. 2 www.avidscience.com Keywords WAPDAP; Perseus; MaxQuant; Python Software Wrapper; Proteomics Analysis Abstract Post-experiment, data-analysis is critical to mass spectrometry based proteomics, especially in the assessing the large amounts of data that are produced. “BIG DATA” analyses are multi-step (time-consuming) processes; they often involve the sequential use of several suites of software with manual intervention at each step of the analysis. Proteomics-analyses processes are also iterative and repetitive. Some steps require several hours to complete, burdening users with time constraints. We have designed a software called, WAPDAP (A Wrapper for an Automated Proteomics Data Analysis Pipeline), to address this. Developed in the Python scripting language, WAPDAP sequentially and automatically executes two key proteomics-based mass spectrometry data-analyses software, MaxQuant and Perseus. WAPDAP is accessible via the World Wide Web. A web interface allows the user to input MaxQuant and Perseus parameters. These parameters are then incorporated into the software. Output results are also available online through a browser. A demonstration of WAPDAP can be found on YouTube at: https://www.youtube.com/watch?v=ikvCyU8mswg www.avidscience.com 3 Bioinformatics Bioinformatics Introduction Large amounts of data result from mass spectrometry experiments. Mass spectrometry experiments typify BIG DATA in the domain of proteomics. Commercial [1, 2] and free software [3] suites have been developed to process this data. These software suites (often several programs working in sequence) address different issues in proteomics data-processing. Processing mass spectrometry data involves the following steps: a) Identifying and quantifying peptides and proteins; b) statistically analysing the results from (a); and, c) sorting and presenting the data in a human-readable format. Traditionally, involved manual intervention is necessary at each step. The results of a previous step are often the input for the next step. Manually going through these steps for several (often hundreds of) files is repetitive and time consuming. In this chapter, we discuss the creation of a “wrapper” software, WAPDAP. WAPDAP automates and streamlines the proteomics analysis of the results of a mass spectrometry experiment for two popular and freely-available software systems: MaxQuant, which identifies and quantifies proteins from mass spectrometry data; and, Perseus, which performs statistical analyses on identified and quantified proteins. WAPDAP leverages the World Wide Web for its execution and provides two major advantages: 1) it can be executed remotely by a user who only has to input a few parameters related to his or her experiment 4 www.avidscience.com via a web-based input-form; 2) mitigates the need for the users to have any expertise in the use of MaxQuant or Perseus—these software are executed on the server-side. Background Interaction proteomics is an important tool that enhances our understanding of the composition, topology and structure of specific complex macromolecular assemblies. The workflow for interaction proteomics involves an assay of digested peptides of a “bait”-protein and its binding partner; this is followed by mass-spectrometric analysis. A quantitative comparison of this test assay with a control assay allows a differential view of the results [4]. There are two approaches for quantitative comparison: 1) Stable Isotope Labeled quantification, where the labeled isotope can be introduced into amino acids as an internal or external standard [5]; and, 2) Label Free Quantification (LFQ), a method that was designed to be an improvement over stable isotope quantification, [6] which requires extra preparation steps and is expensive. Also, not every material, e.g., clinical samples, can be metabolically labeled [6]. Label free quantification can be done by spectral counting or by comparing the direct mass spectrometric signal intensities for any given peptide [5]. Assessing signal intensities has been identified as slightly more accurate than spectral counting [7]. www.avidscience.com 5 Bioinformatics Bioinformatics MaxQuant and Perseus Of the commercial and academic software currently in use, almost all can assess and process the large amounts of data that arise from the label free quantification method. Of these, MaxQuant is a free, widely-used, open source software solution for fast data-analysis of high resolution mass spectrometers [7]. MaxQuant’s performance is comparable with other commercial software. MaxQuant focuses on intensity based quantitation rather than spectral counting [3]. To identify proteins, MaxQuant uses its own probabilistic search engine ‘Andromeda’ [8]. First, the theoretical fragments are calculated based on digestion rule, combinatorics of modifications, and fragmentation method (CID/HCD or ETD) from the given FASTA-formatted protein-sequence file [8].Then, the MS/MS spectra are processed by “centroiding”, “deisotoping”, and transferring to charge state =1 [8]. Next, the processed and measured spectra are filtered by dividing the m/z plane into a 100 Thompson window and taking only top q (default q = 12) peaks from that window [8]. These peaks are then matched with the theoretical peaks and the score is calculated, which is the probability of a chance match between the experimental and theoretical peaks [8]. This score calculation is determined iteratively for every instance from the top q=12 peaks to the top q=4 peaks (q is the number of allowed peaks in 100 Thompson window). The best score is then used in the next step. This score is the P-value 6 www.avidscience.com for the probability of having k or more matches in n theoretical masses, where the null hypothesis equals no similarity between the raw spectrum and the theoretical mass [8]. When all the peptides are identified and are assigned a score, a one percent FDR (False Discovery Rate) cut-off is applied to the identified peptides by using Posterior Error Probability (PEP) [8]. Another one percent FDR is applied to protein level [8]. Because there are two sequential application of one percent FDR, the ultimate FDR applied to peptides become less than one percent [8]. The peak detection in MaxQuant is done by fitting a Gaussian peak shape to the three central raw data points in each MS scan and are assembled into three dimensional (3D) peak hills over the m/z-retention time plane [3]. By bootstrap replication, an individual mass precision is calculated for each 3D peak [3]. The peak intensity and mass together is termed ‘feature’, which is the basis of data analysis in MaxQuant [3]. The intensities for Label Free Quantification are determined as the intensity maximum/ retention time profile [6]. MaxQuant allows the option of increasing the number of peptides to be used for quantification by transferring peptide identifications to unsequenced or unidentified peptides by matching their mass and retention times (‘match-between-runs’ feature in MaxQuant) [6]. Perseus is a statistical analysis software that performs statistical tests MaxQuant results. Additionally, normaliwww.avidscience.com 7 Bioinformatics Bioinformatics zation and visualization of the data analysis results are also provided by Perseus [9]. This statistical analysis software is created by the creators of MaxQuant.. For comprehensive analysis of the data using MaxQuant and Perseus, first, the mass-spectrometry, experimental raw files are loaded in to MaxQuant. The proteins from the data files are identified and quantified. One of the many MaxQuant output files is ‘proteinGroups’, which is loaded in into Perseus. In Perseus, statistical tests are performed to mark the significance of each protein. A proteomics data analysis pipeline involving MaxQuant and Perseus however, raises a few procedural issues: i) a sequential run of MaxQuant and Perseus involves waiting for MaxQuant (which often takes more than 24 hours) to complete its analysis. ii) It is a repetitive process. iii) The results that come out of the Perseus are complex and need additional processing before they can be made available to the end user; iv) very often, proteomics experimental scientists are not conversant with the use of MaxQuant and/or Perseus. This results in several learning-curve issues which takes the focus of the practitioner from his or her benchexpertise. software and does the final processing to generate results desired by the user in an easily-understood format. To this end, we developed a software called WAPDAP written in the Python Programming language. Procedurally, our software automatically: i) accepts input through a web-based form that only requires the use of an Internet browser; ii) acquires the mass spectrometry experiment results’ the raw files, which are then into MaxQuant, iii) acquires the output from the MaxQuant and feeds it into Perseus; and, iv) takes the Perseus output and further processes it to generate results that are accessible in an Internet browser. A publication by Pendarvis et al. (2009) [10] describes the generation of probabilities and differential expression analysis by streamlining the software, Bioworks 3.2, Sequest and ProtQuant. Sequest however, is not an open source software solution. Aiche et al. (2015) [11] published the results of automation has been done by using an open source software solution called OpenMS; the results however, are only applicable for metabolites. WAPDAP does not process data; it leverages all the best features of MaxQuant and Perseus into an automated, user-friendly system. WAPDAP To address all these issues there was a need of a software solution that automates the MaxQuant and Perseus 8 www.avidscience.com www.avidscience.com 9 Bioinformatics Bioinformatics Methods Creating and Automating the Pipeline The pipeline includes a sequential run of MaxQuant and Perseus. The pipeline, illustrated in Figure 1, can be divided into three distinct parts: • Processing the raw data with MaxQuant • Statistical processing of the MaxQuant output with Perseus • Processing the Perseus output for the final result generation Processing Raw Data with MaxQuant The pipeline has been automated using the Python programming language. Python is an interpreted, interactive, object-oriented programming language [12]. It is a high-level general purpose programming language [12]. The raw files, which are the output of a Mass-spectrometry experiment, are loaded into MaxQuant. To process these files, their parameters have to be specified. These parameters are specified in a MS Excel file named ‘inputfile’ and are adapted into the code by the Python module called ‘openpyxl’ which is useful for reading in data from an MS Excel sheet into Python environment [13]. The software then instructs the process to execute the MaxQuant software. All the instructions given to the MaxQuant and Perseus software are through the use of a module called 10 www.avidscience.com ‘PyAutoit’ which is designed to automate a Windows GUI and general scripting [14]. After beginning the execution of MaxQuant, the software then loads the raw files into the system. Then, in the ‘Group-specific parameter’ tab of MaxQuant, Label Free Quantification (LFQ) is selected and minimum ratio count for LFQ is given as 1. The minimum ratio count for LFQ corresponds to the number of peptides that would be considered for pair-wise comparison to get the LFQ intensity ratio across samples. Next, the path for the FASTA-formatted peptide-sequence file, against which MaxQuant’s search engine searches for a match in the ‘Global parameters’ tab, is specified. In the same tab, the ‘match between runs’ option is activated. This option increases the number of peptides for label free quantification significantly by transferring peptide identifications to un-sequenced or unidentified peptides. All the other parameters are retained as default in MaxQuant. These include the Mass-spectrometry instrument parameters, where the default instrument is Orbitrap and peptide tolerance is 4.5 ppm. The default peptide and protein FDR is one percent. Subsequently, the identification and quantification process is initiated. MaxQuant outputs several result files. One of these is the ‘proteinGroups.txt’ file. This is the input file in the Perseus for statistical analysis. WAPDAP searches for this file every 10 minutes. When WAPDAB identifies that the proteinGroups.txt file has been generated, the execution of Perseus is initiated. www.avidscience.com 11 Bioinformatics Bioinformatics Statistical Processing of the MaxQuantOutput with Perseus The ‘proteinGroups.txt’ file is loaded into the Perseus and only the LFQ intensities are considered for statistical analysis. Proteins that are potential contaminants, reversed, and which are only identified by site, are filtered out. Proteins identified by sight are those that meet the one percent peptide FDR cut-off but fail to meet the one percent protein FDR cut-off. After that, the groups (i.e. control group, treated group) are categorically annotated and the LFQ intensities are log base2 transformed to facilitate the calculation of protein expression fold change. A resulting histogram that shows all the different groups is produced and exported to an automatically newly created folder named ‘perseus_results’. All the results produced from this process are placed in that folder. Following log base2 transformations, where there were no identifications (intensities are zero) a NaN (not a number) notation is assigned. The next step involves filtering out those proteins which do not satisfy one of the following criteria. The default criteria for three control replicates and three treated replicate groups is that they must be identified by at least in two replicates in at least one group (i.e. control group). Users can control this filtering. Some proteins which have satisfied our filtering criteria remain as a result. While NaN notations exist, the t-test cannot be performed. These 12 www.avidscience.com NaN notations are replaced by imputations from a normal distribution. Then, a histogram showing the imputed values, a volcano plot, a heat map, and a multi scatter plot are produced and exported to the ‘perseus_results’ folder. Next, a Student’s t-test is performed on the log base 2 transformed filtered LFQ intensities. This produces a column showing all the proteins whose changes are statistically significant. An additional step is then performed to generate an additional column of only numerical part of first gi (unique identifier of NCBI entries) number of the protein group. This is done through the use of regular expressions, which are used for later extraction of protein names from the FASTA file. Lastly, the resulting matrix containing all this information is exported to the ‘perseus_results’ folder in a file named ‘stock.txt’ and the Perseus session is saved and closed. Then the software starts the final result generation process. Processing the Perseus Output for the Final Result Generation The final result is generated by accessing and executing a Python module called ‘PypeR’ which allows the R programming and scripting language commands to be executed within Python by the “pipe” communication method [15]. The final result generation process starts by reading in the Perseus output ‘stock.txt’ file. Then, the proteins are filtered to get only the statistically significant proteins. www.avidscience.com 13 Bioinformatics Bioinformatics After that, the FASTA-formatted sequence file is searched against the numerical part of the ‘gi number’ to get the corresponding protein names. The proteins are filtered to get four different groups of proteins in four different text files. Typically, these are the final instances of information sought by the end user. The four different groups of proteins are: • Significant up-regulated proteins • Significant unique up-regulated proteins • Significant down-regulated proteins • Significant unique down-regulated proteins Web User Interface The web user interface has been created by using Django web framework. It serves as an interface for the end user who can access MaxQuant and Perseus processing and the results without having to learn about MaxQuant and/or Perseus. Django is one of many Web development frameworks provided by Python [16]. It is a fast, secure, and scalable high-level Web framework [17]. The website serves two primary purposes: • Take in the parameters for running the process by web form • Serving the results that comes out of the process Django commands are given through Windows command prompts. First, a project called ‘mysite’ is created. 14 www.avidscience.com Then two apps are created inside the project, which are named ‘personal’ and ‘results’. The ‘personal’ app serves the purpose of a form to take in the input parameters which includes an email so that the user can be contacted when final results become available. The ‘results’ app serves the function of providing results. In the ‘personal’ app, a class is developed for all of the necessary parameters using the form function of Django. A separate bootstrap called ‘django-bootstrap-form’ has been used to make the webpage more accessible and easy to read. The whole website uses a different bootstrap called Twitter bootstrap, which is a front-end toolkit for faster development of web applications [18]. It is a collection of CSS and HTML conventions for styling typography, forms, buttons, tables, grids, navigation etc.[18]. For each app to work properly, at first, it is installed in the ‘settings.py’ file of ‘mysite’ directory. Then, in the same directory in ‘urls.py’ file, the URLs for these apps are specified that directs towards the ‘urls.py’ files inside corresponding app’s directory. After that, the URLs in ‘urls.py’ file in both of the app’s directory are specified. Ultimately, the URLs lead to functions in the ‘views.py’ file in the same directory, how to customize the results on the web page. The form is presented in the home page of the website. On clicking the submit button of the form, the information from the form populates a database. Sqlite3 has been used as database. Then the submit button function sends one email to the client and another email to the website’s www.avidscience.com 15 Bioinformatics Bioinformatics administrator indicating the job id for the submitted job which has just been created. The client can access the results of the job once it is complete. The administrator annotates the job ID with file path into the database indicating the position of the raw files for the specified person and job. Once the mass-spectrometer completes its job, the administrator runs the data analysis process by indicating only the job ID. The data analysis software automatically gets the necessary parameters from the database which was specified by the client using the corresponding job id. After the data analysis process is done, the software sends an email to the client indicating that the job is complete and the results are available to be downloaded from the website. The YouTube video at: https://www.youtube.com/ watch?v=ikvCyU8mswg illustrates the use of every aspect of WapDap from the completion of the input form to a truncated running of MaxQuant to Perseus and results output to a web-page. Results and Discussions The data analysis process results in generating the following picture files: • Nan_Histogram.png – this histogram is generated after log base2 transformation and provides information about the distribution of all the groups (control and treated). 16 www.avidscience.com • RMV_Histogram.png – this histogram is created after replacing the missing values (RMV), hence the name and shows the imputed values. • Volcano_plot.png – this image shows the volcano plot which is a visual representation of Student’s two sample t-test showing the specified threshold of fold change. This figure provides a view as to the changes that have occurred while comparing groups (i.e. if the proteins are upregulated, or down regulated etc.) • Heatmap.png – this image provides the heat map for the experiment which provides information about overall behavior of the individual replicates of all groups. • Multi_scatter.png – this image provides the scatter plots of all replicates in all groups against all replicates in all groups and provides Pearson correlation among them. • Significant up regulated.txt, Significant unique up regulated.txt, Significant down regulated.txt, and Significant unique down regulated.txt files – these files contain the detailed result of the data analysis process for all those four class of proteins, which includes the following columns: • Protein name – provides the protein names which is the first member of the each protein groups. www.avidscience.com 17 Bioinformatics Bioinformatics • Protein groups – provides the gi numbers of all the members for each protein group. • p-value – provides the p-values of the corresponding proteins • Molecular weight – provides the molecular weight of the corresponding proteins in Da. • Sequence coverage – provides the sequence coverage of the corresponding protein in percent. • Fold change – provides the fold change for the corresponding proteins. • Inverse ratio – it is just the invers of the fold change. • Stock.txt – this text file contains all the columns that come out of the Perseus. All of the text files are in a tab separated format. These are very easy to visualize; the user can drag and drop these into a MS Excel file. • The Perseus session that contains all of the steps is also saved but is not served through default results in website. But if the client requests it, the Perseus saved session will be provided. • All of the files stated above, except the saved Perseus session are available through the website as downloadable links. Conclusion The iterative and time-consuming data analysis pipeline for label free quantitative interaction proteomics has been automated through WAPDAP. The pipeline involves sequential usage of MaxQuant and Perseus software and followed by an additional result processing for generating simpler results. This was accomplished mainly by using ‘PyAutoit’ and ‘PypeR’ module of Python programming language. The Dajngo Web framework was used to develop the linking web interface for the process. References 1. Hirosawa M, Hoshida M, Ishikawa M, ToyaT.. MASCOT: multiple alignment system for protein sequences based on three-way dynamic programming. ComputApplBiosci. 1993; 9: 161-167. 2. Searle BC. Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies. Proteomics. 2010; 10: 1265-1269. 3. Cox J,M Mann.MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008; 26: 1367-1372. 4. Aebersold R,M Mann. Mass-spectrometric exploration of proteome structure and function. Nature. 2016; 537: 347-355. 5. Marcus Bantscheff, Markus Schirle, GavainSweet- 18 www.avidscience.com www.avidscience.com 19 Bioinformatics Bioinformatics man, Jens Rick, Bernhard Kuster. Quantitative mass spectrometry in proteomics: a critical review. Anal Bioanal Chem. 2007; 389: 1017-1031. downstream data analysis and visualization in large-scale computational mass spectrometry. Proteomics. 2015; 15: 1443-1447. 6. Cox J, Hein MY, Luber CA, Paron I, Nagaraj N, et al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol Cell Proteomics. 2014; 13: 2513-2526. 12.Python 3.5.2 documentation. Python. https:// docs.python.org/3/faq/general.html. 7. Nahnsen S, Bielow C, Reinert K, Kohlbacher O. Tools for label-free peptide quantification. Mol Cell Proteomics. 2013; 12: 549-556. 14.AutoIt. AUTOIT. https://www.autoitscript.com/ site/autoit/. 8. Cox J, Neuhauser N, Michalski A, Scheltema RA, Olsen JV, et al. Andromeda: a peptide search engine integrated into the MaxQuant environment. J Proteome Res. 2011; 10: 1794-1805. 9. Tyanova S, Temu T, Sinitcyn P, Carlson A, Hein MY, et al. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat Meth. 2016. 10.Ken Pendarvis, Ranjit Kumar, Shane C Burgess, BinduNanduri. An automated proteomic data analysis workflow for mass spectrometry. BMC Bioinformatics. 2009; 10: S17. 13.Gazoni E, CC. A Python library to read/write Excel 2010 xlsx/xlsm files. OpenPyXL. https://openpyxl.readthedocs.io/en/default/. 15.Xia M, MMcClellaand, Y Wang, PypeR. A Python Package for Using R in Python. Journal of Statistical Software, 2010; 35: 1-8. 16.Riti P. What You Need to Know about Python. Birmingham: Packt Publishing Ltd. 2016; 52. 17.Django makes it easier to build better Web apps more quickly and with less code. Django. https:// www.djangoproject.com/. 18.Otto M. Bootstrap from Twitter. Twitter. 2011. https://blog.twitter.com/2011/bootstrap-fromtwitter. 11.Aiche S, Sachsenberg T, Kenar E, Walzer M, Wiswedel B, et al. Workflows for automated 20 www.avidscience.com www.avidscience.com 21
© Copyright 2026 Paperzz