WAPDAP-Software that Automates Proteomics Data

Bioinformatics
Bioinformatics
Chapter 1
WAPDAP-Software that Automates
Proteomics Data Analysis Pipeline
Adnan Ahmed, Masoud Zabet-Moghaddam and Chiquito
Crasto
Center for Biotechnology and Genomics, Texas Tech University, USA
Corresponding Author: ChiquitoCrasto, Center for Biotechnology and Genomics, #115 Experimental Sciences Building,
Texas Tech University, Lubbock, TX 79409, USA,Tel: 806-8345448; Email: [email protected]
*
First Published March 20, 2017
Copyright: © 2017 Adnan Ahmed, Masoud ZabetMoghaddam and Chiquito Crasto.
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits
unrestricted use, distribution, and reproduction in any medium,
provided you give appropriate credit to the original author(s)
and the source.
2
www.avidscience.com
Keywords
WAPDAP; Perseus; MaxQuant; Python Software
Wrapper; Proteomics Analysis
Abstract
Post-experiment, data-analysis is critical to mass
spectrometry based proteomics, especially in the assessing the large amounts of data that are produced. “BIG
DATA” analyses are multi-step (time-consuming) processes; they often involve the sequential use of several
suites of software with manual intervention at each step
of the analysis. Proteomics-analyses processes are also iterative and repetitive. Some steps require several hours
to complete, burdening users with time constraints. We
have designed a software called, WAPDAP (A Wrapper
for an Automated Proteomics Data Analysis Pipeline), to
address this. Developed in the Python scripting language,
WAPDAP sequentially and automatically executes two
key proteomics-based mass spectrometry data-analyses
software, MaxQuant and Perseus. WAPDAP is accessible via the World Wide Web. A web interface allows the
user to input MaxQuant and Perseus parameters. These
parameters are then incorporated into the software. Output results are also available online through a browser. A
demonstration of WAPDAP can be found on YouTube at:
https://www.youtube.com/watch?v=ikvCyU8mswg
www.avidscience.com
3
Bioinformatics
Bioinformatics
Introduction
Large amounts of data result from mass spectrometry
experiments. Mass spectrometry experiments typify BIG
DATA in the domain of proteomics. Commercial [1, 2]
and free software [3] suites have been developed to process
this data. These software suites (often several programs
working in sequence) address different issues in proteomics data-processing. Processing mass spectrometry data
involves the following steps: a) Identifying and quantifying peptides and proteins; b) statistically analysing the results from (a); and, c) sorting and presenting the data in
a human-readable format. Traditionally, involved manual
intervention is necessary at each step. The results of a previous step are often the input for the next step. Manually
going through these steps for several (often hundreds of)
files is repetitive and time consuming.
In this chapter, we discuss the creation of a “wrapper”
software, WAPDAP. WAPDAP automates and streamlines
the proteomics analysis of the results of a mass spectrometry experiment for two popular and freely-available software systems: MaxQuant, which identifies and quantifies proteins from mass spectrometry data; and, Perseus,
which performs statistical analyses on identified and
quantified proteins. WAPDAP leverages the World Wide
Web for its execution and provides two major advantages:
1) it can be executed remotely by a user who only has to
input a few parameters related to his or her experiment
4
www.avidscience.com
via a web-based input-form; 2) mitigates the need for the
users to have any expertise in the use of MaxQuant or Perseus—these software are executed on the server-side.
Background
Interaction proteomics is an important tool that enhances our understanding of the composition, topology
and structure of specific complex macromolecular assemblies. The workflow for interaction proteomics involves
an assay of digested peptides of a “bait”-protein and its
binding partner; this is followed by mass-spectrometric
analysis. A quantitative comparison of this test assay with
a control assay allows a differential view of the results [4].
There are two approaches for quantitative comparison: 1)
Stable Isotope Labeled quantification, where the labeled
isotope can be introduced into amino acids as an internal
or external standard [5]; and, 2) Label Free Quantification
(LFQ), a method that was designed to be an improvement
over stable isotope quantification, [6] which requires extra
preparation steps and is expensive. Also, not every material, e.g., clinical samples, can be metabolically labeled [6].
Label free quantification can be done by spectral counting or by comparing the direct mass spectrometric signal
intensities for any given peptide [5]. Assessing signal intensities has been identified as slightly more accurate than
spectral counting [7].
www.avidscience.com
5
Bioinformatics
Bioinformatics
MaxQuant and Perseus
Of the commercial and academic software currently in
use, almost all can assess and process the large amounts of
data that arise from the label free quantification method.
Of these, MaxQuant is a free, widely-used, open source
software solution for fast data-analysis of high resolution
mass spectrometers [7]. MaxQuant’s performance is comparable with other commercial software. MaxQuant focuses on intensity based quantitation rather than spectral
counting [3]. To identify proteins, MaxQuant uses its own
probabilistic search engine ‘Andromeda’ [8]. First, the theoretical fragments are calculated based on digestion rule,
combinatorics of modifications, and fragmentation method (CID/HCD or ETD) from the given FASTA-formatted
protein-sequence file [8].Then, the MS/MS spectra are
processed by “centroiding”, “deisotoping”, and transferring to charge state =1 [8]. Next, the processed and measured spectra are filtered by dividing the m/z plane into a
100 Thompson window and taking only top q (default q
= 12) peaks from that window [8]. These peaks are then
matched with the theoretical peaks and the score is calculated, which is the probability of a chance match between
the experimental and theoretical peaks [8]. This score calculation is determined iteratively for every instance from
the top q=12 peaks to the top q=4 peaks (q is the number of allowed peaks in 100 Thompson window). The best
score is then used in the next step. This score is the P-value
6
www.avidscience.com
for the probability of having k or more matches in n theoretical masses, where the null hypothesis equals no similarity between the raw spectrum and the theoretical mass
[8]. When all the peptides are identified and are assigned a
score, a one percent FDR (False Discovery Rate) cut-off is
applied to the identified peptides by using Posterior Error
Probability (PEP) [8]. Another one percent FDR is applied
to protein level [8]. Because there are two sequential application of one percent FDR, the ultimate FDR applied to
peptides become less than one percent [8].
The peak detection in MaxQuant is done by fitting a
Gaussian peak shape to the three central raw data points
in each MS scan and are assembled into three dimensional (3D) peak hills over the m/z-retention time plane [3].
By bootstrap replication, an individual mass precision is
calculated for each 3D peak [3]. The peak intensity and
mass together is termed ‘feature’, which is the basis of data
analysis in MaxQuant [3]. The intensities for Label Free
Quantification are determined as the intensity maximum/
retention time profile [6]. MaxQuant allows the option of
increasing the number of peptides to be used for quantification by transferring peptide identifications to unsequenced or unidentified peptides by matching their mass
and retention times (‘match-between-runs’ feature in
MaxQuant) [6].
Perseus is a statistical analysis software that performs
statistical tests MaxQuant results. Additionally, normaliwww.avidscience.com
7
Bioinformatics
Bioinformatics
zation and visualization of the data analysis results are
also provided by Perseus [9]. This statistical analysis software is created by the creators of MaxQuant..
For comprehensive analysis of the data using MaxQuant and Perseus, first, the mass-spectrometry, experimental raw files are loaded in to MaxQuant. The proteins
from the data files are identified and quantified. One of
the many MaxQuant output files is ‘proteinGroups’, which
is loaded in into Perseus. In Perseus, statistical tests are
performed to mark the significance of each protein.
A proteomics data analysis pipeline involving MaxQuant and Perseus however, raises a few procedural issues: i) a sequential run of MaxQuant and Perseus involves waiting for MaxQuant (which often takes more
than 24 hours) to complete its analysis. ii) It is a repetitive
process. iii) The results that come out of the Perseus are
complex and need additional processing before they can
be made available to the end user; iv) very often, proteomics experimental scientists are not conversant with the use
of MaxQuant and/or Perseus.
This results in several learning-curve issues which
takes the focus of the practitioner from his or her benchexpertise.
software and does the final processing to generate results
desired by the user in an easily-understood format. To
this end, we developed a software called WAPDAP written in the Python Programming language. Procedurally,
our software automatically: i) accepts input through a
web-based form that only requires the use of an Internet
browser; ii) acquires the mass spectrometry experiment
results’ the raw files, which are then into MaxQuant, iii)
acquires the output from the MaxQuant and feeds it into
Perseus; and, iv) takes the Perseus output and further processes it to generate results that are accessible in an Internet browser.
A publication by Pendarvis et al. (2009) [10] describes
the generation of probabilities and differential expression
analysis by streamlining the software, Bioworks 3.2, Sequest and ProtQuant. Sequest however, is not an open
source software solution. Aiche et al. (2015) [11] published the results of automation has been done by using an
open source software solution called OpenMS; the results
however, are only applicable for metabolites. WAPDAP
does not process data; it leverages all the best features of
MaxQuant and Perseus into an automated, user-friendly
system.
WAPDAP
To address all these issues there was a need of a software solution that automates the MaxQuant and Perseus
8
www.avidscience.com
www.avidscience.com
9
Bioinformatics
Bioinformatics
Methods
Creating and Automating the Pipeline
The pipeline includes a sequential run of MaxQuant
and Perseus. The pipeline, illustrated in Figure 1, can be
divided into three distinct parts:
• Processing the raw data with MaxQuant
• Statistical processing of the MaxQuant output
with Perseus
• Processing the Perseus output for the final result
generation
Processing Raw Data with MaxQuant
The pipeline has been automated using the Python
programming language. Python is an interpreted, interactive, object-oriented programming language [12]. It is a
high-level general purpose programming language [12].
The raw files, which are the output of a Mass-spectrometry
experiment, are loaded into MaxQuant. To process these
files, their parameters have to be specified. These parameters are specified in a MS Excel file named ‘inputfile’ and
are adapted into the code by the Python module called
‘openpyxl’ which is useful for reading in data from an
MS Excel sheet into Python environment [13]. The software then instructs the process to execute the MaxQuant
software. All the instructions given to the MaxQuant and
Perseus software are through the use of a module called
10
www.avidscience.com
‘PyAutoit’ which is designed to automate a Windows GUI
and general scripting [14]. After beginning the execution
of MaxQuant, the software then loads the raw files into
the system. Then, in the ‘Group-specific parameter’ tab of
MaxQuant, Label Free Quantification (LFQ) is selected
and minimum ratio count for LFQ is given as 1. The minimum ratio count for LFQ corresponds to the number of
peptides that would be considered for pair-wise comparison to get the LFQ intensity ratio across samples. Next,
the path for the FASTA-formatted peptide-sequence file,
against which MaxQuant’s search engine searches for a
match in the ‘Global parameters’ tab, is specified. In the
same tab, the ‘match between runs’ option is activated.
This option increases the number of peptides for label free
quantification significantly by transferring peptide identifications to un-sequenced or unidentified peptides. All
the other parameters are retained as default in MaxQuant.
These include the Mass-spectrometry instrument parameters, where the default instrument is Orbitrap and peptide tolerance is 4.5 ppm. The default peptide and protein
FDR is one percent. Subsequently, the identification and
quantification process is initiated. MaxQuant outputs several result files. One of these is the ‘proteinGroups.txt’ file.
This is the input file in the Perseus for statistical analysis.
WAPDAP searches for this file every 10 minutes. When
WAPDAB identifies that the proteinGroups.txt file has
been generated, the execution of Perseus is initiated.
www.avidscience.com
11
Bioinformatics
Bioinformatics
Statistical Processing of the MaxQuantOutput
with Perseus
The ‘proteinGroups.txt’ file is loaded into the Perseus
and only the LFQ intensities are considered for statistical analysis. Proteins that are potential contaminants, reversed, and which are only identified by site, are filtered
out. Proteins identified by sight are those that meet the
one percent peptide FDR cut-off but fail to meet the one
percent protein FDR cut-off. After that, the groups (i.e.
control group, treated group) are categorically annotated
and the LFQ intensities are log base2 transformed to facilitate the calculation of protein expression fold change.
A resulting histogram that shows all the different groups is
produced and exported to an automatically newly created
folder named ‘perseus_results’. All the results produced
from this process are placed in that folder. Following log
base2 transformations, where there were no identifications (intensities are zero) a NaN (not a number) notation
is assigned.
The next step involves filtering out those proteins
which do not satisfy one of the following criteria. The default criteria for three control replicates and three treated
replicate groups is that they must be identified by at least
in two replicates in at least one group (i.e. control group).
Users can control this filtering. Some proteins which have
satisfied our filtering criteria remain as a result. While
NaN notations exist, the t-test cannot be performed. These
12
www.avidscience.com
NaN notations are replaced by imputations from a normal
distribution. Then, a histogram showing the imputed values, a volcano plot, a heat map, and a multi scatter plot are
produced and exported to the ‘perseus_results’ folder.
Next, a Student’s t-test is performed on the log base
2 transformed filtered LFQ intensities. This produces a
column showing all the proteins whose changes are statistically significant. An additional step is then performed
to generate an additional column of only numerical part
of first gi (unique identifier of NCBI entries) number of
the protein group. This is done through the use of regular
expressions, which are used for later extraction of protein
names from the FASTA file. Lastly, the resulting matrix
containing all this information is exported to the ‘perseus_results’ folder in a file named ‘stock.txt’ and the Perseus session is saved and closed. Then the software starts
the final result generation process.
Processing the Perseus Output for the
Final Result Generation
The final result is generated by accessing and executing a Python module called ‘PypeR’ which allows the R
programming and scripting language commands to be executed within Python by the “pipe” communication method [15]. The final result generation process starts by reading in the Perseus output ‘stock.txt’ file. Then, the proteins
are filtered to get only the statistically significant proteins.
www.avidscience.com
13
Bioinformatics
Bioinformatics
After that, the FASTA-formatted sequence file is searched
against the numerical part of the ‘gi number’ to get the
corresponding protein names. The proteins are filtered to
get four different groups of proteins in four different text
files. Typically, these are the final instances of information
sought by the end user. The four different groups of proteins are:
• Significant up-regulated proteins
• Significant unique up-regulated proteins
• Significant down-regulated proteins
• Significant unique down-regulated proteins
Web User Interface
The web user interface has been created by using
Django web framework. It serves as an interface for the
end user who can access MaxQuant and Perseus processing and the results without having to learn about MaxQuant and/or Perseus. Django is one of many Web development frameworks provided by Python [16]. It is a fast,
secure, and scalable high-level Web framework [17].
The website serves two primary purposes:
• Take in the parameters for running the process by
web form
• Serving the results that comes out of the process
Django commands are given through Windows command prompts. First, a project called ‘mysite’ is created.
14
www.avidscience.com
Then two apps are created inside the project, which are
named ‘personal’ and ‘results’. The ‘personal’ app serves the
purpose of a form to take in the input parameters which
includes an email so that the user can be contacted when
final results become available. The ‘results’ app serves the
function of providing results. In the ‘personal’ app, a class
is developed for all of the necessary parameters using
the form function of Django. A separate bootstrap called
‘django-bootstrap-form’ has been used to make the webpage more accessible and easy to read. The whole website
uses a different bootstrap called Twitter bootstrap, which
is a front-end toolkit for faster development of web applications [18]. It is a collection of CSS and HTML conventions for styling typography, forms, buttons, tables, grids,
navigation etc.[18]. For each app to work properly, at first,
it is installed in the ‘settings.py’ file of ‘mysite’ directory.
Then, in the same directory in ‘urls.py’ file, the URLs for
these apps are specified that directs towards the ‘urls.py’
files inside corresponding app’s directory. After that, the
URLs in ‘urls.py’ file in both of the app’s directory are
specified. Ultimately, the URLs lead to functions in the
‘views.py’ file in the same directory, how to customize the
results on the web page.
The form is presented in the home page of the website.
On clicking the submit button of the form, the information from the form populates a database. Sqlite3 has been
used as database. Then the submit button function sends
one email to the client and another email to the website’s
www.avidscience.com
15
Bioinformatics
Bioinformatics
administrator indicating the job id for the submitted job
which has just been created. The client can access the results of the job once it is complete. The administrator annotates the job ID with file path into the database indicating the position of the raw files for the specified person
and job. Once the mass-spectrometer completes its job,
the administrator runs the data analysis process by indicating only the job ID. The data analysis software automatically gets the necessary parameters from the database
which was specified by the client using the corresponding
job id. After the data analysis process is done, the software
sends an email to the client indicating that the job is complete and the results are available to be downloaded from
the website.
The YouTube video at: https://www.youtube.com/
watch?v=ikvCyU8mswg illustrates the use of every aspect
of WapDap from the completion of the input form to a
truncated running of MaxQuant to Perseus and results
output to a web-page.
Results and Discussions
The data analysis process results in generating the following picture files:
• Nan_Histogram.png – this histogram is generated
after log base2 transformation and provides information about the distribution of all the groups
(control and treated).
16
www.avidscience.com
• RMV_Histogram.png – this histogram is created
after replacing the missing values (RMV), hence
the name and shows the imputed values.
• Volcano_plot.png – this image shows the volcano
plot which is a visual representation of Student’s
two sample t-test showing the specified threshold
of fold change. This figure provides a view as to
the changes that have occurred while comparing groups (i.e. if the proteins are upregulated, or
down regulated etc.)
• Heatmap.png – this image provides the heat map
for the experiment which provides information
about overall behavior of the individual replicates
of all groups.
• Multi_scatter.png – this image provides the scatter plots of all replicates in all groups against all
replicates in all groups and provides Pearson correlation among them.
• Significant up regulated.txt, Significant unique up
regulated.txt, Significant down regulated.txt, and
Significant unique down regulated.txt files – these
files contain the detailed result of the data analysis
process for all those four class of proteins, which
includes the following columns:
• Protein name – provides the protein names which
is the first member of the each protein groups.
www.avidscience.com
17
Bioinformatics
Bioinformatics
• Protein groups – provides the gi numbers of all
the members for each protein group.
• p-value – provides the p-values of the corresponding proteins
• Molecular weight – provides the molecular weight
of the corresponding proteins in Da.
• Sequence coverage – provides the sequence coverage of the corresponding protein in percent.
• Fold change – provides the fold change for the
corresponding proteins.
• Inverse ratio – it is just the invers of the fold
change.
• Stock.txt – this text file contains all the columns
that come out of the Perseus. All of the text files
are in a tab separated format. These are very easy
to visualize; the user can drag and drop these into
a MS Excel file.
• The Perseus session that contains all of the steps is
also saved but is not served through default results
in website. But if the client requests it, the Perseus
saved session will be provided.
• All of the files stated above, except the saved Perseus session are available through the website as
downloadable links.
Conclusion
The iterative and time-consuming data analysis pipeline for label free quantitative interaction proteomics has
been automated through WAPDAP. The pipeline involves
sequential usage of MaxQuant and Perseus software and
followed by an additional result processing for generating
simpler results. This was accomplished mainly by using
‘PyAutoit’ and ‘PypeR’ module of Python programming
language. The Dajngo Web framework was used to develop the linking web interface for the process.
References
1. Hirosawa M, Hoshida M, Ishikawa M, ToyaT..
MASCOT: multiple alignment system for protein
sequences based on three-way dynamic programming. ComputApplBiosci. 1993; 9: 161-167.
2. Searle BC. Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies. Proteomics. 2010; 10: 1265-1269.
3. Cox J,M Mann.MaxQuant enables high peptide
identification rates, individualized p.p.b.-range
mass accuracies and proteome-wide protein quantification. Nat Biotechnol. 2008; 26: 1367-1372.
4. Aebersold R,M Mann. Mass-spectrometric exploration of proteome structure and function. Nature. 2016; 537: 347-355.
5. Marcus Bantscheff, Markus Schirle, GavainSweet-
18
www.avidscience.com
www.avidscience.com
19
Bioinformatics
Bioinformatics
man, Jens Rick, Bernhard Kuster. Quantitative
mass spectrometry in proteomics: a critical review. Anal Bioanal Chem. 2007; 389: 1017-1031.
downstream data analysis and visualization in
large-scale computational mass spectrometry.
Proteomics. 2015; 15: 1443-1447.
6. Cox J, Hein MY, Luber CA, Paron I, Nagaraj N, et
al. Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ. Mol Cell
Proteomics. 2014; 13: 2513-2526.
12.Python 3.5.2 documentation. Python. https://
docs.python.org/3/faq/general.html.
7. Nahnsen S, Bielow C, Reinert K, Kohlbacher O.
Tools for label-free peptide quantification. Mol
Cell Proteomics. 2013; 12: 549-556.
14.AutoIt. AUTOIT. https://www.autoitscript.com/
site/autoit/.
8. Cox J, Neuhauser N, Michalski A, Scheltema RA,
Olsen JV, et al. Andromeda: a peptide search engine integrated into the MaxQuant environment.
J Proteome Res. 2011; 10: 1794-1805.
9. Tyanova S, Temu T, Sinitcyn P, Carlson A, Hein
MY, et al. The Perseus computational platform for
comprehensive analysis of (prote)omics data. Nat
Meth. 2016.
10.Ken Pendarvis, Ranjit Kumar, Shane C Burgess,
BinduNanduri. An automated proteomic data
analysis workflow for mass spectrometry. BMC
Bioinformatics. 2009; 10: S17.
13.Gazoni E, CC. A Python library to read/write Excel 2010 xlsx/xlsm files. OpenPyXL. https://openpyxl.readthedocs.io/en/default/.
15.Xia M, MMcClellaand, Y Wang, PypeR. A Python
Package for Using R in Python. Journal of Statistical Software, 2010; 35: 1-8.
16.Riti P. What You Need to Know about Python.
Birmingham: Packt Publishing Ltd. 2016; 52.
17.Django makes it easier to build better Web apps
more quickly and with less code. Django. https://
www.djangoproject.com/.
18.Otto M. Bootstrap from Twitter. Twitter. 2011.
https://blog.twitter.com/2011/bootstrap-fromtwitter.
11.Aiche S, Sachsenberg T, Kenar E, Walzer M,
Wiswedel B, et al. Workflows for automated
20
www.avidscience.com
www.avidscience.com
21