Statistical Analysis

Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
Statistical Analysis
The result of a proteomic quantitative analysis is a list of peptide and protein abundances for every
protein in different samples, or abundance ratios between the samples. The downstream interpretation
of these data varies according to the experimental design. In this chapter we will describe different
generic methods for the interpretation of quantitative datasets.
For this, we will use Perseus, a graphical user interface for the interpretation of quantitative biology
data (and a part of the MaxQuant environment1). The dataset used here for illustrative purposes is
freely available through the ProteomeXchange2 consortium via the PRIDE3 partner repository under the
accession number PXD000441, and available in the resources folder under Results_MQ_5cell-line-mix. It
consists of five cell lines derived from acute myeloid leukemia (AML) patients measured with a spiked-in
internal stanard (IS) obtained from the combination of the same five AML cell lines that have been
metabolically labeled with heavy isotopes4 and analyzed using MaxQuant version 1.4.1.2. The workflow
presented here can also be followed in5.
Perseus allows the design and navigation of data interpretation workflows. It offers numerous
possibilities for data interpretation and thus it will require time to get familiar with all the available
features. But with a small effort you will be able to design complex workflows intuitively. Importantly,
you will be able to visualize the data at any point, and design branches in your workflow for testing
different possibilities.
This chapter introduces the basic functions, and is by no means aiming at covering all features or
present a reference workflow. Please continue exploring the software on your own and critically adapt
the interpretation workflow to your experiment!
Note:
MaxQuant and Perseus work only on Windows 64 bits.
Refer to the online instruction to make sure your system is compatible:
http://coxdocs.org/doku.php?id=maxquant:common:download_and_installation
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 1 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
4.5.1 - Installation
You can download the Perseus software from the MaxQuant website:
http://coxdocs.org/doku.php?id=maxquant:common:download_and_installation.
Upon registration, a link to a zip file will be sent to you. Start Perseus. You should then see the following
startup screen:
4.5.2 - File Import
Next, we will import the MaxQuant output files. Click the green arrow at the top left:
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 2 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
Next, use the Select button to select proteinGroups_5cell-line-mix.txt in the MaxQuant result folder.
Note that Perseus assigns the MaxQuant data into different categories. Select the categories Ratio H/L
normalized E1 to E5, and include them to Main using the '>' symbol.
When would you use the other output files? The other categories in this file? [4.5a]
You should see the following screen:
To the left, you can see a table with your data, called a matrix, in the middle your workflow, consisting
of a list of matrices, and to the right metrics about your data.
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 3 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
4.5.2 - Filtering and Rearranging Columns
Prior to further analysis, we will now filter rows corresponding to decoy hits, contaminants, and proteins
only identified by site.
How would these proteins affect the statistical analysis? [4.5b]
Go to Filter rows → Filter rows based on categorical column. Make sure that Only identified by site is
selected at the top and click OK. Repeat the operation for Contaminant and Reverse. You should have
the following screen:
Note how the workflow expands after each operation.
Note:
You can navigate the matrices, go back in your workflow and branch to try different data
interpretation strategies.
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 4 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
Remove the empty columns by selecting Rearrange → Remove empty columns. Rename the columns via
Rearrange → Rename columns according to the table below. The column names correspond to the
different cell lines used:
Original Name
Sample Name
Condition
Ratio H/L normalized E1
Molm-13
Relapse
Ratio H/L normalized E2
MV4-11
Diagnosis
Ratio H/L normalized E3
NB4
Relapse
Ratio H/L normalized E4
OCI-AML3
Diagnosis
Ratio H/L normalized E5
THP-1
Relapse
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 5 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
As indicated in the table, some of the cell lines were derived from patients at diagnosis or relapse. We
are going to compare these two conditions, and to do this we need to categorize the samples
accordingly. Go to Annot. rows → Categorical annotations rows and specify the group of every cell line
to Diagnosis or Relapse as indicated below.
Look at the row corresponding to the protein group of A0FGR8 at the top.
What are the ratios between the different cell lines? [4.5c]
Look at the row corresponding to the protein group of A0JLT2 at the top.
What are the ratios between the different cell lines? How can this be explained? [4.5c]
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 6 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
We are now going to filter out proteins with missing values. Use Filter rows → Filter rows based on valid
values. Select a minimum number of two values in each group.
Where do missing values come from? How do they affect the results? How do you set the number of
allowed missing values? [4.5d]
Use the default in this case, it will filter out proteins with missing values in at least three cell lines.
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 7 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
The MaxQuant ratios are displayed as Heavy/Light, we are going to invert these ratios in order to have
Light/Heavy. For this, use Basic → Transform, and write 1/(x) in the Transformation field as displayed
below.
What does Heavy and Light correspond to? [4.5e]
Finally, go to Basic → Transform and type log2(x) as displayed below.
Why do we log transform the ratios before further processing? [4.5f]
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 8 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
4.5.3 - Normalization
Go to Visualization → Histogram, and leave the settings to default. Next, in Data, select Histogram:
What do you think of the distributions of ratios? Would you perform additional normalization? [4.5g]
Go to Normalization → Subtract. Choose Columns in the Matrix access field.
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 9 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
Leave the other settings as the defaults and click OK. Redraw the histograms:
How did the distributions of ratios change? Why did we subtract and not divide? [4.5h]
4.5.4 - Student's t-test
The Student's t-test can be used to determine whether two sets of data are significantly different. Here
we will perform the test for every protein and compare the ratios at diagnosis versus relapse. Select
Tests → Two-sample tests.
What are the available tests? Why select both sides? What do the truncation parameters refer to? [4.5i]
Leave the settings at the default, you will see the results of the test to the right of the table.
How many times did we perform the test? What is the lowest p-value? How many significant proteins
would you retain? [4.5j]
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 10 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
4.5.5 - Volcano Plots
A volcano plot allows you to view the most promising significant proteins by using the ratios. Go to Misc.
→ Volcano plot.
Note that the volcano plot appears under the Test tab of the current matrix.
How do you interpret this plot? How many significant proteins would you retain? [4.5k]
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 11 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
4.5.6 - Value Imputation
Some data interpretation methods are not tolerant towards missing values. We are now going to
assume that the missing values are due to the protein amounts being below the limit of detection and
give an artificial low value to the missing values6, 7. Go to Imputation → Replace missing values from
normal distribution. Leave the values at the defaults and draw the histograms again:
Above the table in the Points tab, change Selection from table to Selection from imputation. The
imputated values are now displayed in red.
How are the imputated values distributed? Would you trust a protein assumed regulated on the basis of
imputation? [4.5l]
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 12 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
4.5.7 - Principal Component Analysis (PCA)
The principal component analysis (PCA) allows the grouping of the data with the highest similarity. This
method requires no missing values, so it is important to conduct filtering or imputation beforehand.
Select Principal component analysis under Clustering/PCA.
What are the most similar cell lines in terms of protein expression? In which situation would such plots be
useful? [4.5m]
4.5.8 - Hierarchical Clustering
Another useful way to identify clusters in complex datasets is to conduct hierarchical clustering. Click on
Clustering/PCA and Hierarchical Clustering.
What are the most similar cell lines in terms of protein expression? Is this in agreement with the PCA
plot? [4.5m]
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 13 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
Note that you can select clusters of interest and visualize the associated profiles.
In order to better cluster proteins with similar profiles, you can normalize the expression values for each
line. Select Z-score under Normalization and perform hierarchical clustering again:
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 14 of 15
Bioinformatics for Proteomics Tutorial
4.5 - Statistical Analysis
References
1.
2.
3.
4.
5.
6.
7.
Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.range mass accuracies and proteome-wide protein quantification. Nature biotechnology 26,
1367-1372 (2008).
Vizcaino, J.A. et al. ProteomeXchange provides globally coordinated proteomics data submission
and dissemination. Nat Biotech 32, 223-226 (2014).
Martens, L. et al. PRIDE: the proteomics identifications database. Proteomics 5, 3537-3545
(2005).
Aasebo, E. et al. Performance of super-SILAC based quantitative proteomics for comparison of
different acute myeloid leukemia (AML) cell lines. Proteomics 14, 1971-1976 (2014).
Aasebo, E., Berven, F.S., Selheim, F., Barsnes, H. & Vaudel, M. Interpretation of Quantitative
Shotgun Proteomic Data. Methods in molecular biology 1394, 261-273 (2016).
Webb-Robertson, B.J. et al. Review, evaluation, and discussion of the challenges of missing value
imputation for mass spectrometry-based label-free global proteomics. Journal of proteome
research 14, 1993-2001 (2015).
Lazar, C., Gatto, L., Ferro, M., Bruley, C. & Burger, T. Accounting for the Multiple Natures of
Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation
Strategies. Journal of proteome research 15, 1116-1125 (2016).
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License.
Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel
([email protected])
Page 15 of 15

Download Report

Statistical Analysis

Paperzz.com

Your Paperzz