Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis Statistical Analysis The result of a proteomic quantitative analysis is a list of peptide and protein abundances for every protein in different samples, or abundance ratios between the samples. The downstream interpretation of these data varies according to the experimental design. In this chapter we will describe different generic methods for the interpretation of quantitative datasets. For this, we will use Perseus, a graphical user interface for the interpretation of quantitative biology data (and a part of the MaxQuant environment1). The dataset used here for illustrative purposes is freely available through the ProteomeXchange2 consortium via the PRIDE3 partner repository under the accession number PXD000441, and available in the resources folder under Results_MQ_5cell-line-mix. It consists of five cell lines derived from acute myeloid leukemia (AML) patients measured with a spiked-in internal stanard (IS) obtained from the combination of the same five AML cell lines that have been metabolically labeled with heavy isotopes4 and analyzed using MaxQuant version 1.4.1.2. The workflow presented here can also be followed in5. Perseus allows the design and navigation of data interpretation workflows. It offers numerous possibilities for data interpretation and thus it will require time to get familiar with all the available features. But with a small effort you will be able to design complex workflows intuitively. Importantly, you will be able to visualize the data at any point, and design branches in your workflow for testing different possibilities. This chapter introduces the basic functions, and is by no means aiming at covering all features or present a reference workflow. Please continue exploring the software on your own and critically adapt the interpretation workflow to your experiment! Note: MaxQuant and Perseus work only on Windows 64 bits. Refer to the online instruction to make sure your system is compatible: http://coxdocs.org/doku.php?id=maxquant:common:download_and_installation This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 1 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis 4.5.1 - Installation You can download the Perseus software from the MaxQuant website: http://coxdocs.org/doku.php?id=maxquant:common:download_and_installation. Upon registration, a link to a zip file will be sent to you. Start Perseus. You should then see the following startup screen: 4.5.2 - File Import Next, we will import the MaxQuant output files. Click the green arrow at the top left: This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 2 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis Next, use the Select button to select proteinGroups_5cell-line-mix.txt in the MaxQuant result folder. Note that Perseus assigns the MaxQuant data into different categories. Select the categories Ratio H/L normalized E1 to E5, and include them to Main using the '>' symbol. When would you use the other output files? The other categories in this file? [4.5a] You should see the following screen: To the left, you can see a table with your data, called a matrix, in the middle your workflow, consisting of a list of matrices, and to the right metrics about your data. This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 3 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis 4.5.2 - Filtering and Rearranging Columns Prior to further analysis, we will now filter rows corresponding to decoy hits, contaminants, and proteins only identified by site. How would these proteins affect the statistical analysis? [4.5b] Go to Filter rows → Filter rows based on categorical column. Make sure that Only identified by site is selected at the top and click OK. Repeat the operation for Contaminant and Reverse. You should have the following screen: Note how the workflow expands after each operation. Note: You can navigate the matrices, go back in your workflow and branch to try different data interpretation strategies. This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 4 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis Remove the empty columns by selecting Rearrange → Remove empty columns. Rename the columns via Rearrange → Rename columns according to the table below. The column names correspond to the different cell lines used: Original Name Sample Name Condition Ratio H/L normalized E1 Molm-13 Relapse Ratio H/L normalized E2 MV4-11 Diagnosis Ratio H/L normalized E3 NB4 Relapse Ratio H/L normalized E4 OCI-AML3 Diagnosis Ratio H/L normalized E5 THP-1 Relapse This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 5 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis As indicated in the table, some of the cell lines were derived from patients at diagnosis or relapse. We are going to compare these two conditions, and to do this we need to categorize the samples accordingly. Go to Annot. rows → Categorical annotations rows and specify the group of every cell line to Diagnosis or Relapse as indicated below. Look at the row corresponding to the protein group of A0FGR8 at the top. What are the ratios between the different cell lines? [4.5c] Look at the row corresponding to the protein group of A0JLT2 at the top. What are the ratios between the different cell lines? How can this be explained? [4.5c] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 6 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis We are now going to filter out proteins with missing values. Use Filter rows → Filter rows based on valid values. Select a minimum number of two values in each group. Where do missing values come from? How do they affect the results? How do you set the number of allowed missing values? [4.5d] Use the default in this case, it will filter out proteins with missing values in at least three cell lines. This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 7 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis The MaxQuant ratios are displayed as Heavy/Light, we are going to invert these ratios in order to have Light/Heavy. For this, use Basic → Transform, and write 1/(x) in the Transformation field as displayed below. What does Heavy and Light correspond to? [4.5e] Finally, go to Basic → Transform and type log2(x) as displayed below. Why do we log transform the ratios before further processing? [4.5f] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 8 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis 4.5.3 - Normalization Go to Visualization → Histogram, and leave the settings to default. Next, in Data, select Histogram: What do you think of the distributions of ratios? Would you perform additional normalization? [4.5g] Go to Normalization → Subtract. Choose Columns in the Matrix access field. This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 9 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis Leave the other settings as the defaults and click OK. Redraw the histograms: How did the distributions of ratios change? Why did we subtract and not divide? [4.5h] 4.5.4 - Student's t-test The Student's t-test can be used to determine whether two sets of data are significantly different. Here we will perform the test for every protein and compare the ratios at diagnosis versus relapse. Select Tests → Two-sample tests. What are the available tests? Why select both sides? What do the truncation parameters refer to? [4.5i] Leave the settings at the default, you will see the results of the test to the right of the table. How many times did we perform the test? What is the lowest p-value? How many significant proteins would you retain? [4.5j] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 10 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis 4.5.5 - Volcano Plots A volcano plot allows you to view the most promising significant proteins by using the ratios. Go to Misc. → Volcano plot. Note that the volcano plot appears under the Test tab of the current matrix. How do you interpret this plot? How many significant proteins would you retain? [4.5k] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 11 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis 4.5.6 - Value Imputation Some data interpretation methods are not tolerant towards missing values. We are now going to assume that the missing values are due to the protein amounts being below the limit of detection and give an artificial low value to the missing values6, 7. Go to Imputation → Replace missing values from normal distribution. Leave the values at the defaults and draw the histograms again: Above the table in the Points tab, change Selection from table to Selection from imputation. The imputated values are now displayed in red. How are the imputated values distributed? Would you trust a protein assumed regulated on the basis of imputation? [4.5l] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 12 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis 4.5.7 - Principal Component Analysis (PCA) The principal component analysis (PCA) allows the grouping of the data with the highest similarity. This method requires no missing values, so it is important to conduct filtering or imputation beforehand. Select Principal component analysis under Clustering/PCA. What are the most similar cell lines in terms of protein expression? In which situation would such plots be useful? [4.5m] 4.5.8 - Hierarchical Clustering Another useful way to identify clusters in complex datasets is to conduct hierarchical clustering. Click on Clustering/PCA and Hierarchical Clustering. What are the most similar cell lines in terms of protein expression? Is this in agreement with the PCA plot? [4.5m] This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 13 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis Note that you can select clusters of interest and visualize the associated profiles. In order to better cluster proteins with similar profiles, you can normalize the expression values for each line. Select Z-score under Normalization and perform hierarchical clustering again: This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 14 of 15 Bioinformatics for Proteomics Tutorial 4.5 - Statistical Analysis References 1. 2. 3. 4. 5. 6. 7. Cox, J. & Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.range mass accuracies and proteome-wide protein quantification. Nature biotechnology 26, 1367-1372 (2008). Vizcaino, J.A. et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotech 32, 223-226 (2014). Martens, L. et al. PRIDE: the proteomics identifications database. Proteomics 5, 3537-3545 (2005). Aasebo, E. et al. Performance of super-SILAC based quantitative proteomics for comparison of different acute myeloid leukemia (AML) cell lines. Proteomics 14, 1971-1976 (2014). Aasebo, E., Berven, F.S., Selheim, F., Barsnes, H. & Vaudel, M. Interpretation of Quantitative Shotgun Proteomic Data. Methods in molecular biology 1394, 261-273 (2016). Webb-Robertson, B.J. et al. Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics. Journal of proteome research 14, 1993-2001 (2015). Lazar, C., Gatto, L., Ferro, M., Bruley, C. & Burger, T. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies. Journal of proteome research 15, 1116-1125 (2016). This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. Elise Aasebø ([email protected]), Harald Barsnes ([email protected]) and Marc Vaudel ([email protected]) Page 15 of 15
© Copyright 2026 Paperzz