Gene expression analysis Curtis Huttenhower Slides courtesy of: Amy Caudy (Princeton) Gavin Sherlock (Stanford) Matt Hibbs (Jackson Labs) Florian Markowetz (Cancer Research UK) Olga Troyanskaya (Princeton) Harvard School of Public Health Department of Biostatistics 03-26-14 Gene expression analyses • Unsupervised – Clustering (class discovery) – Ordination – Coexpression (network construction) • Supervised – Differential expression (class comparison) – Biomarker discovery (class prediction) 2 Supervised analysis = learning from examples, classification – We have already seen groups of healthy and sick people. Now let’s diagnose the next person walking into the hospital. – We know that these genes have function X (and these others don’t). Let’s find more genes with function X. – We know many gene-pairs that are functionally related (and many more that are not). Let’s extend the number of known related gene pairs. Known structure in the data needs to be generalized to new data. Un-supervised analysis = pattern finding – Are there groups of genes that behave similarly in all conditions? – Disease X is very heterogeneous. Can we identify more specific sub-classes for more targeted treatment? – What are the major patterns of variation (genes, pathways) active under these conditions? No structure is known. We first need to find it. Exploratory analysis. Supervised analysis Calvin, I still don’t know the difference between cats and dogs … Oh, now I get it!! Class 1: cats Don’t worry! I’ll show you once more: Class 2: dogs Un-supervised analysis Calvin, I still don’t know the difference between cats and dogs … I don’t know it either. Let’s try to figure it out together … Unsupervised analysis clustering Visualizing Data MAK16 YAL025C MAK16 YBL015W ACH1 5 0.5 4 0 OD 7.30 OD 6.90 OD 3.70 OD 1.80 OD 0.80 OD 0.46 OD 0.26 3 2 -0.5 1 MAK16 -2 -2 -2.5 -3 -4 7. 30 D 6. 90 O D 3. 70 O D 1. 80 O D 0. 80 O D O D O D -1 O -1.5 0. 46 0 0. 26 -1 YBL048W YBL048W YBL049W YBL049W YBL064C YBL064C YBL078C YBL078C YBR072W HSP26 YBR139W YBR139W YBR147W YBR147W YCR021C HSP30 YDL085W YDL085W YDL204W YDL204W YDL208W NHP2 Visualizing Data (cont.) Expression During Sporulation 5 Series1 Series2 Series3 Series4 4 Series5 Series6 Series7 Series8 Series9 Series10 3 Series11 Series12 Series13 Series14 Series15 2 Series16 Series17 Series18 Series19 Log Ratio Series20 Series21 1 Series22 Series23 Series24 Series25 Series26 0 Series27 0 2 4 6 8 10 Series28 Series29 Series30 Series31 -1 Series32 Series33 Series34 Series35 Series36 Series37 -2 Series38 Series39 Series40 Series41 Series42 -3 Series43 Series44 Series45 Series46 Series47 -4 Series48 Time (hours) Series49 Series50 Series51 What is clustering? • Reordering of gene (or experiment) expression vectors in the dataset so that similar patterns are next to each other (or in separate groups) • Identify subsets of genes (or experiments) that are related by some measure Why cluster? Conditions • Dimensionality reduction: datasets are too large to be able to get information out without reorganizing the data Genes • “Guilt by association” => if unknown gene i is similar in expression to known gene j, maybe they are involved in the same/related pathway Clustering Techniques • Algorithm (Method) – – – – – – – – Hierarchical K-means Self Organizing Maps QT-Clustering NNN . . . • Distance Metric – – – – – – – – Euclidean (L2) Pearson Correlation Spearman Correlation Manhattan (L1) Kendall’s t . . . Distance Measures • Choice of distance measure is important for most clustering techniques • Pair-wise measures – compare vectors of numbers – e.g. genes x & y, each with n measurements Euclidean Distance Pearson Correlation Spearman Correlation Distance Measures Euclidean Distance Spearman Correlation Pearson Correlation Hierarchical clustering • Imposes (pair-wise) hierarchical structure on all of the data • Often good for visualization • Basic Method (agglomerative): 1. 2. 3. 4. Calculate all pair-wise distances Join the closest pair Calculate pair’s distance to all others Repeat from 2 until all joined Hierarchical clustering Hierarchical clustering Hierarchical clustering Hierarchical clustering Hierarchical clustering Hierarchical clustering Single linkage Clustering Nearest Neighbor • • +• • • • • + • This method produces long chains which form straggly clusters. • • Complete Linkage Clustering Uses the Furthest Neighbor • • +• • • • • + • • • This method tends to produce very tight clusters of similar patterns Average Linkage Clustering • Average (only shown for two cases) • +• • • • • + • • • The red and blue ‘+’ signs mark the centroids of the two clusters. Centroid Linkage Clustering • Centroid • +• • • • • + • • • The red and blue ‘+’ signs mark the centroids of the two clusters. Hierarchical clustering: problems • Highly sensitive to similarity measure, algorithm • Clusters everything, hard to define distinct clusters • Genes assigned to clusters on the basis of all experiments • Optimizing node ordering hard (finding the optimal solution is NP-hard) • Can be driven by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated K-means Clustering • Groups genes into a pre-defined number of independent clusters • Basic algorithm: 1. Define k = number of clusters 2. Randomly initialize each cluster with a seed (often with a random gene) 3. Assign each gene to the cluster with the most similar seed 4. Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster 5. Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.) K-means example K-means example K-means example K-means: problems • Have to set k ahead of time – Ways to choose “optimal” k: minimize withincluster variation compared to random data or held out data • You’ll get k clusters whether they exist or not • Each gene only belongs to exactly 1 cluster • One cluster has no influence on the others (one dimensional clustering) • Genes assigned to clusters on the basis of all experiments Can a gene belong to N clusters? • Fuzzy clustering: each gene’s relationship to a cluster is probabilistic • Gene can belong to many clusters • More biologically realistic, but harder to get to work well/fast 0.85 • Harder to interpret 0.15 Advanced clustering methods • • • • Fuzzy clustering Clustering with resampling Biclustering Clustering based on physical properties (spring models, “attraction of points”) • Dimensionality reduction Clustering Tools • TIGR MeV – http://www.tm4.org/mev.html • Sleipnir – http://huttenhower.sph.harvard.edu/sleipnir • Cluster & JavaTreeView – http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm – http://jtreeview.sourceforge.net/ • CLICK & EXPANDER – http://www.cs.tau.ac.il/~rshamir/expander/expander.html Never underestimate the power of Excel in conjunction with Python! Exercise Caution • Typically, when constructing a microarray dataset, certain filters are applied to retain only the ‘interesting’ genes. • Clustering imposes an ordering/cluster structure on the genes, whether one exists or not. • This is accentuated by filtering out of genes. An Example Bryan, 2004 The Result of Filtering Groups in the data, which didn’t exist with the full dataset, suddenly appear! Bryan, 2004 Cluster Evaluation • Mathematical consistency – Compare coherency of clusters to background • Look for functional consistency in clusters – Requires a gold standard, often based on GO, KEGG, MSigDB, etc. – ROC curves/AUC or precision/recall • Evaluate likelihood of enrichment in clusters – Hypergeometric distribution, etc. – Several tools available (DAVID, GSEA, others) Gene set overlap • Inputs: – One (or more) result set(s) of discrete (“hard”) clusters – One (or more) characterized gene sets to test • Probability of observing x or more genes in a cluster of n genes with a common annotation – – – – – Hypergeometric or Fisher’s exact tests appropriate N = total number of genes in genome M = number of genes with annotation p value n = number of genes in cluster x = number of genes in cluster with annotation M N M n j n j N j x n • Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.) Gene set enrichment • Inputs: – One (or more) result set(s) of genes in rank order – One (or more) characterized gene sets to test • Probability of observing x or genes from a pathway in the top/bottom n ranked result genes • Multiple hypothesis correction still required over all characterized gene sets tested. GO term Enrichment Tools • GSEA (http://www.broadinstitute.org/gsea/) • GOrilla (http://cbl-gorilla.cs.technion.ac.il/) • DAVID (http://david.abcc.ncifcrf.gov/) • AmiGO & Princeton’s GoTermFinder – http://amigo.geneontology.org/cgi-bin/amigo/go.cgi – http://go.princeton.edu • GOLEM (http://function.princeton.edu/GOLEM) Sealfon et al., 2006 Advanced analysis methods – a very brief overview More Unsupervised Methods • Search-based approaches – Start with query gene/condition, find similar – Also referred to as “prioritization” • Singular Value Decomposition (SVD) & Principal Component Analysis (PCA) – Decomposition of data matrix into “patterns”, “weights”, and “contributions” – Real names are “principal components”, “singular values”, and “left/right eigenvectors” – Used to remove noise, reduce dimensionality, identify common/dominant signals SVD (& PCA) • SVD is a general method for matrix decomposition, PCA is performing SVD on centered data • Projects data into another orthonormal basis • New basis ordered by variance explained Vt = X U Singular “Eigen-genes” values Original “Eigen-conditions” Data matrix PCA PCA Supervised methods – a very brief introduction Supervised vs. Unsupervised • Unsupervised methods can find novel profile groupings • Supervised methods take known groupings and create rules for reliably assigning genes or conditions into these groups Hierarchical clustering of lung cancers Microbiome ordination Only use unsupervised methods A) as a qualitative, visual guide or B) when you don’t know what you’re looking for. 49 Supervised analysis: setup • Training set – Data: samples (conditions) or genes – Labels: classes of interest (e.g. case/control for conditions, function annotations for genes) • Test set – Data: as above without labels. – E.g. Genes without known function • Goal: generalization – Build a classifier from the training data that is good at predicting the right class for new data. Learning to classify expression profiles Think of a space with #genes dimensions (yes, it’s hard for more than 3). Expression of gene 2 Each sample corresponds to a point in this space. Same principle as PCA, ordination If gene expression is similar under some conditions, the points will be close to each other. If gene expression overall is very different, the points will be far away. Expression of gene 1 Which line separates best? A B C D No sharp knife, but a … Support Vector Machines Maximal margin separating hyperplane Datapoints closest to separating hyperplane = support vectors How well did we do? Training error: how well do we do on the data we trained the classifier on? But how well will we do in the future, on new data? Test error: How well does the classifier generalize? Same classifier (= line) New data from same classes The classifier will usually perform worse than before: Test error > training error Cross-validation Training error Train classifier and test it Test error Train Test K-fold Cross-validation Here for K=3 Step 1. Train Train Test Step 2. Train Test Train Step 3. Test Train Train Other supervised methods • • • • • • Bayesian networks Neural networks Linear discriminant analysis (LDA) Logistic regression Boosting Decision trees Summary II • Supervised and un-supervised learning … are needed everywhere in biology and medicine • Microarrays = points in high-dimensional spaces • Classifiers = surfaces (hyperplanes) in these spaces • Support Vector Machines use maximal margin hyperplanes as classifiers • Classifier performance: Test error > training error • Cross-validation is the right way to evaluate classifier performance Identifying biomarkers Class discovery Class comparison Class prediction The problem • Have samples in two groups A and B • Want to identify biomarker genes between A and B • Challenges: – Data are not normally distributed – Expression data are often noisy A B • Goal: robust and reliable methods for identification of biomarker genes Hierarchical clustering of lung cancers Patient survival for Adenocarcinoma subgroups 1 Cum. Survival (Group 1) Cum. Survival .8 Cum. Survival (Group 2) Cum. Survival (Group 3) .6 .4 p = 0.002 .2 for Gr. 1 vs. Gr. 3 0 0 10 20 30 40 Time (months) 50 60 Nonparametric t-test • Want to pick genes with: – Maximal difference in mean expression between samples A B A B – Minimal variance of expression within sample A B A B Nonparametric t-test group 1 : n1 samples, with average expression X1 group 2 : n 2 samples, with average expression X 2 t statistic X :t S 1 X2 S X1 X 2 X1 X 2 s12 s22 n1 n2 p value (from column permutations) : p j p erm count(t j p erm t jo b s ) count(perm utations) count(permutations) Low Expression High Expression Wilcoxon rank-sum test • Tests for equality of means of two samples • Uses rank data • Good for non-normal data Original data 2 | 0 | 3 | 5 | 9 gene 1 Ranks 2 | 1 | 3 | 4 | 5 gene 1 0 | 1 | 2 | 3 | 5 gene 2 1 | 2 | 3 | 4 | 5 gene 2 7 | 4 | 3 | 2 | 1 gene 3 5 | 4 | 3 | 2 | 1 gene 3 • Identifies genes with skewed distribution of ranks Microarrays: MeV www.tm4.org/mev 66 Microarrays: MeV www.tm4.org/mev 67 Analysis summary • Unsupervised methods – – Clustering • Hierarchical clustering • K means clustering – Decomposition • SVD/PCA • Supervised methods – Require examples with known answers • Need both positive and negative examples – Support Vector Machines • Biomarker identification – Nonparametric t-test – Rank sum test
© Copyright 2026 Paperzz