Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop February 23, ‘07 Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Row 9 Row 10 Row 11 Row 12 Row 13 Row 14 Row 15 Row 16 Row 17 Row 18 Row 19 Row 20 Row 21 Row 22 Row 23 Row 24 Row 25 Row 26 Row 27 Row 28 Row 29 Row 30 Row 31 Row 32 Row 33 Row 34 Row 35 Row 36 Row 37 Row 38 Row 39 Row 40 Row 41 Row 42 Row 43 Row 44 Row 45 Row 46 Row 47 Row 48 Row 49 Row 50 Row 51 Row 52 Row 53 Row 54 Row 55 Row 56 Row 57 Row 58 Row 59 Row 60 Row 61 Row 62 Row 63 Row 64 Row 65 Row 66 Row 67 Row 68 Row 69 Row 70 Row 71 Row 72 Row 73 Row 74 Row 75 Row 76 Row 77 Row 78 Row 79 Row 80 Row 81 Row 82 Row 83 Row 84 Row 85 Row 86 1 2 3 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Comp Comp Comp Comp Scotch whiskey database Original matrix = Prototypical flavor patterns X Mixing levels (weights) + Residual How many flavor patterns? Profile likelihood (eigen values) -47 Profile Likelihood -48 Profile likelihood -49 (Zhu and Ghodsi) -50 -51 Scree Plot (eigen values) 250 -52 0 1 2 3 4 5 6 7 Rows 8 9 10 11 12 13 150 Scree plot 100 50 Scree Plot (determinant) 1.1 0 0 1 2 3 4 5 6 7 Rows 8 9 10 11 12 13 1 0.9 Det Eigen Value 200 Volume filled 0.8 (Determinant) 0.7 0 1 2 3 4 5 6 7 Rows 8 9 10 11 12 13 Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Row 9 Row 10 Row 11 Row 12 Row 13 Row 14 Row 15 Row 16 Row 17 Row 18 Row 19 Row 20 Row 21 Row 22 Row 23 Row 24 Row 25 Row 26 Row 27 Row 28 Row 29 Row 30 Row 31 Row 32 Row 33 Row 34 Row 35 Row 36 Row 37 Row 38 Row 39 Row 40 Row 41 Row 42 Row 43 Row 44 Row 45 Row 46 Row 47 Row 48 Row 49 Row 50 Row 51 Row 52 Row 53 Row 54 Row 55 Row 56 Row 57 Row 58 Row 59 Row 60 Row 61 Row 62 Row 63 Row 64 Row 65 Row 66 Row 67 Row 68 Row 69 Row 70 Row 71 Row 72 Row 73 Row 74 Row 75 Row 76 Row 77 Row 78 Row 79 Row 80 Row 81 Row 82 Row 83 Row 84 Row 85 Row 86 1 2 3 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Comp Comp Comp Comp AnCnoc Floral Sweetness Fruity Malty Nutty Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Row 9 Row 10 Row 11 Row 12 Row 13 Row 14 Row 15 Row 16 Row 17 Row 18 Row 19 Row 20 Row 21 Row 22 Row 23 Row 24 Row 25 Row 26 Row 27 Row 28 Row 29 Row 30 Row 31 Row 32 Row 33 Row 34 Row 35 Row 36 Row 37 Row 38 Row 39 Row 40 Row 41 Row 42 Row 43 Row 44 Row 45 Row 46 Row 47 Row 48 Row 49 Row 50 Row 51 Row 52 Row 53 Row 54 Row 55 Row 56 Row 57 Row 58 Row 59 Row 60 Row 61 Row 62 Row 63 Row 64 Row 65 Row 66 Row 67 Row 68 Row 69 Row 70 Row 71 Row 72 Row 73 Row 74 Row 75 Row 76 Row 77 Row 78 Row 79 Row 80 Row 81 Row 82 Row 83 Row 84 Row 85 Row 86 1 2 3 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Comp Comp Comp Comp Balmenach Winey Body Honey Sweetness Nutty Malty Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Row 9 Row 10 Row 11 Row 12 Row 13 Row 14 Row 15 Row 16 Row 17 Row 18 Row 19 Row 20 Row 21 Row 22 Row 23 Row 24 Row 25 Row 26 Row 27 Row 28 Row 29 Row 30 Row 31 Row 32 Row 33 Row 34 Row 35 Row 36 Row 37 Row 38 Row 39 Row 40 Row 41 Row 42 Row 43 Row 44 Row 45 Row 46 Row 47 Row 48 Row 49 Row 50 Row 51 Row 52 Row 53 Row 54 Row 55 Row 56 Row 57 Row 58 Row 59 Row 60 Row 61 Row 62 Row 63 Row 64 Row 65 Row 66 Row 67 Row 68 Row 69 Row 70 Row 71 Row 72 Row 73 Row 74 Row 75 Row 76 Row 77 Row 78 Row 79 Row 80 Row 81 Row 82 Row 83 Row 84 Row 85 Row 86 1 2 3 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Comp Comp Comp Comp GlenGarioch Spicy Fruity Sweetness Body Malty Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 Row 8 Row 9 Row 10 Row 11 Row 12 Row 13 Row 14 Row 15 Row 16 Row 17 Row 18 Row 19 Row 20 Row 21 Row 22 Row 23 Row 24 Row 25 Row 26 Row 27 Row 28 Row 29 Row 30 Row 31 Row 32 Row 33 Row 34 Row 35 Row 36 Row 37 Row 38 Row 39 Row 40 Row 41 Row 42 Row 43 Row 44 Row 45 Row 46 Row 47 Row 48 Row 49 Row 50 Row 51 Row 52 Row 53 Row 54 Row 55 Row 56 Row 57 Row 58 Row 59 Row 60 Row 61 Row 62 Row 63 Row 64 Row 65 Row 66 Row 67 Row 68 Row 69 Row 70 Row 71 Row 72 Row 73 Row 74 Row 75 Row 76 Row 77 Row 78 Row 79 Row 80 Row 81 Row 82 Row 83 Row 84 Row 85 Row 86 1 2 3 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Flavor Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey Nutty Malty Fruity Floral 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 Comp Comp Comp Comp Lagavulin & Laphroig Medicinal Smoky Body Statistical Issues 1. Massive testing: Hundreds of “omic” predictors and several questions per sample. 2. Family-wise versus false discovery. 3. Missing data, outliers. Don’t fool yourself. Matrix Factorization Methods 1. Principle component analysis. 2. Singular value decomposition. 3. Non-negative matrix factorization. 4. Independent component analysis. 5. Robust MF. Area of active research. Key Papers 1. Good (1969) Technometrics – SVD. 2. Liu et al. (2003) PNAS – rSVD. NMF commits one vector to each mechanism. 3. Lee and Seung (1999) Nature – NMF. 4. Kim and Tidor (2003) Genome Research. 5. Brunet et al. (2004) PNAS – Micro array. SVD eigen vectors come from a composite of mechanisms. NMF Algorithm Samples Genes or Compounds Start with random elements in red and green. A Optimize so that = WH +E Green are the “spectra”. Red are the “weights”. (aij – whij)2 is minimized. Inference • Test each variable sequentially within an ordered set. Each set corresponds to a particular eigenvector, which has been ordered by decreasing values. Increase in statistical power. Genomic example. Simulation. Micro Array Example • Group AML: patients with acute myeloid leukemia • Group ALL: patients with acute lymphoblastic leukemia – Subgroup ALL-T: T cell subtypes – Subgroup ALL-B: B cell subtypes Golub,T.R. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531– 537. Clustering NMF clusters samples correctly. Brunet et al (2004). PNAS vol. 101 no. 12 4164–4169 Additional subgroup of ALL-B. Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al (2004). PNAS vol. 101 no. 12 4164–4169 Clustering NMF clusters samples correctly. Additional subgroup of ALL-B. Brunet et al (2004). PNAS vol. 101 no. 12 4164–4169 Sequential testing Cluster 1 ALL-B1 (33 genes) Immune Response MHC class II 10 genes (p=0.00019) 5 genes Proteasome 7 genes P = 0.00054 Immune Response 28 genes (p=0.00047) MHC class I & II 6 genes P = 0.00018 Upregulation in ALL-B2 genes Higher rate of transcription and replication processes More: RNA Processing Cluster 3 ALL-B2 11 genes P = 0.00260 (169 genes) DNA Repair and Replication Cell Growth and Proliferation 11 genes P = 0.01519 61 genes Cell Cycle 12 genes Transcription 16 genes Proliferative nature compared with ALL-B1 Proteasomal activity Energy production. Simulation Simulation 200 Genes 1-5: upregulated by T1 Y 150 100 T2 T2 T1 T1 T1 N N 50 Group Y 200 Genes 6-10: upregulated by T2 150 100 T2 T2 T1 T1 T1 N N 50 Group 250 Genes 11-20: upregulated by T1 and T2 Y 200 150 100 Group T2 T2 T1 T1 T1 N Intragroup correlation structure N 50 Simulation results Increased power Same level of FDR For more details see paper Summary • The strategy is conceptually simple: – – – • Non-negative matrix factorization is used to create groups of genes that are moving together in the dataset. The error rate to be controlled is allocated over these groups. Within each group, genes are tested sequentially. The strategy should be effective if there are sets of genes moving together so that group formation reflects biological reality. Areas of research: Robust algorithms Speed Multiblock NMF (e.g. relate active motifs with differentially expressed genes) Contact Information Paul Fogel [email protected] +33 1 43 26 16 86 Independent consultant Stan Young National Institute of Statistical Sciences [email protected] 919 685 9328 Literature www.niss.org/irMF Software
© Copyright 2026 Paperzz