Analysing Microarray Data Using Bayesian Network Learning Name: Phirun Son Supervisor: Dr. Lin Liu Contents Aims Microarrays Bayesian Networks Classification Methodology Results Aims and Goals Investigate suitability of Bayesian Networks for analysis of Microarray data Apply Bayesian learning on Microarray data for classification Comparison with other classification techniques Microarrays Array of microscopic dots representing gene expression levels Gene expression is the process of DNA genes being transcribed into RNA Short sections of genes attached to a surface such as glass or silicon Treated with dyes to obtain expression level Challenges of Microarray Data Very large number of variables, low number of samples Data is noisy and incomplete Standardisation of data format ◦ MGED – MIAME, MAGE-ML, MAGE-TAB ◦ ArrayExpress, GEO, CIBEX Bayesian Networks Represents conditional independencies of random variables Two components: ◦ Directed Acyclic Graph (DAG) ◦ Probability Table Methodology Create a program to test accuracy of classification ◦ Written in MATLAB using Bayes Net Toolbox (Murphy, 2001), and Structure Learning Package (Leray, 2004) ◦ Uses Naive network structure, K2 structure learning, and predetermined structure Test program on synthetic data Test program using real data Comparison of Bayes Net and Decision Tree Synthetic Data Data created from well-known Bayesian Network examples ◦ Asia network, car network, and alarm network Samples generated from each network Tested with naive, pre-known structure, and with structure learning Synthetic Data - Results 50 Samples, 10 Folds, 100 Iterations Class Node: Dyspnoea Correct Naive 81.0% K2 Learning 83.4% Known Graph 85.0% 100 Samples, 10 Folds, 50 Iterations Class Node: Dyspnoea Correct Naive 83.1% K2 Learning 84.3% Known Graph 85.1% Asia Network Lauritzen and Spiegelhalter, ‘Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems’, 1988, pg 164 Synthetic Data - Results 50 Samples, 10 Folds, 100 Iterations Class Node: Engine Starts Correct Naive 53.5% K2 Learning 58.3% Known Graph 62.4% 100 Samples, 10 Folds, 50 Iterations Class Node: Engine Starts Correct Naive 56.5% K2 Learning 58.7% Known Graph 61.2% Car Network Heckerman, et al, ‘Troubleshooting under Uncertainty’, 1994 pg 13 Synthetic Data - Results 50 Samples, 10 Folds, 10 Iterations Class Node: InsufAnesth Correct Naive 72.4% K2 Learning 78.7% ALARM Network Known Graph 89.6% 37 Nodes, 46 Connections 50 Samples, 10 Folds, 10 Iterations Class Node: Hypovolemia Correct Naive 69.0% K2 Learning 77.8% Known Graph 93.6% Beinlich et al, ‘The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks’, 1989 Lung Cancer Data Set Publically available data sets: ◦ Harvard: Bhattacharjee et al, ‘Classification of Human Lung Carcinomas by mRNA Expression Profiling Reveals Distinct Adenocarcinoma Subclasses’, 2001 11,657 attributes, 156 instances, Affymetrix ◦ Michigan: Beer et al, ‘Gene-Expression Profiles Predict Survival of Patients with Lung Adenocarcinoma’, 2002 6,357 attributes, 96 instances, Affymetrix ◦ Stanford: Garber et al, ‘Diversity of Gene Expression in Adenocarcinoma of the Lung’, 2001 11,985 attributes, 46 instances, cDNA Contains missing values Feature Selection Li (2009) provides a feature-selected set of 90 attributes ◦ Using WEKA feature selection ◦ Also allows comparison with Decision Tree based classification Discretised data in 3 forms ◦ Undetermined values left unknown ◦ Undetermined values put into either category – two category ◦ Undetermined values put into another category – three category WEKA: Ian H. Witten and Eibe Frank, ‘Data Mining: Practical machine learning tools and techniques’, 2005. Harvard Set Harvard Training on Michigan MATLAB WEKA DT 2-Cat -> 2-Cat NF 95 (99.0%) 95 (99.0%) 95 (99.0%) 2-Cat -> 2-Cat F 94 (97.9%) 93 (96.9%) 92 (95.8%) 3-Cat -> 3-Cat NF 94 (97.9%) 95 (99.0%) 94 (97.9%) 3-Cat -> 3-Cat F 88 (91.7%) 95 (99.0%) 94 (97.9%) MATLAB WEKA DT 2-Cat -> 2-Cat NF 41 (89.1%) 46 (100%) 43 (93.5%) 2-Cat -> 2-Cat F 41 (89.1%) 45 (97.8%) 36 (78.3%) 3-Cat -> 3-Cat NF 41 (89.1%) 46 (100%) 42 (91.3%) 3-Cat -> 3-Cat F 41 (89.1%) 46 (100%) 42 (91.3%) Harvard Training on Stanford Michigan Set Michigan Training on Harvard MATLAB WEKA DT 2-Cat -> 2-Cat NF 150 (96.2%) 154 (98.7%) 153 (98.1%) 2-Cat -> 2-Cat F 144 (92.3%) 153 (98.1%) 150 (96.2%) 3-Cat -> 3-Cat NF 145 (92.9%) 153 (98.1%) 153 (98.1%) 3-Cat -> 3-Cat F 140 (89.7%) 152 (97.4%) 153 (98.1%) Michigan Training on Stanford MATLAB WEKA DT 2-Cat -> 2-Cat NF 41 (89.1%) 46 (100%) 41 (89.1%) 2-Cat -> 2-Cat F 41 (89.1%) 46 (100%) 40 (87.0%) 3-Cat -> 3-Cat NF 41 (89.1%) 45 (97.8%) 39 (84.8%) 3-Cat -> 3-Cat F 41 (89.1%) 46 (100%) 39 (84.8%) Stanford Set Stanford Training on Harvard MATLAB WEKA DT 2-Cat -> 2-Cat NF 139 (89.1%) 153 (98.1%) 139 (89.1%) 2-Cat -> 2-Cat F 139 (89.1%) 150 (96.2%) 124 (79.5%) 3-Cat -> 3-Cat NF 139 (89.1%) 150 (96.2%) 154 (98.7%) 3-Cat -> 3-Cat F 139 (89.1%) 150 (96.2%) 152 (97.4%) Stanford Training on Michigan MATLAB WEKA DT 2-Cat -> 2-Cat NF 86 (89.6%) 95 (99.0%) 86 (89.6%) 2-Cat -> 2-Cat F 86 (89.6%) 92 (95.8%) 72 (75.0%) 3-Cat -> 3-Cat NF 86 (89.6%) 95 (99.0%) 94 (97.9%) 3-Cat -> 3-Cat F 86 (89.6%) 95 (99.0%) 91 (94.8%) Future Work Use structure learning for Bayesian Classifiers Increase of homogeneous data Other methods of classification
© Copyright 2026 Paperzz