Exploring Disease Data CS 640 Bioinformatics Some Definitions • Data mining: extracting hidden patterns and useful info from large data sets. Ex- clustering, machine learning. Should not be: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything" - Jeff Jonas, IBM • Machine learning: the ability of a program to learn from experience. Ex- neural networks, decision trees, rule-based methods, MDR. Methods • Regression methods: modeling the relationship between a dependent variable and one of more independent variables. • Data mining methods: Search the space of possible models efficiently. Better with non-linear and high-dimensional data, or data with many potential interactions. • Exhaustive Search: search all possible models for the best one. Linear regression • Relates outcome as a linear combination of the parameters (but not necessarily of the independent variables). • Ex: Let y = incidence of disease, n data points. Independent variables A,B 1) yi = b0 + b1Ai + εi, i = 1,…,n 2) yi = b0 + b2 (Bi)2 + εi, i = 1,…,n where b0, b1, b2 = parameters, εi is error term. In both of these examples, the disease is modeled as linear in the parameters, although it is quadratic in variable B Linear regression Given a sample, we estimate the params (ex: can use least squares) to arrive at the linear regression model: yˆ i Bˆ 0 Bˆ1 xi [1] Multiple regression • Relates the the probability of an event to a linear combination of predictor variables. • Ex: Let y = incidence of disease, n data points. Independent variables x1, x2 yi = b0 + b1xi1 + b2xi2 + … + bpxip + εi, i = 1,…,n Best-fit line: yˆ i Bˆ 0 Bˆ1 xi1 Bˆ 2 xi 2 ... Bˆ p xip For each unit increase in xip, ŷ i is expected to increase by B̂ p . Fitted line for 2 input variables is a plane yˆi Bˆ0 Bˆ1 xi1 Bˆ2 xi 2 Logistic regression[1] • Often used when the outcome is binary, relates the log-odds of the probability of an event to a linear combination of predictor variables. Ex: • ln(p/(1 – p)) = α + βxB + γxC + ixBxC, where xB and xC are measured binary indicator variables, and regression coefficients β and y represent main effects, i represents interaction. Other statistical methods [1] • Bayesian model selection: a statistical approach incorporating both prior distributions for parameters and observed data into the model. • Maximum likelihood: a statistical method used to make inferences about the combination of parameter values resulting in the highest probability of obtaining the observed data Modeling Terminology[1] • Saturated: a statistical model that is as full as possible (saturated) with parameters. • Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters • Entropy: the uncertainty associated with a random variable Modeling Terminology[1] • Cross-validation: partitioning a data set into n subsets, then using each subset in turn as the test set while using the other n-1 to train. • Overfitting: a model that provides a good fit to a specific data set but generalizes poorly. • Marginal effects: the effects of one parameter averaged over the possible values taken by other parameters. Marginal Effects [2] Marginal penetrance: Ex: The probability P(D|A=Aa), irrespective of what value B has Table II. Penetrance values for combinations of genotypes from two single nucleotide polymorphisms exhibiting interactions in the absence of independent main effects Genotype Genotype AA (0.25) Marginal penetrance B Aa (0.50) aa (0.25) BB (0.25) 0 1 0 0.5 Bb (0.50) 1 0 1 0.5 bb (0.25) 0 1 0 0.5 Marginal penetrance A 0.5 0.5 0.5 Genotype frequencies are given in parentheses Marginal penetrance values for the A, B genotypes. Weka [3] • A collection of visualization tools and algorithms for data analysis and predictive modeling. • Preprocessing tools for reading data in a variety of formats and transforming it. • Classification algorithms include regression, neural network, support vector machine, decision tree. Display includes ROC curves • Clustering: k-means, expectation maximization • Visualization includes scatter-plot, bar graph References [1] Cordell, 2009, Detecting gene–gene interactions that underlie human diseases. Nature Review Genetics [2] McKinney et al, 2006, Machine Learning for Detecting Gene-Gene Interactions, A Review. Biomedical Genomics and Proteomics [3] Weka site: http://www.cs.waikato.ac.nz/ml/weka
© Copyright 2026 Paperzz