Modelling skin/ageing phenotypes with latent variable models in Infer.NET 1 David Knowles Leopold Parts University of Cambridge Daniel Glass Wellcome Trust Sanger Institute John M. Winn King’s College London 1 Microsoft Research Cambridge [email protected] Abstract Models Methods We demonstrate and compare three unsupervised Bayesian latent variable models implemented in Infer.NET [1] for biomedical data modeling of 42 skin and ageing phenotypes measured on the 12,000 female twins in the Twins UK study [2]. Factor graphs for the three proposed models. We use Variational Message Passing under the Infer.NET framework. To support these models various factors were added to the framework: e.g. logistic regression, ordinal regression, “sum where”. Shared precision matrix Λ N(0,Λ-1) Factor loadings S W K * Data characteristics Factor analysis Conjugate prior Mean Like many biomedical applications: β Mixture parameters Latent factors z Gate Latent factors m G K Y Noise parameter β 4. High dimensional. 6000 phenotype and exposure variables, measured at multiple time points: use dimensionality reduction Observations D Exposure indicators Spectrophotometry Diet questions Vitamin D level ? Lifetime sun exposure Cell generation number Cumulative skin damage due to external causes Skin Type (I&II or III&IV) Cumulative skin damage due to internal causes Telomere length DNA repair fidelity Explanatory variables Life smoke exposure 2. Generalised factor analysis model. Allows different observed data types using various likelihood functions 3. Combined regression and factor analysis model. Provides the expressive power of FA and interpretability of regression. Menopause/ Hysterectomy Conclusions 1. Using appropriate likelihood models allows optimal integration of different data types 2. FA models have superior predictive performance to mixture models in this setting 3. Combining regression and FA components eases interpretability but at some cost to predictive performance (this may be due to scheduling problems or local minima) 4. Infer.NET allows us to use complex models Future work 1. Time series. Multiple asynchronous visits, different phenotypes recorded each time. 2. Scalability. Although our message passing algorithms are efficient, scaling modern healthcare size datasets remains a challenge. Parallelization is a potential solution. 1 Age Alcohol consumption Symptoms N Results N Exposure indicators Regression Y 1. Generalised mixture model. Clusters individuals. Suitable conjugate prior for each data type. Medical expertise: prior knowledge Quantity of food intake Factor analysis Mixture model Observations N correlation between true and inferred regression coefficients 3. Multiple observations. Combine into a single phenotype: aids interpretability, improves statistical power and helps with missingness. Auxiliary variables likelihood models D 2. Heterogeneous data. Continuous, categorical (including binary), ordinal and count data: using appropriate likelihood functions for each of these data types improves statistical power. Explanatory variables + p Likelihood function 1. High missingness. Many variables have up to 80% missing: Bayesian methods are able to naturally deal with missingness Exposure indicators Latent factors 0.8 0.6 EP Ordinal Probit VMP Ordinal Logistic EP Linear 0.4 3. Online learning. This would allow new data could be incorporated as it is recorded. 0.2 4. Nonlinearities. We are currently experimenting with Gaussian Process and Mixture of Experts models to accommodate nonlinearity. 0 0 20 40 -0.2 60 80 100 sample size, N Hormones Total cumulative skin damage SCC BCC Solar keratosis Dryness Solar elastosis Wrinkles Skin thickness Leg ulcers Hair loss Key processes involved in skin and ageing, devised in collaboration with an experienced dermatologist. We use this prior knowledge in a very crude way at the moment (separating explanatory variables and symptoms) but we intend to use such knowledge to incorporate more structure into our models. Imputation performance (real data). For a random 10% of individuals treat symptoms (e.g. skin cancer, wrinkles) as missing, but leave the explanatory variables (e.g. age, smoking, sun exposure), and infer the predictive posterior over the held out values. −1.4 Malignant melanoma −1.6 Grey hair Correlation under the model. The fitted FA model implies a particular covariance structure for the variables of interest. −1.8 Naevi −2.0 ?inverse ?inverse relationship Synthetic data test. Ordinal regression with 5 output values, P = 20 observed explanatory variables and varying sample size. log pred performance per held out value Langerhans Cell function Symptoms Fibroblast function Keratinocyte function Latent cell function Melanocyte function HRT 1 2 3 4 5 Regression−FA model Factor analysis model Mixture model 6 7 1 2 3 4 5 6 7 num factors/components 8 1 2 3 4 5 6 7 8 References [1] T. Minka, J.M. Winn, J.P. Guiver, and D.A. Knowles. Infer.NET 2.4, 2010. Microsoft Research Cambridge. http://research.microsoft.com/infernet. [2] Tim D. Spector and Alex J. MacGregor. The St. Thomas’ UK Adult Twin Registry. Twin Research, 5:440–443(4), 1 October 2002. Funding DK was supported by Microsoft Research through the Roger Needham Scholarship at Wolfson College, Cambridge.
© Copyright 2026 Paperzz