poster

Modelling skin/ageing phenotypes with latent variable models in Infer.NET
1
David Knowles
Leopold Parts
University of Cambridge
Daniel Glass
Wellcome Trust Sanger Institute
John M. Winn
King’s College London
1
Microsoft Research Cambridge
[email protected]
Abstract
Models
Methods
We demonstrate and compare three unsupervised
Bayesian latent variable models implemented in
Infer.NET [1] for biomedical data modeling of
42 skin and ageing phenotypes measured on the
12,000 female twins in the Twins UK study [2].
Factor graphs for the three proposed models.
We use Variational Message Passing under the Infer.NET framework. To support these models various factors were added to the framework: e.g. logistic regression, ordinal regression, “sum where”.
Shared precision
matrix
Λ
N(0,Λ-1)
Factor loadings
S
W
K
*
Data characteristics
Factor
analysis
Conjugate prior
Mean
Like many biomedical applications:
β
Mixture
parameters
Latent factors
z
Gate
Latent factors
m
G
K
Y
Noise
parameter
β
4. High dimensional. 6000 phenotype and exposure variables, measured at multiple time
points: use dimensionality reduction
Observations
D
Exposure
indicators
Spectrophotometry
Diet questions
Vitamin D level
?
Lifetime sun
exposure
Cell generation
number
Cumulative skin
damage due to
external causes
Skin Type
(I&II or III&IV)
Cumulative skin
damage due to
internal causes
Telomere length
DNA repair
fidelity
Explanatory
variables
Life smoke
exposure
2.
Generalised factor analysis model. Allows different
observed data types using various likelihood functions
3. Combined regression and
factor analysis model. Provides the expressive power
of FA and interpretability of
regression.
Menopause/
Hysterectomy
Conclusions
1. Using appropriate likelihood models allows
optimal integration of different data types
2. FA models have superior predictive performance to mixture models in this setting
3. Combining regression and FA components
eases interpretability but at some cost to
predictive performance (this may be due to
scheduling problems or local minima)
4. Infer.NET allows us to use complex models
Future work
1. Time series. Multiple asynchronous visits, different phenotypes recorded each time.
2. Scalability. Although our message passing algorithms are efficient, scaling modern healthcare size datasets remains a challenge. Parallelization is a potential solution.
1
Age
Alcohol
consumption
Symptoms
N
Results
N
Exposure
indicators
Regression
Y
1. Generalised mixture model.
Clusters individuals. Suitable
conjugate prior for each data
type.
Medical expertise: prior knowledge
Quantity of food
intake
Factor
analysis
Mixture model
Observations
N
correlation between true and inferred
regression coefficients
3. Multiple observations. Combine into a single
phenotype: aids interpretability, improves
statistical power and helps with missingness.
Auxiliary
variables
likelihood models
D
2. Heterogeneous data. Continuous, categorical
(including binary), ordinal and count data:
using appropriate likelihood functions for
each of these data types improves statistical
power.
Explanatory
variables
+
p
Likelihood
function
1. High missingness. Many variables have up to
80% missing: Bayesian methods are able to
naturally deal with missingness
Exposure
indicators
Latent factors
0.8
0.6
EP Ordinal Probit
VMP Ordinal Logistic
EP Linear
0.4
3. Online learning. This would allow new data
could be incorporated as it is recorded.
0.2
4. Nonlinearities. We are currently experimenting with Gaussian Process and Mixture of Experts models to accommodate nonlinearity.
0
0
20
40
-0.2
60
80
100
sample size, N
Hormones
Total cumulative
skin damage
SCC
BCC
Solar
keratosis
Dryness
Solar
elastosis
Wrinkles
Skin
thickness
Leg ulcers
Hair loss
Key processes involved in skin and ageing, devised in collaboration with an experienced dermatologist. We use this prior knowledge in a very
crude way at the moment (separating explanatory
variables and symptoms) but we intend to use such
knowledge to incorporate more structure into our
models.
Imputation performance (real
data). For a random 10% of individuals treat symptoms (e.g. skin
cancer, wrinkles) as missing, but
leave the explanatory variables
(e.g. age, smoking, sun exposure),
and infer the predictive posterior
over the held out values.
−1.4
Malignant
melanoma
−1.6
Grey hair
Correlation under the model. The fitted FA model
implies a particular covariance structure for the
variables of interest.
−1.8
Naevi
−2.0
?inverse
?inverse relationship
Synthetic data test. Ordinal regression with
5 output values, P = 20 observed explanatory variables and varying sample size.
log pred performance per held out value
Langerhans Cell function
Symptoms
Fibroblast function
Keratinocyte function
Latent cell
function
Melanocyte function
HRT
1
2
3
4
5
Regression−FA model
Factor analysis model
Mixture model
6
7
1
2
3
4
5
6
7
num factors/components
8
1
2
3
4
5
6
7
8
References
[1] T. Minka, J.M. Winn, J.P. Guiver, and D.A. Knowles.
Infer.NET 2.4, 2010.
Microsoft Research Cambridge.
http://research.microsoft.com/infernet.
[2] Tim D. Spector and Alex J. MacGregor. The St. Thomas’ UK
Adult Twin Registry. Twin Research, 5:440–443(4), 1 October
2002.
Funding
DK was supported by Microsoft Research through the Roger
Needham Scholarship at Wolfson College, Cambridge.