A Bayesian machine learning approach to latent variable modelling

MRC Career Development
Award in Biostatistics
A Bayesian machine learning approach to latent variable
modelling to accelerate endotype discovery
Dr Danielle Belgrave, PhD
Research Fellow in Biomedical Modelling
[email protected]
15th July 2016
Outline
• Introduction: What is Endotype Discovery?
• Motivating Example: The Allergic March
• Probabilistic Programming in Infer.NET
• Modelling Strategies for Analysis Latent Trajectories of Comorbidities
• Concluding Remarks
Endotype Discovery: The Grand Challenge
To identify subgroups of complex disease risk or treatment outcome
explained by a distinctive underlying mechanism
(“endotypes”)
Foundation of Stratified Medicine,
seeking better-targeted interventions
M.C. Escher
Order and Chaos, 1950
Asthma genetics: Low Yield
• Legacy of non-replicated genetic epidemiology,
typical of most common chronic disorders
Linkage in 1 study only
Linkage in >1 study
Big Data: Feast or Fog?
Problem Space
Observation Space
Data Space
y = b1x1 + b2x2 + b3x3 + c
...Health is measured with error and missingness:
Endo-phenotypes are resolved as if the researcher
was looking through a doyley and prism at the problem
STELAR e-Lab
ALSPAC
Requests for data
MAAS
SEATON
Data Exports
On-going data
collection
STELAR e-Lab
Methods and Results
Comments, questions
Ashford
Isle of
Wight
Data Collection
STELAR Researchers
e-Lab Platform
Motivating Example
Latent Trajectory Models to Understand Disease Comorbidities of Eczema, Asthma
and Rhinitis
Received Wisdom: Allergic March
• Progression of allergy
Eczema → Asthma → Rhinitis
From: Spergel & Paller, 2003
• Cross-sectional, population-level →
• Used to explain mechanisms
leading to allergy
• Eczema causally linked
to asthma and rhinitis
• Prevention strategy:
target children with eczema
to reduce progression to asthma
From: World Allergy Organization, 2014
What should General Practice trainees
learn about atopic eczema?1
• Effective atopic eczema (AE) control not only improves quality of life
but may also prevent the atopic march (1).
• “Atopic dermatitis … is often the first step in the atopic march leading
to the development of asthma or allergic rhinitis (2).”
• “Early and effective treatment of atopic dermatitis could theoretically
interrupt the atopic march and decrease the risk of asthma (3).”
1. Munidasa D et al. J Clin Med. 2015 Feb 12;4(2):360-8.
2. Spergel JM. Ann Allergy Asthma Immunol. 2010;105(2):99-106
3. Gali et al. Allergy Asthma Proc 2007;28:540-3
Stopping the Atopic March
Millions Spent, complete failure…
Warner JO. J Allergy Clin Immunol. 2001;108(6):
929-37
Aim: The goal of this study was to determine whether
early intervention with pimecrolimus limits the atopic
march in infants with AD.
Conclusion: This longitudinal observation of infants
with AD provides evidence of the atopic march.
Pimecrolimus was safe and effective in infants with
mild to moderate AD.
Schneider L et al. Pediatr Dermatol. 2016 Jun 7.
doi: 10.1111/pde.12867
Data Description
• STELAR eLab: Population-based birth cohort studies
• Manchester Asthma and Allergy Study (MAAS)
• 1184 subjects recruited prenatally
• Avon Longitudinal Study of Parents and Children (ALSPAC)
• 8665 subjects recruited prenatally
• Question 1: Has Your Child Had Eczema in the Past 12 Months?
• Question 2: Has Your Child Had Wheeze in the Past 12 Months?
• Question 3: Has Your Child Had Rhinitis in the Past 12 Months?
• Assessed at ages 1, 3, 5, 8 and 11 years
A Tale of 2 Cohorts: Cross-Sectional
Patterns Show the Allergic March
Data from two population-based
birth cohort studies:
Manchester Asthma and Allergy
Study (MAAS)
1184 subjects recruited prenatally
Avon Longitudinal Study of Parents
and Children (ALSPAC)
8665 subjects recruited prenatally
From: Belgrave et al. Developmental Profiles of Eczema, Wheeze, and Rhinitis: Two Population-Based Birth Cohort Studies. PlosMedicine 2014
Does atopic march exist at individual level?
• Alternative Hypothesis: The allergic march is an imposed paradigm which does
not adequately describe the natural history of Eczema → Asthma → Rhinitis
• To capture disease heterogeneity and encapsulate different patterns of
symptom progression in individual children in the first 11 years of life
• To use probabilistic machine learning techniques to model the development of
Eczema → Asthma → Rhinitis
• To identify homogeneous groups of children with similar profiles of Eczema →
Asthma → Rhinitis and understand the characteristics of different groups
• Develop a Bayesian network of probabilistic dependencies in order to identify
meaningful latent structure and dependencies
Probabilistic Model to Capture Disease
Profile Heterogeneity
• Bayesian Machine Learning modelling framework to identify distinct latent classes
based on individual changing disease profiles of Eczema → Asthma → Rhinitis
• Assumption: each child belongs to one of N latent classes
• Children in the same class having similar Eczema → Asthma → Rhinitis transitions over
time
• Number and size of classes not known a priori
• Calculated each child’s posterior probability of belonging to each of the latent classes,
and assigned children to the class with the greatest probability
• Compare models using model evidence
Probabilistic Programming in
Infer.NET
What is Probabilistic Programming?
Pfeffer, Avi. "Practical probabilistic programming." International Conference on Inductive Logic Programming. Springer Berlin Heidelberg, 2010.
Infer.NET
• Compiles probabilistic programs into inference code (EP/VMP/Gibbs).
• Supports many (but not all) probabilistic program elements
• Extensible – distribution channel for new machine learning research
• Consists of a chain of code transformations:
Infer.NET Architecture
Modelling Strategies for
Analysing Latent Transition
Profiles
Disambiguating Longitudinal Comorbidities of Eczema, Asthma and Rhinitis
Generalised Framework: Hidden Markov Model
for Approximating Disease Transition States
• An HMM consists of 5 elements:
1.
2.
3.
4.
5.
A finite set of M observation states: Y1, Y2, Y3,…, Ym
A finite set of K latent states: S1, S2, S3,…, Sk
Initial probabilities P(S) for each S defining the probability of starting in state S
Transition probabilities between latent states P(St|St-1)
Emission probabilities of the observed states given the latent state at each time step P(Yi|Sj)
• HMM: A tool for representing probability distributions over sequences of observations. Assumptions:
1.
2.
3.
The observations are sampled at discrete, equally-spaced time intervals, so t can be some integer-valued time index
The observation at time t was generated by some process whose state St is hidden from the observer
The state of this hidden process satisfies the Markov property
4.
The hidden state variable is discrete: St can take on K (integer) values
•
Given the value of St-1 the current state St is independent of all the states prior to t-1
• Three Fundamental Problems for HMMs
1. Evaluation: Given the model parameters and observed data, calculate the likelihood of the data
2. Decoding: Given the model parameters and observed data, estimate the optimal sequence of hidden states
3. Learning: Given the observed data, estimate the model parameters
• Joint distribution of a sequence of states and observations:
:
,
:
=
∏
(
( | )
Generalised Framework: Hidden Markov Model for
Approximating Transition States of the Allergic March
• Bayesian Approach: Learning starts with some
a priori knowledge about the model structure
and model parameters
• Represented in the form of a prior probability
distribution over model structures and
parameters
• Updated using the data to obtain a posterior
probability distribution over models and
parameters
• Assume a single model structure M and
estimate the parameter θ that maximises the
likelihood P(D|θ,M) under that model
• Assume uniform Dirichlet prior distribution
with equal probability assignment to each
latent class
• Assume Uniform Dirichlet Prior:
Dir(1/m,…,1/m) for the observation matrix
• Assume Uniform Dirichlet Prior: Dir(1/k,…,1/k)
for the latent state membership
public ClusterSimpleChain(int numYears)
{
…
probState0 = Variable.Array<double>(k).Named("probState0");
probState0Prior = Variable.Array<Beta>(k).Named("probState0Prior");
probState0[k] = Variable<double>.Random(probState0Prior[k]);
for (int y = 0; y < numYears; y++)
{
...
#if clusterQ
Q_T[y] = Variable.Array(Variable.Array<double>(s), k).Named("Q_T" + y);
Q_F[y] = Variable.Array(Variable.Array<double>(s), k).Named("Q_F" + y);
QTPriorArr[y] = Variable.Array(Variable.Array<Beta>(s), k).Named("QTPriorArr" + y);
QFPriorArr[y] = Variable.Array(Variable.Array<Beta>(s), k).Named("QFPriorArr" + y);
Q_T[y][k][s] = Variable<double>.Random(QTPrior[y][k][s]);
Q_F[y][k][s] = Variable<double>.Random(QFPrior[y][k][s]);
#else
Q_T[y] = Variable.Array<double>(s).Named("Q_T" + y);
Q_F[y] = Variable.Array<double>(s).Named("Q_F" + y);
QTPriorArr[y] = Variable.Array<Beta>(s).Named("QTPriorArr" + y);
QFPriorArr[y] = Variable.Array<Beta>(s).Named("QFPriorArr" + y);
Q_T[y][s] = Variable<double>.Random(QTPriorArr[y][s]);
Q_F[y][s] = Variable<double>.Random(QFPriorArr[y][s]);
#endif
…
}
Hidden Markov Model 1: Eczema, Wheeze
and Rhinitis Independent Disease Profiles
Eczema
Class
Age 1
Eczema State
Age 3
Eczema State
Age 5
Eczema State
Age 8
Eczema State
Age 11
Eczema State
Wheeze
Class
Age 1
Wheeze State
Age 3
Wheeze State
Age 5
Wheeze State
Age 8
Wheeze State
Age 11
Wheeze State
Rhinitis
Class
Age 1
Rhinitis State
Age 3
Rhinitis State
Age 5
Rhinitis State
Age 8
Rhinitis State
Age 11
Rhinitis State
Children (n=9801)
Latent Class
Disease Profile
Hidden Markov Model 2: Eczema, Wheeze
and Rhinitis follow an “Allergic March” Profile
Eczema
Class
Age 1
Eczema State
Age 3
Eczema State
Age 5
Eczema State
Age 8
Eczema State
Age 11
Eczema State
Wheeze
Class
Age 1
Wheeze State
Age 3
Wheeze State
Age 5
Wheeze State
Age 8
Wheeze State
Age 11
Wheeze State
Rhinitis
Class
Age 1
Rhinitis State
Age 3
Rhinitis State
Age 5
Rhinitis State
Age 8
Rhinitis State
Age 11
Rhinitis State
Children (n=9801)
Latent Class
Disease Profile
Model 3: Individual-level Longitudinal
Latent Disease Profile
Latent State
Age 1
Latent State
Age 3
Latent State
Age 5
Latent State
Age 8
Latent State
Age 11
Class = 1,….,k
Eczema
Age 1
Eczema
Age 3
Eczema
Age 5
Eczema
Age 8
Eczema
Age 11
Wheeze
Age 1
Wheeze
Age 3
Wheeze
Age 5
Wheeze
Age 8
Wheeze
Age 11
Rhinitis
Age 1
Rhinitis
Age 3
Rhinitis
Age 5
Rhinitis
Age 8
Rhinitis
Age 11
Children (n=9801)
Latent Class
Disease Profile
Myth Bust by Learning from Data
MRC STELAR consortium
working at scale across
MAAS and ALSPACS cohorts
From: Belgrave et al. Developmental Profiles of Eczema, Wheeze, and Rhinitis:
Two Population-Based Birth Cohort Studies. PlosMedicine 2014
Investigating Model Sensitivity to Priors
Model evidence for different numbers of inferred latent classes with different priors.
Priors where the number of pseudo-counts is set from 1/k up to 2 to investigate whether setting different priors
influences model evidence.
Table of Model Evidence
Number of Inferred Classes
Prior on the number of pseudo-counts
2
3
4
5
6
7
8
9
1/k
-50177
-49030
-48297
-47774
-47367
-47130
-46989
-47109*
2/k
-50200
-49104
-48310
-47797
-47357
-47143
-46994
-47334*
1
-49920
-48448
-47506
-46930
-46845
-46658
-46503
-46424*
2
-49920
-48448
-47506
-46930
-46845
-46733
-46596
-46431*
Limitations of HMMs
• Exact inference in these generalisations is usually intractable
• Can use approximate inference algorithms such as Markov chain sampling and
variational methods
• M time points with K outcomes, then there are KM possible states of the
system a disease state
• HMM would require KM distinct states to model this system with K2M
parameters in the transition matrix
• Need to constrain parameters to avoid overfitting
• Can be computationally expensive
Conclusions
1. Discoveries about asthma and allergic disease, not hypothesised a priori,
have been made by experts explaining structure
learned from data by algorithms tuned by those experts
2. Heuristic blend of biostatistics and machine-learning reveals more than
either method individually
3. Data-complete epidemiology requires persistent integration of
• Data
• Methods
• Expertise
Thank You
Professor Adnan Custovic
Professor Angela Simpson
Professor Iain Buchan
Professor Christopher Bishop
Professor John Henderson
Dr Raquel Granell
Dr John Guiver
Dr John Winn
Dr Phillip Couch
Mike Porter
The eLab Dev Team
MRC Career Development
Award in Biostatistics
M
e st e r
c h
a n
Ast h
m a
a nd
Al le rg y
d y
St u