Transmission tree reconstruction by augmentation of internal

Transmission tree reconstruction by augmentation of
internal phylogeny nodes
Matthew Hall
Li Ka Shing Institute for Health Information and Discovery, University of Oxford
February 2017
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
1 / 29
The relationship of the phylogeny to the transmission tree
Let T be a time-tree (rooted, with branch lengths in units of time).
Let V be its node set of size n.
Suppose the isolates at the tips of T come from a set of H of hosts.
Initial assumptions:
Complete sampling of the epidemic since the TMRCA
No superinfection or reinfection
Transmission is a complete bottleneck
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
2 / 29
The relationship of the phylogeny to the transmission tree
The transmission tree N (a
DAG whose nodes are the
members of H, depicting which
host infected which other) can
be represented by a map
d : V → H taking each node to
a host (tips to the host they
were sampled from).
Visualised by collapsing the
nodes in the preimage of each
h ∈ H under d to a single node.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
3 / 29
The relationship of the phylogeny to the transmission tree
The transmission tree N (a
DAG whose nodes are the
members of H, depicting which
host infected which other) can
be represented by a map
d : V → H taking each node to
a host (tips to the host they
were sampled from).
Visualised by collapsing the
nodes in the preimage of each
h ∈ H under d to a single node.
Matthew Hall (Oxford)
Transmission tree reconstruction
A
H
F
D
B
E
G
C
I
J
February 2017
3 / 29
The simplest version
Assume that the phylogeny and
transmission tree coincide;
internal nodes are transmission
events.
This implies no within-host
diversity and necessitates no
more than one tip per host.
If n is internal with children nC1
and nC2 , then either
d(n) = d(nC1 ) or
d(n) = d(nC2 ).
Trivially 2n−1 transmission trees
for a fixed T .
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
4 / 29
Within-host diversity
If within-host diversity is
assumed then internal nodes are
coalescences of two lineages
within a host.
The subgraph induced by the
preimage of d for any host must
be connected.
An extra set of parameters q
represent the infection times.
Question: How many
transmission trees for a fixed T ?
(Depends on the topology.)
With one tip per host?
With ≥ 1 tip per host?
(Sometimes 0.)
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
5 / 29
Simultaneous MCMC reconstruction of phylogeny and
transmission tree
In either case we get an (injective but not surjective) map z from the
set of possible ds to the space of transmission trees.
Thus an MCMC method that samples from the posterior distribution
of phylogenies with internal node augmentation obeying either set of
rules simultaneously samples from the posterior distribution of
transmission trees.
Not only a method for reconstructing N , but a population model
(tree prior) for reconstruction of T that is more realistic for an
outbreak than the standard unstructured coalescent models.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
6 / 29
Decomposition
Let S be the sequence data and φ the various model parameters.
Without within-host diversity:
p(T , d, φ|S) =
p(S|T )p(T , d|φ)p(φ)
p(S)
p(S|T ) is the standard phylogenetic likelihood and p(T , d|φ) the
probability of observing the augmented tree under a transmission
model.
With within-host diversity:
p(T , d, q, φ|S) =
p(S|T )p(T |N , q, φ)p(N , q|φ)p(φ)
p(S)
p(N , q|φ) is the probability of the transmission tree and its timings as
above; p(T |N , q, φ) is the probability of the within-host
mini-phylogenies under a coalescent process.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
7 / 29
MCMC implementation
Hall et al., 2015 implemented
simultaneous reconstruction of
both trees in BEAST, with
MCMC proposals that respect
the rules of node augmentation.
Several other approaches (e.g.
Didelot et al., 2014, Morelli et
al., 2012; Ypma et al., 2013;
Klinkenberg et al., 2017) with
recent work on the incomplete
sampling problem (Didelot et
al., 2016; Lau et al., 2016).
Matthew Hall (Oxford)
i)
i)
ii)
iii)
ii)
iv)
iii)
Exchange
Subtree slide
Wilson-Balding
Transmission tree reconstruction
Exchange
Subtree slide
50%
50%
Wilson-Balding
50%
50%
February 2017
8 / 29
43617 tips
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
9 / 29
The BEEHIVE study
NGS short-read sequence data acquired from samples taken from
European (and one African) HIV cohort studies.
Some cohorts go back to the early epidemic in the 1980s
Current data from 3138 individuals
Epidemiology: age, gender, date of first positive test, countries of
origin and infection, risk group, ART dates, etc.
Sequences from one time point only (with a few exceptions)
Rather than making a consensus sequence from each host’s reads, we
want to use everything.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
10 / 29
Phyloscanner: phylogenetic analysis of NGS pathogen data
sequencing
mapping to
references
Idea: align all short reads from all hosts to a reference genome and
slide a window across the genome, building a phylogeny for the reads
overlapping each window.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
11 / 29
Phyloscanner: phylogenetic analysis of NGS pathogen data
Identical reads from a single host are merged but the duplicate counts
kept as tip traits
We use RAxML for reconstruction
Tips are not associated with each other across different windows, but
hosts are.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
12 / 29
The topological signal of transmission
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Once we have many tips from
each host, transmission has a
topological signal.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Direct transmission is suggested
when the clade from the
infectee is not monophyletic
(Romero-Severson et al., 2016)
but in general we only see the
direction of transmission from
the topology.
●
Matthew Hall (Oxford)
●
Starts to look like a parsimony
problem.
Transmission tree reconstruction
February 2017
13 / 29
The topological signal of transmission
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Once we have many tips from
each host, transmission has a
topological signal.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Direct transmission is suggested
when the clade from the
infectee is not monophyletic
(Romero-Severson et al., 2016)
but in general we only see the
direction of transmission from
the topology.
●
Matthew Hall (Oxford)
●
Starts to look like a parsimony
problem.
Transmission tree reconstruction
February 2017
13 / 29
The topological signal of transmission
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Once we have many tips from
each host, transmission has a
topological signal.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Direct transmission is suggested
when the clade from the
infectee is not monophyletic
(Romero-Severson et al., 2016)
but in general we only see the
direction of transmission from
the topology.
●
Matthew Hall (Oxford)
●
Starts to look like a parsimony
problem.
Transmission tree reconstruction
February 2017
13 / 29
Challenges reconstructing transmission from this data
Datasets:
Enormous size
Contamination
present
Coverage is uneven
Epidemiology:
Sampling is
incomplete
Multiple infections
present
Bottleneck at
transmission may be
wide (IDUs)
43617 tips
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
14 / 29
Transmission tree reconstruction using parsimony
For a fixed tree, we aim to:
Reconstruct hosts from those represented in the tips to internal nodes
in the tree
But also allow reconstruction to “a host outside the dataset’ as
required by incomplete sampling
Minimise the number of infection events amongst hosts in the
dataset. . .
. . . except, penalise reconstructions which suggest an unreasonable
amount of genetic diversity stemming from a single infection event.
Identify multiple infections and contaminations
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
15 / 29
Transmission tree reconstruction using parsimony
Suppose the T has nodes V and we are reconstructing characters
from the set of states S.
The cost function c(p, q; i, j) determines the cost of transitioning
from state i to state j along the branch from p to q (in the direction
away from the root).
We take a node- and edge- dependent c on a known tree; costs for
transitions vary depending on the states involved and the branch on
which they occur.
The lowest cost reconstruction is found with the Sankhoff algorithm.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
16 / 29
Transmission tree reconstruction using parsimony
If reads are taken from n hosts h1 , . . . , hn making up the study
population, we use the hi as states along with an “unsampled” state
u.
Assume that the root of the tree was in the unsampled state (using
an outgroup if required). Tips from outside the study population (the
outgroup, other reference sequences, contaminants) are assigned u as
a state.
We are interested in minimising the cost of infections of members of
the study population, but not hosts outside that population, so
c(p, q; h, u) = 0 for all p, q, h.
Reconstruction starts by putting a u on the root. (Otherwise it will
always be cheap to reconstruct some hi to the root, plausible or no.)
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
17 / 29
Transmission tree reconstruction using parsimony
Let l(p, i) be the sum of the
branch lengths in the subtree
rooted at p pruned so that only
tips from i remain, or ∞ if there
are no such tips. Then we take
c(p, q; i, j) = 1 + kl(p, j) with
k ∈ R+ a tunable parameter.
This penalises reconstructions
with unreasonable amounts of
within-host diversity (right).
Matthew Hall (Oxford)
l1
●
●
●
q
p
●
●
l2
r
●
o
●
s
p
●
●
h
u
q
r
●
s
o
Left: C = 1 + k(l1 + l2 )
Right: C = 2
Transmission tree reconstruction
February 2017
18 / 29
Transmission tree reconstruction using parsimony
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
c(p, q; h, u) = c(p, q, h, u) for all p, q, h. Above trees have equal C .
Choose to break ties always towards h (for dense sampling), always
towards u (conservative), or based on branch lengths.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
19 / 29
Parsimony for detection of contaminants
Small numbers of contaminant reads are frequent, where a virus from
the wrong host has been sequenced
The parsimony reconstruction does double duty to detect these
Find “multiple introductions” where the tips in one split have very
low read counts, and ignore those tips
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
20 / 29
Augmented phylogeny
"Transmission tree"
●
●
●
●
Without the assumptions of a
single infection event per host
and complete sampling, the
“transmission tree” has multiple
nodes per host and unsampled
nodes.
B1
●
●
●
A
●
●
●
●
B2
●
●
●
●
●
●
●
●
●
●
A
●
●
●
●
●
U
●
●
●
●
●
●
B
●
●
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
21 / 29
Example
Known HIV transmission chain
(Lemey et al., 2005, Vrancken
et al., 2014).
?
B
A
Reconstruction on RAxML trees
for env and pol genes
Host
A
B
C
D
E
F
G
H
I
K
L
True
B?
A?
B
C
C
A
F
B
B
E
C
Matthew Hall (Oxford)
env
U
U
B
D
C
U
U
B
B
E
C
pol
U
A
B
D
C
A
F
B
B
E
C
E
C
H
L
D
I
F
G
K
Transmission tree reconstruction
February 2017
22 / 29
Full phyloscanner results
Full output from phyloscanner is a separate phylogeny for the reads in
each genome window
We would like to use these in a manner similar to bootstrapping, to
indicate support for topological relationships
The phylogenies are not trivially comparable across windows as the
tips are not the same
The hosts are the same, so the trasmission trees are more
comparable, but:
Some hosts are absent from some windows due to sequencing problems
One or more nodes for unsampled regions
Potentially multiple nodes per host
Question: Can we nonetheless give a summary or median
transmission tree?
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
23 / 29
Classification of relationships
For the time being, concentrate on classifying pairwise relationships
between hosts on each window
Contiguity: is the subgraph induced by the nodes from the pair, with
perhaps some unsampled nodes, connected?
Descent: Are all the nodes from one patient ancestral to those from
the other?
Mean patristic distance in the phylogeny
Then count relationships across windows
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
24 / 29
Improved HIV cluster detection
By setting a threshold on patristic distance we can refine the procedures
for identifying HIV clusters with likely directionality.
BEE0221-1
13
BEE0239-1
7
12
5
3
5
1
3
2
11
1
BEE0260-1
BEE0220-1
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
25 / 29
Improved HIV cluster detection
5
9
BEE0247-1
1
BEE0018-1
BEE0117-1
BEE0529-1
8
3
BEE0259-1
BEE0042-1
2
2
13
8
BEE0406-1
6
4
BEE0062-1
1
5
5
13
1
4
4
7
1
BEE0096-1
10
BEE0735-1
3
11
9
3
5
BEE0426-1
BEE0809-1
2
2
BEE0109-1
3
4
1
1
13
9
BEE0287-1
1
6
BEE0212-1
1
BEE0843-1
BEE0738-1
2
12
1
6
6
BEE0095-1
9
1
6
4
1
2
5
3
9
5
7
BEE0726-1
6
7
6
6
2
BEE0197-1
14
BEE0728-1
3
1 32
BEE0198-1
BEE0098-1
1
BEE0527-1
4
16
BEE0840-1
1
3
9
7
3
2
3
2
5
11
BEE0442-1
BEE0331-1
3
BEE0240-1
1
BEE0409-1
1
19
BEE0826-1
1
1
3
BEE0036-1
6
325
8
1
14
BEE0321-1
1
11
11
6
6
2
2
6
8
BEE0201-1
1
BEE0386-1
BEE0563-1
1
2
BEE0037-1
7
8
2
BEE0023-1
8
BEE0427-1
9
7
BEE0112-1
9
8
BEE0885-1
7
3 3 2
4
BEE0351-1
2
7
7
BEE0564-1
3
10
6
BEE0854-1
12
BEE0865-1
1
2
BEE0205-1
6
3
BEE0886-1
BEE0381-1
BEE0241-1
11
BEE0234-1
3
6
7
4
483
6
BEE0102-1
4
3
3
4
2
5
3
BEE0243-1
2
BEE0100-1
12
BEE0043-1
BEE0174-1
BEE0347-1
7
5
5
15
6
3
2
BEE0255-1
7
BEE0115-16BEE0146-1 1 0
2
7
10
BEE0085-1
1
5BEE0428-1
2 5
2
3
5
2
18
4
1
3
BEE0193-1
6
BEE0086-1
4
4
1
1
BEE0200-1
9
2
BEE0233-1 1
1
7
BEE0110-1
2
BEE0153-1
11
BEE0142-1
BEE0227-1
7
BEE0265-1
1
20
6
BEE0248-1
13
1
1
10
8
BEE0175-1
BEE0196-1
BEE0741-1
6
14
3
BEE0327-1
BEE0215-1
7
BEE0804-1
1
5
BEE0394-1
BEE0093-1
2
1
24
BEE0254-1
1
6
BEE0795-1
13
4
3
6
BEE0577-1
1
6
1
3
BEE0815-1
BEE0323-1
7
BEE0811-1
2
1
2
BEE0821-1
5
BEE0437-1
4
BEE0015-1
14
6
BEE0009-1
1
BEE0464-1
BEE0474-1
1
3
BEE0767-1
4
BEE0108-1
BEE0057-1
2
2
6
7
1
BEE0014-1
14
10
16
BEE0032-1
9
9
8
3
1
BEE0803-1
4
BEE0141-1
9
BEE0122-1
9
1
2
6
1
BEE0114-1
6
16
BEE0567-1
10
BEE0542-1
BEE0107-1
BEE0341-1
BEE0541-1
2
BEE0019-1
1
4
20
1
BEE0161-1
5
3
BEE0154-1
5
1
6
BEE0736-1
BEE0290-1
BEE0547-1
3
BEE0301-1
2
BEE0353-1
BEE0696-1
BEE0031-1
6
BEE0237-1
BEE0758-1
1
6
7
BEE0181-1
BEE0662-1
2
2
1
BEE0794-1
4
BEE0555-1
4
2
BEE0592-1
18
14
BEE0789-1
BEE0808-1
2
BEE0305-1
5
2
BEE0709-1
6
BEE0273-1
5
7
2
BEE0053-1
BEE0791-1
1
2
BEE0723-1
1
10
9
BEE0260-1
7
BEE0139-1
16
6
BEE0213-1
BEE0322-1
BEE0148-1
BEE0017-1
BEE0012-1
BEE0799-1
13
BEE0470-1
6
BEE0532-1
7
BEE0206-1
7
6
11
BEE0573-1
12
1
2
2
7
1
14
9
BEE0130-1
14
6
2
1
2
BEE0223-1
BEE0395-1
BEE0164-1
BEE0168-1
8
BEE0368-1
BEE0807-1
8
13
1
1
1
10
7
6
BEE0160-1
27
BEE0345-1
1
BEE0454-1
3
3
BEE0131-1
5
21
BEE0534-1
BEE0092-1
BEE0097-1
3
BEE0545-1
4
BEE0222-1
BEE0231-1
BEE0385-1
2
5
BEE0657-1
1
5
12
BEE0230-1
BEE0710-1
11
7
BEE0851-1
BEE0219-1
1
BEE0065-1
3
7
5
18
BEE0572-1
3
2
BEE0225-1
BEE0473-1
BEE0877-1
7
BEE0797-1
3
BEE0162-1
4
BEE0539-1
BEE0398-1
2
4
BEE0330-1
BEE0798-1
4
BEE0120-1
3
5
BEE0780-1
BEE0121-1
3
BEE0557-1
3
5
6
BEE0538-1
BEE0584-1
1
12
BEE0182-1
12
BEE0543-1
BEE0284-1
BEE0686-1
BEE0560-1
1
4
BEE0867-1
4
BEE0049-1
BEE0669-1
BEE0335-1
1
2
BEE0253-1
4
7
10
24
2
BEE0599-1
15
BEE0147-1
BEE0034-1
Matthew Hall (Oxford)
12
10
BEE0552-1
BEE0178-1
4
11
10
BEE0220-1
BEE0849-1
8
8
7
5
5
7
BEE0258-1
1
4
BEE0733-1
4
6
11
BEE0465-1
BEE0221-1
3
2
BEE0734-1
BEE0725-1
BEE0214-1
2
15
4
BEE0040-1
2
12
BEE0059-1
2
BEE0050-1
1
3
7
BEE0016-1
4
BEE0285-1
BEE0144-1
14
6
BEE0574-1
5
Transmission tree reconstruction
7
BEE0263-1
13
BEE0318-1
February 2017
26 / 29
Conclusions
The transmission tree can be viewed as an augmentation of the
internal nodes of a phylogeny with host information, subject to
various sets of rules.
Considerable recent work in reconstructing it for smaller datasets,
usually using MCMC.
Phyloscanner allows reconstruction with big, NGS datasets.
Work remains to be done for rigorous treatment of the output.
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
27 / 29
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
28 / 29
Acknowledgements
Oxford
Christophe Fraser
Chris Wymant
Imperial
Oliver Ratmann
Edinburgh
Andrew Rambaut
Mark Woolhouse
Matthew Hall (Oxford)
Transmission tree reconstruction
February 2017
29 / 29