Phylogenetic Analysis of HIV Samples from a Single Host

Phylogenetic Analysis of HIV
Samples from a Single Host
Master Thesis
Rounak Vyas
November 20, 2011
Advisors: Prof. Niko Beerenwinkel, Dr. Osvaldo Zagordi
Computational Biology Group, ETH Zürich
Contents
Contents
1
2
3
Introduction
1.1 AIDS . . . . . . . . . . .
1.2 HIV . . . . . . . . . . . .
1.3 Longitudinal Studies . .
1.4 HIV Sequencing . . . . .
1.5 Recent Studies of HIV-1
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
4
6
6
Materials and Methods
2.1 Patient History . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . .
2.3 Entropy Analysis . . . . . . . . . . . . . . . . . . . . . . .
2.4 Recombination Analysis . . . . . . . . . . . . . . . . . . .
2.5 Molecular Clock Estimation . . . . . . . . . . . . . . . . .
2.6 Poisson Fitter . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Sliding MinPD . . . . . . . . . . . . . . . . . . . . . . . . .
2.8 Rate of synonymous and nonsynonymous substitutions
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
11
11
12
13
16
17
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Results
3.1 Entropy Analysis . . . . . . . . .
3.2 Molecular Clock Rate Estimation
struction . . . . . . . . . . . . . .
3.3 Founder Virus Analysis . . . . .
3.4 Demographic Reconstruction . .
3.5 Sliding MinPD . . . . . . . . . . .
3.6 Conclusions . . . . . . . . . . . .
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . .
and Phylogenetic Tree Con. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
21
21
23
28
34
35
39
40
i
Chapter 1
Introduction
The earliest well documented incident of AIDS dates back to the 1980’s.
Since then, more than 25 million people have died from AIDS [4]. The
World Health Organization has now declared AIDS a pandemic and sincere
efforts are underway around the world to identify an effective cure for this
condition.
1.1
AIDS
As the name suggests, Acquired Immunodeficiency Syndrome (AIDS) is a
condition wherein the patient’s immune system becomes severely compromised, enabling opportunistic infections such as pneumonia, tuberculosis,
herpes, and others. These infections ultimately result in the death of the
patient. Clinically, AIDS is described as an advanced stage in an infection
caused by Human Immunodeficiency Virus (HIV) wherein the CD4+ cell
count drops below the critical level. CD4+ cells are a special class of white
blood cells which play a major role in recognizing foreign antigens (like bacteria and viruses) within the body [12]. In their absence, the immune system
is not able to recognize and clear these foreign agents leading to sustained
infections.
HIV infection can only be contracted from another infected individual through
exchange of body fluids like blood, genital fluids and breast milk. Chance
exchange of these takes place when the individuals share needles of an injection or engage in unprotected sexual intercourse. The infection cannot be
acquired by ingestion of the virus.
An individual may remain HIV positive for several years before becoming
an AIDS patient. After the onset of AIDS, the patient’s life span shortens to 8
to 12 months [19]. Current therapies are able to significantly increase the life
span of the infected individuals by delaying the onset of AIDS. Most of these
1
1.2. HIV
therapies act by interfering with one of the crucial steps in replication, entry
or release of the virus from the infected cell. However, due to the unusually
high rate of mutation in HIV, it is able to quickly develop resistance against
these therapies and flourish again. Hence there is presently no cure for AIDS
[29].
AIDS is not a disease that targets a specific organ, but a condition characterized by progressive immune failure leading to infections in several organs. To develop an effective therapy, it is imperative to gain insight on
the processes through which the virus evolves to establish a nonperishable
population within the host. Of particular interest are the evolutionary pattern the virus undergoes while subjected to selective pressures of the host
immune system, and the development of viral drug resistance. A detailed
understanding of these processes may offer insight into the development of
effective treatment strategies.
1.2
HIV
The Human Immunodeficiency Virus belongs to the family of retroviruses
[3], RNA viruses that use reverse transcriptase to encode their genetic material into DNA only within a host cell. It is known to cause Acquired
Immunodeficiency Syndrome [34]. There are two prominent types known
as HIV-1 and HIV-2, differing in their virulence, infectivity, and prevalence
[15]. These have originated from the Simian Immunodeficiency Virus subtypes cpz and smm and infect chimpanzees and old world monkeys respectively. In this report we focus on HIV-1 as this is the virus type with which
both patients were infected.
Figure 1.1: Diagram of Human Immunodeficiency Virus [1]
The virus has of two identical copies of the complete genome encoded on
two positive single RNA strands and consists of nine genes encoding 19 vi2
1.2. HIV
ral proteins. The viral core is composed of the Capsid Protein (CA, p24),
Matrix protein (MA, p17) and P6. Following reverse transcription of the viral genome by reverse transcriptase subsequent to host infection, the newly
produced DNA is incorporated into the host genome by viral integrase [25].
The pre-proteins encoded by the viral genome are converted to fully functioning HIV proteins by protease. Following host infection RNAse H breaks
down the retroviral genome.
The HIV genome codes for a series of proteins serving structural and regulatory functions. Structural proteins include gp120 which lies outside the
virus particle and gp41 just inside the membrane, with gp41 serving as a
membrane anchor for gp120. Tat (transactivator) is a regulatory gene that
accelerates the production of viral progenies and is known to be a crucial
protein for HIV replication. Rev stimulates the production of HIV proteins
but suppresses the expression of other regulatory genes of HIV. Nef (negative replication factor gene) encodes for proteins that are exposed to the
cytoplasm of the host cell and are necessary for viral spread and disease
progression by down-regulating the CD4 count. Vif encodes the Viral Infectivity Factor found inside the virus and is responsible the rapid spread of
the virus. Vpr (Viral protein R) accelerates the production of HIV proteins
and interferes with the host cell cycle thus inhibiting the cell division. Vpu
(Viral protein U) helps in assembling new virus particles, budding out from
the host cell, and accelerates the degradation of CD4 proteins.
Figure 1.2: HIV-1 genome HXB2 strain [17]
HIV Life Cycle
HIV enters lymphocytes by binding to the chemokine and CD4 receptors
present on the cell surface [10, 40]. This binding is facilitated by viral gp160
protein (gp120 and gp41 proteins) [10, 40]. Following binding, the viral
envelope fuses with the cell membrane and releases the HIV capsid into the
cell. Besides CD4+ cells, HIV can also infect macrophages and dendritic
cells [10, 40].
Once the viral capsid enters the cell, the viral RNA is reverse transcribed
to a cDNA molecule [25]. This process is facilitated by the reverse transcriptase enzyme which is extremely error prone and also lacks the proof reading
3
1.3. Longitudinal Studies
capacity leading to a misincorporation rate of 10−4 to 10−5 per base or approximately one mis-incorporation per genome per replication cycle [38].
The cDNA and its complement form a double stranded viral DNA which
is then transported into the cell nucleus where it is subsequently integrated
into the host genome with the help of integrase viral enzyme [25].
Once incorporated, viral DNA requires cellular transcription factors to encode for viral proteins. During viral replication, the pro-viral DNA is transcribed into mRNA which is spliced and then transported to the cytoplasm
where it is translated into viral proteins (mainly Tat and Rev Proteins). Rev
protein accumulates in the nucleus and inhibits mRNA splicing and the unspliced full length mRNA leave the nucleus to enter the cytoplasm [32]. The
full length mRNA is actually the viral genome which binds to Gag protein
and is packaged into new viral packets. After processing by the host endoplasmic reticulum gp160 is transported to the plasma membrane where
gp41 anchors gp120 to the membrane of the infected host cell. The viral
capsid then assembles and buds out of the cell to infect other cells [25].
The high rate of mutation in HIV is due to the high error rate of the reverse
transcriptase enzyme while transcribing the viral RNA genome into a DNA
sequence that can be incorporated in the host cell genome for coding viral
proteins [5]. Along with a high misincorporation rate of approximately one
base per replication, reverse transcriptase also lacks proof-reading activity
rendering it unable to check and rectify copy mistakes, often resulting in
several slightly different copies of HIV within a single patient. Distinct viral
populations are referred to as ”quasispecies”, and each genetically distinct
individual is referred as a haplotype.
1.3
Longitudinal Studies
Due to the prohibitively expensive nature of prospective HIV screening, HIV
studies are generally only performed on high risk population groups, such
as within a prison. Combined with additional ethical considerations, the
result is that most studies enroll patients that are already symptomatic. In
already symptomatic populations the viral load is already established and
thus is of little insight into the dynamics of the pre-seroconversion phase, i.e.
before any viral antibody production has taken place.
To develop insight into the temporal evolution of the virus within a host,
longitudinal studies of patient populations are of critical importance. Longitudinal studies combine data collected at multiple examinations at intervals
between minutes and years to afford a more comprehensive insight into the
viral dynamics than is possible through examinations at a single time point
alone.
4
1.3. Longitudinal Studies
Figure 1.3: HIV Life Cycle [6]
While longitudinal studies are clearly desirable, they also present technical
challenges such as censoring of events due to the relocation, death or disenrollment of cohort members. Additionally, changing patient habits and a
lifestyle choice can complicate analysis. Lastly, longitudinal studies are necessarily more involving and therefore more expensive than single time-point
studies.
5
1.4. HIV Sequencing
1.4
HIV Sequencing
Traditional Sanger-based sequencing method only read the consensus genomic sequence of heterogeneous viral populations [36], obfuscating the
genomic variability present in the population which is of potential importance for identifying gradually-fixating resistance mutations. Next Generation Sequencing (NGS) technologies represent an improvement over Sanger
sequencing and facilitate the sequencing of distinct haplotypes within a sample [30]. However, NGS reads are error-prone and require sophisticated
processing techniques to create error-free haplotype reconstruction and frequency estimation. Currently, haplotypes with frequencies as low as 0.05%
can be estimated with 99% confidence [24].
1.5
Recent Studies of HIV-1
In 1999 R. Shankarappa et al. conducted a landmark study investigating the
evolution of HIV within an infected individual prior to the onset of AIDS
[22]. They studied the evolution of C2-V5, a high mutation region of the
HIV-1 env gene, in nine patients over six to twelve years. They estimated
the viral diversity within and between time points, identified mutations conferring the viral strain any fitness advantage, and characterized the existence
of three distinct phases: the early phase with linear increase in diversity and
divergence from the founder virus strain, an intermediate phase with linear
increase in divergence but stabilization or decline in diversity, and a late
phase with stabilization in the divergence and continued decrease in diversity.
More recently, Poon et al, used longitudinal deep sequencing data with coalescent analysis to estimate the date of HIV infection [26]. Time of infection
was estimated using the time to most recent common ancestor (TMRCA) of
a time-calibrated phylogenetic tree relating sequences from all time points.
This is justified by the argument that most HIV infections are established by
a single viral strain due to bottlenecks during transmission [39, 13]. 19 HIV
positive individuals were followed and 7 genomic regions were analyzed.
The authors compared the estimated time since infection from experimental methods to TMRCA estimates obtained with the BEAST software library.
They observed a stronger correlation between clinical and computational estimates for TMRCA in highly variable regions of HIV genome (such as env)
relative to that in conserved regions such as pol. The reduced correlation in
the conserved regions is thought to be due to a possible overestimation of
time scales due to the increased sensitivity of the coalescent based methods
towards the sampled genetic variation. Consequently, sequences with high
divergence were found to be ideal for calibrating the evolutionary clock. In
the case of a multiple founder virus infection, this method was found to
6
1.5. Recent Studies of HIV-1
overestimate the infection time.
In the same month another interesting study was published by a different
author, Suzanne English, et al. [23]. This discussed the construction of the
transmission history of HIV-1 infected individuals using Phylogenetic methods. It showed that the diversity is fairly limited in the early phase of the
infection and is even compatible with the transmission of a single viral variant. It also provided evidence to support the idea that a single donor can
in principle transmit two distinct variants to two different individuals in a
small time span of few hours. The transmission history was constructed
using the Bayesian and Maximum likelihood approach. Env, gag and pol regions were used for this analysis. The inter host genetic diversity predictions
proportionately varied depending on the extent of conservation observed in
the region used for its prediction. Highest diversity was predicted using
env gene (least conserved) followed by pol gene and then gag gene. Transmission history was constructed using the inter-host variation observed in
these three regions. BEAST software was used for carrying out this analysis.
A temporal study on HIV-1 was undertaken by G. Achaz et al and published in 2004 [11]. This study was conducted using gag-pol sequence data
collected over time from two chronically infected individuals to estimate the
population structure and the neutral mutation rate of this region per site per
generation. Neutral coalescent models were used for the analysis. For the
genealogy construction, coalescent approach identical to the one proposed
by Felsenstein 1999 was used. 19 time points collected over a period of 4
years were used for the analysis. This compensated for the low mutation
rate in the sequences.
A longitudinal study to understand the viral evolution in early Acute Hepatitis C Virus infection was carried out by Bull RA, Luciani F, McElroy K,
Gaudieri S, Pham ST, et al. published in 2011 [9]. We discuss this paper
in greater detail due to the parallelism with our study. This study aimed
at identifying genetic variants as low as 0.1% frequency and subsequently
quantify them over the course of infection. They also identified two sequential bottlenecks that occurred early in infection. BEAST software was used to
estimate the changes in the effective population size of the virus population
over time. It was also used to construct ancestor descendant relationships
with the viral samples from different time points. The rate of evolution for
the virus was also estimated during this analysis. In depth nonsynonymous
and synonymous substitution analysis was carried out to identify any visible pattern of change. Entropy changes were measured across the whole
genome and across patients which indicated non-uniform evolution of HCV
across the genome and over time. Single founder virus hypothesis were
tested for infection using the freely available tool called as Poisson Fitter on
the HIV database.
7
1.5. Recent Studies of HIV-1
Several other time series data analysis on HIV positive patients have been
carried out for identifying/understanding the order in which resistance mutations are accumulated when the patient is placed under a drug therapy.
However, we do not discuss these since our patients did not show any drug
resistance even after the therapy was discontinued.
8
Chapter 2
Materials and Methods
2.1
Patient History
HIV samples were collected from two patients enrolled at the department of
infectious diseases at Universitatspital Zurich. The protease coding region
of HIV was deep-sequenced and analyzed. This region was chosen for the
study since both the patients were treated with a protease inhibitor drug.
Patient I.D.123
Figure 2.1: Viral load in patient I.D. 123
This patient was a part of the Primary HIV Infection study which emphasizes on beginning the treatment in the early phase of infection and then
discontinuing it. It is based on the assumption that the patient is likely to
control the virus when the treatment is started early. However, most patients
9
2.1. Patient History
Table 2.1: Sample collection time points, patient I.D.123
Sr.No.
1
2
3
4
Sample Name
PR1
PR 2
PR 28
PR 3
Sample Collection Date
12.12.2003
15.12.2005
30.05.2006
14.08.2007
suffer from a viral rebound, like patient 123. As can be seen in figure 2.1,
four samples over a period of 3.74 years were sequenced from the patient
after being tested as HIV positive. These samples have been marked in red.
The regions marked as ART in figure 2.1 show the periods of treatment with
Lopinavir, an anti-retroviral drug. The exact dates of sample collection can
be seen in table 2.1
Patient I.D.181
Figure 2.2: Viral load in patient I.D.181
The patient remained untreated until almost an year after being tested HIV
positive. During this phase, the viral load in blood plasma was regularly
monitored. Samples from three time points marked with red in figure 2.2
were deep-sequenced and analyzed. The exact dates of sample collection
have been mentioned in table 2.2.
10
2.2. Data Pre-processing
Table 2.2: Sample collection time points, Patient I.D.181
Sr.No.
1
2
3
2.2
Sample Name
PR4
PR 5
PR 6
Sample Collection Date
28.09.2005
15.03.2006
08.09.2006
Data Pre-processing
Haplotype reconstruction and error correction was performed using ShoRAH
[24]. The output file contained haplotype sequences in FASTA format. The
header of each haplotype contained two numbers. First number showed our
confidence in the haplotype sequence on a scale of 0 to 1. The other number could be used to calculate the frequency of the haplotype in the sample
population. It showed the number of times the sequences constituting the
haplotype were sequenced. It is known as the average read number of a
haplotype.
These files often contained over a hundred sequences with only a few having a high read count and confidence. For a meaningful analysis, these files
were filtered to select sequences with a confidence of over 0.9. This reduced
the number of haplotypes to one quarter or less. The threshold was chosen
to optimize the number of sequences for analysis. Too few sequences would
not contain enough information for the analysis and too many would certainly add noise to the result. This cutoff returned a reasonable number of
haplotypes.
Since a functional protease is fundamental for HIV, any gaps present in the
haplotype sequences were assumed to be sequencing errors. The haplotypes
from a single run were used to first construct a consensus sequence using
the Biopython EMBOSS tool known as “Cons”. Any gaps present in the consensus sequence were filled using HXB2 protease reference sequence. The
consensus sequence was then used to fill the gaps present in the haplotype
sequences. The reading frame of every haplotype was also ensured to start
at the first nucleotide position. A python script was written for performing
all the above tasks.
2.3
Entropy Analysis
HIV constantly accumulates mutations to cope with the selective forces being exerted by the immune system and drug treatments. The nucleotide
sites that accumulate these mutations are mainly responsible for rendering
the virus resistant to different therapies. In order to improve our current
11
2.4. Recombination Analysis
methods, it would be fruitful to identify these sites and also have an insight
on how these sites maintain diversity in the viral population. For this purpose we calculate entropy for every time point dataset and try to identify
any visible spatial or temporal patterns.
Entropy of a position in a sequence depicts the uncertainty associated with
the nucleotide present at the site. High entropy indicates that the site can
have variable nucleotides.
Let X be a discrete random variable ( bases while considering nucleotides,
amino acids while considering proteins), taking a finite number of possible
values x1 , x2 , . . . , xn with probabilities p1 , p2 , . . . , pn such that pi ≥ 0, i =
1, 2 . . . , n and ∑in=1 pi = 1. The entropy is then given by
n
Hn ( p1 , p2 , ..., pn ) = − ∑ pi logb pi
i =1
Here b is the base of the logarithm. A simple python script was written
to calculate and plot the entropy at every position in the alignment, we
used natural logarithm for our calculations. When deep-sequencing data
was submitted as an input, the script could take into account the average
number of reads while calculating the entropy at every position.
2.4
Recombination Analysis
Recombination plays a crucial part in the evolution of retroviruses and is
more prevalent in conserved regions [7]. Since we use a fairly conserved
HIV region for our analysis, we performed a recombination detection study.
If an alignment contains recombinant sequences then relationships between
different segments of the alignment cannot be described using a single phylogenetic tree. To unfold the true evolutionary relationships, it is imperative
to identify the recombination break points and partition the alignment into
the number of observed recombinant sets and then depict the evolutionary
relationships in each of these partitions using a separate phylogenetic tree.
If recombination events are not taken into account during a phylogenetic
analysis, then the results are most likely to be meaningless.
We used Recombination Identification program [37] which has been developed
to specifically detect recombinants in HIV-1 nucleotide sequences. It accepts
a set of nucleotide sequences from a single viral genomic region collected
from a single patient as an input. The program requires a background sequence which is essentially the consensus sequence of the genomic region
that is to be analyzed. This can be selected from the available list in the
program; alternatively the user is free to submit a consensus sequence with
12
2.5. Molecular Clock Estimation
the nucleotide data. In the latter case, the consensus should be aligned to
the rest of the sequences.
This program detects recombinants by sliding a window of pre-specified
length along the alignment and calculating the hamming distance of the
query sequence from all other sequences. The best match within every window is qualified and the confidence in each match is calculated using a z-test.
If two neighboring windows on the same sequence have best matches with
different sequences then it is considered as a recombinant. The program
implicitly assumes each site to be evolving independently but according to
the same process. It also approximates the binomial distribution of the hamming distances by a normal distribution.
2.5
Molecular Clock Estimation
This section closely follows “The Evolutionary Analysis of Measurably Evolving Populations using Serially Sampled gene sequences” by Allen Rodrigo, et al [21] and “Estimating Divergence times” by J.L
Throne, H.Kishino [28]
Our interest in estimating the rate of evolution comes from our desire to construct rooted, time scaled phylogenetic trees using serially sampled data. A
phylogenetic tree using only contemporary sequences can be constructed using standard approaches like maximum parsimony and N-J method which
assume that all the input sequences belong to a single time point and are
therefore equally distant from the root of the tree [35]. This is not the
case with serially sampled sequences and care must be taken to scale the
branches according to their time of sampling. Rate of mutation is required
for this scaling of branches. Unlike the standard tree construction techniques
where the branch lengths are calculated using a composite parameter µt,
where µ is the substitution rate and t is the sampling time, with serially sampled data these two parameters can be decoupled into time and substitution
rate and the tree branches can be expressed in units of either.
The rate of molecular evolution is an outcome of a complex interplay between the biological systems and their surroundings. Since these systems
and their surroundings change over time, it is inherent that their evolutionary rates would also fluctuate. These fluctuations in rates over different periods are best described as the relaxed molecular clock. In the case of HIV, the
rate of evolution is influenced by the rate of mutation, the generation length
as well as the probability of fixation of the mutation in the viral population.
All these factors depend intricately on the biology as well as the population
size of HIV. When the population size fluctuates, so does the fixation probability of a mutation resulting in the change of selection pressure on the
virus. Hence changes in the population size are necessary to be taken into
account while deciphering phylogenetic relationships. This is done using co13
2.5. Molecular Clock Estimation
alescent based models that use sequence data to determine the population
genetic parameters (e.g. population size, etc) which in turn determines the
shape of the genealogy. Coalescent theory describes the dependence of a
phylogenetic tree that represents the shared ancestry of sampled genes (i.e.
genealogies) on the change in population size and structure [33]. BEAST
implements variable population size coalescent model which allows determining the past population dynamics. This option is known as the Bayesian
skyline plot. It is a non-parametric model which makes use of the time calibrated sequence data to estimate demographic model parameters using the
Bayesian methods [14]. It can estimate the evolutionary rate, substitution
model parameters, phylogeny and ancestral population dynamics within a
single run. It then plots the past population evolution over time. The plot
begins from the estimated root age of the phylogenetic tree.
It can also be argued that depending on the period of observation, the evolutionary rates can be assumed to be approximately constant implying a
strict molecular clock. Such an assumption facilitates evolutionary studies
but one should always keep in mind the scenarios when the weakness of
this assumption out competes its convenience.
The molecular clock model selection and rate estimation was performed
using a Java based tool, BEAST [27]. It was the natural choice for performing the analysis since it implements substitution models, insertion deletion
models, demographic models for performing a series of coherent analysis.
It can also explicitly model the rate of molecular evolution on every branch
of the phylogenetic tree. This rate can be constrained to be constant over
all branches or can be allowed to freely vary along different lineages. This
molecular clock model can be readily combined with other models that allow the rate of substitution to vary along the alignment while sharing some
common parameters such as the rate of transition or transversion. Since several models can be combined, many unnecessary simplifying assumptions
can be avoided.
BEAST provides Bayesian framework for testing hypothesis on biological
data. Its three main genera of analysis are “constructing rooted and time
measured phylogenies”, “estimating population change over time using coalescent
based models” and “demo-geographic sequence analysis”. We constructed time
calibrated phylogenies after estimating the clock rate and population evolution plots, hence we will be discussing these two methods in detail. Demogeographic analysis uses the location of sample collection and includes this
information while drawing statistical inferences.
BEAST is one of the few available platforms which can deal with time
stamped data and make use of relaxed or strict molecular clock models
to construct rooted trees and calibrate internal node ages in absolute time
scales. It makes use of the Metropolis-Hastings Markov Chain Monte Carlo
14
2.5. Molecular Clock Estimation
algorithm to provide sample based estimates of the posterior distributions
of the evolutionary parameters given a set of sequence data. It facilitates
analysis of multi-locus data since the data can be appropriately partitioned
and the evolutionary parameters can be linked/unlinked between partitions.
This feature can be extremely helpful when dealing with viral sequences
with genes e.g. Pol and Env which have different rates of mutations. In such
a situation, the demographic model parameters can be shared between partitions assuming exponential or logistic growth while the substitution model
parameters can be unlinked across different partitions.
Model Summary
The model first estimates a phylogenetic tree to explain the relationship between n contemporaneous sequences. This is the genealogy, denoted by g.
The coalescent events are then assumed to occur only on internal nodes of
the tree, i.e. there can be maximum of n − 1 coalescent events occurring on
the tree. The population might change or remain the same after the occurrence of a coalescent event. The indicator function Ic (i ) is used to denote
whether the ith event is a coalescent. The times at which the coalescent
events occur are denoted using a vector u = (u1 , u2 , . . . , un−1 ). The period
where the population size remains unchanged is called as an interval and
the vector used to denote the number of coalescent events in each interval
is A = ( a1 , a2 , . . . , am ). Here m is the total number of such intervals with
1 < m < n − 1. The time at which each grouped interval ends is denoted
by w = (w1 , . . . , wm ) and the vector of effective populations sizes is denoted
using Θ = (θ1 , θ2 , . . . , θm ).
The vectors denoting the effective population size together with the genealogy g and the vector of number of coalescent events in each interval A
constitute to the demographic and coalescent time parameters. The probability of the genealogy can be easily calculted and is denoted by f G ( g|θ, A).
BEAST uses a fixed number of coalescent events m since the resulting posterior demographic function is consistent for a large range of its values. The
vector of effective population size are sampled using a MCMC algorithm.
Each new population size is sampled from a exponential distribution with
a mean equal to the previous population size. This formulation represents
our belief that the population size is autocorrelated through time.
The posterior distribution sampled is the product of the likelihood of piecewise demographic model and the priors
f het (Θ, A, Ω, g, µ| D ) =
1
Pr ( D |µ, g) f G ( g|Θ, A) f Θ (Θ1 ) X f A ( A) f Ω (Ω) f µ (µ)
Z
where, f Θ (Θ1 ) is the scale invariant prior for the first effective population
size and the rest are drawn from an exponential distribution which is cen15
2.6. Poisson Fitter
tered around the size of previous population. Ω contains the parameters of
the substitution model and µ is the mutation rate that scales the genealogies
(phylogenetic tree) from units of mutations per site to units of time.
The sampled posterior distribution is the product of the likelihood of piecewise demographic model and the priors. If the sampled substitution model
parameters and mutation rates are ignored, then we get a list of states associated with a genealogy and demographic parameters. Then the demographic history can be constructed as a piecewise function of time for each
of the states. The marginal posterior distribution of the population size is
calculated for each time point till the time to most recent common ancestor
along with the 95% confidence interval that accounts for phylogenetic and
demographic uncertainty. The population estimates are usually smooth due
to the averaging effect of the sampling procedure in use.
2.6
Poisson Fitter
Freely available on http://www.hiv.lanl.gov
Studies in [39, 13] showed that HIV undergoes genetic bottlenecks when
the mode of transmission is sexual (horizontal transfer) or mother to child
(vertical transfer). This primarily results in new infections being initiated by
homogeneous viral strains. Once the infection is established by a single viral
strain, it is expected to grow exponentially until the host immune system
initiates a response. This is a case of neutral evolution where the mutation
counts are expected to follow a poisson process. Once the host immune
system triggers a response against the infection or when the patient is placed
under therapy, the virus population does not grow exponentially and the
accumulated mutations are no longer random and the Poisson distribution
cannot be used for describing the pairwise Hamming Distance frequency
distribution.
Poisson Fitter [16] analyzes a set of HIV sequences assumed to be collected
close to the time of infection to estimate whether the infection was initiated
by a single of a multiple founder viruses. It is based on maximum likelihood
approach which first tests the hypothesis of a single founder virus strain
initiating the infection and if this condition is met then the time of infection
is estimated with 95% confidence interval, provided the sample has been
drawn before the virus population was subject to any selection pressure.
Poisson Fitter can read deep-sequencing datasets and so was the natural
choice for performing this analysis. Another reason for selecting this tool
was that it is specially designed for working with HIV and Hepatitis C virus
datasets and makes use of their default substitution rate. It has been used
in some other longitudinal studies that have been discussed in the literature
review section.
16
2.7. Sliding MinPD
This tool compares the sample genetic diversity with the diversity expected
under the neutral growth model, i.e infection by a single viral strain accumulating random mutations, by performing statistical tests on the Hamming
Distance and fitting a Poisson distribution to the same using the maximum
likelihood method. It tests whether the phylogenetic tree for the sequences
shows a star topology. The Poisson distribution shape parameter is then
found to be
∑n iYi
λ = in=0
= E (Y )
∑i=0 Yi
where Y = (Y0 , . . . , Yn ) are the number of pairs of sequences that have a
hamming distance equal to the subscript n. The model assumes a generation
time of 2 days and a mutation rate of 2.16 × 10−5 per site per replication with
a basic reproductive ratio R0 = 6 based on the findings from [39, 13, 18].
When the sequence data shows a star phylogeny, one finds that E(Yi ) = Yi .
Once this condition is satisfied, the age of the root of the tree is the same as
the time of HIV transmission to the patient.
When the goodness of fit is low, it might indicate that the sample was collected after the initiation of the selection pressures or the infection was initiated by multiple founder viruses. Deep sequencing data can also be analyzed using Poisson Fitter and the plots are then on a log scale since the
number of identical sequences are much more than the ones that differ and
this information gets masked on a linear scale.
2.7
Sliding MinPD
This section describes the methods in [8]
The traditional phylogenetic approaches treat all the sequence data as contemporaneous data and deal with serially sampled data by merely scaling
the tips of the leaves. These methods are also not able to account for recombination events. Furthermore, when the data is collected from quickly
evolving viruses which exhibit complex substitution patterns, phylogenetic
trees are not able to depict all the information. In such a situation, an evolutionary network can be used to depict the ancestor descendant relationships
and recombination events.
Sliding MinPD constructs an evolutionary network using serially sampled
data and detects recombination events using a sliding window approach. It
is based on the minimum pairwise distance approach combined with the
sliding window method and recombination detection techniques.
The algorithm consists of three phases. In the first phase, every sequence
that does not belong to the first time point is deemed as the query sequence
and its pairwise distance is calculated against every other sequence from
17
2.8. Rate of synonymous and nonsynonymous substitutions
the previous time point. In the second phase, the breakpoints in the recombinant sequences and their donor sequences are identified using the sliding window approach where the best match is identified for every window
along the alignment. In the final step, potential ancestors from previous
time points are identified. For the non-recombinant sequences these are the
ones which had the shortest calculated distance in first step.
The results of this program were found to be extremely sensitive to the
specified window length. Hence the analysis was carried out for only a
single patient.
2.8
Rate of synonymous and nonsynonymous substitutions
This section closely follows [31], the chapter ”Neutral and adaptive protein evolution” by Ziheng Yang
in [41] and the Hypothesis Testing for Phylogenies manual [20]
The rate of nonsynonymous and synonymous substitution provides an insight on the type of selection pressure acting on the viral population. When
the ratio of the rates of nonsynonymous and synonymous substitutions is
greater than one for a genomic region, then that region is said to be under
positive selection, e.g. when a HIV patient is placed under a drug therapy, the virus shows concerted substitutions towards acquiring a particular
residue which eventually fixates in the population making the virus drug
resistant. This type of evolution is known as positive directional selection.
Another kind of positive selection is to maintain the amino acid diversity at
certain sites which are potential targets of the host immune system. This is
commonly known as diversifying positive selection.
When the genomic region accumulates synonymous and nonsynonymous
substitutions at the same rate, then it is said to be under neutral evolution.
In negative selection the rate of nonsynonymous substitutions is much lower
than that of synonymous substitutions causing selective removal of alleles
that are deleterious. It is also commonly known as purifying selection.
A substitution behaves synonymous or nonsynonymous depending on the
codon in which it occurs and on the position within the codon. For example,
GGX → GGY is always a synonymous substitution whereas CAX → CAY
is synonymous if X → Y is a transition and nonsynonymous otherwise.
Hence while dealing with coding sequences, it is always meaningful to use
codons as the units for selection analysis.
We used Mega software for calculating the rate of synonymous and nonsynonymous mutations which is based on the method described by M. Nei
and T. Gojobori in 1986. Here we describe the method in detail. First the
18
2.8. Rate of synonymous and nonsynonymous substitutions
number of synonymous and nonsynonymous sites for each codon present
in the sequence is computed.
Let S be the number of synonymous sites for each codon S= ∑3i=1 f i , where
f i is the fraction of synonymous changes at the ith position in a codon. Then
the number of non-synonymous sites S for each codon can be calculated as
N= 3–S
This can be understood by a simple example. In the case of TTA which
codes for leucine
f 1 = 31 ( T → C ), f 2 = 0, f 3 = 31 ( A → G ) and so S = 32 , N =
7
3
The total number of synonymous and nonsynonymous sites in a sequence
of r codons is therefore given by S = ∑ri=1 f i and N = (3r − S).
The number of nonsynonymous and synonymous nucleotide differences between a pair of sequences is calculated by comparing the sequences codon
by codon and counting the number of synonymous and nonsynonymous
nucleotide differences for each pair of compared codons. This can be easily
done when the codons are differing at only a single position. When they
differ at two nucleotide positions then there are two possible ways through
which this difference could have occurred. Both the paths are considered
then with equal probability and the number of synonymous and nonsynonymous substitutions are counted and Sd andNd are updated. For example: If
TTT codon is compared against GTA, then the two pathways are
1. TTT(Phe)→GTT(Val)→GTA(Val), one synonymous and one nonsynonymous substitution
2. TTT(Phe)→TTA(Leu)→GTA(Val), two nonsynonymous substitution
The value of Sd becomes 0.5 and Nd becomes 1.5 respectively. Similarly,
when there are three nucleotide differences then six possible pathways between the codons with three mutational steps within each pathway are considered.
The proportion of synonymous and nonsynonymous differences are then
calculated using the equations ps = SSd and pn = NNd where S and N are
the average number of synonymous and nonsynonymous sites for the two
compared sequences. Further the per site substitutions are calculated using
the Jukes and Cantor (1969) formula [31]:
3
4
d = − ln(1 − p)
4
3
Where p is ps and pn for synonymous and nonsynonymous substitutions
respectively. This method gives approximate estimates and the formula is
not applicable to two and threefold degenerate sites. The program used
19
2.8. Rate of synonymous and nonsynonymous substitutions
by us for this analysis makes use of this method for calculating the rate of
synonymous and nonsynonymous changes.
20
Chapter 3
Results
3.1
Entropy Analysis
Patient I.D. 123
The entropy was calculated and plotted for every position in the alignment
for all the four datasets, as shown in figures 3.1, 3.2, 3.3 and 3.4.
A set of constant peaks can be observed around position 50 and 290 in all
the four plots. The sequence region around these sites was explored and
summarized in table 3.1. The base number shows the nucleotide position
whose neighboring sequence is being viewed. The following four columns
show the entropy of the base at different time points. The last column shows
the neighboring sequence of the base. Constant high entropies were found
in the homo-polymeric regions. These were most likely sequencing errors
since the 454 sequencing technique (used in our analysis) is known to suffer
from high base mis-incorporation rate in the homopolymeric regions. These
regions of the alignment were manually curated to remove the anomalies.
Figure 3.1: Patient I.D. 123: Entropy plot for samples collected in 2003
21
3.1. Entropy Analysis
Figure 3.2: Patient I.D. 123: Entropy plot for samples collected in 2005
Figure 3.3: Patient I.D. 123: Entropy plot for samples collected in 2006
Figure 3.4: Patient I.D. 123: Entropy plot for samples collected in 2007
22
3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction
Table 3.1: Patient I.D. 123: Neighboring sequences of high entropy sites
Base No.
48
49
294
Time1
0.377
0.377
0.377
Time2
0.305
0.305
0.655
Time3
0.305
0.474
0.586
Time4
0.562
0.693
0.562
Sequence preceeding the site
GGGGGG (43-48)
GGGGGGC (43-49)
TTTAAATTTT (285-294)
In general, sequences from the first time point showed the highest entropy
measure which gradually decreased over time.
Patient I.D. 181
High entropy was measured in the several regions including few homopolymeric regions. Regions suspected with sequencing errors have been listed in
the table 3.2. These region were manually corrected to remove the sequencing errors. There was no spatial or temporal trend observed for change in
entropy.
Table 3.2: Patient I.D. 181: Neighboring sequences of high entropy sites
Base No.
129
132
294
3.2
Time1
0.271
0.271
0.271
Time2
0.224
0.224
0.606
Time3
0.113
0.113
0.549
Sequence preceding the site
AAACCAAAAA (124-133)
AAACCAAAAA (124-133)
TTTAAATTTT (285-293)
Molecular Clock Rate Estimation and Phylogenetic
Tree Construction
Sequences from all the data points were used to simultaneously estimate the
substitution rate and for constructing the phylogenetic tree depicting ancestral descendant relationship between sequences from all time points. BEAST
software was used for this analysis. After performing a number of test runs
to understand the effect of every parameter on the run, the following setting
was found to provide the optimal results in terms of high log-likelihood
value of the estimated parameters, fast convergence of the MCMC chain
and low standard deviation in the distribution of parameters. The phylogenetic tree construction runs for both the patients were performed with the
settings specified in table 3.3.
The priors specified for the BEAST run have been summarized in table 3.4.
The operator setting used to explore the sample space for the parameters
has been summarized in table 3.5.
23
3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction
Table 3.3: Patient I.D. 123 and 181: Analysis Settings. Results shown in
fig: 3.7, 3.9
Site Models
1. Substitution Model
2. Base Frequencies
3. Site Heterogeneity Model
4. Partition in codon position
Parameter
HKY
Estimated
Gamma+Invariant Sites
None
Clock Model
1. Model
2 .Estimate Rate
Strict Clock
Yes
Demographic model
1. Tree Prior
2. Starting Tree
Constant size
Randomly Generated
Table 3.4: Patient I.D. 123 and 181: Priors for the BEAST run
Parameter
Kappa
Prior
Lognormal[1,1.25]
Bound
[0,inf]
Frequencies
alpha
pInv
uniform[0,1]
uniform[0,10]
uniform[0,1]
[0,1]
[0,1000]
[0,1]
clock.rate
uniform[5.4E-5,1]
rootHeight Using Tree Prior
const.popsize 1/x
[0,inf]
[3.761,inf]
[0,inf]
Description
HKY transition transversion
parameter
base frequencies
gammma base frequencies
proportion of invariant sites
parameter
substitution rate
root height of the tree
coalescent population size parameter
Let us now briefly discuss how the current choice of parameters was formulated. A series of runs were set up with parameter rich substitution models
like general time reversible model. These initial runs took long to converge.
As a result, simpler substitution model was selected for the analysis like
HKY model. This made a significant difference in the decreasing the convergence time of the chain. All initial test runs were made with the uncorrelated
lognormal clock which draws the rate of each branch from an underlying
lognormal distribution. The standard deviation (i.e. ucld.stdev parameter)
estimate of the clock rate was close to zero for most of these runs. A value
close to zero for this parameter indicated clock-like behavior of the dataset
[2]. Thereafter a set of runs were made with different coalescent models like
expansion growth model, exponential growth model and constant growth
model. The Bayes Factor was used for selecting the best fitting model. For
24
3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction
Table 3.5: Choice of operator values for the rate estimation and phylogenetic
tree construction, Patient I.D. 123 and 181
Operates on
Kappa
Type
scale
Tuning
0.75
Weight
0.1
Frequencies
Alpha
deltaExchange
scale
0.01
0.75
0.1
0.1
Clock.rate
Tree
scale
subtreeSlide
0.75
0.0035
3.0
15.0
Tree
wideExchange
n/a
3.0
Tree
wilsonBalding
n/a
3.0
Tree
narrowExchange
n/a
15.0
rootHeight
scale
0.75
3.0
uniform
n/a
30.0
scale
0.75
3.0
randomWalk
upDown
1.0
0.75
3.0
3.0
Internal
heights
popSize
node
growthRate
Substitution rate
and heights
Description
HKY transition-transversion
parameter of partition
frequencies
gamma shape parameter of
partition
substitution rate of partition
Performs the subtree-slide rearrangement of the tree
Performs global rearrangements of the tree
Performs the Wilson-Balding
rearrangement of the tree
Performs local rearrangements of the tree
root height of the tree of partition
Draws new internal node
heights uniformly
coalescent population size parameter of partition
exponential.growthRate
Scales substitution rates inversely to node heights of the
tree
both the patients, the constant population model could not be rejected. Table 3.6 shows the statistics for this parameter.
Table 3.7 summarizes the statistics for patient I.D. 181. The substitution
rate in HIV protease coding region for this patient was found to be slightly
higher. The shape of the distribution was similar to the one of patient I.D.
123.
The phylogenetic tree topologies and parameters were also sampled over
the MCMC chain. Tree and parameter values were logged once every 10,000
steps and the chain was run till the effective sample size, i.e the number of
independent draws from the posterior distribution exceeded 250 [2]. In our
case the effective sample sizes were well over 1000 for the relevant parameters.
Figure 3.7 shows the ancestor descendant relationship between sequences
25
3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction
Table 3.6: Patient I.D. 123: Clock rate statistics
mean
stderr of mean
median
geometric mean
95% HPD lower
95% HPD upper
auto-correlation time (ACT)
effective sample size (ESS)
1.5085E-3
1.922E-5
1.4137E-3
1.3607E-3
3.3823E-4
2.8556E-3
16204.9553
1261.3426
Figure 3.5: Patient I.D. 123: Substitution rate distribution. 95% confidence
interval has been marked in blue. The red region shows the rate sampling
outside the interval
80
70
60
Frequency
50
40
30
20
10
0
-1E-3
0
1E-3
2E-3
3E-3
4E-3
5E-3
clock.rate
Figure 3.6: Patient I.D. 181: Substitution rate distribution. The 95% confidence interval has been marked in blue. The red region shows the rate
sampling outside the interval
175
150
Frequency
125
100
75
50
25
0
-2.5E-3
0
2.5E-3
5E-3
7.5E-3
1E-2
1.25E-2
1.5E-2
1.75E-2
2E-2
clock.rate
26
3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction
Table 3.7: Patient I.D. 181: Clock rate statistics
mean
stderr of mean
median
geometric mean
95% HPD lower
95% HPD upper
auto-correlation time (ACT)
effective sample size (ESS)
4.9113E-3
6.0229E-5
4.4662E-3
4.2646E-3
8.9876E-4
1.0203E-2
16264.458
1847.5869
collected at different times from patient I.D. 123. Most of the sequences
from the first time point share nodes with the sequences from the second
and the third time point. However, the sequences from the final time point
(i.e. the ones collected in 2007) appear like a segregated clade from the rest
of the tree showing only a faint relationship with low frequency haplotypes
from the first time point. In a more rigorous analysis, it was found that haplotype number 25 (not shown in this tree as the frequency of the haplotype
was 0.02%) from the first sampling point was identical to a low frequency
haplotype number 5 from the final time point. None of other sequences from
all the other time points were found 100% identical to any of the haplotypes
from the final time point.
There were a series of identical haplotypes found over time. Starting from
one side of the tree, we see that haplotype 6 from 2003 branches out to
haplotype 2 of 2005 which further branches to haplotype 3 from 2006. These
three sequences were found to be identical over time and their frequencies
constantly dwindled between 0.8% to 1% after which the haplotype was not
visible. There were 9 such haplotypes from the first time point that appeared
again in the later stages. These have been shown in the figure 3.8. We see
in this figure that the frequency of the haplotypes decreases from the first
to the second time point but it again increases in the next time point. The
sequences from the final time point show little similarity with the ones from
previous time points.
A pairwise sequence alignment was performed between the majority haplotype with a frequency of 65% from the the third time point and the haplotype from the final time point with a frequency of 86% . These two
sequences showed 97% similarity. Another tree incorporating haplotypes
with frequency as low as 0.2% was constructed to trace any relationship of
the sequences from the final time point with low frequency variants from
the second and the third time point. However there were no variants from
2005 and 2006 sharing a node with the sequences sampled in 2007.
From a total of eight haplotypes with frequency greater than 0.5% in the first
27
3.3. Founder Virus Analysis
time point, four were found to be present in the second time point. Note that
the second sample was collected after almost an year of the anti-retroviral
therapy. This leads us to conclude that the viral haplotypes were latent
during the therapy period but were quick to rebound when the therapy was
stopped. So, even though an year passed in terms of absolute time scale
nothing much happened in terms of evolutionary time scale.
For patient I.D. 181, the haplotypes with frequency greater than 1% progressively increased over time. While there were only 3 haplotypes from the
first time point with frequency greater than 1%, the number changed to 23
for the final time point. All the sequences from the first time point could be
traced over the three time points. The haplotype with the highest frequency
from the first time point remained to be the one with highest frequency over
the next two time points as well but the frequency decreased over time from
87% to 38%. This can be clearly seen in figure 3.10.
3.3
Founder Virus Analysis
The analysis was carried out to detect if the infection was initiated by a single
founder virus haplotype. This knowledge is often necessary in tracing the
source of infection. It is also useful for explaining the pattern of genetic
diversity observed in the patient. If the phylogenetic tree constructed using
sequences from only the first time point shows a star like phylogeny then
the infection is likely to have been initiated by a single virus. When we see
distinct clades in the tree, the infection can be assumed to be initiated by
multiple founder viruses. Another reason for the phylogenetic tree to not
show a star like topology can be that the sequences are from a time where
the immune selection already started shaping the viral evolution. Care must
be taken to use samples that have been collected from a time very close to
the time of infection so that the intra-sample diversity is not higher than
10%. When the sample shows a star like phylogeny with low intra-sample
diversity, then the time of infection can be estimated using the Poisson Fitter
tool.
Figure 3.11 and 3.12 show the phylogenetic trees constructed using the
samples from the first time points for patient I.D. 123 and 181 respectively.
These trees clearly do not show a star topology that would indicate a single
founder virus infection. Since the exact dates of infection are unknown, it
might be the case that the samples used for the analysis were from viruses
that were evolving under selective pressure exerted by the immune system.
The sample from patient 181 that was used for the analysis was collected
after a few months of infection ( this can be seen by looking at the patient infection time line shown in figure 2.2), hence this result might be misleading.
We see a single virus with frequency equal to 86% in the first time point. It
28
3.3. Founder Virus Analysis
29
Figure 3.7: Patient I.D. 123: Phylogenetic Tree showing relationship between sequences with frequency greater than 0.5%.Blue marks sequences
from 2003, Green marks sequences from 2005, While red and orange show
sequences from 2006 and 2007 respectively
3.3. Founder Virus Analysis
Figure 3.8: Patient I.D.123: Haplotypes traced over time with their measured
frequencies
30
3.3. Founder Virus Analysis
31
Figure 3.9: Patient I.D. 181: Phylogenetic Tree showing relationship between
sequences with frequency greater than 0.5%.Blue marks sequences sampled
on 28.09.2005, Green marks sequences from 15.03.2006, While orange shows
sequences sampled on 08.09.2006
3.3. Founder Virus Analysis
Figure 3.10: Patient I.D. 181: Haplotypes traced over time with their measured frequencies
32
3.3. Founder Virus Analysis
Figure 3.11: Patient I.D. 123: Phylogenetic tree with sequences from 2003
with frequency greater than 0.5%
is likely that the infection was started by a single founder virus but due to
the selective immune pressure, the founder virus showed a divergent evolutionary pattern. Poisson Fitter analysis showed that the hamming distance
distribution of the first time point sample did not confirm with the distribution expected under neutral evolution from a single founder virus. Another
explanation for the observation could be that the samples were not from a
time point close to the initiation of infection.
Figure 3.12: Patient I.D. 181: Phylogenetic tree with sequences from 2005
with frequency greater than 0.5%
33
3.4. Demographic Reconstruction
Figure 3.13: Patient I.D. 123: Demographic construction of the viral population with frequency greater than 0.5%
3.4
Demographic Reconstruction
The effective population size of the virus was plotted over the period of infection for patient I.D. 123 figure 3.13 and patient 181 figure 3.14. The time
points where the sequences were collected have been marked in the plot. We
see a slight increase in the viral population over time during course of infection for patient I.D. 123. The results of this analysis were robust to the use of
different model settings. This indicated that the sequence data was informative and not too sensitive towards slight model mis-specification. However,
since the model could not be informed about the anti-retroviral therapy period, we do not know if the results would be sensitive to this information.
As previously concluded from figure 3.8 that the virus was under a latency
period during the therapy, one would assume that the results of the demographic should not be affected by the therapy phase.
For patient I.D. 181, we see no change in the effective population size of the
virus over the first year of infection. The times of sample collection have
been marked in the figure 3.14.
As a proof of principle of the coalescent model used in our study, another
analysis was performed on a sequence dataset collected over a period of
3 years from an influenza epidemic. This demographic plot was in agreement with the variation observed during the epidemics. The data contained
sequences of length 1700 base pairs and there were samples collected over
every few months giving a high coverage over a period of three years. The
results are not shown since these are freely available on the BEAST website.
34
3.5. Sliding MinPD
Figure 3.14: Patient I.D. 181: Demographic construction of the viral population with frequency greater than 0.5%
3.5
Sliding MinPD
The evolutionary network was constructed for the patient I.D. 123 using the
Sliding MinPD software. Sequences from 2005, 2006 and 2007 were used in
the analysis. The samples from the first time point were ignored since the
majority haplotypes were same between the first and the second time point.
A set of runs were set up with differing window sizes and sliding window
sizes. The results of the runs were found to be sensitive to slight changes in
the window length. Due to lack of confidence in the results of this analysis,
this was not carried out for patient I.D.181. The results of the run have
been shown in figure 3.15 and 3.16. The parameters for the runs except the
window and the sliding window size has been mentioned in table 3.8.
Table 3.8: Parameters used for constructing the evolutionary network using
Sliding MinPD
Active Recombination Detection
Recombination Detection Option
Crossover option
PCC threshold
bootstrap recomb. tiebreaker option
bootscan seed
bootscan threshold
substitution model
gamma shape - rate heterogeneity
show bootstrap values
markers for clustering
clustering distance threshold
clustering option
Yes
Bootscan RIP
Many
p0.4
Yes
E-3
TN93
1847.5869
alpha 0.5
No
Yes
T0.001
by bases but post
35
3.5. Sliding MinPD
In figure 3.15 and 3.16 the sequences in a single column belong to the
same time point. The ancestor descendant relationships are denoted using
black dotted lines. The sequences marked in red are the recombinants and
the red line when traced back in time shows the two donor haplotypes.
It can be clearly seen that when the window size is changed the number of
recombinants change. Hence this analysis was not carried out for the second
patient.
36
3.5. Sliding MinPD
Figure 3.15: Patient I.D. 123: Evolutionary Network constructed using win37
dow size 50 and the sliding window size of 15
3.5. Sliding MinPD
38
Figure 3.16: Patient I.D. 123: Evolutionary Network constructed using window size 50 and the sliding window size of 30
3.6. Conclusions
3.6
Conclusions
A series of downstream sequence analysis were carried out on the time
stamped data collected from two patients. We formulated an ordered sequence in which these standardized tests could be applied to any set of deepsequenced, single gene datasets. These tests returned information regarding
the evolutionary pattern being followed by the virus and provided estimates
for clinically important parameters like the time of HIV transmission and
the population structure of the virus over the infection period. While there
were no special tools used to detect the effect of anti-retroviral therapy, entropy measure and the change in haplotype frequency over time were used
to identify any of its visible effects. Since we used coalescent based models
for estimating these parameters, the estimates were unique for every patient.
As this method is sensitive to the amount of genetic variation present in the
data set, the estimated parameters had large confidence intervals. The over
estimation of the time of HIV transmission to the host could be attributed
to the phylogenetic tree structure that did not support the single founder
virus hypothesis. This implied that the age of the root of the tree showed
the time to the common ancestor of the multiple viruses that infected the
patient rather than the time of HIV transmission to the host.
The relatively conserved protease coding sequences were successful in constructing robust phylogenetic trees showing the relationship between the
haplotypes sequenced from samples collected at different time points. The
pre and post anti-retroviral therapy samples from the patient 123 showed
high similarity in terms of haplotype sequences and their frequencies which
led us to conclude that the virus went under hibernation during the therapy
but was quick to rebound once it was discontinued. There was no evidence
of recombination found in the sample sequences. The viral population plots
were fairly smooth indicating that the effective population size remained almost constant over the course of infection. Even though the number of haplotypes with frequency greater than 1% increased over time for patient 181
indicating increased intra-population diversity, this was not followed with
a increase in the effective population size. A closer look at the sequence
identity revealed that these haplotypes were on an average 99% identical.
The conserved regions of the virus are able to construct stable phylogenies,
but these regions should be used in combination with other variable region
dataset to provide population estimates with smaller confidence intervals.
Another solution would be to increase the number of sampled time points
for improving the coalescent based estimates.
39
Bibliography
[1]
http://www.niaid.nih.gov/factsheets/howhiv.htm.
[2]
A rough guide to BEAST1.4.
[3]
61.0.6. lentivirus. International Committee on Taxonomy of Viruses, 2002.
[4]
Overview of the global AIDS epidemic, 2006.
[5]
D Baltimore. Viral RNA-dependent DNA Polymerase: RNA-dependent
DNA polymerase in virions of RNA tumour viruses.
Nature,
226(5252):1209–1211, 2000.
[6]
Daniel Beyer. multiplication of HIV. 1997.
[7]
M T et al. Bretscher. Recombination in HIV and the evolution of drug
resistance:for the better or for worse? BioEssays, 26:180–188.
[8]
P. Buendia and G. Narasimhan. Sliding MinPD: Building evolutinary
networks of serial samples via an automated recombination detection
approach. Bioinformatics, 2007.
[9]
McElroy K Gaudieri S Pham ST et al. Bull RA, Luciani F. Sequential bottlenecks drive viral evolution in early acute Hepatitis C Virus infection.
PLoS Pathog, 7(9), 2011.
[10] Kim Pl Chan D. HIV entry and its inhibition. Cell, 93(5):681–684, 2003.
[11] Achaz G. et al. Mol Biol Evol., 21(10), 2004.
[12] Cunningham et al. Manipulation of dendritic cell function by viruses.
Current opinion in microbiology, 13(4):524–529, 2010.
40
Bibliography
[13] Cynthia A. Derdeyn et al.
Envelope-constrained neutralizationsensitive HIV-1 after heterosexual transmission. Science, pages 1134–
1137, 2004.
[14] Drummond A. J. et al. Bayesian coalescent inference of past population
dynamics from molecular sequences. Molecular Biology and Evolution,
7(214), 2007.
[15] Gilbert P B et al. Comparison of HIV-1 and HIV-2 infectivity from a
prospective cohort study in senegal. Statistics in Medicine, 22(4), 2003.
[16] Giorgi EE et al. Estimating time since infection in early homogeneous
HIV-1 samples using a poisson model. BMC Bioinformatics, 25(11), 2010.
[17] Korber et al. Numbering positions in HIV relative to HXB2CGl. Human
Retroviruses and AIDS, 1998.
[18] Lee et al. Modelling sequence evolution in acute HIV-1 infetion. J Theor
Biol., 2009.
[19] Morgan D et al. HIV-1 infection in rural Africa: is there a difference in
median time to AIDS and survival compared with that in industrialized
countries? AIDS, 16(4):597–632, 2002.
[20] Pond S.L. et al. Estimating selection pressures on alignments of coding
sequences, 2007.
[21] Rodrigo A. et al. Mathematics of Evolution and Phylogenetics. Oxford University Press, 2007. The Evolutionary Analysis of Measurably Evolving
Populations using Serially Sampled gene sequences.
[22] Shankarappa et al. Consistent viral evolutionary changes associatde
with the progression of Human Immunodeficiency Virus Type 1 infection. Journal of Virology, 73(12):10489–10502, 1999.
[23] Suzanne English et al. Phylogenetic analysis consistent with a clinical
history of sexual transmission of HIV-1 from a single donor reveals
transmission of highly distinct variants. Retrovirology, 8(54), 2011.
[24] Zagordi et al. ShoRAH: Estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics, 12(119),
2011.
[25] Zheng Y H et al. Newly identified host factors modulate HIV replication. Immunol Lett, pages 225–234, 2005.
41
Bibliography
[26] Poon Art F.Y. Dates of HIV infection can be estimated for seroprevalent patients by coalescent analysis of serial next-generation sequencing
data. AIDS, 25(16):2019–2026, 2011.
[27] Drummond A J and Rambaut A. BEAST:bayesian evolutionary analysis
by sampling trees. BMC Evolutionary Biology, 7(214), 2007.
[28] H.Kishino J.L Throne. Statistical methods in molecular evolution. Springer,
2005. Estimation of Divergence Times from Molecular Sequence Data.
[29] DePasquale M. P. Kartsonis N. Hanna G. J. Wong J. Finzi D. Rosenberg
E. Gunthard H.F. Sutton L. Savara A. Petropoulos C. J. Hellmann N.
Walker B. D. Richman D. D. Siliciano R. Martinez-Picado, J. and R. T.
D’Aquila. Antiretroviral resistance during successful therapy of Human Immunodeficiency Virus type 1 infection. Proc. Natl. Acad. Sci. U.
S. A, 97(20):10948–10953, 2000.
[30] Hall N. Advanced sequencing technologies and their wider impact in
microbiology. J of Exp. Biol., 210(Pt 9):1518–1525, 2007.
[31] Gojobori T Nei M. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol.,
3(5), 1986.
[32] V. W. Pollard and M. H. Malim. The HIV-1 Rev protein. Annu. Rev.
Microbiol., 52:491–532, 1998.
[33] Oliver G Pybus and Andrew Rambaut. Evolutionary analysis of the
dynamics of viral infectious diseases. The Nature Reviews, 10:540–549,
2009.
[34] Weiss RA. How does HIV cause AIDS? Science, 260(4):1273–1279, 1993.
[35] N Saitou and M Nei. The neighbor joining method:a nw method for
receonstructing phylogenetic trees. Mol Bio Evol, 4(4), 1987.
[36] Coulson AR Sanger F. A rapid method for determining sequences
in DNA by primed synthesis with DNA polymerase. J of Mol Biol,
94(3):441–448, 1975.
[37] A. Siepel and B. Korber. Scanning the database for recombinant HIV-1
genomes. Los Alamos National Laboratory, pages 35–60.
[38] Dan Stowell. The molecules of HIV. 2006.
42
Bibliography
[39] SM. et al. Wollinsky. Selective transmission of Human Immunodeficiency Virus type-1 variants from mothers to infants. Science, 255:1134–
1137, 1992.
[40] Sodroski J Wyatt R. The HIV-1 envelope glycoproteins: fusogens, antigens, and immunogens. Science, 280(5371):1884–8, 1998.
[41] Ziheng Yang. Computational Molecular Evolution. Oxford, 2006.
43