Predicting Human Immunodeficiency Virus Type 1 Drug Resistance

Predicting Human Immunodeficiency Virus Type 1
Drug Resistance From Genotype
Using Machine Learning.
Robert James Murray
Master of Science
School of Informatics
University Of Edinburgh
2004
ABSTRACT: Drug resistance testing has been increasingly incorporated into the clinical
management of human immunodeficiency virus type 1 (HIV-1) infection. At present, there are two
ways to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs – namely phenotyping
and genotyping. Although, phenotyping is recognised as providing a quantified measurement of
drug resistance, it involves a complex procedure and is time consuming. On the other hand,
genotyping involves a relatively simple procedure and can be done quickly. However, the
interpretation of drug resistance from genotype information alone is challenging. A number of
machine-learning methods have now been used to automatically relate HIV-1 genotype with
phenotype. However, the predictive quality of these models has been mixed. This study
investigates the nature of these computational models and their predictive merit. Using the
complete Stanford dataset of matched phenotype genotype pairs, two contrasting machine-learning
approaches were implemented to analyse the significance of sequence mutations in the protease
and reverse transcriptase genes of HIV-1 for 14 antiretroviral drugs. Both decision tree and
nearest-neighbour classifiers were generated and compared with previously published
classification models. I found prediction errors between 6.3- 24.7% for decision tree models and
prediction errors between 18.0 – 46.2% for nearest-neighbour classifiers. This was compared with
prediction errors of between 8.1 –51.0% for previously published decision tree models and a
correlation coefficient of 0.88 for a neural network lopinavir classification model.
Declaration
I declare that this thesis was composed by myself, that the work contained herein is my own
except where explicitly stated otherwise in the text, and that this work has not been
submitted for any other degree or professional qualification except as specified.
(Robert James Murray)
Table of Contents
1 Introduction
1
1.1 Overview
1
1.2 HIV Infection
2
1.3 Resistance Testing
3
1.4 Phenotype & Genotype Resistance Tests
5
1.5 Literature Review
8
2 Machine Learning
11
2.1 Overview & The Phenotype Prediction Problem
11
2.2 The Training Experience, E.
13
2.3 Learning Mechanisms
15
2.3.1 Decision Trees
15
2.3.2 Artificial Neural Networks
21
2.3.3 Instance Based
25
3 Materials & Methods
29
3.1 Data Set
29
3.2 Decision Trees
33
3.3 Nearest-Neighbour
35
4 Results
36
4.1 Classification Models
36
4.1.1 Reverse Transcriptase Inhibitors
36
4.1.2 Protease Inhibitors
42
4.2 Prediction Quality
5 Conclusion
48
53
5.1 Concluding Remarks & Observations
53
5.1.1 Decision Tree Models
54
5.1.2 Neural Network Models
58
5.1.3 Nearest-Neighbour Models
60
5.2 Suggestions For Further Work
62
5.2.1 Handling Ambiguity Codes
62
5.2.2 Using A Different Attribute Selection Measure
62
5.2.4 Using A Different Distance Metric
63
5.2.5 Receiver Operating Characteristic Curve
64
5.2.6 Other Machine Learning Approaches
64
A Pre-Processing Software
65
B Cultivated Phenotype
67
C The Complete Datasets
73
D Original Decision Trees
96
Bibliography
97
Chapter 1
Introduction
1.1
Overview
Drug resistance testing has been increasingly incorporated into the clinical management of
human immunodeficiency virus type 1 (HIV-1) infection. Resistance tests can show whether
a particular HIV-1 strain is likely to be suppressed by a drug. At present, there are two ways
to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs – namely phenotyping
and genotyping. Whereas, phenotypic assays provide a direct quantification of drug
resistance, genotypic assays only provide ‘clues’ towards drug resistance. In particular,
genotyping attempts to establish the presence or absence of genetic mutations in the protease
and reverse transcriptase genes of HIV-1 that have been previously associated with drug
resistance. Although, phenotyping is recognised as providing a quantified measurement of
drug resistance, it involves a complex procedure and is time consuming. On the other hand,
genotyping involves a relatively simple procedure and can be done quickly. However, the
interpretation of drug resistance from genotype information alone is still challenging, and
often requires expert analysis.
A number of machine-learning methods have now been used to automatically relate HIV-1
genotype with phenotype [1,2,10,9]. Using datasets of matched phenotype genotype pairs,
machine-learning methods can be used to derive computational models that predict
phenotypic resistance from genotypic information. However, the predictive quality of these
models has been mixed. For some drugs, these models offer reasonable prediction rates but
for others the results are less useful for managing HIV-1 infection.
This study attempts to investigate the nature of these computational models and their
predictive merit. Specifically, I had the following initial goals: related to previous work [1],
1
Chapter 1. Introduction
generate decision tree classifiers to recognise genotypic patterns attributed to drugresistance; evaluate the predictive quality and nature of these models in retrospect; consider
the application of other machine learning approaches.
Using the complete Stanford dataset of matched phenotype genotype pairs, two contrasting
machine-learning approaches were implemented, in Java, to analyse the significance of
sequence mutations in the protease and reverse transcriptase genes of HIV-1 for 14
antiretroviral drugs. Specifically, decision tree classifiers were generated to predict
drug-susceptibilities from genotypic information and nearest neighbour classifiers were built
to identify genotypes with similar mutational patterns. Also evaluated in the study was the
possible use of artificial neural networks as proposed in [2].
The predictive quality of the classification models was analysed in regards to an independent
testing set. For decision trees, I found prediction errors between 6.3 – 24.7% for all drugs.
This was compared with the performance of the decision trees presented in [1]. Specifically,
these models achieved prediction errors in the range of 8.1 – 51.0% over the same testing set.
Nearest-neighbour classifiers exhibited poorer performance with prediction errors between
18.0 – 46.2%.
1.2
HIV Infection
Human immunodeficiency virus (HIV) is a superbly adapted human pathogen and as of the
end of 2003, an estimated 37.8 million people worldwide – 35.7 million adults and 2.1
million children younger than 15 years – were living with HIV/AIDS. Furthermore, an
estimated 4.8 million of these were new HIV infections.
Once HIV enters the human bloodstream - through sexual reproduction, exchange of blood
or breast milk - it seeks a host cell in order to reproduce. Although HIV can infect a number
of cells in the body, the main target is an immune cell called lymphocyte - a type of T-cell.
Once HIV comes in contact with a T-cell it attaches itself to it and hijacks it’s cellular
machinery to reproduce thousands of new copies of HIV, see figure 1.1. T-cells are an
important part of the immune system because they help facilitate the body’s response to
many common but potentially fatal infections. Without enough T-cells the body’s immune
system is unable to defend itself against infections.
2
Chapter 1. Introduction
image from http://www.aidsmeds.com/lessons/LifeCycle1.htm
Fig. 1.1
The HIV life-cycle. In (1) HIV encounters a T-cell and gp120 (on the surface of HIV)
binds to the T-cells cd-4 molecule. The membranes of the HIV particle and T-cell fuse and the
contents of the HIV particle release into the T-cell. In (2) reverse transcription creates a DNA
copy of the virus’s RNA. In (3) the HIV DNA is transported to the T-cell's nucleus. Another viral
enzyme called integrase hides the proviral DNA into the cell's DNA. In (4) HIV’s genetic
material directs the T-cell to produce new HIV. In (5) and (6) a new HIV particle is assembled.
Being HIV-positive, or being infected with HIV is not that same as having acquired immune
deficiency syndrome (AIDS). Someone with HIV can live for many years with few ill
effects. Off course, this is provided that their bodies replace the T-cells destroyed by the
virus. However, once the number of T-cells diminishes below a certain threshold, infected
individuals will start to display symptoms of AIDS. Such individuals will have a lowered
immune response and are highly susceptible to a wide range of infections that are harmless
to healthy people but may inevitably prove fatal. Indeed, in 2003, HIV/AIDS associated
illnesses caused the deaths of approximately 2.9 million people worldwide, including an
estimated 490,000 children younger than 15 years and since the first AIDS cases were
identified in 1981, over 20 million people with HIV/AIDS have died.
1.3
Resistance Testing
There are now a number of antiretroviral drugs approved for treating HIV-1 infection.
Treatment with combinations of these drugs can offer individuals prolonged virus
suppression and a chance for immunologic reconstruction. However, unless therapy
suppresses virus replication completely the “selective pressure” of antiretroviral treatment
enhances the emergence of drug-resistant variants, see figure 1.2.
3
Chapter 1. Introduction
image from http://www.vircolab.com/bgdisplay.jhtml?itemname=understandinghiv
Fig. 1.2
The development of drug resistance.
The emergence of these variants depends on the development of genetic variations in the
virus that allow it to escape from the inhibitory effects of a drug. We say that a drug-resistant
variant is identifiable from a reference virus by the manifestation of mutations that contribute
to reduce susceptibility. This occurs through natural mechanisms. In particular, HIV-1
genetic variability results from the inability of the HIV-1 reverse transcriptase to rewrite
nucleotide sequences during replication [3]. This is compounded by the high rate of HIV-1
replication (approximately 10 billion particles/day), a spontaneous mutation rate
(approximately 1 mutation/copy) and genetic recombination when viruses of different
sequence infect the same cell.
Once a drug-resistant variant has emerged greater levels of the same antiretroviral drug is
required to suppress virus replication. However, with greater levels of drug we increase the
risk of adverse side effects and harm to an individual. Therefore when resistance occurs,
patients often need to change to a new drug regimen.
To help, we can use resistance tests to show whether a particular HIV-1 strain is likely to be
suppressed by a drug or not. Nevertheless, until recently, resistance testing was used solely
as a research tool, to investigate the mechanisms of drug failure. In July 1998, the idea of
extending the methodology of resistance testing to routine clinical management, although
4
Chapter 1. Introduction
logical, could not be recommended due to lack of validation, standardisation and a concrete
definition of the role of the testing [4]. Nevertheless, since then, a number of studies
emerged indicating the worth of resistance testing for clinical management and the lack of
standardisation was addressed with the development of commercial resistance tests. Both of
these factors contributed to a second statement, published in early 2000, which explicitly
recognised the value of HIV-1 drug resistance testing [5] and finally, in May 2000, the
International AIDS Society recommended the incorporation of drug-resistance testing in the
clinical management of patients with HIV-1 infection [6]. Furthermore, considerable data
supporting the use of drug-resistance testing has now been published or presented at
international conferences [7].
1.4
Phenotype And Genotype Resistance Tests
At present, there are two ways to assess the susceptibility of an HIV-1 isolate to
antiretroviral drugs in vitro – namely phenotyping and genotyping. Phenotyping directly
measures the susceptibility of HIV-1 strains to particular drugs, whereas genotyping
establishes the absence or presence of specific mutations in HIV-1 that have been previously
associated with drug resistance.
Phenotype resistance tests involve direct quantification of drug sensitivity. Viral replication
is measured in cell cultures under the selective pressure of increasing concentrations of
antiretroviral drugs. Specifically, a sample of blood is taken from a patient and the HIV is
isolated. At this stage the reverse transcriptase and protease genes are recognised and
amplified. The amplified genes are then inserted into a laboratory strain of HIV, which the
scientists are able to grow (a recombinant virus). The ability of the virus to grow in the
presence of each antiretroviral drug is evaluated. In particular, the virus is grown in the
presence of varying concentrations of antiretroviral drugs and its ability to grow in the
presence of these drugs is compared to that of a reference strain.
The outcome of a phenotypic test may be expressed as a: IC50, IC90 or IC95 value. Where the
IC value expresses the concentration of a particular drug required to inhibit the growth of the
virus by 50%, 90% or 95%, respectively, see figure 1.3. The level of drug resistance is
reported by typically comparing the IC50 value of the patients HIV with that of a reference
strain. In particular, a degree of fold-change is reported where fold-change indicates the
increase in the amount of drug needed to stop viral replication compared to the reference
5
Chapter 1. Introduction
strain. In other words a phenotypic resistance test gives a clear indication of the capability of
an HIV-1 strain to grow in the presence of a particular concentration of an antiretroviral
drug. These results are easily interpreted but often prove time consuming, expensive and
labour intensive.
Antiviral effect (%)
Reference Strain
Resistant Strain
50%
Drug Concentration
IC50
Fig. 1.3.
IC50
Comparing the IC50 values of a reference and resistant strain.
Genotypic resistance tests are based on the analysis of mutations associated with resistance.
In a genotypic test the nucleotide sequence of the virus genome is determined by direct
sequencing of PCR products. This sequence is then aligned with a reference strain and the
differences are recorded. In particular, the results of a genotypic test are given as a list of
point mutations with respect to a reference strain. Such mutations are expressed by the
position they have in a certain gene, preceded by the letter corresponding to the amino-acid
seen in the reference virus, and followed by the mutated amino acid. For example M184V
would correspond to the substitution of Methyonine by Valine at the codon 184.
In contrast to phenotypic tests, genotypic tests usually provide results in a few days and are
less expensive. However, genotyping is a method that can be viewed as primarily measuring
the likelihood of reduced susceptibility, with the major challenge being the correct
interpretation of the results in order to associate a realistic level of drug-resistance. The
specific type and placement of the mutations determines which drugs the virus may be
resistant to. For example, if the “M184V” mutation is discovered in a patient’s HIV, the
6
Chapter 1. Introduction
virus is probably resistant to the reverse transcriptase inhibitor lamivudine. In this respect,
clinicians often utilise tables of known mutations attributed to drug-resistance, see figure 1.4.
However, this is not a simple task because it is simply not true that the clinician can consider
each mutation independently of the others. Specifically, the influence of a mutation on drug
resistance must be considered as a part of a global interaction [8]. In addition it is not viable
to continue to use such tables of mutations because their complexity must grow as the
number of drugs, and especially drug combinations, increases.
image from http://hivinsite.ucsf.edu/InSite?page=kb-authors&doc=kb-03-02-07
Fig. 1.4.
Mutations in the protease associated with reduced susceptibility to protease inhibitors.
7
Chapter 1. Introduction
1.5
Literature Review
A number of statistical methods have been used to investigate the relationship between the
results of HIV-1 genotypic and phenotypic assays. Cluster analysis, linear discriminant
analysis, heuristic scoring metrics, nearest neighbour, neural networks and recursive
partitioning have been used to correlate drug-susceptibility with genotype. However, the
problem of relating the results of genotypic and phenotypic assays provides several statistical
challenges and the success of these methods have been mixed. Firstly, phenotype must be
considered as a consequence of a large number of possible mutations. As mentioned
previously, this is compounded by the fact that the effect of mutations at any given position
is influenced by the presence of mutations at other positions and it is therefore necessary to
detect global interactions.
The use of cluster analysis and linear discriminant analysis is described in [9]. Investigating
drug-resistance mutations of two protease inhibitors saquinavir (SQV) and indinavir (IDV)
the results of these analyses were comparable. In particular, both analyses were able to
identify the association of mutations at amino acid positions 10, 63, 71 and 90 with in vitro
resistance to SQV and IDV.
In elaboration, cluster analysis requires a notion of distance between any two amino acid
sequences. Typically a set of indicator variables is created for each amino acid position and a
vector of indicator variables is then used to represent a sequence. The distance between two
sequences can then be defined as the standard Euclidean distance between the two
corresponding vectors. Such a measure can then be used to create groups of amino acid
sequences with similar genotypes. Furthermore, the distance between any two groups can be
defined as the average of the individual distances between all pairs of members of the two
groups. Creating hierarchies of such groups or clusters then facilitates the investigation of
the degree to which similar genotypes have similar phenotypes.
On the other hand, linear discriminant analysis can be used to determine which mutations
best predict sensitivity to drugs, as defined by the phenotypic IC50 value. A dataset consisting
of matched phenotype genotype pairs is split into two groups, labelled as either resistant or
susceptible depending on the basis of an IC50 cut-off. A linear discriminant function is then
used to predict which group an unknown genotype belongs to. Where the linear discriminant
8
Chapter 1. Introduction
function is simply a linear combination of predictors or indicator variables (the choice here
varies).
In contrast, a simple scoring metric is used in [10] to predict HIV-1 protease inhibitor
resistance. Here a database of genotype phenotype pairs was analysed. It was found that
samples with one or two mutations in the protease gene were phenotypically susceptible and
samples with five or more mutations were resistant against all protease inhibitors. A list of
all the mutations present in the database was compiled and split into two groups. One in
which mutations were frequent in susceptible and resistant samples and one in which
mutations were predominantly present in resistant samples. A scoring system using the
presence or absence of any single mutation compiled by Schinazi et al [11] was then used as
a criterion for predicting phenotypic resistance from genotype. This was enhanced by the
incorporation of a secondary score that simply takes into account the total number of
resistance associated mutations in the protease. This achieved high sensitivity (92.5 - 98.5%)
but lower specificity (57.9% - 77.3%) on unseen cases.
Commercially, the Virco-Tibotec company, over time have accumulated a database of
around 100,000 genotypes and phenotypes. Using this dataset a virtualPhenotypeTM is
generated from genotype by means of a nearest neighbour strategy. In particular, their
system begins by identifying all the mutations in the genotype that can affect resistance to
each drug. It uses this ‘profile’ to interrogate the dataset for previously seen genotypes that
have similar profiles. When all the possible matches are identified, the phenotypes for these
samples are retrieved and for each drug the data is averaged. This generates a report of
virtual IC50 values for each drug.
In contrast, in [2] artificial neural networks were used to predict lopinavir resistance from
genotype. In brief, a neural network is a two-stage regression or classification model that
represents real-valued, discrete-valued and vector-valued functions. Using a set of examples,
algorithms such as backpropagation tune the parameters of a fixed network of units and
interconnections. For example, backpropagation employs a gradient descent to attempt to
minimise the error between the network outputs and target values. In [2] two neural network
models were developed. The first was based on changes at only 11 amino acid positions in
the protease, as described in the literature, and the second was based on 28 amino acid
positions resulting from category prevalence analysis. A set of 1322 clinical samples was
utilised to train, validate and test the models. Results were expressed in terms of the
9
Chapter 1. Introduction
correlation coefficient R2. In simple terms the correlation coefficient indicates to which
extend the predicted and true values lie on a straight line. It was found that the 28-mutation
model proved to be more accurate than the 11-mutation model at predicting lopinavir drugresistance (R2 = 0.88 against R2 = 0.84).
Alternatively, decision tree classifiers were generated by means of recursive partitioning in
[1]. These models were then used to identify genotypic patterns characteristic of resistance to
14 antiretroviral drugs. In brief, recursive partitioning describes the iterative technique used
to construct a decision tree. Recursive partitioning algorithms begin by splitting a population
of data into subpopulations by determining an attribute that best splits the initial population.
It continues by repeated identification of attributes that best splits each resulting
subpopulation. Once a subpopulation contains individuals of the same type no more splits
are made and a classification is assigned to that population. An unknown case is then given a
classification by sorting it down the tree from root to a leaf using the attributes as specific
tests. Here the initial population was a dataset of matched phenotype genotype pairs,
consisting of 471 clinical samples. For each drug in the study a separate decision tree
classifier was constructed to predict phenotypic resistance from genotype using an
implementation of recursive partitioning. These models were then assessed using leave-oneout experiments and prediction errors were found in the range of 9.6 – 32%.
10
Chapter 2
Machine Learning
2.1
Overview & The Phenotype Prediction Problem
In general, the field of machine learning is concerned with the question of how to construct
computer programs that automatically improve with experience. Typically, we have a
classification, either quantitative or categorical, that we wish to predict based on a set of
input characteristics. We have a training set consisting of matched classifications with
characteristics. Then using this dataset we strive to build a classification model that will
enable us to predict the appropriate outcome for unseen cases. Consider the following
definition of a well-posed learning problem, according to Mitchell [12]:
Definition: A computer program is said to learn form experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks
in T, as measured by P, improves with experience E.
The first step in constructing such a program is to consider the type of training experience
from which the system will learn. This is important because the type and availability of
training material has a significant impact on the success or failure of the learner. An
important consideration (especially applicable to clinical data) is how well the training
material represents the entire population. In general, machine learning is most effective when
the training material is distributed similarly to that of the remaining population. However, in
a clinical setting, this is almost never entirely the case. When developing systems that learn
11
Chapter 2. Machine Learning
from clinical data it is often necessary to learn from a distribution of examples that may be
fairly different from those in the remaining population. Such situations are challenging
because the success over one distribution will not necessarily lead to a strong performance
over some other distribution, since most machine learning theory rests on the assumption
that the distribution of training material is exactly representative of the entire population.
However, although not ideal, this situation is often the case when developing practical
machine learning systems.
The next step is to determine the type of knowledge to be learned and how this will be used
by the performance measure. In particular, we seek to define a target function
F: C
O, where F accepts as input a set of characteristics C and produces as output some
classification O that is true for C. The problem of improving performance P at tasks T is then
reducible to finding a function F2 that performs better than a function F1 at tasks T, as
measured by P.
The final step in building a learning system is to choose an appropriate learning mechanism
to derive the target function F. This includes determining the representation of the function
F, whether it should be a decision tree, neural network, instance-based, concept-based, linear
function etc. Once a representation is decided upon an appropriate algorithm must then be
employed to generate such representations from the training experience E.
Many machine learning problems can be specified using this critique. In particular we can
define a machine learning problem by specifying the task T, performance measure P, training
experience E and target function F. In this way, the generalised phenotype prediction
problem, can be defined as follows:
•
Task T: The correct interpretation of genotype sequence information with respect to
drug resistance.
•
Performance Measure P: The percentage of sequences correctly classified as
resistant or susceptible to a particular drug.
•
Training Experience E: A dataset of matched phenotype genotype pairs.
• Target Function: Resistant: Genotype
Drug_Susceptibility_Classification, where
Resistant accepts as input the result of a genotype resistance test, Genotype, and
outputs a drug susceptibility classification, Drug_Susceptibility_Classification,
indicating the susceptibility a genotype to a particular antiretroviral drug.
12
Chapter 2. Machine Learning
2.2
The Training Experience, E.
The first step towards addressing the phenotype prediction problem is to acquire a suitable
dataset of matched phenotype genotype pairs.
In [1] 471 clinical samples from 397 patients were analysed, first hand, to provide both
phenotypic and genotypic information for six nucleoside inhibitors of the reverse
transcriptase, zidovudine (ZDV), zalcitabine (ddC), didanosine (ddI), stavudine (d4T),
lamivudine (3TC) and abacavir (ABC); three nonnucleoside inhibitors of the reverse
transcriptase, nevirapine (NVP), delavirdine (DLV) and efavirenz (EFV); and five protease
inhibitors, saquinavir (SQV), indinavir (IDV), ritonavir (RTV), nelfinavir (NFV) and
amprenavir (APV). This resulted in 443-469 phenotype-genotype pairs for each drug, except
for APV.
In detail, genotype results were obtained through direct sequencing of the patients HIV
containing the complete protease and the first 650-750 nucleotides of the reverse
transcriptase. These sequences were then aligned to the reference strain HXB2 to identify
differences.
Each genotype was then modelled in a computationally useable manner using one attribute
for each amino acid position, allowing as a value a single-letter amino acid code or
“unknown” for positions in which ambiguous or no sequence information was available, see
figure 2.1.
Reference sequence
Sequencing result
Alignment
Attribute Values
Fig. 2.1
Modelling genotype. Each sequence position is represented by either the amino acid
present at that position or “unknown” if no sequence information is available.
13
Chapter 2. Machine Learning
Phenotyping was performed using a recombinant virus assay. Recombinant viruses were
cultivated in the presence of increasing amounts of antiretroviral drugs and fold-change
values were calculated by dividing the IC50 value of the relevant recombinant virus by the
IC50 value of a reference strain (NL4-3). Fold-change values were distributed as in figure
2.2.
image from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=123057&rendertype=figure&id=F1
Fig 2.2 Frequency distribution of fold-change values for a subset of 271 samples for which data
was available for all 14 drugs.
Using the fold-change values, the dataset of genotypes were grouped into two classes for
each antiretroviral drug. In particular, a genotype was labelled as either “resistant” or
“susceptible”, using a drug specific fold-change threshold, see figure 2.3. These thresholds
were obtained from previously published work: 8.5 for ZDV, 3TC, NVP, DLV, EFV; 2.5 for
ddC, ddI, d4T and ABC; and 3.5 for SQV, IDV, RTV, NFV, APV.
Drug
No. of pheno-geno pairs.
Percentage of examples classified resistant.
ZDV
456
58.1
DdC
456
43.0
DdI
456
49.1
D4T
456
38.6
3TC
452
54.4
ABC
445
66.3
NVP
457
45.1
DLV
455
36.5
EFV
443
35.9
SQV
465
46.7
IDV
469
48.8
RTV
469
50.1
NFV
468
53.6
APV
277
32.9
Fig. 2.3
Characteristics of the dataset.
14
Chapter 2. Machine Learning
In contrast, to the more direct approach presented above, one may obtain both phenotypic
and genotypic information via online databases. In recent years, databases, consisting of
thousands of matched phenotype genotype pairs, have emerged in response to a concern
about the lack of publicly available HIV reverse transcriptase and protease sequences. One
such database, the Stanford HIV-1 drug resistance database, available at
http://hivdb.stanford.edu/, strives towards linking HIV-1 reverse transcriptase and protease
sequence data, drug treatment histories and drug susceptibilities to allow researchers to
analyse the extent of clinical cross-resistance among current and experimental antiretroviral
drugs.
Other online databases that provide access to HIV reverse transcriptase and protease
sequences with drug-susceptibility data, include:
•
http://hivinsite.ucsf.edu/
•
http://www.resistanceweb.com/
•
http://jama.ama-assn.org
•
http://home.ncifcrf.gov/hivdrp
2.3
Learning Mechanisms.
The final step in addressing the phenotype prediction problem is to choose an appropriate
learning mechanism. This includes choosing an appropriate representation of the function
Resistant: Genotype
Drug_Susceptibility_Classification, and an algorithm to
automatically derive such a representation from the training experience. There are many
choices available; decision trees, artificial neural networks and instance-based learning are
just a few.
2.3.1
Decision Trees.
In [1] decision trees were used to represent the target function
Resistant: Genotype
Drug_Susceptibility_Classification for 14 antiretroviral drugs.
Specifically, decision trees were generated from a set of phenotype genotype pairs (described
previously in section 2.2) using the C4.5 software package and a statistical measure
15
Chapter 2. Machine Learning
indicating the amount of information that a particular sequence position provides about
differentiating resistant from susceptible samples.
A decision tree is an acyclic graph with interior vertices symbolizing tests to be carried out
on a characteristic or attribute and leaves indicating classifications. Decision trees then
classify unseen cases by sorting them through the tree from the root to some leaf, which
provides a classification. In other words, as each node in the graph symbolizes a test of some
attribute, and each edge departing from that node corresponds to one of the possible values
for the attribute. An unseen case is then classified by starting at the root node of the graph,
testing the attribute specified by this node and traversing the edge corresponding to the value
of the attribute in the given case. This process is repeated for the sub-graph rooted at the new
node, until a leaf node is encountered. At which point a classification is assigned to the case.
Specifically, in [1] the classification of a genotype was achieved by traversing a decision
tree, associated with an antiretroviral drug, from the root to a leaf according to the values of
amino acids at specific sequence positions, see figure 2.4.
Root
90
F,M
V
L
S
R
48
V
R
Leaf
R
G
S
54
L
I
Internal node
V
R
72
84
I
S
V
R,T,V
R
S
I
R
Fig. 2.4 Decision tree for the protease inhibitor SQV as presented in [1]. Nodes represent tests
at specific amino acid positions, edges represent amino acid values and leafs represent drugsusceptibility classifications.
16
Chapter 2. Machine Learning
In more general terms, a decision tree represents a disjunction of conjunctions of constraints
on the attribute values. Each path from the root of the graph to a leaf translates to a
conjunction of attribute tests, and the graph itself translates to a disjunction of these
conjunctions. For example the decision tree in figure 2.4 can be translated into the
expression
( 90 = F ) \/ (90 = M) \/ ( 90 = L /\ 48 = V ) \/ ( 90 = L /\ 48 = G /\ 54 = L ) \/
( 90 = L /\ 48 = G /\ 54 = I /\ 84 = V ) \/ ( 90 = L /\ 48 = G /\ 54 = V /\ 72 = I )
representing the knowledge that the decision tree uses to determine resistance.
Most algorithms that have been developed for generating decision trees offer slight
variation on a core methodology that uses a greedy search through a space of possible
decision trees. In this way we can think of decision tree learning as involving the
search of a very large space of possible decision tree classifiers to determine the one
that best fits the training data. Both the ID3 and C4.5 algorithms typify this approach.
The basic decision tree-learning algorithm, ID3, performs a greedy search for a
decision tree that fits the training data. In summary, it begins by asking, “Which
attribute should be tested at the root of the tree?” Each possible attribute is evaluated
using a statistical test to determine how well it alone classifies the complete set of
training data. A descendent of the root is created for each possible value of the
attribute, that is chosen, and the training data is sorted to the appropriate descendent
nodes. The procedure is then repeated for each of the descendent nodes until all the
training data is correctly classified. The ID3 algorithm is given in table 2.1.
The central choice in the ID3 algorithm is which attribute to test at each node. In order
to select an appropriate attribute we can employ a statistical measure to quantify how
well an attribute separates a set of data according to a finite alphabet of possible
outcomes. A popular measure is called information gain.
The information gain of an attribute A, relative to a dataset S, is defined as
ig(S,A)
═ entropy(S)
∑
v є values(A)
17
| Sv |
entropy(Sv)
|S|
Chapter 2. Machine Learning
ID3(Examples, Target_attribute, Attributes)
Examples are the training examples. Target_attribute whose value is to be predicted by the
tree. Attributes is a list of other attributes that may be tested by the learned decision tree.
Returns a decision tree that correctly classifies the given Examples.
-
-
Create a Root node for the tree.
If all Examples are positive, Return the single-node tree Root, with label = +
If all Examples are negative, Return the single-node tree Root, with label = If Attributes is empty, Return the single-node tree Root, with label = most common value of
Target_attribute in Examples.
Otherwise Begin
o A the attribute from Attributes that best classifies Examples
o The decision attribute for Root A
o For each possible value, vi , of A,
Add a new tree branch below Root, corresponding to the test A = vi
Let Examplesvi be the subset of Examples that have value vi for A
If Examplesvi is empty
•
Then below this new branch add a leaf node with label = the
most common value of Target_attribute in Examples
•
Else below this new branch add the subtree
ID3(Examplesvi, Target_attribute, Attributes – {A})
Return Root
Algorithm taken from Mitchell’s Book [12]
Table 2.1
The ID3 algorithm
where values(A) is the set of all possible values for attribute A, Sv is the subset of S for which
attribute A has value v and entropy(S) measures the impurity of the set S, such that if all the
members of S belong to the same class then entropy is 0 and conversely the entropy is 1 if
members of S are equally distributed, see figure 2.5. Specifically, if a set S, contains a
number of examples belonging to either a positive or negative class, the entropy of S relative
to these two classes is defined as:
entropy(S) ═ − p(+) log2 p(+)
− p(-) log2 p(-)
where p(+) is the proportion of examples in S belonging to the positive class and p(-) is
the proportion of examples in S belonging to the negative class. In this respect,
ig(S,A), measures the expected reduction in the impurity of a set of examples S
according to the attribute A. In other words, we wish to select attributes with high
values for information gain.
18
Chapter 2. Machine Learning
1.0
entropy(S)
0.0
0.5
1.0
p(+)
Fig. 2.5
The entropy function relative to a boolean classification
A major drawback of the ID3 algorithm is that it continues to grow a decision tree
until all the training examples are perfectly classified. This approach, although
reasonable can lead to difficulties when either there is erroneous data in the training
set or the distribution of the training examples is not representative of the entire
population. In these cases, ID3 produces decision trees that overfit the training
examples. A decision tree is said to overfit the training examples if there exists some
other decision tree that classifies the training examples less accurately but nevertheless
performs better over the entire population. Figure 2.6 illustrates the impact of
overfitting.
Accuracy
On training data
On validation data
Size of tree (no. of nodes)
Fig. 2.6 Overfitting in decision tree learning. When ID3 creates a new node the accuracy of the
tree measured using the training examples increases. However, when measured using the testing
examples the accuracy of the tree decreases as its size increases.
19
Chapter 2. Machine Learning
Overfitting is a significant problem for decision tree learning and indeed machine learning as
a whole. Specifically, overfitting has been found to decrease the accuracy of learned decision
trees by 10-25% [13]. However, there are a number of techniques available that minimise the
effects of overfitting in decision tree learning. Typically, we begin by randomly partitioning
the training data into two subsets, one for training and one for validation. Using the training
set we grow an overfitted decision tree and then post-prune using the validation set. Postpruning, in this respect, has the effect that any nodes added due to coincidental regularities in
the training set are likely to be removed because the same regularities are unlikely to occur
in the validation set.
Reduced error pruning is one post-pruning strategy that uses a validation set to minimise the
effects of overfitting. Here each node in the decision tree is considered as a candidate for
pruning. The removal of a node is determined by how well the reduced tree classifies the
examples in the validation set. In particular, if the reduced tree performs no worse than the
original over the validation set then the node is removed, making it a leaf and assigning it the
most common classification of the training examples associated with that node. Figure 2.7
illustrates the impact of reduced error pruning.
Accuracy
On training data
On validation data
On validation data
(during pruning)
Size of tree (no. of nodes)
Fig. 2.7
The impact of reduced-error pruning.
The C4.5 algorithm is a successor of the basic ID3 algorithm. C4.5 behaves in the same way
as ID3 but offers consideration to a number of issues. In particular, ID3, was criticised for
not offering support to handle continuous attributes and for not handling training data with
missing attribute values.
20
Chapter 2. Machine Learning
Where the basic ID3 algorithm is restricted to attributes that take on a discrete set of values,
C4.5 converts continuous values into a set of discrete values by separating the data into a
number of bins and attaching a label to each bin. Considering the dataset presented in section
2.2, the ability to handle continuous values isn’t required. In particular, each genotype is
modelled using a number of attributes (amino acid positions) that have discrete values
(single letter amino acid codes).
C4.5 handles attributes with missing values by assigning a probability to each of the possible
values of an attribute, based on the observed frequencies of the various values. These
fractional proportions are then used to both grow the tree and classify unseen cases with
missing attribute values. Again, considering the dataset presented in sections 2.2, the ability
to handle attributes with missing values isn’t a problem. In particular, the dataset presented
in section 2.2 explicitly models missing attribute values using a value of “unknown”.
Once continuous values and missing attribute values are handled, the C4.5 algorithm uses a
greedy search (similar to ID3) to find a decision tree that exactly conforms to the training
data and uses a post-pruning strategy, called rule-post pruning, to minimise the effects of
overfitting. In particular rule-post pruning involves: inferring an overfitted decision tree from
the training data, converting the decision tree into a set of classification rules (conjunctions
of constraints on attribute values) and removing any constraints on attribute values that result
in improving a rules estimated accuracy.
2.3.2
Artificial Neural Networks
In [2] two artificial neural networks were independently used to represent the target function
Resistant: Genotype
Drug_Susceptibility_Classification for the protease inhibitor
lopinavir. Specifically, two artificial neural networks were generated from a set of 1322
phenotype genotype pairs using the backpropagation algorithm to learn the parameters of a
single hidden layer network predicting the susceptibility of lopinavir from genotype. The
first network was based on mutations at 11 amino acid positions that were previously
recognised as being attributed to drug-resistance and the second was based on mutations at
28 amino acid positions as identified through statistical analysis.
The study of artificial neural networks was initially inspired by the observation that
biological learning systems are built from very complex webs of simple computational units
21
Chapter 2. Machine Learning
called neurons. Analogous, artificial neural networks are built from densely interconnected
sets of units called sigmoid perceptrons, see figure 2.8.
-1
v1
w0
w1
v2
.
.
.
vn
1
1 + e-inputs
∑
w2
inputs = w0 + w1v1 + w2v2 + … + wnvn
wn
Fig. 2.8
A sigmoid perceptron.
A sigmoid perceptron is a simple computational unit that takes as input a vector of numerical
values and outputs a continuous function of its inputs, called the sigmoid function.
Specifically, given a vector of inputs {v1, v2, …, vn} a linear combination of these inputs is
calculated as w0 + w1v1 + w2v2 + … + wnvn. Each wi is a numerical constant that weights the
contribution of an input vi. The output of the sigmoid perceptron is then obtained using the
sigmoid function:
1
1 + e-inputs
where inputs is the result of w0 + w1v1 + w2v2 + … + wnvn.
The backpropagation algorithm, given in table 2.2, learns appropriate values for each weight
wi in a multilayer network with a fixed number of sigmoid perceptrons and interconnections
in order to give a correct output (classification) for a set of inputs (characteristics). Figure 2.9
illustrates a single-hidden layer network. Backpropagation uses a training set of matched
inputs and outputs and employs gradient descent to minimise the error between the network
outputs and the actual outputs of the training set.
22
Chapter 2. Machine Learning
vi
wi
hi
ki
Fig. 2.9
A single hidden layer network
To learn the appropriate weights of a network consisting of a single sigmoid perceptron
gradient decent uses a training set of matched input output pairs of the form
({v1, v2, …, vn}, t), where {v1, v2, …, vn} is a vector of input values and t is the target output.
Gradient decent begins by choosing small arbitrary values for the weights. Weights are then
updated for each training example that is misclassified until all the training examples are
correctly classified. A learning rate η is used to determine the extent to which a weight is
updated. Specifically, each training example is classified by the perceptron to obtain an
output o and each weight is updated by the rule wi
wi + η(t – o)vi. This process is then
repeated until the perceptron makes no classification errors in the training data.
In considering networks with multiple sigmoid units and multiple outputs we again begin by
choosing small arbitrary values for the weights in the network (typically between –0.05 and
0.05) and update them according to the weight update rule wij
wij + η δj xij, where wij
denotes the weight from unit i to j, xij denotes the input from unit i into unit j and δ is a term
representing the misclassification error for each unit in the network. For an output unit k the
term δk is computed as okv(1 – okv)(tkv – okv), where okv is the output value associated with the
kth output unit and training example v. For a hidden unit h the term δh is computed as
ohv (1 - ohv)
∑
wkh δk
k є outs
where okv is the output value associated with the hidden unit h and training example v.
23
Chapter 2. Machine Learning
BACKPROPAGATION(training_examples, η, nin, nout, nhidden)
Create a feed-forward network with nin inputs, nhidden hidden units and nout output units.
Initialise all network weights to small random numbers.
Until the termination condition is met, Do
o For each <x, t> in training_examples, Do
Propagate the input forward through the network:
Input the instance x to the network and compute the output ou of every unit
u in the network.
Propagate the errors backward through the network:
For each network output unit k, calculate its error term
δk
ok(1 – ok)(tk – ok)
For each hidden unit h, calculate its error term
δh
oh (1 - oh)
∑
wkh δk
k є outpu ts
Update each network weight
wij
wij + η δj xij
Algorithm taken from Mitchell’s Book [12]
Table 2.2
The backpropagation algorithm
Backpropagation continues to update the weights of a multilayer network, in this fashion,
until all the training examples are correctly classified or once the error on the training
examples falls below some threshold.
In the context of the phenotype prediction problem, the target function
Resistant: Genotype
Drug_Susceptibility_Classification can be represented using an
artificial neural network derived using a set of training patterns ({v1, v2, …, vn}, t) and the
backpropagation algorithm. Here {v1, v2, …, vn} represents the result of a genotype resistance
test and t is the corresponding target phenotype. An unseen genotype {v1, v2, …, vn}q is then
classified using the network by propagating the values v1, v2, …, vn through it. In other
words, given the input values v1, v2, …, vn compute the output of every unit in the network.
The final output of the network is then an estimate of phenotype.
24
Chapter 2. Machine Learning
2.3.3
Instance Based
Nearest neighbour learning contrasts with learning decision trees and artificial neural
networks in that nearest neighbour learning simply stores the training examples and doesn’t
attempt to extract an explicit description of the target function. In this way generalisation
beyond the training examples is postponed until a new case must be classified. Specifically,
when a new case is presented to a nearest neighbour classifier, a set of similar cases is
retrieved from the training set (using a similarity measure) and used to classify the new case.
In this way, a nearest neighbour classifier constructs a different approximation to the target
function for each query, based on a collection of local approximations.
In regards to the phenotype prediction problem, this is a significant advantage, as the target
function Resistant: Genotype
Drug_Susceptibility_Classification may be very complex,
due to the fact that each mutation may be part of a global interaction exhibiting large
interdependence. However, nearest neighbour classifiers typically consider all the attributes
of each case when making predictions and if the target function actually depends only on a
small subset of these attributes, then cases that are truly most similar may well be deemed to
be unrelated. In the context of the phenotype prediction problem this is problematic and extra
care should be given to the design of a similarity measure. Another limitation of nearest
neighbour methods is that nearly all computation is left until a new classification is required
and the total computation cost of classifying a new case can be high.
The k-nearest neighbour methodology is the most basic nearest neighbour algorithm
available. It assumes that all n-attribute cases correspond to locations in an n-dimensional
space. This is called the feature space. The nearest neighbours of an unseen case are then
retrieved using the standard Euclidean distance. Specifically, each case is described using a
vector of features and the distance between two cases or the similarity of two cases ci and cj
is defined to be
√
n
∑
r=
(ar(ci) – ar(cj))2
1
where, ar(c), denotes the value of the rth feature of case c. Using such a measure, k-nearest
neighbour approximates the classification of an unseen case, cq, by retrieving the k cases c1,
c2,…, ck that are closest to cq and assigning it the most common classification associated with
25
Chapter 2. Machine Learning
the cases c1, c2,…, ck. For example, if k = 1 then 1-nearest neighbour assigns to cq the
classification associated ci, where ci is a training case closest to cq. Figure 2.10 illustrates the
operation of the k-nearest neighbour algorithm.
+
+
-
. cq
+
+
Fig. 2.10
K-nearest neighbour. Here each case is represented using a 2 dimensional feature
vector and cases are either classified as positive or negative. 1-nearest neighbour classifies cq as
positive, whereas 5-nearest neighbour classifies cq as negative.
In terms of the phenotype prediction problem, k-nearest neighbour would approximate the
phenotype of an unseen genotype gq, by retrieving the k genotypes g1, g2,…, gk that are
closest to gq and assigning it a classification of drug-susceptibility using the phenotypes
associated with the genotypes g1, g2,…, gk. With the similarity or distance between two
genotypes being determined using an appropriate distance measure.
One possibility for this distance measure is to simply apply the Euclidean distance as
described above. Here we model each genotype as a vector of single-letter amino acid codes.
In this way, every sequence position is considered when determining the distance between
two genotypes. As mentioned this may be problematic because it treats each sequence
position as being equally important. In this way, we neglect to highlight the importance of
amino-acid changes, at specific sequence positions.
Commercially, the Virco-Tibotec company employs a k-nearest neighbour approach to
predict drug-susceptibility from genotype, see figure 2.11. Here the distance between two
genotypes is based on the comparison of their ‘profiles’. Where a profile can be thought of as
a feature vector containing all the mutations present in a genotype previously associated with
drug-resistance. Here we do not consider every sequence position in the distance measure, as
above, but rather only a subset of sequence positions. However, this method relies on
26
Chapter 2. Machine Learning
previous knowledge of mutations associated with drug-resistance and fails to consider amino
acid changes beyond this set.
image from http://www.vircolab.com/bgdisplay.jhtml?itemname=howitworks_virtualphenotype&product=virtualphenotype
Fig. 2.11
How Virco generates a virtualPhenotype™
An alternative measure that doesn’t presume any previous knowledge of drug-resistant
mutations constructs a feature vector using a reference strain. Specifically, a feature vector
for a genotype is constructed by comparing its complete sequence with a reference sequence.
For positions in which there is no change in amino acid the feature vector is augmented to
contain a dummy value, β, and for other positions the feature vector is augmented to contain
the amino acid present in the genotype sequence. In other words, the feature vector of a
genotype represents a pattern of deviations from a reference sequence. We then compute a
similarity score based on how well two feature vectors conform. For positions in which both
vectors contain non-dummy values we compute the percentage of these that are different.
Also, an additional factor is included that represents the percentage of the remaining
positions that are different. Figure 2.12 illustrates this approach.
Another possible similarity measure that again doesn’t presume any previous knowledge of
drug-resistance mutations is derived through the comparison of two dot-plots. A dot plot is a
visual representation of the similarities between two sequences. Each axis of a rectangular
array represents one of the two sequences to be compared and at every point in the array
where the two sequences are identical a dot is placed (i.e. at the intersection of every row and
27
Chapter 2. Machine Learning
column that have the same amino acid in both sequences). A diagonal stretch of dots
indicates regions where the two sequences are similar. Using such a representation of
sequence similarity, we can construct a dot plot for each genotype in the training set in
relation to a reference sequence. Similarly a dot-plot can be constructed for a query genotype
using the same reference sequence. The distance between two genotypes could then be
estimated by comparing the similarity of two dot-plots. Figure 2.13 illustrates this approach.
Reference Strain
Genotype sequences
{β,.., β,|, β,……, β,|, β,|, β,………, β}
{β,.., β,|, β,……, β,|, β,|, β,…, β,|, β ,.., β,|, β,…, β}
g1
g2
distance(g1, g2) = mutAgreement(g1, g2) + diffs(g1, g2).
mutAgreement(g1, g2) = no. of shared mutations that are different / total no of shared
mutations.
diffs(g1, g2) = no. of unshared mutations / length of sequence.
Fig. 2.12
Comparing two feature vectors
(i)
(iii)
(ii)
Fig. 2.13
Comparing dot-plots. (i) A dot plot obtained by comparing the reference sequence,
A, with itself. (ii) A dot plot obtained by comparing the reference sequence with a sequence, B,
with a small subset of mutations. (iii) A dot plot obtained by comparing the reference sequence
with a sequence, C, with a large number of mutations. Similarity of A and B determined by how
well (ii) and (iii) conform.
28
Chapter 3
Materials & Methods
3.1
Data Set
The complete HIV-1 reverse transcriptase and protease drug susceptibility data sets from the
Stanford HIV-1 drug resistance database were utilised to determine viral genotype and drug
susceptibility to five nucleoside inhibitors of the reverse transcriptase, zalcitabine (ddC),
didanosine (ddI), stavudine (d4T), lamivudine (3TC) and abacavir (ABC); three
nonnucleoside reverse transcriptase inhibitors, nevirapine (NVP), delavirdine (DLV) and
efavirenz (EFV); and six protease inhibitors, saquinavir (SQV), lopinavir (LPV), indinavir
(IDV), ritonavir (RTV), nelfinavir (NFV) and amprenavir (APV). Using a simple text parser,
implementation described in Appendix A, I obtained 381-855 phenotype-genotype pairs for
each of these drugs, see table 3.2. In addition, I would have liked to reuse the dataset of
phenotype-genotype pairs used in [1] but although the results of genotyping were deposited
in GenBank (accession numbers AF347117 to AF347605) no drug susceptibility information
was attached to them. Furthermore, the Stanford dataset contained no drug-susceptibility
information for the reverse transcriptase inhibitor zidovudine, which was also included in
[1].
The Stanford HIV-1 reverse transcriptase and protease drug susceptibility data sets are
available at http://hivdb.stanford.edu/cgi-bin/GenoPhenoDS.cgi and can be downloaded in a
plain text format, see figure 3.1.
29
Chapter 3. Materials & Methods
SeqID SubType Method … SQVFold SQVFoldMatch
7439 B Virologic … 47.4 = … - -
…
P1
P2 P3 … MutList …
- … 10I, 24FL, 37D, 46I, 53L, 60E, 63P, 71IV, 73S, 77I, 90M, 93L …
7443 B Virologic … 574.2 = … - - - … 10I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A …
7459 B Virco … 15.0 = … -
- -
…
10I, 19Q, 35D, 48V, 63P, 69Y, 71T, 90M, 93L …
.
.
.
45121 B Virologic … na na … - -
-
… 10I, 13V, 41K …
45122 B Virologic … na na … - -
-
… 50L …
.
.
.
7430 B Virologic … 121.7 = … - - -
…
10I, 15V, 20M, 35D, 36I, 54V, 57K, 62V, 63P, 71V, 73S …
Fig. 3.1
The HIV-1 protease drug-susceptibility dataset obtained from the Stanford HIV
resistance database.
For each sample presented in the Stanford dataset we are given a fold-change value
(phenotype), for each drug, and a list of mutations identified in comparison to a reference
strain (genotype). See Appendix C for a list of the sequences included in the protease and
reverse transcriptase datasets.
Using this information, I was able to model each sample in a similar way to that presented in
section 2.2. Specifically, I modelled each protease sample using one attribute for each of its
99 amino acids, allowing as a value one of the 20 naturally occurring amino acids. Similarly,
each reverse transcriptase sample was modelled using 440 attributes representing the first
440 amino acids of the reverse transcriptase.
To acquire the values for each attribute, the list of mutations was used in conjunction with
the reference strain HXB2. In particular, for each sample, the original sequence was
reconstructed by substituting each mutation into the protease or reverse transcriptase genes
of HXB2 (GenBank accession number K03455). In this way, each sequence that was
constructed contained no gaps or areas with loss of sequence information and so an attribute
value of “unknown” was not required. Figure 3.2 illustrates this process.
30
Chapter 3. Materials & Methods
SeqID SubType Method … SQVFold SQVFoldMatch
…
P1
P2 P3 … MutList …
…
7443 B Virologic … 574.2 = … - - - … 10I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A …
7459 B Virco … 15.0 = … …
- -
…
10I, 19Q, 35D, 48V, 63P, 69Y, 71T, 90M, 93L …
10I, 17E, 37D, 43T, 48V, 54V, 63P, 67S,
71V, 77I, 82A
Fold-change value for
saquinavir (SQV)
Reference sequence
Reconstructed genotype sequence
Position 17 = E
Attribute values
Fig. 3.2 Modelling the data. A list of mutations in the protease gene is used in conjunction with
the reference sequence HXB2 to obtain the values of each attribute.
Using the drug-specific fold-change values associated with each sample, the dataset of
genotypes were grouped into two classes. Here, fold-change values were distributed as in
table 3.1. In particular, a genotype was labelled as either “resistant” or “susceptible”, in
accordance with a drug specific fold-change threshold. An important consideration, here, is
how the choice of thresholds affects the distribution of resistant and susceptible samples. By
varying the choice of thresholds we include or exclude certain samples from either the
resistant or susceptible classes. However, at this time, the same thresholds were used as
described in section 2.2. By grouping the dataset of genotypes using these thresholds I
obtained the datasets as described in Table 3.2.
31
Chapter 3. Materials & Methods
0-1
2–
4-7
3
8-
16 -31
32 -63
15
64 -
128 -
256 -
512 -
1024 -
127
255
511
1023
2047
0%
0%
> 2047
ddC
22%
57%
16%
2%
2%
0.8%
0.2%
0%
0%
0%
ddI
22%
61%
10%
4%
3%
0%
0%
0%
0%
0%
0%
0%
D4T
32%
49%
13%
3%
1%
0.2%
0%
0%
0%
0%
0%
0%
3TC
11%
18%
8%
5%
6%
16%
18%
16%
0%
0.1%
0.1%
0%
ABC
21%
30%
31%
13%
4%
1%
0%
0%
0%
0%
0%
0%
NVP
32%
27%
11%
3%
2%
6%
6%
4%
5%
4%
0%
0%
DLV
33%
26%
12%
5%
3%
5%
5%
4%
3%
0.5%
0%
0%
EFV
41%
23%
7%
4%
4%
5%
5%
4%
4%
2%
0.2%
0.4%
SQV
32%
26%
11%
7%
7%
8%
4%
2%
2%
1%
0%
0%
LPV
24%
21%
13%
13%
10%
10%
8%
1%
0%
0%
0%
0%
IDV
20%
28%
14%
14%
13%
7%
2%
0.7%
0.5%
0%
0%
0.1%
RTV
24%
24%
12%
9%
9%
9%
9%
4%
0%
0%
0%
0%
NFV
11%
22%
11%
12%
15%
15%
7%
3%
0.8%
2%
0.1%
0%
APV
34%
29%
19%
9%
5%
2%
0.7%
0.3%
0.3%
0%
0%
0%
Table 3.1
Frequency distribution of fold-change values. Resistance factors are grouped into
equidistant bins.
Drug
No. of phenotype genotype
Percentage of examples classified
pairs.
resistant.
ZDV
-
-
DdC
646
29.2%
DdI
673
20.9%
d4T
631
22.5%
3TC
749
59.4%
ABC
582
53.9%
NVP
706
29.0%
DLV
601
28.3%
EFV
557
27.1%
SQV
830
33.9%
LPV
381
58.5%
IDV
822
48.6%
RTV
773
49.1%
NFV
855
64.3%
APV
701
33.3%
Table 3.2
Percentage of resistant examples.
32
Chapter 3. Materials & Methods
3.2
Decision Trees
For each drug I derived a decision tree classifier to identify genotypic patterns characteristic
of resistance, see Appendix B for a description of implementation. Each dataset of
phenotype-genotype pairs was partitioned into a training, validation and test set. 56% of the
complete dataset was randomly selected for training, 14% of the complete dataset was
randomly selected for validation and what remained formed the testing set. Genotypes were
labelled as either resistant or susceptible according to a drug-specific threshold, as defined in
[1].
Decision trees were generated using the ID3 algorithm, as described in section 2.3.1. A
training set was recursively split, according to attribute values (positions – amino acids),
until all the examples in the training set were perfectly classified. Attributes were selected
based on maximal information gain representing the amount of information that an amino
acid position provides about differentiating resistant from susceptible genotypes.
Trees were pruned using reduced error pruning to minimise the effects of overfitting. This
gave rise to confidence factors associated with certain leafs, estimated by the fraction of
training examples incorrectly classified by the pruned tree.
Classification of an unseen genotype (list of mutations) was achieved by sorting it through a
drug-specific decision tree from the root to a leaf according to the values (single-letter amino
acid codes) of attributes (amino acid positions). If a genotype was unable to be completely
sorted through the tree in this manner (i.e. it contains an attribute value that is not recognised
by the tree) then it was classified as “unknown”.
In addition, the classification of a nucleotide sequence was achieved through a two-step
process. In order to classify a nucleotide sequence each codon in the sequence was translated
into a single-letter amino acid code using the standard genetic code, see figure 3.3. Thus,
constructing a set of attribute values to be classified.
For each classification an explanation was generated. Specifically, recording the path
followed through a decision tree during classification generated an explanation. This was
translated into natural language for easy readability, see figure 3.4 for an example.
33
Chapter 3. Materials & Methods
Fig. 3.3
The genetic code
List of mutations
Found I at position 10,
Found A at position 71,
Found V at position 48,
** Reached decision: example is resistant. (8[0%]) **
Fig. 3.4
Generating an explanation.
34
Chapter 3. Materials & Methods
3.3
Nearest-Neighbour
By means of the same training, validation and test sets used to create the decision trees, I
created a k-nearest neighbour classifier to approximate drug-susceptibility from k similar
genotypes in the training + validation datasets of each antiretroviral drug.
A feature vector for each genotype in the training + validation datasets was constructed by
comparing each attribute value with a reference sequence. For positions in which there was
no change in amino acid the feature vector was augmented to include a dummy value, β, and
for other positions the feature vector was augmented to include the amino acid present in the
genotype sequence.
To classify the drug-susceptibility of an unseen genotype, 3 similar genotypes (guaranteed a
majority) were retrieved from the training + validation datasets using a similarity measure.
The drug-susceptibility of an unseen genotype was then approximated as the majority drugsusceptibility classification of the retrieved genotypes. For each classification an explanation
that consisted of the three closest genotypes was generated.
The similarity of two genotypes was defined by how well two feature vectors conform. In
particular, for positions in which two feature vectors contained non-dummy values I used the
percentage of these that are different as an indicator of similarity. I also added to this score,
the percentage of the remaining positions that were different.
35
Chapter 4
Results
4.1
Classification Models
I obtained both decision tree and nearest-neighbour classifiers that predict drug-susceptibility
from genotype, for 8 reverse transcriptase and 6 protease inhibitors. Decision tree learning
generated classification models with varying complexity, ranging from only 5-9 interior
attribute tests to 10, 11, 12, 16 and 19 interior attribute tests for zalcitabine, indinavir,
abacavir, amprenavir and saquinavir, respectively. In contrast, each nearest-neighbour
classifier generated no explicit models but rather simply stored the training and validation
datasets used to generate the decision trees.
4.1.1
Reverse Transcriptase Inhibitors
I obtained decision tree classifiers for each of the reverse transcriptase inhibitors: zalcitabine,
didanosine, stavudine, lamivudine, nevirapine, abacavir, delavirdine and delavirdine, see
figure 4.9 - 4.16, respectively. The decision trees for these drugs varied in complexity. In
particular, I found rather simple models for didanosine, stavudine, nevirapine and efavirenz
with these trees having only 5-6 interior attribute tests. On the other hand, I found more
complex models for zalcitabine, abacavir and delavirdine with these trees having 9-12
interior attribute tests.
36
Chapter 4. Results
Training and validation datasets were randomly created from the entire dataset of applicable
phenotype genotype pairs and were used to derive each decision tree and nearest-neighbour
classifier, see Appendix C for details. Within these datasets, fold-change values were
distributed as in figures 4.1 – 4.8.
60
50
%
40
30
20
10
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127 128-255 256-511 512-1023 1024-204 >2047
Validation
Fig. 4.1 Frequency distribution of fold-change values in the training and validation datasets for
zalcitabine.
60
50
%
40
30
20
10
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127
128-255 256-511 512-1023 1024-204 >2047
Validation
Fig. 4.2 Frequency distribution of fold-change values in the training and validation datasets for
didanosine.
60
50
%
40
30
20
10
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127
128-255
256-511 512-1023 1024-2047 >2047
Validation
Fig. 4.3 Frequency distribution of fold-change values in the training and validation datasets for
stavudine.
37
Chapter 4. Results
25
%
20
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127
128-255 256-511 512-1023 1024-204 >2047
Validation
Fig. 4.4 Frequency distribution of fold-change values in the training and validation datasets for
lamivudine.
40
%
30
20
10
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127 128-255 256-511 512-1023 1024-204 >2047
Validation
%
Fig. 4.5 Frequency distribution of fold-change values in the training and validation datasets for
abacavir.
35
30
25
20
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127
128-255 256-511 512-1023 1024-204 >2047
Validation
Fig. 4.6 Frequency distribution of fold-change values in the training and validation datasets for
nevirapine.
40
%
30
20
10
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127 128-255 256-511 512-1023 1024-204 >2047
Validation
Fig. 4.7 Frequency distribution of fold-change values in the training and validation datasets for
delavirdine.
38
Chapter 4. Results
50
%
40
30
20
10
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127
128-255
256-511 512-1023 1024-2047 >2047
Validation
Fig. 4.8 Frequency distribution of fold-change values in the training and validation datasets for
efavirenz.
[P75] = <M>, <T> then: resistant (15[8%])
[P75] = <A> then: susceptible (3[0%])
[P75] = <V> and:
[P184] = <M> then: susceptible (182[17%])
[P184] = <V> and:
[P210] = <L> then: susceptible (121[33%])
[P210] = <W> and:
[P177] = <E>, <G> then: resistant (9[0%])
[P177] = <D> and:
[P35] = <I> then: resistant (3[0%])
[P35] = <T> then: susceptible (2[0%])
[P35] = <V> and:
[P74] = <V>, <I> then: resistant (4[0%])
[P74] = <L> and:
[P118] = <V> then: susceptible (15[0%])
[P118] = <I> then: resistant (3[0%])
[P35] = <M> and:
[P41] = <M> then: resistant (1[0%])
[P41] = <L> then: susceptible (3[0%])
[P75] = <I> and:
[P41] = <M> then: resistant (10[0%])
[P41] = <L> then: susceptible (1[0%])
[P75] = <L> and:
[P32] = <K> then: susceptible (1[0%])
[P32] = <H> then: resistant (1[0%])
Fig. 4.9
Decision tree classifier for zalcitabine. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
[P151] = <M> then: resistant (15[0%])
[P151] = <Q> and:
[P74] = <L>, <S> then: susceptible (333[12%])
[P74] = <V> and:
[P211] = <R>, <A> then: resistant (18[0%])
[P211] = <T> then: susceptible (1[0%])
[P211] = <K> and:
[P297] = <E>, <K> then: susceptible (7[0%])
[P297] = <A>, <R> then: resistant (2[0%])
[P74] = <I> and:
[P35] = <V> then: resistant (3[0%])
[P35] = <L> then: susceptible (1[0%])
Fig. 4.10
Decision tree classifier for didanosine. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
39
Chapter 4. Results
[P210] = <L> and:
[P75] = <V> then: susceptible (257[8%])
[P75] = <A>, <T>, <I>, <S> then: resistant (17[0%])
[P75] = <M> and:
[P3] = <S> then: resistant (1[0%])
[P3] = <C> then: susceptible (1[0%])
[P210] = <W> and:
[P69] = <N>, <D>, <S> then resistant (15[0%])
[P69] = <T> and:
[P67] = <D>, <G>, <E> then: susceptible (22[0%])
[P67] = <N> then: resistant (38[50%])
Fig. 4.11
Decision tree classifier for stavudine. (n[e]) at the leaves denotes the number of
examples n and the estimated error e. Leafs with an error of 50% represent an inability to make a
definite classification.
[P184] = <V>, <I> then: resistant (236[0%])
[P184] = <M> and:
[P69] = <T>, <N>, <A> then: susceptible (165[6%])
[P69] = <D> and:
[P245] = <T> then: resistant (1[0%])
[P245] = <E>, <K>, <A> then: susceptible (3[0%])
[P245] = <V> and:
[P118] = <V> then: susceptible (5[0%])
[P118] = <I> then: resistant (3[0%])
[P69] = <S> and:
[P41] = <L> then: resistant (2[0%])
[P41] = <M> then: susceptible (1[0%])
[P69] = <I> and:
[P21] = <V> then: resistant (1[0%])
[P21] = <I> then: susceptible (1[0%])
Fig. 4.12
Decision tree classifier for lamivudine. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
[P103] = <R> then: susceptible (5[0%])
[P103] = <N>, <T> then: resistant (60[0%])
[P103] = <K> and:
[P181] = <C>, <I> then: resistant (28[15%])
[P181] = <Y> and:
[P190] = <A>, <S>, <T>, <Q>, <C>, <E>, <V> then: resistant (20[0%])
[P190] = <G> and:
[P106] = <I>, <L> then: susceptible (6[0%])
[P106] = <A>, <M> then: resistant (9[0%])
[P106] = <V> and:
[P188] = <Y>, <H> then: susceptible (274[3%])
[P188] = <L>, <C> then: resistant (5[0%])
[P103] = <S> and:
[P35] = <V>, <M> then: resistant (2[0%])
[P35] = <I> then susceptible (1[0%])
Fig. 4.13
Decision tree classifier for nevirapine. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
40
Chapter 4. Results
[P184] = <V>, <I> then: resistant (149[24%])
[P184] = <M> and:
[P41] = <M> and:
[P151] = <M> then: resistant (11[0%])
[P151] = <Q> and:
[P65] = <K> and:
[P211] = <K>, <A>, <S>, <T>, <G> then: susceptible (54[0%])
[P211] = <R> and:
[P178] = <M>, <V> then: susceptible (7[0%])
[P178] = <L> then: resistant (2[0%])
[P178] = <I> and:
[P69] = <T>, <N> then: susceptible (50[0%])
[P69] = <S> then: resistant (1[0%])
[P65] = <R> and:
[P35] = <I> then: susceptible (1[0%])
[P35] = <M> then: resistant (1[0%])
[P35] = <V> and:
[P162] = <C> then: resistant (1[0%])
[P162] = <A> then: susceptible (1[0%])
[P162] = <S> and:
[P62] = <A> then: resistant (4[0%])
[P62] = <V> then: susceptible (1[0%])
[P41] = <L> and:
[P69] = <N>, <D>, <S>, <E> then: resistant (11[0%])
[P69] = <A> then: susceptible (1[0%])
[P69] = <T> and:
[P44] = <E> then: susceptible (26[22%])
[P44] = <D>, <E> then: resistant (7[0%])
Fig. 4.14
Decision tree classifier for abacavir. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
[P103] = <N>, <S>, <T> then: resistant (66[8%])
[P103] = <R>, <Q> then: susceptible (6[0%])
[P103] = <K> and:
[P181] = <C>, <I> then: resistant (17[0%])
[P181] = <Y> and:
[P245] = <Q>, <T>, <E>, <M>, <K>, <I>, <S>, <L>, <A> then: susceptible (81[0%])
[P245] = <R> then: resistant (1[0%])
[P245] = <V> and:
[P102] = <K>, <R> then: susceptible (155[3%])
[P102] = <E> then: resistant (1[0%])
[P102] = <Q> and:
[P100] = <I> then: resistant (2[0%])
[P100] = <L> and:
[P20] = <R> then: resistant (1[0%])
[P20] = <K> and:
[P230] = <L> then: resistant (1[0%])
[P230] = <M> and:
[P188] = <Y>, <C> then: susceptible (17[0%])
[P188] = <L> and:
[P35] = <V> then: resistant (1[0%])
[P35] = <T> then: susceptible (1[0%])
Fig. 4.15
Decision tree classifier for delavirdine. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
41
Chapter 4. Results
[P103] = <R>, <T>, <Q> then: susceptible (5[0%])
[P103] = <N> then: resistant (66[17%])
[P103] = <K> and:
[P190] = <S>, <T>, <Q>, <C>, <E> then: resistant (10[0%])
[P190] = <G> and:
[P188] = <Y>, <C>, <H> then: susceptible (211[2%])
[P188] = <L> then: resistant (7[0%])
[P190] = <A> and:
[P101] = <E>, <D> then: resistant (6[0%])
[P101] = <K> and:
[P135] = <T>, <M> then: resistant (3[0%])
[P135] = <I> then: susceptible (5[0%])
[P103] = <S> and:
[P35] = <V> then: susceptible (2[0%])
[P35] = <M> then: resistant (1[0%])
Fig. 4.16
Decision tree classifier for efavirenz. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
4.1.2
Protease Inhibitors
I obtained decision tree classifiers for each of the protease inhibitors: lopinavir, saquinavir
indinavir, ritonavir, amprenavir and nelfinavir, see figure 4.23 - 4.28, respectively. Here, I
found rather more complex models than was found for the reverse transcriptase inhibitors.
In particular, I found decision trees with 6-9 interior attribute tests for the drugs ritonavir,
lopinavir and nelfinavir. On the other hand, the decision trees for indinavir, amprenavir and
saquinavir were even more complex with 11, 16 and 19 interior attribute tests, respectively.
This complexity indicates that the genetic basis of drug resistance is more complicated for
protease inhibitors than for reverse transcriptase inhibitors. However, this conjecture is by no
means conclusive as complex decision trees can stem from noisy training data. It may be
simply the case that the training data for these drugs contains errors and that these errors are
distributed evenly between the training and validation datasets.
Training and validation datasets were randomly created from the entire dataset of applicable
phenotype genotype pairs and were used to derive each decision tree and nearest-neighbour
classifier, see Appendix C for details. Within these datasets, fold-change values were
distributed as in figures 4.17 – 4.22.
42
Chapter 4. Results
35
30
25
%
20
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
64-127 128-255 256-511 512-1023 1024-204 >2047
Training
Validation
Fig. 4.17
Frequency distribution of fold-change values in the training and validation datasets
for saquinavir.
25
20
%
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
64-127 128-255 256-511 512-10231024-204 >2047
Training
Validation
Fig. 4.18
Frequency distribution of fold-change values in the training and validation datasets
for lopinavir.
30
25
%
20
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
64-127 128-255 256-511 512-10231024-204 >2047
Training
Validation
Fig. 4.19
Frequency distribution of fold-change values in the training and validation datasets
for indinavir.
25
%
20
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127 128-255 256-511 512-1023 1024-204 >2047
Validation
Fig. 4.20
Frequency distribution of fold-change values in the training and validation datasets
for ritonavir.
43
Chapter 4. Results
25
%
20
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
64-127 128-255 256-511 512-1023 1024-204 >2047
Training
Validation
Fig. 4.21
Frequency distribution of fold-change values in the training and validation datasets
for nelfinavir.
40
35
30
%
25
20
15
10
5
0
0-1
2-3
4-7
8-15
16-31
32-63
Training
64-127 128-255 256-511 512-1023 1024-204 >2047
Validation
Fig. 4.22
Frequency distribution of fold-change values in the training and validation datasets
for amprenavir.
[P10] = <L>, <H>, <R>, <M> then: susceptible (92[10%])
[P10] = <V>, <F>, <Z> then: resistant (55[12%])
[P10] = <I> and:
[P82] = <A>, <T>, <I>, <F>, <S> then: resistant (37[0%])
[P82] = <V> and:
[P71] = <V>, <L> then: resistant (16[19%])
[P71] = <I> then: susceptible (2[0%])
[P72] = <I>, <V> then: resistant (10[0%])
[P72] = <M>, <E> then: susceptible (2[0%])
[P71] = <A> and:
[P46] = <M>, <L> then: susceptible (9[0%])
[P46] = <I> and:
[P72] = <T> then: resistant (2[0%])
[P72] = <L> then: susceptible (1[0%])
[P72] = <I> and:
[P93] = <L> then: susceptible (3[0%])
[P93] = <I> then: resistant (3[0%])
Fig. 4.23
Decision tree classifier for lopinavir. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
44
Chapter 4. Results
[P10] = <R>, <Z> then: resistant (26[0%])
[P10] = <I> and:
[P71] = <I>, <T>, <V>, <Z> then: resistant (105[18%])
[P71] = <A> and:
[P48] = <V>, <S> then: resistant (8[0%])
[P48] = <G> and:
[P37] = <D>, <Z> then: resistant (3[0%])
[P37] = <E>, <T>, <N> then: susceptible (4[0%])
[P37] = <S> and:
[P84] = <V>, <C> then: resistant (6[0%])
[P84] = <I> and:
[P73] = <S>, <G>, <C> then: susceptible (32[0%])
[P73] = <T> then: resistant (2[0%])
[P71] = <L> and:
[P13] = <I> then: resistant (2[0%])
[P13] = <V> then: susceptible (2[0%])
[P10] = <L> and:
[P90] = <L> then: susceptible (173[0%])
[P90] = <M> and:
[P88] = <S>,
resistantby
(14[0%])
Occasionally, a mutation
may<D>
be then:
expressed
the position they have in a certain gene,
[P88] = <N> and:
[P48] = <V> then:
(4[0%])seen in the wild-type virus, and
preceded by the letter corresponding
to theresistant
amino-acid
[P48] = <G> and:
[P84] = <V> then: resistant (4[0%])
[P84] = <I> and:
[P60] = <E> and:
[P64] = <I> then: resistant (4[0%])
[P64] = <V> then: susceptible (4[0%])
[P60] = <D> and:
[P14] = <K> then: susceptible (14[0%])
[P14] = <R> then: resistant (1[0%])
[P10] = <V> and:
[P71] = <V> then: resistant (10[0%])
[P71] = <A> then: susceptible (8[0%])
[P71] = <T> and:
[P12] = <Z> then: resistant (1[0%])
[P12] = <T> and:
[P30] = <D> then: susceptible (1[0%])
[P30] = <N> then: resistant (1[0%])
[P10] = <F> and:
[P84] = <I> then: susceptible (20[33%])
[P84] = <L>, <C>, <A> then: resistant (3[0%])
[P84] = <V> and:
[P63] = <P>, <A> then: resistant (7[0%])
Fig. 4.24
Decision tree classifier for saquinavir. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
45
Chapter 4. Results
[P54] = <V>, <T>, <A>, <Z> then: resistant (125[0%])
[P54] = <M> then: susceptible (2[0%])
[P54] = <I> and:
[P46] = <Z> then: resistant (7[0%])
[P46] = <I> and:
[P63] = <P>, <A> then: resistant (46[7%])
[P63] = <V>, <R>, <Q> then: susceptible (5[0%])
[P63] = <L> and:
[P77] = <I> then: susceptible (3[0%]
[P77] = <V> and:
[P69] = <Y>, <Q> then: resistant (2[0%])
[P69] = <K> then: susceptible (1[0%])
[P69] = <H> and:
[P10] = <I>, <L>, <V> then: resistant (4[0%])
[P10] = <F> then: susceptible (1[0%])
[P46] = <M> and:
[P90] = <M> then: resistant (40[27%])
[P90] = <L> then: susceptible (200[6%])
[P46] = <L> and:
[P10] = <I>, <R> then: resistant (5[0%])
[P10] = <L> then: susceptible (10[0%])
[P10] = <F> and:
[P62] = <I> then: resistant (2[0%])
[P62] = <V> then: susceptible (1[0%])
[P54] = <L> and:
[P10] = <I> then: resistant (2[0%])
[P10] = <L> then: susceptible (2[0%])
[P20] = <K> then: susceptible (2[0%])
[P20] = <T> then: resistant (1[0%])
Fig. 4.25
Decision tree classifier for indinavir. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
[P82] = <A>, <T>, <F>, <L>, <S> then: resistant (109[4%])
[P82] = <V> and:
[P90] = <M> then: resistant (79[28%])
[P90] = <L> and:
[P84] = <I> then: susceptible (205[6%])
[P84] = <V>, <L>, <A> then: resistant (26[0%])
[P84] = <C> and:
[P10] = <I>, <F> then: resistant (3[0%])
[P10] = <L> then: susceptible (1[0%])
[P82] = <I> and:
[P46] = <I> then: resistant (3[0%])
[P46] = <L> then: susceptible (1[0%])
[P46] = <M> and:
[P84] = <I> then: susceptible (4[0%])
[P84] = <V>, <C> then: resistant (2[0%])
Fig. 4.26
Decision tree classifier for ritonavir. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
46
Chapter 4. Results
[P10] = <R> then: susceptible (3[0%])
[P10] = <Z> then: resistant (28[0%])
[P10] = <I> and:
[P84] = <I> then: susceptible (90[22%])
[P84] = <V>, <C>, <A> then: resistant (41[37%])
[P10] = <L> and:
[P84] = <I> and:
[P50] = <V> then: resistant (3[0%])
[P50] = <L> then: susceptible (5[0%])
[P50] = <I> and:
[P47] = <I> then: susceptible (156[2%])
[P47] = <A> then: resistant (2[0%])
[P47] = <V> and:
[P12] = <T> then: susceptible (1[0%])
[P12] = <S> then: resistant (1[0%])
[P84] = <V> and:
[P73] = <S> then: susceptible (1[0%])
[P73] = <T> then: resistant (2[0%])
[P73] = <G> and:
[P46] = <I> then: susceptible (1[0%])
[P46] = <L> then: resistant (2[0%])
[P46] = <M> and:
[P12] = <P> then: resistant (1[0%])
[P12] = <T> and:
[P36] = <M> then: susceptible (1[0%])
[P36] = <Z> then: resistant (1[0%])
[P84] = <C> and:
[P20] = <K> then: susceptible (1[0%])
[P20] = <I> then: resistant (1[0%])
[P10] = <V> and:
[P46] = <I> then: resistant (6[0%])
[P46] = <M> then: susceptible (1[0%])
[P46] = <L> and:
[P61] = <Q> then: resistant (3[0%])
[P61] = <E> then: susceptible (1[0%])
[P10] = <F> and:
[P84] = <V>, <A> then: resistant (8[0%])
[P84] = <I> and:
[P46] = <I> then: resistant (3[0%])
[P46] = <M> then: susceptible (13[50%])
[P46] = <L> and:
[P15] = <I> then: susceptible (3[0%])
[P15] = <V> then: resistant (3[0%])
Fig. 4.27
Decision tree classifier for amprenavir. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
47
Chapter 4. Results
[P54] = <V>, <L>, <T>, <Z> then: resistant (128[0%])
[P54] = <M> then: susceptible (1[0%])
[P54] = <I> and:
[P90] = <M> then: resistant (94[8%])
[P90] = <L> and:
[P30] = <N> then: resistant (37[0%])
[P30] = <Y> then: susceptible (1[0%])
[P30] = <D> and:
[P46] = <I>, <Z> then: resistant (31[28%])
[P46] = <M> and:
[P88] = <S> then: resistant (8[0%])
[P88] = <D> then: susceptible (1[0%])
[P88] = <N> and:
[P82] = <V>, <S> then: susceptible (155[4%])
[P82] = <F> then: resistant (3[0%])
[P82] = <I> and:
[P20] = <K> then: susceptible (1[0%])
[P20] = <I> then: resistant (1[0%])
[P46] = <L> and:
[P10] = <I> then: resistant (5[0%])
[P10] = <F> then: susceptible (1[0%])
[P10] = <L> and:
[P82] = <A>, <L> then: susceptible (2[0%])
[P82] = <V> then: resistant (5[0%])
Fig. 4.28
Decision tree classifier for nelfinavir. (n[e]) at the leaves denotes the number of
examples n and the estimated error e.
4.2
Prediction Quality
To assess the predictive quality of the classifiers on unseen genotypes 30% of the entire
dataset (applicable to each drug) was randomly selected for testing. These cases were then
queried using either the appropriate decision tree or nearest-neighbour classifier to obtain a
predicted drug-susceptibility. For each genotype in the testing set the predicted classification
was compared with its true classification to obtain both an indication of predictive error and
an estimate of how well a classifier is able to generalise beyond the training population. In
particular, I determined for each classifier its prediction error across a testing set (percentage
of misclassified cases), sensitivity and specificity. The sensitivity of a classifier is the
probability of the classifier to predict drug resistance given that a case is truly resistant. On
the other hand, the specificity of a classifier is the probability of the classifier to predict drug
susceptibility given that a case is truly susceptible.
To assess how these newly generated classifiers faired against the decision tree classifiers
originally presented in [1], I hand implemented each classifier and tested it against the same
testing set as used above, see appendix D for details of these trees.
48
Chapter 4. Results
An important consideration when testing such classification models is how well the
distribution of cases in the testing set concurs with the distribution of cases in the
training/validation sets. This is important because when distributions differ greatly we can
expect a loss of predictive quality. This stems from the fact that learning is most reliable
when the cases in the training set follow a distribution similar to that of the entire population.
Table 4.1 gives the distribution of fold-change values in each testing set.
0-1
2–
4-7
8-
ddC
18%
55%
21%
2%
3%
ddI
16%
66%
12%
3%
3%
d4T
33%
47%
15%
4%
1%
3TC
12%
18%
9%
7%
7%
ABC
21%
28%
34%
11%
4%
2%
3
16 -31
15
32 –
64 -
128 –
256 -
512 –
1024 -
> 2047
63
127
255
511
1023
2047
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
15%
16%
14%
0
0
0
0
0
0
0
0
0
0
NVP
33%
27%
11%
3%
3%
6%
5%
2%
4%
4%
1%
0
DLV
31%
22%
12%
6%
2%
6%
5%
8%
5%
0
0
0
EFV
43%
21%
7%
5%
2%
9%
4%
3%
5%
0
0
0
SQV
33%
29%
12%
5%
4%
8%
4%
2%
0
2%
0
0
LPV
28%
21%
4%
13%
12%
16%
5%
1%
0
0
0
0
IDV
20%
26%
13%
16%
12%
9%
3%
1%
0
0
0
0
RTV
24%
29%
11%
7%
10%
10%
7%
3%
0
0
0
0
NFV
9%
18%
13%
13%
18%
16%
8%
3%
1%
0
0
0
APV
37%
27%
17%
9%
6%
2%
0
0
0
0
0
0
Table 4.1
Frequency distribution of fold-change values within the testing datasets. Resistance
factors are grouped into equidistant bins. Differences between the testing and training set greater
than 5% are highlighted.
Here the distribution of test cases is relatively similar to that of the training experience.
Therefore any models with low predictive quality cannot be reasoned away in respect to
differences in the training and test set distributions.
Testing the newly constructed decision trees resulted in prediction errors in the range 6.3 –
14.3% for the protease inhibitors, except for amprenavir that had a prediction error of 24.2%,
6.0% - 7.8% for the nonnucleoside reverse transcriptase inhibitors, and 6.8%, 14.6%, 14.8%
and 16.6% for 3TC, d4T, ddI and ABC, respectively. The error rate for ddC was the poorest
at 24.7%. Using the same test cases, the previously published decision trees resulted in
prediction errors in the range 8.2% - 20.0% for the protease inhibitors, except for amprenavir
that again had a prediction error of 24.2%, 4.8% - 9.5% for the nonnucleoside reverse
transcriptase inhibitors, and 8.1%, 20.0%, 51.0% and 19.0% for 3TC, d4T, ddI and ABC,
49
Chapter 4. Results
respectively. The error rate for ddC was 29.7%. Table 4.2 gives the details of the prediction
errors for each drug.
Using the nearest neighbour classifier I found relatively poor predictive quality with
prediction errors in the range 18.0 – 19.1% for the protease inhibitors, except for nelfinavir
that had a prediction error of 26.4%, 22.4 – 36.3% for the nonnucleoside reverse
transcriptase inhibitors, and 32.7%, 18.9%, 21.9% and 28.7% for 3TC, d4T, ddI and ABC,
respectively. The error rate for ddC was 46.2%.
From these results it is clear that the newly constructed decision trees outperform the
previously published decision trees over this dataset. The only drug for which the original
decision tree outperforms the newly constructed decision tree is for efavirenz. However, this
score is not a true indication of the predictive quality of the original tree because it fails to
return a classification for 72% of the cases in the testing set. In addition, although the
nearest-neighbour classifiers fair poorly, compared to the newly constructed decision trees,
they outperform the original decision trees for some drugs.
Drug
No. of test cases.
New D-Tree.
Original D-Tree.
Nearest Neighbour.
Prediction Error.
Prediction Error
Prediction Error
ddC
182
24.7% [0%]
29.7% [0%]
46.2% [0%]
ddI
196
14.8% [0%]
51.0% [0%]
21.9% [0%]
d4T
185
14.6% [0%]
20.0% [0%]
18.9% [0%]
3TC
220
6.8% [0%]
8.1% [0%]
32.7% [0%]
ABC
174
16.6% [0%]
19.0% [1%]
28.7% [0%]
NVP
205
7.3% [0%]
7.3% [0.9%]
22.4% [0%]
DLV
179
7.8% [0%]
9.5% [0%]
36.3% [0%]
EFV
168
6.0% [0%]
4.8% [72%]
24.4% [0%]
SQV
251
14.3% [0%]
20.0% [0%]
18.3% [0%]
LPV
95
10.5% [0%]
No tree available.
18.9% [0%]
IDV
247
14.5% [0%]
15.0% [3%]
19.4% [0%]
RTV
228
10.5% [0%]
11.0% [0.8%]
18.0% [0%]
NFV
254
6.3% [0%]
8.2% [4%]
26.4% [0%]
APV
219
24.2% [0%]
24.2% [8.7%]
19.1% [0%]
Table 4.2
Prediction errors.
50
Chapter 4. Results
Considering the sensitivity and specificity of the classification models (Table 4.3), the newly
constructed decision trees achieved sensitivities in the range of 0.79 – 0.96 for the protease
inhibitors, except for amprenavir that had a sensitivity of 0.67, 0.82 – 0.89 for the
nonnucleoside reverse transcriptase inhibitors, 0.88 – 0.9 for 3TC and ABC. The sensitivities
for ddC, ddI and d4T were poorest with values 0.39, 0.32 and 0.67, respectively.
Specificities were in the range of 0.81 – 0.89 for the protease inhibitors, 0.94 - 0.97 for the
nonnucleoside reverse transcriptase inhibitors, 0.91 – 0.98 for ddC, ddI, d4T, 3TC. The
specificity for ABC was poorest with a value of 0.77.
For the previously published decision trees sensitivities were in the range of 0.92 – 0.96 for
the protease inhibitors, except for amprenavir that had a sensitivity of 0.76, 0.78 – 1 for the
nonnucleoside reverse transcriptase inhibitors, 0.93 and 0.88 for 3TC and ABC. The
sensitivities for ddC, ddI and d4T were poorest with values 0.59, 0.64 and 0.72, respectively.
Specificities were in the range of 0.71 – 0.87 for the protease inhibitors, 0.94 - 0.97 for the
nonnucleoside reverse transcriptase inhibitors, 0.64 – 0.97 for ABC, ddC, ddI, d4T, 3TC.
The specificity for EFV was poorest with a value of 0.27.
It is clear from these results that the newly constructed decision trees fair better at predicting
susceptibility than resistance and vice versa for the previously published decision trees.
For the nearest neighbour classifiers sensitivities were in the range of 0.64 – 0.69 for the
protease inhibitors, except for amprenavir that had a sensitivity of 0.48, 0.42 – 0.7 for the
nonnucleoside reverse transcriptase inhibitors, 0.69 for 3TC and ABC. The sensitivities for
ddC, ddI and d4T were 0.51, 0.36 and 0.47, respectively. Specificities were in the range of
0.91 – 0.98 for the protease inhibitors, 0.61 - 0.89 for the nonnucleoside reverse transcriptase
inhibitors, 0.55 – 0.92 for ABC, ddC, ddI, d4T, 3TC.
51
Chapter 4. Results
New D-Tree.
Original D-Tree
Nearest Neighour
Drug
sensitivity
specificity
sensitivity
Specificity
Sensitivity
specificity
ddC
0.39
0.93
0.59
0.75
0.51
0.55
ddI
0.32
0.98
0.64
0.45
0.36
0.88
d4T
0.67
0.91
0.72
0.82
0.47
0.92
3TC
0.90
0.98
0.88
0.97
0.69
0.65
ABC
0.88
0.77
0.93
0.64
0.69
0.73
NVP
0.89
0.94
0.89
0.94
0.47
0.89
DLV
0.82
0.97
0.78
0.97
0.70
0.61
EFV
0.85
0.97
1
0.27
0.42
0.87
SQV
0.79
0.89
0.96
0.71
0.63
0.91
LPV
0.92
0.86
-
-
0.73
0.91
IDV
0.90
0.81
0.94
0.75
0.64
0.98
RTV
0.93
0.87
0.92
0.86
0.69
0.93
NFV
0.96
0.88
0.93
0.87
0.65
0.96
APV
0.67
0.93
0.76
0.72
0.48
0.95
Table 4.3
Sensitivities and specificities.
52
Chapter 5
Conclusion
5.1
Concluding Remarks and Observations
Using a dataset of HIV-1 reverse transcriptase and protease genotypes with matched drugsusceptibilities I was able to construct decision tree classifiers to recognise genotypic
patterns characteristic of drug-resistance for 14 antiretroviral drugs. No prior knowledge
about drug-resistance associated mutations was used and mutations at every sequence
position were treated equally. Using an independent testing set I was able to judge each
decision trees predictive quality compared with a number of similarly derived decision trees,
presented in the literature. I also constructed a novel nearest-neighbour classifier to predict
drug-susceptibility from genotype, for each drug. Each nearest-neighbour classifier used a
database of matched phenotype – genotype pairs but, in contrast to decision tree learning,
nearest-neighbour learning did not attempt to extract a generalised classification function. In
order to investigate the possible advent of neural network learning compared to decision tree
learning I derived a decision tree classifier for the protease inhibitor lopinavir. This is
compared to the neural network classifier for lopinavir presented in [2].
The predictive quality of the decision tree classifiers were mixed. For the decision trees, I
found prediction errors between 6.0 – 24.7% for all drugs. These results offered an
improvement over the performance of previously published decision trees that had prediction
errors between 8.1 – 51.0% over the same testing set. Nearest-neighbour classifiers exhibited
53
Chapter 5. Conclusion
poorer performance with prediction errors between 18.0 – 46.2%, but still outperformed the
previously published decision trees on some drugs.
5.1.1
Decision Tree Models
The decision trees generated in the scope of this study varied in complexity. In particular, I
found rather novel classification models (5-7 interior attribute tests) for the drugs didanosine,
stavudine, lamivudine, nevirapine, efavirnez, ritonavir and lopinavir and I found more
complex models (9-19 interior attribute tests) for the drugs delavirdine, nelfinavir,
zalcitabine, indinavir, abacavir, amprenavir and saquinavir. This is in contrast to the decision
tree classifiers presented in [1] that had between 4 and 12 interior attribute tests for all drugs.
This increase in complexity may stem from the fact that the training data of genotypes
exhibits large mutational interdependence, I used a larger training set, the training data is
distributed differently, pruning is ineffective or the training data contains noise and
subsequently causes overly specific classification functions. In the most extreme case I can
compare the complexity of the decision trees for the protease inhibitor saquinavir. In [1] the
decision tree for saquinavir had only 5 interior attribute tests and achieved a prediction error
of 12.5% in leave-one-out experiments. A leave-one-out experiment predicts a classification
for a case in the training data by constructing a decision tree on the remaining data and then
uses the respective tree to give a classification. This rather novel tree is in contrast to the
decision tree for saquinavir, presented in this study, which has 19 interior attribute tests and
achieved a prediction error of 14.3% over an independent testing set consisting of 251 cases.
In this case the tree appears to be overgrown and the effects of reduced error pruning are
minimal. In particular only two leaf nodes contain a classification error (indication of where
subtrees have been removed) and 54% of leafs have only 1-4 training cases associated with
them. In this respect the tree appears to be overly specific and pruning wasn’t executed
‘hard’ enough to force generalisation.
This is true for a number of the other decision tree classifiers that I generated. In particular,
the decision trees for amprenavir, abacavir and indinavir appear to be overly specific and the
effects of reduced error pruning, again, appear to be minimal. The remaining classification
models are more similar in structure and complexity to the ones previously published. For
models in which overfitting appears to have occurred, a basic reduced error pruning strategy
appears not to introduce enough bias to create shorter more general trees. This is problematic
54
Chapter 5. Conclusion
because shorter trees are more desirable than larger ones. The reason why shorter trees are
preferred to larger ones stems from Occam’s razor that states that shorter trees form a
sounder basis for generalization beyond a set of training data than larger ones. This is
particularly important if we wish to use decision tree classifiers as a future tool to help select
drug regimens for treating HIV-1 infection. We seek a classification model that generalises
well to the entire population of HIV-1 cases.
Looking again at the performance of the novel saquinavir classification model using an
independent testing set (used to test the complex model), we see a dramatic decrease in
performance than was previously published (prediction error of 20.0%). Similarly, the
performance of the other models also decreased, except for d4T, 3TC, NVP and NFV. Here I
may argue that leave-one-out experiments are not sufficient to judge the performance these
models because single cases taken from the training data are less likely to include
information that is omitted. This is true in the fact that some of the models fail to return a
classification because they fail to recognise the presence of amino acids at certain sequence
positions. For example, the previously published decision tree for efavirenz fails to return a
classification for 72% of the testing examples.
The relative similarity of performance also makes a strong case for the ID3 algorithm. In
particular, the ID3 algorithm is able to generate decision trees with a similar performance to
the decision trees generated using the C4.5 algorithm. In this setting, many of the C4.5
extensions and features are not required.
Furthermore, considering the performance of the novel saquinavir classification model over
this testing set implies that the novel model is overly general. In other words the novel
classification model is less likely to differentiate between resistant and susceptible cases than
the more complex model. In this respect the complex saquinavir classification model may
indeed satisfy Occam’s razor, for this particular dataset, and indeed it may be the smallest
possible decision tree that can be generated from the training data. Therefore both the novel
and complex saquinavir classification models satisfy Occam’s razor in a contradictory way.
This presents us with a difficult choice of which tree would best generalise to the entire
population of HIV-1 cases.
Statistically, we can compare the sensitivities and specificities of the models as an indication
to how well each decision tree is able to generalise beyond the training population.
55
Chapter 5. Conclusion
Considering the sensitivities (the ability of a model to predict drug-resistance when a case is
truly resistant) of the newly constructed decision tree models with the previously published
models, the results were comparable for most drugs except for zalcitabine, didanosine,
amprenavir and saquinavir. Therefore, by using a larger dataset for learning I have found
classification models with relatively similar sensitivities. Therefore the ability of a decision
tree to predict resistance seems not to depend greatly on the size of the training dataset.
Considering the specificities (the ability of a model to predict drug-susceptibility when a
case is truly susceptible) of the newly constructed decision tree models with the previously
published models, the new models offer an improvement across the board. Therefore, by
using a larger training dataset I have found classification models that offer an improved
ability to predict susceptibility.
Returning to the decision tree for saquinavir, by inspecting the attribute tests present in the
novel and complex model we can get a picture of the genetic basis for saquinavir drugresistance. The novel model represents saquinavir-resistance as being determined by
mutational changes at the sequence positions 90, 48, 54, 84 and 72. These sequence positions
have all been previously described as being associated with either high-level or intermediatelevel resistance [14], except position 72. Here, positions associated with high-level resistance
are placed closer to the root of the tree. In contrast, the complex model represents saquinavir
drug-resistance as being determined by mutational changes at the sequence positions 10, 71,
48, 37, 84, 73, 13, 90, 88, 60, 64, 14, 12, 30 and 63. Only the positions 10, 71, 48, 84, 73, 90
and 63 have been previously described as being associated with saquinavir resistance. Here,
the positions 10, 71, 90, 48, 84 are placed closer to the root of the tree and are associated
with high-level resistance, except for positions 10 and 71 which are regarded as only
accessory resistance mutations. The other positions are not listed in [14] as being associated
with saquinavir resistance. This may imply that, either, the decision tree has been able to
identify as yet unknown resistance-associated mutations from the training data or that the
decision tree-learning algorithm has been fooled by noise in the training data.
This situation is similar for the reverse transcriptase inhibitor classification models. In other
words these models contain a mixture of known high-level, intermediate-level and accessory
resistance associated mutations. They also contain a number of mutations not listed in [15].
However, this situation does not extend to the other protease inhibitor classification models.
For ritonavir, amprenavir and nelfinavir their models tended to identify only previously
known high-level, intermediate-level and accessory resistance associated mutations.
56
Chapter 5. Conclusion
Furthermore, higher-level resistance associated mutations tended to be tested closer to the
roots of their respective trees. Only the classification models for lopinavir and indinavir
followed a similar pattern to that of saquinavir and tended to identify intermediate-level,
accessory-level and previously unknown mutations.
These observations raise three important considerations when applying decision tree learning
to the phenotype-prediction problem: choosing an appropriate attribute measure, the
importance of the training data and how deeply to grow a tree.
In this study, amino acid positions were selected based on maximal information gain.
However, judging from the types of mutations that are placed closer to the roots of the trees
(in some cases accessory rather than high-level resistance associated mutations) I may have
obtained better classification models if I had used a different attribute selection measure.
Specifically, information gain has a natural bias to prefer attributes with many values to
those with fewer values. In the context of the phenotype prediction problem this could prove
problematic because a certain amino acid position could exhibit a variety of mutations but
nevertheless play no role in drug-resistance. This amino acid position may then be selected,
accidentally, as a good indicator of resistance due to coincidences in the training data.
The size and quality of the training data heavily dictates the quality of the eventual decision
tree classifier. Off course, to obtain a decision tree classifier that generalises well to the
entire population of HIV-1 cases the training data should be distributed in such a way that it
is representative of the entire population. This is extremely difficult in practice and is
unrealistic. However, with larger and larger datasets of matched phenotype-genotype pairs
becoming available, it may become possible to probabilistically model the distribution of
cases within the entire population. Training sets that are constructed to follow such
distributions will therefore generate decision trees with a stronger predictive merit. In this
way the size of the training set is less important once we have reached a minimal number of
examples. Here what is important is the examples themselves. A better decision tree will be
grown if the data that it is derived from is of good quality and varied. By good quality I
mean that the data is reliable (minimises error) and by varied I mean that for each phenotype
we have a wide selection of genotypes.
In this study I simply used the complete HIV-1 protease and reverse transcriptase drugsusceptibility datasets to generate decision tree classifiers with relatively low prediction
57
Chapter 5. Conclusion
errors. However, since no attention was spared to determine the quality and variety of
examples in the training, validation and test sets these same decision trees may falter when
applied to genotypes drawn randomly from the entire population of HIV-1 cases. Similar to
how the performance of the previously published decision trees faltered when applied to a
different testing set.
The C4.5 and ID3 algorithms grow decision trees just deeply enough in order to perfectly
classify the training examples, in accordance with Occam’s razor. Whilst this is sometimes a
reasonable strategy it can lead to difficulties. Particularly, for the phenotype prediction
problem we can only ever obtain a relatively small subset of training examples compared
with the entire population and this may lead to a classification model that does not generalise
well to the true target function. There reaches a point during decision tree learning when a
tree starts to become specific to the target function represented by the training examples and
fails to generalise to the remaining population. In other words, the tree has become
overfitted. As previously mentioned, the effects of overfitting can be minimised by using a
number of post-pruning strategies. However, as has been exhibited, the effects of basic
reduced error pruning were minimal for some trees. We should therefore introduce a suitable
‘weighting’ to force pruning to consider even smaller decision trees that may not perform
well against the training set but nevertheless fair better over the remaining population.
5.1.2
Neural Network Models
The use of neural networks may be particularly suited to the phenotype prediction problem
for drugs such a lopinavir where resistance is dictated by complex combinations of multiple
mutations [2]. In particular, by representing the classification function
Resistant: Genotype Æ Drug_Susceptibility_Classification using a neural network we are
able to represent nonlinear functions of many variables such a multiple mutations exhibiting
large interdependence.
Indeed, by using neural network learning we do not have to make any prior assumptions
about the form of the target function. For example, feedforward networks containing three
layers of sigmoid perceptions are able to approximate any function to arbitrary accuracy,
given a sufficient (potentially very large) number of units in each layer [16]. This is in
contrast to decision tree learning where we must make some judgement as to what size of
trees should be preferred to others i.e. what bias should we introduce.
58
Chapter 5. Conclusion
Furthermore, representing the target function using a neural network gives the advantage
over decision tree learning that the predicted drug-susceptibility is quantative. In other
words, such a neural network can return a predicted fold-change value rather than simply a
discrete classification such as ‘resistant’ or ‘susceptible’.
In [2] a neural network classification model was constructed from a dataset of phenotypegenotype pairs to predict resistance of the protease inhibitor lopinavir. The performance of
the model was determined using a testing set of 177 examples and the results were expressed
using the linear correlation coefficient R2. The linear correlation coefficient is a number
between –1 and 1that measures how close a set of points lie on a straight line. A value of 1
indicates that all the points fall on a straight line. In other words, it was used in this case to
determine how well the predicted fold-change and actual fold-change values agree. For their
‘best’ neural network, a correlation coefficient of 0.88 was obtained. In order to compare the
performance of this network against decision tree learning I obtained an equivalent decision
tree classifier to predict lopinavir resistance using a dataset of phenotype – genotype pairs
obtained from the Stanford database.
I obtained a relatively simple decision tree classification model that determines lopinavir
resistance according to mutations at the positions 10, 82, 71, 72, 46 and 93. With the
exception of position 72 all these positions have been previously recognised as being
significant for lopinavir resistance [2]. Using an independent testing set, consisting of 95
cases, the decision tree had a prediction error of 10.5% and a sensitivity and specificity of
0.92 and 0.86, respectively. In other words, for 89.5% of the testing cases the predicted
classification agreed with the true classification. This result is comparable to the correlation
coefficient 0.88 obtained for the neural network. However, some allowance should be made
for differences in the size of the testing sets. In particular, the neural network model was
tested against 177 examples, which is nearly double that for the decision tree.
However, the decision tree model has the advantage over the neural network model in that it
is easily interpreted. In other words, experts can easily understand and examine the
knowledge portrayed by the decision tree. Using decision trees it is easy to derive a set of
corresponding rules. Tracing out a path from the root of the tree to a leaf yields a single rule.
Where the internal nodes, of the tree, induce the premises of the rule and the leaf determines
its conclusion. Such rules can be presented as evidence for a classification.
59
Chapter 5. Conclusion
This is in contrast to neural network models that act as a black box. A genotype is given as
input and a fold-change value is given as output but no clue as to how such a prediction was
made is available. One may wish to look inside the black box but all that will be found is a
number of connected units with no real meaningful interpretation. An analogy can be made
with looking inside at the workings of the human brain that is in itself made up from millions
of similar interconnected units. It is not enough to simply understand how each unit
processes signals rather we wish to know the knowledge that they portray.
In addition, like decision tree learning, neural network learning is prone to overfitting. This
occurs as training proceeds because some weights will continue to be updated in order to
reduce the error over the training data. However, there are a number of methods available to
help minimise the effects of overfitting. Most methods introduce a bias to prefer neural
networks with small weight values, i.e. to avoid learning complex decision surfaces. But
compared to decision trees these solutions are not as aesthetically pleasing.
Also, in practical terms, neural network training algorithms typically require longer training
times than decision tree learning. Training times range from a few seconds to many hours,
depending on the number of weights to learn in the network and the amount of training
examples.
5.1.3
Nearest - Neighbour Models
Nearest-neighbour models have an advantage over decision tree and neural network models
in that no explicit representation of the target function needs to be made. In particular, a
nearest-neighbour method simply stores the set of training examples and postpones
generalising beyond these examples until a new case must be classified. Here we avoid the
problems of overfitting and estimating a one-time target function that embodies the entire
population. In other words, a nearest-neighbour classifier represents the target function by a
combination of many local approximations, whereas decision tree and neural network
learning must commit at training time to a single global approximation. In this respect, a
nearest-neighbour classifier effectively uses a richer hypothesis space than both decision tree
and neural network learning. In addition because all the training data is stored and reused in
its entirety the information that it contains is never lost. Where the main difficulty in
60
Chapter 5. Conclusion
defining a nearest-neighbour classifier to address the phenotype-prediction problem lies is
determining an appropriate distance metric for retrieving ‘similar’ genotypes.
In the scope of this study I derived a nearest-neighbour classifier to predict drugsusceptibility from genotype. This was based on a novel distance metric. The performance of
the nearest-neighbour classifiers was assessed, in the same way as the decision tree
classification models, using a randomly selected independent test set of genotype cases. In
comparison to the decision tree models, created in this study, these classifiers faired poorly.
In particular, I found prediction errors in the range of 18.9 – 46.3% compared to prediction
errors in the range of 6.3 – 24.7% for the decision trees.
However, given these results we cannot conjecture that all nearest-neighbour methods will
fair poorly in the context of the phenotype prediction problem. This is clear when we
consider the commercial success of Virco’s virtualPhenotype™ that employs a nearestneighbour classification scheme. Indeed some studies have even proved that Virco’s
prediction scheme is a useful as an independent predictor of the clinical outcome of
antiretroviral therapy [17].
It is clear from these results that the distance metric used in this study is too naive and is not
able to entirely capture the genetic basis of drug-resistance when comparing genotypes. In
particular, it does not take into consideration any details of the mutations that are present
rather it only considers the percentage of differences between two sequences. However, for
the protease inhibitors this novel distance metric produced reasonable results. Specifically, I
found prediction errors in the range of 18.0 – 19.1%, except for amprenavir that had a
prediction error of 26.4%. This suggests that for these drugs, drug-resistance can be
characterised in some way by the amount of shared genetic mutations.
A practical disadvantage of nearest-neighbour classifiers is that they are inefficient when
classifying a new case because all the processing is performed at query time rather than in
advance like decision tree and neural network learners.
61
Chapter 5. Conclusion
5.2
Suggestions For Further Work
5.2.1
Handling Ambiguity Codes
Within this study, ambiguity codes are not handled in an effective manner. An ambiguity
code may occur during genotyping when a sample containing a population of HIV-1
sequences is found to contain a number of possible amino acids for a specific sequence
position. When this occurs, some mutations may be represented by multiple amino acid
codes, representing the detection of more than one amino acid at this sequence position. In
this case it is ambiguous to which amino acid should be used when modelling the genotype
sequence. Within this study, the first amino acid code that is encountered is used for
modelling. This is wholly inadequate and would have been improved on if more time
permitted.
5.2.2
Using A Different Attribute Selection Measure
As was mentioned earlier, decision attributes were selected based on maximal information
gain. Here, I may have obtained better classification models if I had used a different attribute
selection measure that doesn’t favour attributes with large numbers of values. A way to
avoid this bias is to select decision attributes based on a measure called gain ratio. The gain
ratio penalises attributes with a large number of values by incorporating a term called split
information, which is sensitive to how uniformly a decision attribute splits the training data:
split_information(S, A) ═ − p(+) log2 p(+)
+ p(-) log2 p(-)
where p(+) and p(-) are the proportion of positive and negative examples, respectively,
resulting from partitioning S by the attribute A. Using this measure, the gain ratio, is then
calculated as:
gain_ratio(S, A) = ig(S, A) / split_information(S, A)
If more time had been permitted I would have experimented with using this measure and
other possible selection measures. Would this have made a big impact on the complexity and
or structure of the decision tree models?
62
Chapter 5. Conclusion
5.2.3
Handling Missing Information
The ability to effectively handle missing information was treated poorly in this study. Here a
training set may only exhibit a subset of the possible mutations that may occur at a particular
sequence position. Therefore, when growing the decision tree using this dataset only this
subset of amino acids will be considered as valid at that particular position. This is
problematic because the decision tree will fail to recognise the presence of other amino acids
at that position. Within this study, if this situation arises at a decision node n the
classification process is halted and a classification of ‘unknown’ is returned. This is not
ideal; we would at least like to take into account the decision knowledge already portrayed
by the decision tree before this point. Returning a classification of ‘unknown’ is worthless.
One possible strategy for dealing with this problem is to assign a classification based on the
most common classification of the training examples associated with the decision node n.
A second more complex procedure begins by assigning a probability to each of the values for
an attribute. These probabilities are based on the observed frequencies of the various values
for the attribute among the training examples at node n. Now when we encounter the node n,
instead of stopping, we continue down the most probable path. We then proceed as normal
until we reach a classification.
If more time was permitted I would have implemented the second of these strategies in order
to maximise the information that a decision tree uses when classifying examples taken from
the entire population of HIV-1 cases.
5.2.4
Using A Different Distance Metric
As has already been highlighted the distance metric used for nearest-neighbour classification
was not strong enough to truly judge the similarity of two genotype sequences. If more time
was permitted I would have experimented with different distance measures. In particular, I
would have investigated statistical measures based on the comparison of two dot-plots.
63
Chapter 5. Conclusion
5.2.5
Receiver Operating Characteristic Curve
A good way to analyse the performance of a classification model is to plot a receiver
operating characteristic curve (ROC). A ROC curve is commonly used within clinical
research to investigate the tradeoff between the sensitivity and specificity of a classification
model. The x-axis of a ROC graph corresponds to the specificity of a model, i.e. the ability
of the model to identify true negatives. Conversely, the y-axis corresponds to the sensitivity
of a model, i.e. how well the model is able to predict true positives. In this way we are able
to look more closely at the ability of a classification model to discriminate between drugresistant and drug-susceptible genotypes. The greater the sensitivity at high specificity values
the better the model. Also, by doing such an analysis we better facilitate the comparison of
two or more classification models.
In order to plot an ROC curve for each of the decision tree models that I created I would
have to introduce a weighting parameter, α, to determine whether an example should be
classified as resistant or not. In more detail, for each leaf node in which there is a prediction
error (i.e. a leaf created as a result of pruning) we have a probability of an example being
classified as resistant. We introduce α at these leafs and we say that an example is resistant if
and only if the probability of the example being resistant is greater than α. Now we can
create the ROC curve by varying the size of α from 1 down to 0 and computing the
sensitivity and specificity of the model for each value of α.
During the scope of this project I had hoped to be able to compare the performance of the
classification models by comparing the ROC curves that they produce. However, I was
unable to complete this work through lack of time.
5.2.6
Other Machine Learning Approaches
There is many other machine learning strategies available. For example, by using genetic
algorithms we can generate a set of classification rules similar to decision tree learning.
Investigating the possible use of these methods for the phenotype prediction problem
presents us with a wide scope for future research. In particular, can we get better
classification models by using different machine learning approaches or even can we get
better results through the hybridisation of different approaches?
64
Appendix A
Pre-Processing Software
I developed a simple text parser, using Java, that takes as input a data file from the Stanford
HIV Resistance Database and outputs a new data file containing the modelled samples, as
previously described. This can be run from the command line using the command
> java DataParser –i inputFile.txt –o outputFile.txt –g gene
where inputFile.txt is a data file obtained from the Stanford Database and gene is one
of reverse transcriptase (rt) or protease (p).
The parser reads each line in the original data file, separately, and interprets the information
present in specific columns. This is a reasonable strategy considering the format of the file
(tab delimited). Furthermore, future updates on this information will only ever alter the
amount of data but not the way it is presented. On interpretation of each line in the original
file a new instance is created that has a unique identification, the fold-change values for each
drug and a set of attributes as described previously. Individual instances are then written to
the programs output, see below.
65
Appendix A. Pre-Processing Software
Data File From The Stanford HIV Resistance Database.
Simple Text Parser
Instances
SeqId APV_Fold ATV_Fold NFV_Fold … P1 P2 P3 … P99
7439 4.3 - 46.1 … P Q V … F
7443 2.3 - 11.0 … P Q V … F
…
66
Appendix B
Cultivated Phenotype
I developed a generic machine-learning program, called cultivated phenotype, that (1)
represents and manipulates a dataset of phenotype – genotype pairs as obtained from the
Stanford HIV-1 reverse transcriptase and protease drug-susceptibility datasets; (2)
characterises a dataset; (3) constructs a decision tree according to a training dataset; (4)
prunes a decision tree according to a validation dataset; (5) predicts drug-susceptibility from
genotype using either a decision tree or 3-nearest neighbour classifier; (6) displays
performance statistics of a single decision tree according to a testing dataset and (7)
compares the performance of newly constructed decision trees, hand-coded decision trees
and 3-nearest neighbour classifiers according to a testing dataset.
The learning component was developed using Java because it was readably available and did
not require any licences. Furthermore, by developing the program using Java it allows for
easy conversion into a form that could be made accessible from the World Wide Web. This
is advantageous because it promotes the distribution of the knowledge that it prescribes and
inevitably this knowledge could be harnessed by clinicians to help manage HIV-1 infection.
In detail, the learning component was developed using an Object-Oriented methodology and
the following classes were defined: CultivatedPhenotype, Example, Attribute, Reference,
DrugWindow, AlterExperience, DataCharacteristics, Tree, BeerenwinkelModels,
NearestNeighbour, QueryWindow, PerformanceWindow, Data, and Graph. Summarised as
follows:
67
Appendix B. Cultivated Phenotype
•
CultivatedPhenotype: This is the main program thread and all subsequent
functionality stems from it. It defines four tables: the first containing a complete
dataset of phenotype-genotype examples; the second containing a subset of
examples to be used for training; the third containing a subset of examples to be
used for validation and the fourth containing a subset of examples to be used for
testing.
Functionality realised by the procedures: void loadFile(string filename).
•
DrugWindow: Defines a number of antiretroviral drugs and associated fold-change
thresholds. Provides functionality to filter a dataset according to a particular drug,
allowing only information related to that drug.
Functionality realised by the procedures: void filterDrug(string antiretroviral),
void setThreshold(double foldValue).
•
Example: Defines a single phenotype-genotype pair. In particular, an Example
contains a fold-change value, sequence identification and a set of Attributes. In
addition, each Example contains a feature vector constructed from its set of
Attributes and a Reference. Includes functionality to compute the similarity of this
example (feature vector) from another.
Functionality realised by the procedures: boolean isResistant(),
string getAttribute(int index), void computeComparisionArray() and
int distanceFromThisSequence(char[] otherComparisionArray).
•
AlterExperience: Defines the training, validation and test sets. Provides
functionality to randomly set aside a percentage of the entire dataset for testing.
Functionality realised by the procedures:
void setAndDisplayTrainingTestSets(int percentage).
•
DataCharacteristics: Describes a number of properties of the training, validation
and testing datasets. For example, determines the total number of examples, the
number of examples classified as resistant and the distribution of fold-change
values.
Functionality realised by the procedures: string getDataCharacteristics().
68
Appendix B. Cultivated Phenotype
•
Tree: Defines the gross structure of a decision tree. Includes functionality to grow a
new decision tree using the ID3 algorithm, a set of examples and a set of attributes.
Includes functionality to self-prune using reduced error pruning and a set of
validation examples. Includes functionality to return a drug-susceptibility
classification given a genotype sequence.
Functionality realised by the procedures:
void addBranch(Tree subtree, string label),
Tree id3(object[] examples, vector attributes), Attribute getBestAttributeGain(),
double getEntropy(object[] examples), void prune(object[] examples),
string queryTree(char[] sequence).
•
BeerenwinkelModels: Defines a number of Tree objects corresponding to the
decision tree classifiers presented in [1].
•
NearestNeighbour: Defines a 3-nearest-neighbour classifier that imposes an
ordering on the training and validation datasets according to a similarity measure
defined on Example. Includes functionality to return a drug-susceptibility
classification given a genotype sequence.
Functionality realised by the procedures:
string queryNearestNeighbour(Example queryExample), string getClassification().
•
QueryWindow: Provides the ability to query a decision tree or nearest neighbour
classifier using either a mutation list or nucleotide sequence. Includes functionality
to obtain a classification, output a drug-susceptibility classification and an
explanation of a classification.
Functionality realised by the procedures: char[] getQuerySequence(),
char translateCodon(string codon),
•
Data: Defines a number of properties regarding the performance of a classifier in
regards to a testing dataset. For example, a Data object stores the total number of
examples in a testing set; the number of examples correctly classified as resistant;
the number of examples incorrectly classified as resistant; the number of examples
correctly classified as susceptible and the number of examples incorrectly classified
as susceptible. Includes functionality to compute the sensitivity, specificity, positive
prediction value, negative prediction value, positive likelihood ratio and the
negative likelihood ratio.
69
Appendix B. Cultivated Phenotype
Functionality realised by the procedures: double getSensitivity(),
double getSpecificity(), double getPositiveLikelihoodRatio(),
double getNegativeLikelihoodRatio(), double getPositivePredictionValue(),
double getNegativePredictionValue(), double getPercentageCorrectlyClassified().
•
PerformanceWindow: Displays the information stored in a Data object.
•
Graph: Plots the results of a number of experimental runs. Includes functionality to
create an independent testing dataset. The remaining examples are then sampled to
create a variety of learning experiences and for each learning experience, a new
decision tree and nearest-neighbour classifier is constructed. For each new decision
tree and nearest neighbour classifier their performance is recorded in regards to the
testing dataset. In addition the performance of a BeerenwinkelModel is computed in
regards to the same testing dataset. The performance of each model is plotted for
each training experience.
Functionality realised by the procedure: vector testDTree().
Above, each class encompasses a number of important procedures. The role of these
procedures is described below.
Procedure
void loadFile(string filename)
void filterDrug(string antiretroviral)
void setThreshold(double foldValue)
boolean isResistant()
string getAttribute(int I)
void computeComparisionArray()
Description
Reads a data file as constructed from the text
parser described in appendix A. Creates a
table of phenotype-genotype pairs.
Removes from the entire dataset of
phenotype-genotype pairs the fold-change
values associated with each drug except for
antiretroviral.
Initialises the fold-change threshold to be
used for discriminating resistant from
susceptible samples.
Compares the fold-value of the Example with
the fold-change threshold. Returns true
(resistant) if the fold-change value exceeds
the threshold and false (susceptible)
otherwise.
Given an index i returns the value of the ith
attribute. In other words, returns the amino
acid present at position i.
Creates a feature vector for the Example. In
particular, retrieves a reference sequence and
compares it to the attribute values of the
Example. For positions in which there was
no change in amino acid the feature vector
70
Appendix B. Cultivated Phenotype
int distanceFromThisSequence(char[] ot)
void setAndDisplayTrainingTestSets(int i).
string getDataCharacteristics().
void addBranch(Tree subtree, string label)
Tree id3(object[] examples, vector, ats)
Attribute getBestAttributeGain(),
double getEntropy(object[] examples),
void prune(object[] examples),
string queryTree(char[] sequence).
was augmented to include a dummy value and
for other positions the feature vector was
augmented to include the attribute value.
Given a feature vector, ot, returns a score
representing the distance from the feature
vector of this Example and ot. The distance is
computed as the sum of two factors. In
particular, for positions in which the two
feature vectors contain non-dummy values the
percentage of these that are different is
computed. In addition, the percentage of the
remaining positions that are different is
added.
Given a percentage of the entire dataset to
allocate for testing, i, randomly samples the
entire dataset to construct a training and
testing dataset. Furthermore 20% of the
training set is then randomly sampled to
create a validation dataset. Examples are
selected using a random number generator
(without replacement) in that each Example is
equally probable to be picked.
Given training, validation and testing
datasets, outputs a description of their
characteristics. In particular it computes the
total number of Examples in each dataset; the
number of Examples that are currently
classified as resistant; the sequences in each
dataset with their phenotypes and the
distribution of phenotype values within the
dataset.
Creates a new branch from a tree node
(amino-acid position) with a label (aminoacid) and a subtree.
Given a set of phenotype-genotype pairs,
examples and a set of attributes, ats,
constructs a decision tree according to the
ID3 algorithm.
Using a set of phenotype-genotype pairs and a
set of attributes, returns the attribute with
maximal information gain.
Given a set of phenotype-genotype pairs,
examples, computes the entropy of the
dataset. Measures the (im)purity of the
dataset.
Given a dataset of phenotype-genotype pairs,
examples, considers each node in the tree for
pruning. In particular, implements the
reduced-error pruning strategy.
Given an unseen genotype sequence,
sequence, obtains a drug-susceptibility
classification by sorting the example down
71
Appendix B. Cultivated Phenotype
string queryNearestNeighbour(Example q)
string getClassification()
char[] getQuerySequence()
char translateCodon(string codon)
double getSensitivity()
double getSpecificity()
double getPositiveLikelihoodRatio()
double getNegativeLikelihoodRatio()
double getPositivePredictionValue()
double getNegativePredictionValue()
double getPercentageCorrectlyClassified()
vector testDTree().
through the tree. In particular, each node in
the tree tests a specific amino acid position
for certain amino acids.
Given an unseen genotype Example, imposes
an ordering on the Examples in the training +
validation datasets as dictated by the distance
measure, computed by the function
int distanceFromThisSequence(char[] ot).
Provided that the Examples in the training
and datasets are ordered, selects the three
closest Examples to an unseen genotype.
Returns the majority drug-susceptibility
classification.
Constructs a genotype sequence from a set of
mutations.
Given a sequence of three nucleotides, codon,
uses the genetic code to translate this code
into an amino acid.
Given the number of true positives, tp, and
false negatives, fn, returns tp / (tp + fp).
Given the number of false positives, fp, and
true negatives, tn, returns tn / (fp + tn ).
Returns sensitivity / (1 – specificity).
Returns (1 – sensitivity) / specificity
Given the number of true positives, tp, and
false positives, fp, returns tp / (tp + fp).
Given the number of true negatives, tn, and
false negatives, fn, returns tn / (tn + fn).
Given the total number of examples tested,
tot, the number of true positives, tp, and the
number of true negatives, tn, returns
((tp + tn) / tot) * 100.
Creates a testing set of phenotype-genotype
pairs using 20% of the entire dataset.
Constructs 20 different training experiences
by randomly sampling the remaining data.
Creates a new decision tree classifier, for
each different learning experience. For each
training experience the performance over the
entire testing set of each newly constructed
decision tree, a Beerenwinkel model and the
k-nearest neighbour classifier are plotted.
72
Appendix C
The Complete Datasets
Given below is a list of the sequences (given as sequence id’s used in the Stanford HIV
resistance database) along with the fold-change values that were associated with each drug.
I used the following sequences from the Stanford HIV-1 protease drug susceptibility data set:
Instance_Id
i1
i2
i3
i4
i5
i6
i7
i8
i9
i10
i11
i12
i13
i14
i15
i16
i17
i18
i19
i20
i21
i22
i23
i24
i25
i26
i27
i28
i29
i30
i31
i32
i33
i34
i35
i36
i37
i38
i39
i40
i41
i42
i43
Seq_Id
7439
7443
7459
7460
7461
7462
7463
7464
7465
7466
7467
8084
8085
41692
8090
8093
8113
40595
8253
8493
8637
8652
8654
8658
8660
8666
8670
8672
8674
9431
9490
9513
41885
9556
2613
2615
9706
9733
9742
10191
10271
10286
10294
APV_Fold
4.3
2.3
8.3
1.6
14.9
2.3
2.3
5.3
0.1
5.1
2.0
ATV_Fold
-
NFV_Fold
46.1
11.0
161.1
3.1
42.2
23.4
37.7
4.9
1.3
16.4
2.9
11.0
54.0
7.0
4.0
37.0
2.0
4.0
1.0
1.0
12.0
73.0
34.0
2.0
3.0
24.0
19.0
0.4
34.7
14.0
1.6
59.0
1.8
72.0
RTV_Fold
34.1
62.2
4.0
23.0
120.0
20.0
175.0
8.0
8.0
3.0
23.0
170.2
7.4
104.8
51.6
22.2
20.0
0.8
24.5
0.6
0.8
1.3
1.8
0.7
1.3
1.6
1.0
115.0
0.3
77.0
0.3
1.3
2.3
1.6
73.0
73
SQV_Fold
47.4
574.2
15.0
1.0
8.0
37.0
580.0
1.0
1.0
21.0
121.0
100.0
0.5
41.0
37.8
56.4
26.0
0.7
23.8
0.6
1.1
7.0
1.0
1.0
5.0
1.0
2.0
1.0
1.0
0.9
2.7
0.6
0.8
1.0
0.4
4.0
0.4
8.9
0.3
0.9
2.5
0.9
27.0
LPV_Fold
10.0
0.7
5.3
0.5
0.8
0.8
0.9
0.5
0.3
1.2
0.3
33.5
0.4
0.7
1.4
1.4
20.0
IDV_Fold
20.0
12.0
3.0
7.0
21.0
8.0
100.0
18.0
15.0
4.0
45.0
100.0
3.3
22.3
32.7
22.8
5.6
0.9
8.8
1.0
3.4
2.8
1.8
1.0
0.9
1.2
1.0
14.0
0.3
22.8
0.4
1.0
1.4
1.2
24.0
Appendix C. The Complete Datasets
i44
i45
i46
i47
i48
i49
i50
i51
i52
i53
i54
i55
i56
i57
i58
i59
i60
i61
i62
i63
i64
i65
i66
i67
i68
i69
i70
i71
i72
i73
i74
i75
i76
i77
i78
i79
i80
i81
i82
i83
i84
i85
i86
i87
i88
i89
i90
i91
i92
i93
i94
i95
i96
i97
i98
i99
i100
i101
i102
i103
i104
i105
i106
i107
i108
i109
i110
i111
i112
i113
i114
i115
i116
i117
i118
i119
i120
i121
i122
i123
i124
i125
i126
i127
i128
i129
i130
i131
i132
i133
i134
i135
i136
i137
i138
i139
i140
10436
10539
10558
10560
10574
10855
11224
11954
11958
2867
12377
12400
12650
12676
12679
40765
12861
12862
12863
12864
12865
12866
12867
13255
13256
13257
13258
13259
13260
13261
13262
12928
12930
12932
12936
12938
12940
12942
12944
13235
13236
13237
13238
13239
13240
13241
13242
13244
13247
13248
13250
13252
13254
15492
15493
15494
15495
15496
15497
15500
15501
15502
15504
15505
15506
15507
15508
15511
15512
15513
15514
15515
15516
15517
15518
15519
15520
15521
15523
15524
15525
15526
15527
15528
15529
15530
15531
15532
15533
15534
15535
15536
15537
15538
15539
15540
15542
0.4
6.1
3.9
1.1
0.5
4.0
5.3
0.4
0.8
3.0
4.4
3.6
4.5
2.3
0.1
0.1
0.1
0.3
1.0
0.9
3.2
5.2
2.2
0.4
2.3
0.4
10.8
13.9
0.1
0.1
0.1
0.1
0.1
0.2
0.2
0.3
0.3
0.3
0.3
0.3
12.0
14.0
16.0
0.4
0.6
8.0
0.3
1.0
0.3
2.0
20.0
2.0
6.0
31.0
0.2
1.0
4.0
1.0
0.4
11.0
1.0
2.0
5.0
3.0
0.5
8.0
1.0
3.0
3.0
4.0
4.0
3.0
4.0
4.0
7.0
5.0
4.0
5.0
0.1
7.0
8.0
13.0
8.0
0.5
0.5
0.6
0.6
0.6
0.6
1.0
1.0
1.0
2.0
2.0
2.0
2.0
2.0
3.0
3.0
3.0
3.0
3.0
4.0
4.0
4.0
4.0
4.0
4.0
5.0
5.0
5.0
5.0
5.0
5.0
6.0
6.0
6.0
8.0
9.0
10.0
12.0
14.0
14.0
16.0
22.0
25.0
100.0
0.7
28.0
2.6
5.4
8.0
9.2
19.0
24.0
33.0
2.2
1.2
0.8
34.0
3.5
47.0
7.1
24.7
2.2
3.6
6.2
11.5
3.0
2.3
5.1
8.1
12.8
12.2
0.4
1.4
2.4
4.0
34.7
3.4
11.1
2.0
2.3
121.7
97.2
8.9
21.0
11.0
15.8
29.9
16.7
24.6
18.6
15.1
1.6
63.0
3.5
27.6
29.1
1.0
1.0
1.0
4.0
2.0
2.0
32.0
1.0
27.0
43.0
8.0
7.0
8.0
4.0
4.0
50.0
8.0
27.0
3.0
4.0
11.0
13.0
14.0
23.0
17.0
2.0
65.0
18.0
77.0
7.0
7.0
15.0
19.0
17.0
10.0
8.0
21.0
17.0
13.0
35.0
13.0
41.0
36.0
90.0
0.8
41.0
9.0
2.1
4.4
3.4
5.5
10.0
119.0
1.0
1.3
0.5
5.5
3.3
1.3
0.4
1.0
16.1
22.4
12.0
11.3
4.4
0.3
0.7
0.8
1.3
1.3
0.4
1.2
1.7
12.1
100.0
5.9
0.9
2.1
1.8
203.2
203.8
1.7
1.5
1.2
3.3
1.4
1.8
3.3
1.9
1.5
8.1
6.2
4.3
0.6
1.1
8.0
3.0
2.0
16.0
5.3
7.0
0.7
2.0
0.4
1.0
7.0
5.0
15.0
8.0
4.0
2.0
40.0
4.0
2.0
22.0
58.0
15.0
10.0
47.0
23.0
62.0
134.0
3.0
93.0
2.0
12.0
105.0
33.0
40.0
33.0
37.0
18.0
17.0
11.0
2.0
85.0
51.0
34.0
203.0
58.0
0.8
4.2
1.2
1.2
1.7
3.9
3.8
3.2
9.0
1.1
0.7
0.6
5.9
1.3
2.3
0.5
0.9
1.0
1.7
9.0
4.9
12.9
0.4
1.4
1.1
1.2
1.4
0.3
1.0
1.2
3.1
91.0
0.8
0.4
1.2
0.9
29.1
37.2
0.7
0.2
1.5
1.2
1.8
1.4
1.2
1.8
2.0
0.3
8.9
0.6
0.4
1.1
2.0
2.0
0.1
1.0
0.6
1.0
1.0
17.0
1.0
0.6
4.0
3.0
6.0
4.0
30.0
2.0
2.0
3.0
14.0
6.0
3.0
4.0
4.0
3.0
1.0
2.0
1.0
3.0
4.0
3.0
1.0
4.0
4.0
3.0
4.0
11.0
23.0
11.0
1.0
9.0
9.0
82.0
43.0
74
24.0
0.6
22.0
1.2
0.3
0.9
2.2
0.8
0.9
-
48.0
0.6
15.0
1.4
1.3
3.5
1.9
9.3
4.8
24.0
1.6
0.6
0.6
16.0
12.0
1.5
0.5
1.2
2.8
3.9
3.6
4.2
3.7
1.5
2.5
2.3
3.0
2.9
0.4
1.1
1.7
6.4
20.3
4.4
0.9
1.2
0.9
32.2
33.2
2.1
6.1
3.1
1.4
6.3
5.4
7.6
7.5
5.5
1.0
13.5
1.1
1.2
1.3
1.0
1.0
2.0
3.0
1.0
1.0
0.5
1.0
0.3
2.0
27.0
3.0
3.0
2.0
1.0
2.0
7.0
10.0
1.0
2.0
6.0
2.0
10.0
14.0
34.0
10.0
9.0
2.0
8.0
4.0
17.0
4.0
13.0
16.0
15.0
10.0
11.0
13.0
17.0
2.0
28.0
19.0
35.0
32.0
Appendix C. The Complete Datasets
i141
i142
i143
i144
i145
i146
i147
i148
i149
i150
i151
i152
i153
i154
i155
i156
i157
i158
i159
i160
i161
i162
i163
i164
i165
i166
i167
i168
i169
i170
i171
i172
i173
i174
i175
i176
i177
i178
i179
i180
i181
i182
i183
i184
i185
i186
i187
i188
i189
i190
i191
i192
i193
i194
i195
i196
i197
i198
i199
i200
i201
i202
i203
i204
i205
i206
i207
i208
i209
i210
i211
i212
i213
i214
i215
i216
i217
i218
i219
i220
i221
i222
i223
i224
i225
i226
i227
i228
i229
i230
i231
i232
i233
i234
i235
i236
i237
15543
15544
15545
15546
15548
15549
15550
15551
15552
15553
15509
16034
16397
25471
25498
25472
25473
25475
25476
25503
25477
25504
25478
25479
25506
25480
25508
25484
26090
26092
26094
26096
26117
26121
26122
26124
26100
26078
26079
26080
26105
26108
26110
26081
26082
26112
26113
26084
40789
26899
26903
27084
27086
27090
27093
27095
27091
27094
27092
27087
27096
27088
27089
27441
27442
27443
27444
27445
27447
27448
27449
27450
27453
27454
27455
27456
27457
27459
27460
27461
27462
27463
27465
27468
27471
27474
27476
27466
27467
27469
27470
27472
27473
27475
27477
28212
28218
12.0
2.0
9.0
10.0
30.0
26.0
29.0
3.0
7.0
3.0
1.0
0.5
0.9
7.6
2.5
4.4
3.4
2.3
2.1
10.2
0.9
1.0
1.2
3.2
2.8
2.9
1.1
1.1
0.3
1.5
9.1
1.3
0.5
0.8
12.0
16.0
4.9
5.9
4.5
9.6
10.3
5.0
3.4
2.0
7.6
0.7
5.3
4.0
2.0
7.0
2.6
5.4
5.9
1.4
3.4
0.6
1.3
4.6
2.5
5.5
5.4
2.9
6.6
1.0
8.5
3.6
27.0
29.0
30.0
35.0
39.0
39.0
39.0
41.0
41.0
82.0
2.0
-
52.0
37.0
106.0
20.0
36.0
36.0
36.0
479.0
36.0
101.0
5.0
12.0
1.8
34.5
132.0
229.0
180.0
12.3
35.3
15.6
1646.0
1646.0
31.2
422.0
359.0
7.4
735.0
97.4
4.0
7.6
34.3
20.8
3.0
4.3
11.5
8.9
65.8
27.1
28.8
30.8
9.3
7.7
7.5
1.7
3.4
1.0
1.6
7.2
34.0
0.6
1.4
102.0
3.9
4.1
16.8
6.4
3.7
23.0
27.4
33.3
37.0
49.8
1.4
2.1
3.3
1.1
0.8
1.4
1.3
3.7
2.3
6.2
5.9
0.5
1.0
2.4
1.5
2.0
1.1
2.3
2.1
7.4
33.5
54.7
7.2
2.3
17.8
27.4
20.5
30.1
1.6
8.5
17.2
205.0
45.0
63.0
77.0
17.0
105.0
203.0
164.0
33.0
54.0
135.0
90.0
14.0
6.5
0.9
90.9
236.0
94.4
179.0
47.8
110.0
74.4
40.6
72.3
513.0
846.0
26.6
75.7
846.0
71.5
1.6
3.3
127.0
6.1
1.9
8.0
22.1
49.8
128.0
47.1
48.9
49.5
4.2
15.2
17.3
2.2
2.3
0.6
1.3
9.2
22.0
0.5
0.8
105.0
1.4
5.0
0.8
3.8
0.5
1.6
8.3
4.6
3.6
2.6
2.0
1.8
0.7
2.5
0.9
3.9
2.0
0.7
1.8
1.3
12.1
22.6
4.5
21.6
5.0
101.9
37.9
15.4
19.5
16.2
26.6
33.7
260.9
5.9
50.0
14.0
33.0
31.0
43.0
43.0
14.0
118.0
4.0
40.0
1.0
6.5
0.8
54.4
123.0
51.8
9.2
1.5
1.8
1.5
34.0
60.3
52.5
153.0
10.0
49.4
23.1
54.1
0.9
2.0
3.6
4.3
0.6
0.3
0.9
1.0
12.3
4.7
5.1
6.5
2.6
6.5
11.0
1.0
1.5
0.9
1.1
2.9
60.0
0.7
0.7
186.0
1.5
0.3
1.4
0.7
0.7
1.5
4.8
4.8
3.8
4.4
7.2
0.9
0.6
1.0
0.1
0.9
0.8
0.6
0.5
0.8
0.9
0.7
1.2
1.3
0.7
0.3
0.4
0.6
0.5
0.3
1.5
32.4
3.4
0.9
1.8
1.9
2.4
16.3
29.8
0.6
0.4
0.8
52.0
2.6
75
0.8
0.5
2.4
0.8
2.9
0.4
4.5
2.0
3.2
5.7
1.4
1.8
1.8
0.7
4.0
1.2
0.9
4.5
2.9
2.8
3.4
5.2
2.5
0.7
2.3
2.8
20.3
33.6
3.7
9.4
46.3
78.7
24.6
99.0
4.3
50.0
41.0
62.0
23.0
26.0
56.0
38.0
17.0
21.0
78.0
4.0
6.1
0.9
20.5
52.3
79.4
43.9
5.4
21.0
5.9
25.2
116.0
16.8
9.6
21.0
5.2
52.2
4.6
1.6
3.1
35.9
9.4
12.5
2.7
8.3
6.5
50.7
24.0
28.1
22.9
7.5
7.7
6.6
2.2
3.7
0.9
1.1
17.2
5.0
0.5
1.0
12.0
9.6
2.4
20.0
4.5
2.5
0.5
25.2
34.6
25.8
15.8
34.5
0.4
0.6
1.0
1.4
1.0
0.5
1.1
2.0
0.8
1.7
2.1
0.6
1.3
0.5
1.7
1.2
1.0
1.5
1.0
10.2
12.6
15.7
7.1
1.8
15.6
9.3
8.2
10.6
8.7
6.8
6.8
156.0
15.0
Appendix C. The Complete Datasets
i238
i239
i240
i241
i242
i243
i244
i245
i246
i247
i248
i249
i250
i251
i252
i253
i254
i255
i256
i257
i258
i259
i260
i261
i262
i263
i264
i265
i266
i267
i268
i269
i270
i271
i272
i273
i274
i275
i276
i277
i278
i279
i280
i281
i282
i283
i284
i285
i286
i287
i288
i289
i290
i291
i292
i293
i294
i295
i296
i297
i298
i299
i300
i301
i302
i303
i304
i305
i306
i307
i308
i309
i310
i311
i312
i313
i314
i315
i316
i317
i318
i319
i320
i321
i322
i323
i324
i325
i326
i327
i328
i329
i330
i331
i332
i333
i334
28231
28222
28215
28230
28229
28217
28216
28226
29039
29040
29041
29042
29043
29044
29045
29046
29047
29048
29049
29050
29051
29052
29053
29054
38708
38710
38712
38714
38716
38718
38720
38722
38724
38726
38728
38730
38732
38734
38736
38738
38740
38742
38744
38746
38748
38750
38752
38756
38758
38760
38762
38766
38768
38770
38772
38774
38776
38778
38780
38782
38784
38786
38788
38790
38792
38794
38796
38798
38800
38802
38804
38806
38808
38810
38812
38816
38818
38820
38822
38824
38826
38828
38834
38836
38838
38840
38842
38844
38846
38848
38850
38856
38858
38860
38862
38864
38866
64.0
0.8
1.1
0.6
0.5
0.7
0.6
1.9
1.3
1.1
2.0
11.4
3.7
0.4
2.7
1.5
2.3
1.2
1.0
28.0
1.3
2.2
130.0
17.0
0.6
0.8
0.7
2.2
4.0
0.5
7.0
0.6
0.8
0.7
0.9
0.6
0.7
2.0
2.8
3.1
12.0
0.6
1.3
2.7
3.5
8.2
1.6
1.6
3.2
2.7
3.3
1.4
1.4
5.8
3.5
1.8
2.8
1.0
3.6
3.6
0.8
5.2
0.7
-
29.0
54.0
1.1
1.1
0.7
1.2
0.8
160.4
74.0
124.0
57.0
109.0
171.4
32.8
140.1
219.0
550.0
47.0
67.0
550.0
74.0
140.0
550.0
96.0
10.7
1.2
1.1
1.2
17.3
6.2
43.0
31.4
25.3
4.2
54.0
14.8
6.9
8.4
1.1
1.2
31.0
3.6
3.5
21.0
13.5
1.3
1.5
1.4
1.3
54.2
6.7
53.0
1.1
23.0
39.0
2.0
2.7
54.2
31.0
1.7
1.5
34.0
7.3
30.0
7.0
6.0
2.3
53.0
1.8
1.9
19.0
45.0
7.7
96.0
2.2
26.0
33.5
4.4
36.4
10.0
6.1
42.9
35.0
49.7
3.6
0.8
15.0
27.0
11.6
7.4
27.0
9.9
21.0
1.9
58.0
1.5
178.0
2.1
0.9
0.8
0.6
0.9
0.5
8.2
4.0
3.6
3.4
4.7
5.7
2.1
10.2
17.0
35.0
2.5
3.1
31.0
3.3
27.0
275.0
120.0
0.1
0.8
1.4
0.5
0.3
18.0
4.4
0.6
41.4
17.3
148.0
0.3
3.1
7.6
0.8
1.3
32.0
2.3
2.1
0.5
19.5
1.2
1.1
1.3
1.4
4.5
30.0
84.0
1.3
48.0
77.0
0.8
27.1
0.5
108.0
1.1
0.9
74.0
16.3
112.0
78.0
2.7
0.8
92.0
7.3
5.7
45.0
5.4
20.3
315.8
1.0
20.0
22.6
3.1
1.1
4.6
45.0
17.7
44.0
47.0
2.7
0.7
10.0
57.0
1.2
12.1
28.0
8.9
75.0
1.5
109.0
1.1
13.0
1.6
1.0
0.9
0.7
0.8
0.6
9.6
7.1
4.4
9.3
6.1
38.1
3.7
21.0
24.0
72.0
5.0
3.9
45.0
5.2
46.0
290.0
2.4
0.1
0.5
1.1
0.8
1.4
1.6
6.7
1.6
2.9
0.4
27.0
0.3
0.7
1.1
0.6
1.4
4.0
0.7
0.6
0.4
0.3
1.5
1.0
0.8
8.9
3.4
0.4
7.6
0.7
4.7
5.9
0.8
2.8
1.9
5.9
0.8
0.5
181.0
24.4
7.4
14.0
0.1
1.2
9.1
0.4
1.0
1.8
10.0
1.2
265.0
0.9
18.0
32.4
0.9
1.6
0.9
1.3
38.5
31.0
4.6
1.8
0.1
6.2
8.8
1.7
1.5
23.0
3.1
6.2
1.1
155.0
0.9
76
1.0
0.9
0.4
67.0
0.7
0.6
1.1
0.7
1.6
3.4
3.7
0.8
15.3
4.6
34.0
0.7
14.0
7.6
0.7
0.8
14.0
0.7
1.2
0.7
6.2
0.8
0.9
1.1
0.7
0.7
4.2
22.0
1.0
16.0
18.0
0.7
4.6
0.7
41.0
0.9
0.5
20.0
2.3
12.0
13.0
4.2
0.7
64.0
1.3
1.6
17.0
1.9
8.4
96.0
0.9
3.2
2.5
1.7
1.2
1.1
7.0
2.7
6.4
26.0
1.2
0.7
2.3
16.0
0.7
5.2
6.1
2.7
44.0
0.8
39.0
0.8
14.0
1.3
1.0
1.0
0.7
0.9
0.7
2.8
3.2
2.6
5.3
1.1
3.9
1.3
5.2
5.8
8.4
1.7
2.3
6.8
3.5
12.0
25.0
51.0
0.3
0.8
1.1
0.6
9.7
11.0
17.0
1.6
11.8
5.8
22.0
0.8
18.6
6.9
0.9
1.5
25.0
1.1
1.7
1.3
11.6
0.9
1.2
1.1
2.1
15.7
3.4
20.0
1.5
11.0
14.0
2.0
7.4
1.7
40.0
0.9
0.9
26.0
6.0
11.0
7.7
9.8
0.8
41.0
1.7
3.0
19.0
11.0
4.6
16.0
0.4
8.4
12.6
2.8
1.9
4.6
7.4
10.9
11.0
26.0
1.7
1.2
3.7
15.0
4.1
10.2
8.1
4.0
18.0
1.1
73.0
1.0
Appendix C. The Complete Datasets
i335
i336
i337
i338
i339
i340
i341
i342
i343
i344
i345
i346
i347
i348
i349
i350
i351
i352
i353
i354
i355
i356
i357
i358
i359
i360
i361
i362
i363
i364
i365
i366
i367
i368
i369
i370
i371
i372
i373
i374
i375
i376
i377
i378
i379
i380
i381
i382
i383
i384
i385
i386
i387
i388
i389
i390
i391
i392
i393
i394
i395
i396
i397
i398
i399
i400
i401
i402
i403
i404
i405
i406
i407
i408
i409
i410
i411
i412
i413
i414
i415
i416
i417
i418
i419
i420
i421
i422
i423
i424
i425
i426
i427
i428
i429
i430
i431
38870
38872
38874
38876
38880
38882
38884
38890
38892
38894
38896
38900
38902
38904
38906
38908
38910
38912
38914
38916
38922
38926
38928
39551
42159
3901
2947
2904
2996
39449
2987
44038
44038
44040
44040
44042
44042
44044
44044
44046
44046
44048
44048
44050
44050
44052
44052
44054
44054
44056
44056
44058
44058
44060
44060
44062
44062
44064
44064
44066
44066
44068
44068
44070
44070
44072
44072
44074
44074
44078
44078
44080
44080
44082
44082
44084
44084
44086
44086
44088
44088
44090
44090
44092
44092
44094
44094
44096
44096
44098
44098
44100
44100
44102
44102
44104
44104
6.0
1.2
11.0
29.0
1.9
6.8
0.7
1.9
5.0
9.8
0.1
8.8
2.5
1.1
0.4
1.6
2.0
0.9
1.5
1.2
1.0
1.8
0.8
0.2
0.3
0.6
0.6
0.9
0.3
0.9
0.8
1.3
0.6
1.0
0.5
0.6
0.6
0.7
1.3
0.9
2.1
1.3
2.0
1.9
1.3
1.5
1.3
0.9
0.7
-
20.4
43.0
3.1
7.2
1.1
131.0
23.0
8.9
10.0
0.8
54.2
121.0
20.9
7.7
20.0
1.7
11.0
42.8
19.5
1.2
2.4
8.6
18.0
14.6
16.0
50.0
5.0
2.0
38.6
31.0
18.0
4.5
2.3
0.8
0.5
1.5
0.8
0.6
1.5
0.8
0.7
1.8
1.4
1.1
0.9
0.4
1.2
4.6
2.5
2.9
1.5
0.2
0.2
3.5
1.8
1.0
0.9
1.0
0.9
1.6
0.6
2.8
1.1
2.0
0.5
2.6
1.2
1.1
0.4
1.4
1.3
8.0
2.6
1.3
1.4
0.5
0.8
0.2
0.6
0.7
1.0
0.7
0.9
0.7
0.8
0.8
2.5
1.0
2.7
3.2
2.0
5.3
3.2
6.2
1.4
2.6
1.4
0.1
79.0
3.5
21.6
1.1
129.0
48.0
10.2
1.9
1.0
0.9
100.0
0.5
26.8
8.6
1.1
28.0
1.8
6.3
0.8
1.3
105.0
17.0
0.6
88.0
2.0
9.0
1.0
147.8
1.0
0.2
1.5
1.4
1.0
0.6
0.4
0.8
0.7
2.2
0.7
1.1
0.7
1.5
0.2
1.1
0.4
1.1
2.3
3.0
1.8
1.3
0.2
0.3
1.6
0.9
0.2
0.5
0.8
0.8
1.1
0.8
0.9
0.5
4.3
0.6
1.4
0.9
1.0
0.5
1.1
1.3
3.7
2.3
1.2
0.5
0.2
1.0
0.4
0.8
0.3
0.9
0.3
1.0
0.3
1.4
0.5
2.8
0.9
2.3
2.2
1.8
5.1
3.0
1.3
0.9
1.4
1.1
0.2
7.1
1.2
0.9
0.4
545.5
24.0
13.8
2.5
0.7
2.1
461.5
0.5
0.9
0.5
0.7
4.6
1.5
2.6
0.8
0.5
18.0
20.0
2.4
33.0
0.3
0.6
1.0
16.1
1.0
0.2
2.9
0.6
0.2
0.5
2.0
0.7
1.2
1.7
0.9
0.7
1.6
1.1
0.7
0.5
0.6
0.8
3.6
1.3
1.3
0.6
0.4
0.5
2.2
0.6
0.6
0.7
1.1
0.5
0.7
0.5
1.2
0.3
1.6
0.4
2.6
0.8
2.9
0.1
0.8
0.7
4.0
1.1
1.5
0.3
0.4
1.2
0.9
0.9
0.3
0.7
1.3
0.9
0.4
0.9
1.0
1.3
0.9
1.2
1.5
1.2
2.4
2.6
1.4
0.5
1.3
0.8
77
0.7
33.0
1.3
2.3
0.8
34.0
4.2
2.5
1.7
0.7
0.8
63.0
0.7
2.4
4.5
0.7
14.0
1.1
1.4
1.0
0.8
11.0
4.3
35.0
-
0.5
22.0
1.5
7.1
0.9
58.0
6.6
7.6
6.5
0.8
2.5
171.4
0.8
4.6
20.3
1.0
10.0
8.2
7.6
1.0
2.1
8.1
11.0
2.5
9.4
1.0
6.0
1.0
16.3
6.0
0.5
0.7
1.1
1.2
0.4
0.8
0.5
1.5
1.4
1.1
0.8
1.1
1.4
1.0
0.9
1.3
0.8
1.4
1.2
1.7
0.8
0.6
0.4
0.7
0.9
1.4
0.5
1.0
0.8
1.6
0.6
4.0
0.8
2.1
0.5
1.4
0.8
0.9
0.3
1.8
0.8
1.2
0.9
1.7
0.5
0.9
0.8
1.3
0.7
1.0
0.7
0.8
0.8
0.4
0.8
0.7
1.3
2.3
1.4
2.5
1.1
1.7
1.9
3.3
0.9
2.8
1.0
Appendix C. The Complete Datasets
i432
i433
i434
i435
i436
i437
i438
i439
i440
i441
i442
i443
i444
i445
i446
i447
i448
i449
i450
i451
i452
i453
i454
i455
i456
i457
i458
i459
i460
i461
i462
i463
i464
i465
i466
i467
i468
i469
i470
i471
i472
i473
i474
i475
i476
i477
i478
i479
i480
i481
i482
i483
i484
i485
i486
i487
i488
i489
i490
i491
i492
i493
i494
i495
i496
i497
i498
i499
i500
i501
i502
i503
i504
i505
i506
i507
i508
i509
i510
i511
i512
i513
i514
i515
i516
i517
i518
i519
i520
i521
i522
i523
i524
i525
i526
i527
i528
44106
44106
44108
44108
44110
44110
44112
44112
45038
45040
45042
45044
45046
45057
45058
45059
45060
45061
45062
45063
45064
45065
45066
45067
45069
45071
45072
45073
45074
45076
45078
45079
45080
45082
45084
45086
45088
45089
45090
45091
45092
45094
45095
45096
45097
45098
45099
45100
45101
45102
45103
45104
45105
45106
45107
45108
45109
45110
45111
45112
45113
45114
45115
45116
45117
45118
45119
45120
45121
45122
45123
45124
45125
45129
45131
45133
45135
45137
45139
45141
45143
45145
45147
45149
45151
45153
45155
45157
45159
45161
45165
45167
45169
45171
45173
45177
45179
0.2
1.1
0.4
1.0
0.9
1.1
0.4
0.9
3.9
0.7
0.6
0.8
0.5
0.5
0.8
0.2
1.6
0.8
0.4
1.6
0.3
4.3
0.5
0.3
0.6
1.1
0.9
0.4
0.4
0.8
1.4
0.1
0.5
4.3
0.6
0.2
0.6
1.0
0.5
0.5
1.3
2.7
3.2
2.1
0.4
0.3
0.8
1.2
0.5
6.2
0.9
0.6
0.6
1.5
1.0
0.2
0.7
0.2
1.0
0.1
0.3
0.2
1.0
0.3
0.6
0.3
0.8
0.5
0.7
0.8
2.3
2.2
0.5
0.2
0.5
0.5
0.4
0.2
6.1
0.4
1.3
5.8
2.3
0.4
0.4
0.2
0.5
0.4
0.2
2.7
0.7
4.1
49.0
1.1
4.9
0.9
1.9
0.5
2.9
6.0
141.0
0.8
7.9
2.7
31.0
3.8
0.6
9.9
1.5
12.0
0.8
11.0
0.7
8.8
7.6
22.0
0.7
22.0
0.7
13.0
0.7
2.3
0.6
5.2
4.0
14.0
2.1
6.9
0.7
0.8
1.5
7.4
1.8
3.4
1.4
2.4
9.5
1.5
1.0
5.4
2.4
10.0
1.0
2.0
2.0
6.0
1.0
3.0
0.9
6.0
-
1.4
1.0
1.3
1.3
1.0
2.7
0.4
1.9
0.7
60.0
2.7
1.7
0.5
83.0
232.0
1.2
0.7
1.1
25.0
0.5
5.8
7.7
86.0
1.2
0.8
7.4
66.0
15.0
1.1
1.0
31.0
72.0
1.3
0.9
1.5
0.9
142.0
6.0
12.0
56.0
0.8
1.6
0.8
46.0
1.0
0.3
91.0
140.0
23.0
35.0
0.6
13.0
45.0
82.0
14.0
59.0
4.6
34.0
0.4
2.1
1.0
0.2
1.1
0.3
1.0
0.1
1.3
0.3
1.0
0.1
1.4
0.2
2.9
0.4
1.7
0.5
13.4
2.8
1.0
1.3
0.2
0.3
29.0
95.7
2.1
26.9
0.2
0.6
34.3
59.4
3.5
0.9
1.0
0.8
1.0
3.6
0.4
2.8
0.2
1.9
0.3
1.5
1.5
0.9
0.4
1.3
13.0
1.2
0.2
1.1
0.6
0.6
2.4
0.8
9.8
1.1
0.2
1.6
3.6
13.0
0.7
0.1
0.9
2.8
0.8
0.3
1.0
0.2
0.8
0.1
0.7
19.0
0.6
0.3
0.9
1.1
0.5
0.1
4.3
10.0
8.2
7.6
0.6
0.3
1.1
1.8
0.6
9.9
1.0
0.8
0.3
2.2
1.0
0.1
0.5
0.2
1.0
0.1
0.7
0.2
1.0
0.3
2.0
0.2
21.5
1.5
1.8
2.1
1.7
5.8
0.2
0.7
0.2
1.8
39.3
130.2
112.1
1.4
0.4
0.2
3.0
0.9
1.0
0.6
0.5
1.2
0.5
1.0
1.1
0.4
2.0
1.1
0.8
0.5
2.4
33.0
0.7
0.3
0.9
0.5
0.3
0.8
1.3
111.0
0.7
0.5
1.0
12.0
3.5
0.5
0.3
1.1
6.9
0.8
0.5
0.7
0.3
0.6
24.0
0.7
0.6
2.0
0.4
0.1
5.6
21.0
2.1
5.8
0.5
0.3
1.1
2.7
0.5
18.0
0.8
1.0
0.3
1.3
0.9
0.8
0.4
0.3
4.6
1.0
0.9
1.2
0.5
3.1
41.1
0.2
0.4
0.4
1.0
0.6
78
2.4
1.0
1.8
4.1
0.3
0.5
0.6
0.3
0.6
1.3
1.7
0.9
0.8
1.4
0.6
1.0
0.3
1.1
0.4
1.9
1.7
0.7
0.3
2.4
8.2
1.0
0.3
0.9
1.1
0.4
1.3
4.2
27.0
1.0
0.4
1.8
12.0
2.6
0.8
0.5
1.2
2.8
0.8
0.4
0.8
0.4
2.3
0.1
0.4
24.0
0.7
0.5
0.7
1.9
0.6
0.3
3.4
5.2
4.6
7.8
0.5
0.4
1.4
2.6
0.9
1.3
0.9
1.0
0.4
1.6
1.0
0.1
0.6
0.3
1.0
0.1
1.0
0.3
1.0
0.1
1.2
0.1
2.2
1.3
1.2
1.6
1.0
1.7
0.4
0.7
37.2
3.8
0.7
3.2
0.7
Appendix C. The Complete Datasets
i529
i530
i531
i532
i533
i534
i535
i536
i537
i538
i539
i540
i541
i542
i543
i544
i545
i546
i547
i548
i549
i550
i551
i552
i553
i554
i555
i556
i557
i558
i559
i560
i561
i562
i563
i564
i565
i566
i567
i568
i569
i570
i571
i572
i573
i574
i575
i576
i577
i578
i579
i580
i581
i582
i583
i584
i585
i586
i587
i588
i589
i590
i591
i592
i593
i594
i595
i596
i597
i598
i599
i600
i601
i602
i603
i604
i605
i606
i607
i608
i609
i610
i611
i612
i613
i614
i615
i616
i617
i618
i619
i620
i621
i622
i623
i624
i625
45181
45183
45185
45187
45189
45191
45193
45195
45197
45199
45201
45203
45205
45207
45209
45211
45213
45215
45217
45219
45225
45227
45231
45233
45235
45237
45239
45241
45243
45245
45249
45251
45253
45255
45257
45259
45261
45263
45265
45267
45269
45271
45273
45275
45277
45281
45283
45285
45287
45289
45291
45293
45295
45297
45299
45301
45303
45305
45307
45309
45311
45313
45315
45317
45319
45321
45323
45325
45327
45329
45331
45333
45335
45337
45341
45343
45345
45347
45349
45351
45353
45355
45359
45361
45363
45365
45367
45369
45371
45373
45375
45377
45379
45381
45383
45385
45387
1.9
3.9
1.3
1.6
2.9
9.2
0.6
3.0
0.5
0.2
0.8
0.7
0.4
0.6
0.7
0.3
1.8
0.2
0.4
1.3
1.5
1.0
0.2
0.9
0.7
1.8
3.2
0.5
0.8
5.3
1.7
1.7
2.6
2.4
4.1
1.1
0.2
3.7
1.0
1.1
3.6
0.4
51.6
0.9
0.4
1.9
2.5
1.1
0.2
2.7
2.9
12.3
0.3
2.8
1.3
1.7
0.7
0.9
7.0
0.4
5.1
0.8
0.5
0.3
0.6
7.3
0.5
1.8
0.2
3.6
1.0
1.7
1.0
1.1
2.8
1.7
15.2
1.7
1.1
7.3
5.7
5.3
0.7
0.3
0.9
2.0
0.2
0.4
5.9
1.0
2.7
0.5
1.1
0.4
0.3
-
6.2
0.9
6.5
1.4
17.4
22.7
0.5
5.3
9.2
16.9
67.7
12.7
0.4
0.5
0.4
4.9
0.6
6.5
12.6
3.5
0.5
0.3
3.8
11.8
53.4
0.5
35.1
30.4
1.2
2.8
2.6
5.5
2.7
6.0
19.6
1.1
1.0
7.6
0.3
90.9
95.7
0.9
1.3
21.1
3.0
72.8
45.2
17.8
56.2
15.1
1.7
15.6
9.0
36.0
28.8
0.7
31.4
0.8
22.1
0.3
0.5
11.4
0.3
16.0
54.5
0.6
41.9
56.0
1.0
26.8
0.5
52.6
24.8
6.4
15.4
28.9
53.2
0.9
0.5
21.4
20.2
0.5
30.4
1.7
45.2
0.3
49.6
10.5
0.3
4.2
1.6
3.9
0.4
15.8
91.0
1.1
7.8
15.2
0.1
144.0
0.4
5.7
1.3
0.4
0.4
37.7
0.8
34.9
7.8
18.4
8.3
0.2
0.6
0.8
6.2
17.9
2.4
2.4
91.7
1.1
5.8
1.9
16.6
0.2
0.6
5.2
15.8
3.4
1.7
5.7
0.4
9.7
146.9
50.3
60.9
29.6
3.0
19.6
5.7
5.6
13.5
0.3
58.6
0.6
0.9
0.2
0.8
220.3
0.4
31.7
0.5
0.4
211.1
2.9
9.1
1.4
18.0
12.5
0.3
0.4
0.7
2.5
0.2
0.6
1.0
13.6
0.5
2.9
4.9
0.4
5.2
0.5
2.3
1.7
1.9
1.6
0.4
3.0
22.2
0.4
3.9
0.9
2.2
0.6
0.5
0.5
1.0
0.2
20.2
42.3
0.7
0.5
0.7
0.4
0.4
12.6
0.9
2.4
0.7
0.6
1.3
0.8
48.8
1.1
0.1
3.4
1.0
0.5
1.6
0.5
21.9
18.7
2.1
7.3
9.2
0.6
1.5
0.2
3.6
31.9
1.5
38.1
3.8
0.6
1.1
0.3
0.5
1.8
0.3
8.8
14.4
1.1
0.8
7.7
24.5
0.3
8.1
17.5
0.7
0.4
0.6
2.8
0.3
0.3
0.4
6.9
0.4
2.2
4.0
0.3
79
1.9
5.0
2.5
1.4
10.2
0.7
7.7
0.3
1.0
1.5
0.8
0.4
13.5
1.4
10.8
9.8
48.4
0.7
3.8
1.0
3.1
25.3
0.5
0.7
17.7
4.1
37.6
0.8
2.2
9.5
9.6
14.9
0.9
0.3
5.4
0.3
18.7
1.5
11.5
1.7
1.7
4.4
0.9
0.4
3.0
0.8
1.2
7.6
1.9
0.8
0.5
3.6
14.8
1.3
1.0
12.8
40.4
0.6
3.7
15.2
6.8
1.0
1.2
0.9
0.8
0.6
6.3
0.2
9.0
7.5
4.8
0.4
0.6
2.6
9.4
0.7
1.3
56.3
0.8
2.9
1.0
1.6
3.2
5.3
1.3
1.4
3.1
0.2
2.9
6.2
8.0
17.2
18.8
3.3
13.9
3.9
8.4
29.2
1.2
0.5
0.9
9.9
0.5
9.6
1.1
1.8
10.8
0.7
41.9
28.5
0.6
0.3
1.7
7.6
0.2
0.8
4.2
0.6
2.1
4.0
0.3
Appendix C. The Complete Datasets
i626
i627
i628
i629
i630
i631
i632
i633
i634
i635
i636
i637
i638
i639
i640
i641
i642
i643
i644
i645
i646
i647
i648
i649
i650
i651
i652
i653
i654
i655
i656
i657
i658
i659
i660
i661
i662
i663
i664
i665
i666
i667
i668
i669
i670
i671
i672
i673
i674
i675
i676
i677
i678
i679
i680
i681
i682
i683
i684
i685
i686
i687
i688
i689
i690
i691
i692
i693
i694
i695
i696
i697
i698
i699
i700
i701
i702
i703
i704
i705
i706
i707
i708
i709
i710
i711
i712
i713
i714
i715
i716
i717
i718
i719
i720
i721
i722
45389
45391
45393
45395
45397
45399
45401
45403
45405
45407
45409
45411
45413
45415
45417
45419
45421
45423
45425
45427
45431
45433
45435
45437
45439
46679
46681
46685
46687
46691
46693
46695
46697
46699
46701
46703
46705
46709
46711
3818
3899
3830
3826
3893
3883
3810
3816
3905
40015
3928
3926
3910
3839
3966
3891
3885
3924
3922
4470
3918
3850
3854
3852
41684
41686
41621
3867
3865
3869
3895
3916
50511
50513
50515
52822
52824
52826
52828
52830
52832
52834
52836
52838
52840
52889
52891
52893
52897
52905
52907
52909
52911
52915
53501
53503
53505
53507
0.5
0.3
0.4
1.4
1.5
0.3
1.7
0.7
0.3
0.3
0.8
0.8
12.3
1.0
0.6
2.6
26.2
0.2
0.5
0.2
1.7
16.1
4.9
0.8
2.0
3.4
0.9
0.6
0.7
1.1
0.8
0.6
1.2
3.0
22.3
0.8
0.5
1.7
6.0
7.5
3.4
3.0
1.3
1.6
0.8
2.3
0.6
0.6
1.0
2.1
1.5
1.2
1.3
4.9
4.1
0.4
3.2
1.1
1.0
7.7
0.2
4.2
3.2
6.3
0.9
6.0
8.8
6.0
19.3
0.7
1.3
15.2
1.9
27.4
2.5
2.6
10.8
17.2
-
1.0
14.2
26.9
40.5
11.7
4.0
63.3
12.1
1.4
1.6
1.0
0.6
61.7
3.4
1.1
7.9
95.7
3.6
9.1
10.7
47.1
11.7
14.4
30.3
2.5
1.1
47.0
1.5
1.4
38.9
21.0
4.5
165.4
1.8
1.3
25.0
18.4
1.0
3.0
8.0
1.0
6.0
1.0
73.0
4.1
24.0
23.0
25.0
10.0
1.0
58.0
28.0
98.0
95.0
1.0
4.0
6.0
80.0
7.0
19.0
7.0
48.0
33.3
13.0
24.0
22.4
10.0
42.0
41.0
2.6
5.8
0.9
18.4
1.0
23.3
29.8
2.5
43.6
45.0
4.1
56.0
6.1
0.9
87.2
5.9
297.0
110.8
46.5
53.2
75.7
17.6
6.7
37.0
56.0
18.0
0.8
0.6
1.1
46.0
0.2
1.4
1.9
0.5
0.9
0.7
26.0
15.7
0.3
31.3
1.9
14.1
72.9
26.2
9.5
44.1
21.0
1.4
0.9
1.4
1.7
1.3
1.0
61.2
16.5
119.0
1.4
0.9
5.9
12.0
0.4
1.0
3.0
0.3
35.0
0.3
147.0
4.4
87.0
1.8
21.0
20.0
1.0
19.0
58.0
118.0
132.0
1.0
7.0
66.0
52.0
3.0
19.0
14.7
85.0
60.8
11.0
28.0
49.7
41.0
17.0
8.0
1.6
14.0
0.8
45.3
0.7
0.6
36.7
14.5
73.5
0.9
7.6
9.3
61.5
0.5
4.1
1.8
23.0
1.9
190.0
176.8
48.7
1.7
139.0
187.0
63.0
0.9
0.7
1.1
1.8
0.7
0.6
0.8
0.8
0.4
0.8
1.2
0.4
12.2
1.0
4.3
0.6
29.0
30.3
1.0
0.6
1.5
0.9
1.1
0.9
3.8
0.7
249.8
1.2
1.0
4.7
18.0
0.3
1.0
59.0
0.3
71.0
1.0
71.0
0.6
5.0
0.4
40.0
13.0
1.6
14.0
3.0
43.0
51.0
0.3
2.0
1.0
38.0
2.0
1.0
0.1
6.0
6.7
7.0
0.5
3.7
1.0
22.0
6.0
1.5
1.6
0.7
16.7
0.8
0.3
7.1
1.2
21.4
2.3
0.4
16.2
20.9
0.7
11.8
1.1
71.0
11.6
3.4
33.2
32.5
47.7
1.3
132.0
306.0
41.0
80
49.0
13.7
0.3
0.8
12.7
0.7
0.8
47.9
5.1
21.2
0.8
1.0
3.3
40.1
10.0
2.0
54.1
5.6
1.1
2.8
0.7
9.9
0.6
0.6
10.1
4.3
13.2
1.2
1.4
7.5
10.9
0.4
2.7
1.3
16.0
39.0
1.3
25.2
49.3
6.3
0.8
24.0
28.0
26.0
0.7
0.6
3.3
11.4
2.1
0.5
0.8
1.8
2.5
1.0
6.2
0.4
4.8
16.9
12.6
9.1
1.3
0.7
1.0
1.1
1.0
1.6
17.1
4.3
186.9
1.3
0.7
9.6
10.5
0.4
1.0
4.0
0.3
31.0
0.3
100.0
7.4
21.0
1.5
6.0
8.0
1.0
14.0
16.0
72.0
62.0
1.0
5.0
11.0
60.0
1.0
7.0
14.2
18.0
12.6
5.0
15.0
13.5
10.0
17.0
8.0
1.3
5.2
0.6
17.0
0.8
1.1
23.4
2.5
31.4
1.8
3.5
24.5
8.4
0.5
22.7
3.0
87.0
69.6
15.1
25.2
41.5
7.5
0.9
12.0
9.4
6.1
Appendix C. The Complete Datasets
i723
i724
i725
i726
i727
i728
i729
i730
i731
i732
i733
i734
i735
i736
i737
i738
i739
i740
i741
i742
i743
i744
i745
i746
i747
i748
i749
i750
i751
i752
i753
i754
i755
i756
i757
i758
i759
i760
i761
i762
i763
i764
i765
i766
i767
i768
i769
i770
i771
i772
i773
i774
i775
i776
i777
i778
i779
i780
i781
i782
i783
i784
i785
i786
i787
i788
i789
i790
i791
i792
i793
i794
i795
i796
i797
i798
i799
i800
i801
i802
i803
i804
i805
i806
i807
i808
i809
i810
i811
i812
i813
i814
i815
i816
i817
i818
i819
53509
53515
53517
53853
53857
53862
53866
53868
53881
53885
53889
53891
53909
53915
53922
53924
53932
53934
53951
54162
54163
54164
54165
54166
54167
54168
54169
54170
54171
54172
54173
54174
54175
54176
54177
54178
54179
54180
54181
54182
54183
54184
54185
54186
54187
54188
54189
54300
54301
54302
54303
54304
54305
54306
54307
54308
54309
54310
54311
54312
54313
54314
54315
54316
54317
54318
54319
54320
54321
54322
54323
54324
54325
54326
54327
54328
54329
54330
54331
54332
54333
54334
54335
54336
54337
54338
54339
54340
54341
54342
54343
54344
54345
54346
54347
54348
54349
6.5
2.3
6.2
0.7
0.6
1.4
1.1
2.4
0.6
0.6
0.6
0.6
1.5
0.5
1.2
0.9
0.6
0.7
0.6
1.7
5.4
8.6
10.4
19.8
64.1
23.0
118.1
31.1
42.5
32.0
28.1
19.0
20.9
74.0
252.2
1.6
24.8
2.3
1.6
2.2
12.5
42.6
40.0
13.2
4.6
39.5
52.7
0.4
0.2
4.2
3.4
4.3
1.0
3.5
16.5
1.6
4.8
4.3
400.0
3.8
6.1
13.0
9.7
6.5
9.6
7.5
25.5
26.7
3.8
10.8
4.7
9.5
12.4
11.7
4.7
3.5
6.3
18.5
3.5
31.6
6.3
14.7
16.7
3.7
46.8
5.9
31.3
25.0
35.5
17.3
16.9
20.9
10.1
6.7
20.6
16.0
-
20.0
15.0
17.0
2.2
0.5
2.9
7.3
3.2
1.1
1.2
0.9
0.6
2.6
1.0
2.8
2.6
3.9
0.7
0.9
15.3
58.7
74.3
395.3
50.7
143.8
600.0
600.0
600.0
600.0
600.0
600.0
271.0
63.9
600.0
600.0
3.0
3.0
1.8
4.3
2.3
8.5
30.8
14.0
1.8
8.4
101.9
136.6
7.8
27.8
31.4
500.0
3.0
76.7
3.5
30.9
20.5
25.8
28.9
18.7
30.9
18.1
18.9
23.3
24.9
14.8
47.4
22.7
22.5
53.7
93.5
44.6
56.5
46.4
19.7
48.8
78.7
74.8
76.5
30.4
153.5
43.5
30.7
54.6
49.8
45.5
54.0
35.0
60.9
99.7
71.9
103.6
63.5
75.9
191.0
39.1
99.6
74.0
63.0
65.0
77.0
0.6
0.3
0.7
2.4
2.0
0.8
0.7
2.2
1.2
3.3
1.1
2.8
3.5
1.4
0.9
1.5
1.6
8.6
10.0
9.1
15.3
21.6
59.8
400.0
84.5
76.0
42.1
54.2
31.0
33.7
141.1
86.2
4.9
7.2
4.7
2.8
4.6
5.0
92.9
33.3
22.4
22.4
84.5
275.0
12.6
27.8
1.7
11.9
7.3
113.0
5.8
-
25.0
20.0
38.0
0.7
0.4
0.7
1.9
0.9
0.9
0.7
0.9
0.4
1.2
1.1
0.4
0.7
0.9
0.5
0.5
6.1
14.1
55.9
92.0
40.8
24.4
1000.0
137.2
140.6
364.6
241.1
256.2
105.0
24.0
1000.0
265.5
3.9
2.0
2.8
1.8
1.9
3.5
30.4
14.3
1.6
4.5
43.1
162.3
4.9
3.2
1.2
13.5
5.5
181.0
0.9
1.1
12.4
111.2
57.5
60.8
453.3
55.0
24.6
24.7
4.1
46.5
86.5
33.0
29.1
3.3
1000.0
81.1
44.6
48.0
30.1
52.7
443.7
441.9
387.8
23.5
1000.0
120.0
91.7
122.7
83.8
33.1
8.4
33.9
75.6
1000.0
126.2
35.7
60.7
1000.0
1000.0
101.3
32.1
139.6
81
35.0
26.0
21.0
0.4
1.6
3.1
3.1
46.8
7.2
77.0
11.4
13.8
9.0
6.0
13.9
123.5
94.4
1.9
37.5
1.4
1.9
2.0
28.9
80.2
79.0
43.4
9.5
99.3
102.7
6.4
54.2
14.6
14.5
9.7
294.7
24.5
8.7
58.3
24.5
23.4
20.4
31.5
64.8
172.4
47.9
39.3
8.5
10.3
46.8
11.1
50.7
39.3
47.0
119.0
84.1
80.4
14.4
5.2
73.3
47.2
124.0
45.2
52.3
214.1
68.5
51.0
43.2
83.6
95.9
10.1
23.1
13.5
11.0
6.3
5.0
1.0
0.7
0.5
2.9
0.5
0.7
0.6
0.7
0.7
1.2
1.7
1.2
1.2
0.9
0.8
0.6
1.2
1.3
4.1
10.7
3.8
62.9
42.3
24.2
26.8
78.3
57.8
52.6
41.0
38.0
92.1
400.0
1.3
8.6
1.9
3.2
1.4
33.9
33.6
16.7
5.6
6.3
79.2
41.5
2.7
25.3
6.1
6.1
1.6
21.9
4.8
11.5
7.7
17.4
19.9
9.7
22.8
13.8
26.0
15.4
39.5
13.4
15.5
7.9
13.7
181.7
100.9
16.0
30.0
34.6
21.4
36.9
64.6
57.6
91.3
20.5
172.2
13.6
8.1
12.2
37.1
35.0
26.4
17.5
26.3
400.0
24.6
36.6
14.7
75.3
400.0
37.4
66.3
23.1
Appendix C. The Complete Datasets
i820
i821
i822
i823
i824
i825
i826
i827
i828
i829
i830
i831
i832
i833
i834
i835
i836
i837
i838
i839
i840
i841
i842
i843
i844
i845
i846
i847
i848
i849
i850
i851
i852
i853
i854
i855
i856
i857
i858
i859
i860
i861
i862
i863
i864
i865
i866
i867
i868
i869
i870
i871
i872
i873
i874
i875
i876
i877
i878
i879
i880
i881
i882
i883
i884
i885
i886
i887
i888
i889
i890
i891
i892
i893
i894
i895
i896
i897
i898
i899
i900
i901
i902
i903
i904
i905
i906
i907
i908
i909
i910
i911
54350
54351
54352
54353
54354
54355
54356
54392
54393
54394
54395
54396
54397
54398
54399
54400
54401
54402
54407
54408
54409
54410
54411
54412
54419
54420
54421
54422
54423
54424
54425
54426
54427
54428
54429
54430
54431
54432
54433
54434
54435
54436
54437
54438
54439
54440
54441
54442
54443
54444
54445
54446
54447
54448
54449
54450
54451
54452
54453
54454
54595
54596
54597
54598
54599
4432
4482
41179
4387
4478
4538
4664
4698
5221
5464
5681
6024
6028
41012
7038
7103
7139
7235
7260
7393
7412
7414
7415
40018
7426
7430
7430
400.0
12.7
11.4
18.4
66.4
15.4
22.4
1.8
3.2
2.5
0.8
7.8
0.6
2.9
3.5
3.9
9.9
11.0
1.8
8.4
21.0
48.0
4.6
18.0
0.5
0.7
0.8
0.8
0.9
0.9
1.1
1.1
1.4
1.4
1.7
1.9
1.9
2.2
2.4
2.4
2.5
2.6
2.8
3.4
3.5
4.4
4.4
4.7
5.4
5.8
6.5
12.1
0.1
0.1
0.1
0.1
0.1
0.4
1.1
0.9
26.4
13.3
21.2
7.7
47.3
1.5
3.9
0.7
3.1
1.2
2.1
15.5
6.2
3.5
2.2
2.8
5.2
3.3
2.8
-
147.8
78.6
94.5
89.8
159.4
132.9
600.0
7.6
4.1
3.6
30.1
1.4
1.1
4.8
6.0
30.0
48.6
143.6
1.3
3.4
2.9
4.1
2.6
3.3
2.4
5.1
8.2
12.9
12.2
0.4
1.4
2.5
31.0
5.7
2.7
7.2
45.3
2.2
21.6
67.7
0.8
46.0
55.0
32.0
3.6
1.2
24.7
37.0
22.0
37.0
0.9
11.0
11.0
64.0
8.7
16.0
41.0
10.2
6.2
9.5
100.0
47.0
17.7
80.7
1.1
4.0
5.2
6.5
3.6
5.0
0.4
0.8
0.9
1.4
1.3
0.5
1.3
1.7
10.9
0.5
2.8
5.7
14.0
1.8
74.1
176.0
0.9
1.0
54.0
14.7
1.7
1.2
47.8
83.0
5.9
102.0
1.1
2.1
0.6
36.7
5.7
45.0
83.0
75.5
8.6
24.2
41.0
32.1
89.9
121.7
69.7
73.6
101.0
286.8
258.7
128.6
1000.0
24.7
1.3
1.8
0.4
1.1
0.6
8.0
5.7
25.4
91.4
350.0
1.3
3.7
1.3
2.0
2.7
1.4
0.5
1.4
1.2
1.2
1.5
0.4
1.0
1.2
0.3
0.3
0.2
0.1
0.7
1.1
9.2
208.7
1.1
0.4
85.0
16.9
1.3
0.7
104.8
7.4
3.4
7.9
0.8
1.1
0.4
74.4
1.6
18.0
10.0
591.5
1.1
19.0
145.6
28.4
9.4
42.1
82
63.1
56.0
79.9
112.5
164.3
209.0
122.5
1.2
5.3
2.3
0.8
3.6
0.8
2.4
3.8
18.1
22.3
1.6
7.7
19.0
31.0
6.2
15.0
91.5
102.5
86.2
109.6
95.8
73.7
7.5
0.8
7.8
25.0
3.0
20.0
1.0
4.0
0.7
17.0
2.2
15.0
37.0
42.6
-
27.2
50.8
55.7
49.0
57.7
400.0
4000.0
6.2
5.1
2.4
1.2
0.6
1.3
2.6
3.5
31.1
29.4
1.5
3.2
1.6
2.4
2.1
1.2
1.6
2.6
2.3
3.1
3.0
0.5
1.1
1.7
15.1
1.0
2.3
95.0
1.0
20.2
155.8
0.8
2.0
21.0
8.7
0.7
0.8
22.2
26.0
8.3
16.0
0.8
6.0
0.7
41.1
3.3
33.0
19.0
12.0
4.4
6.5
12.0
12.9
16.7
48.9
Appendix C. The Complete Datasets
I used the following sequences from the Stanford HIV-1 reverse transcriptase drug
susceptibility data set:
Id
i1
i2
i3
i4
i5
i6
i7
i8
i9
i10
i11
i12
i13
i14
i15
i16
i17
i18
i19
i20
i21
i22
i23
i24
i25
i26
i27
i28
i29
i30
i31
i32
i33
i34
i35
i36
i37
i38
i39
i40
i41
i42
i43
i44
i45
i46
i47
i48
i49
i50
i51
i52
i53
i54
i55
i56
i57
i58
i59
i60
i61
i62
i63
i64
i65
i66
i67
i68
i69
i70
i71
i72
i73
i74
i75
Seq_Id
1
7507
7508
7509
7510
39379
7512
7515
7516
7517
7518
7519
7878
7879
41691
7884
7888
7910
7913
40594
8044
8310
8313
8311
8319
8320
8321
8322
8324
8401
8612
8649
8651
8653
8655
8659
8661
8663
8667
8669
8671
8673
8675
9270
9342
9363
41884
9405
9425
9576
9607
9608
9609
9610
2616
9641
9650
9927
9988
10003
10011
10158
10492
10496
10514
10516
10719
11150
11333
11602
11795
11806
11807
11809
11811
3TC_Fold
100.0
100.0
4.1
100.0
1.0
4.6
200.0
4.1
100.0
666.0
109.1
200.0
1.0
200.0
200.0
1.1
200.0
6.0
4.0
1.0
210.0
15.0
78.4
109.1
200.0
200.0
200.0
200.0
200.0
96.2
2.5
3.0
7.0
1.0
170.0
ABC_Fold
14.5
7.7
3.5
8.6
0.9
6.7
5.5
4.7
3.8
7.4
1.0
4.0
4.0
2.0
1.0
5.0
5.0
3.0
5.0
4.0
4.0
7.0
8.0
6.9
0.8
4.2
3.0
2.0
1.0
5.0
6.6
5.1
3.0
3.5
4.3
5.6
3.8
2.2
2.3
2.5
5.0
-
D4T_Fold
6.4
1.7
3.2
1.7
1.0
3.0
2.5
2.9
0.9
3.9
1.0
1.0
2.7
3.3
2.3
0.8
1.6
2.0
1.0
1.0
5.0
0.9
1.9
0.6
0.8
0.7
2.0
0.7
0.7
2.0
2.4
3.8
13.0
3.3
1.3
-
DDC_Fold
5.4
2.4
1.6
2.4
1.1
1.4
2.5
1.3
2.0
1.0
1.0
1.0
2.1
2.9
2.0
3.1
2.1
1.0
1.9
2.0
1.0
1.0
3.0
3.6
2.2
2.3
2.0
2.2
2.2
1.8
1.8
1.1
1.0
1.4
7.0
1.4
-
83
DDI_Fold
1.9
8.9
4.5
9.4
19.0
2.7
1.5
1.4
1.9
1.1
1.8
1.9
1.5
1.6
0.8
1.0
1.0
1.6
2.2
2.1
1.9
0.9
1.3
2.0
1.0
1.0
2.0
1.3
1.9
1.4
1.6
1.6
1.6
1.5
1.5
1.2
1.2
1.3
22.0
2.0
-
DLV_Fold
0.1
0.2
0.2
1.8
1.1
0.3
7.2
88.8
0.1
0.7
0.1
250.0
1.0
0.7
0.3
0.5
2.2
0.9
2.2
1.5
0.7
0.7
58.0
1.0
72.0
0.1
0.8
0.3
53.0
190.0
-
EFV_Fold
0.2
0.3
17.5
1.2
0.9
0.4
11.4
20.7
0.2
1.3
0.2
96.0
1.0
0.5
0.3
1.1
1.5
0.7
4.4
1.0
0.5
0.7
700.0
0.6
69.0
0.2
0.8
0.2
41.0
240.0
-
NVP_Fold
1.0
1.0
1.0
110.0
300.0
300.0
150.0
0.5
0.5
72.0
2.1
1.5
0.4
490.2
41.8
0.2
1.0
1.0
1.0
3.0
1.0
1.0
4.5
0.2
400.0
0.9
0.7
0.4
4.2
3.4
0.6
9.0
1.0
0.6
0.5
35.0
0.8
191.0
0.4
0.6
0.2
155.0
240.0
-
Appendix C. The Complete Datasets
i76
i77
i78
i79
i80
i81
i82
i83
i84
i85
i86
i87
i88
i89
i90
i91
i92
i93
i94
i95
i96
i97
i98
i99
i100
i101
i102
i103
i104
i105
i106
i107
i108
i109
i110
i111
i112
i113
i114
i115
i116
i117
i118
i119
i120
i121
i122
i123
i124
i125
i126
i127
i128
i129
i130
i131
i132
i133
i134
i135
i136
i137
i138
i139
i140
i141
i142
i143
i144
i145
i146
i147
i148
i149
i150
i151
i152
i153
i154
i155
i156
i157
i158
11810
11845
11849
2868
12270
12293
40495
12503
12528
12531
40764
12836
12837
12838
12839
12840
12841
12842
12843
12844
12845
12846
12847
12848
12849
12850
12851
12852
12853
12854
12855
12856
12857
12858
12859
12860
12964
12965
12966
12967
12968
12969
12970
12971
12972
12973
12974
12975
12976
12977
12978
12979
12980
12981
12927
12929
12931
12933
12937
12939
12941
12943
14405
14419
14421
14425
14435
14440
14442
14448
14462
14468
14473
14476
14478
14479
14481
14486
14496
14501
14506
14514
14517
170.0
6.6
167.0
78.4
1.1
1.0
1.4
0.9
15.0
8.4
200.0
2.0
3.0
150.0
150.0
150.0
1.1
12.5
2.6
3.5
21.9
1.4
150.0
4.0
2.0
4.0
1.0
22.0
8.0
2.0
7.0
32.0
3.0
14.0
15.0
78.0
82.0
85.0
72.0
82.0
84.0
150.0
150.0
150.0
1.6
0.9
150.0
150.0
3.5
21.2
32.5
1.4
24.1
43.3
34.4
23.6
29.8
43.3
0.9
43.3
90.0
90.0
1.3
2.1
43.3
32.5
21.8
43.3
37.0
7.6
8.0
5.1
6.7
1.0
1.2
2.0
1.0
21.0
7.2
2.4
2.6
2.8
3.8
3.1
3.4
1.0
2.6
16.4
28.9
4.4
10.5
1.0
3.2
1.7
2.3
9.5
1.9
1.9
1.3
7.7
40.0
40.0
1.1
1.2
1.2
5.9
1.5
2.8
5.4
0.6
8.9
1.5
1.4
1.1
1.0
1.9
0.9
6.8
0.7
1.5
1.9
0.8
0.8
1.4
1.4
1.7
0.9
1.9
13.3
2.9
11.5
1.6
2.1
1.5
0.8
0.7
1.1
2.8
2.0
1.2
2.7
0.8
0.5
0.8
0.8
1.4
0.4
1.8
1.5
0.2
1.7
15.0
2.2
2.9
0.1
0.8
0.4
0.5
1.6
0.2
5.4
2.0
2.9
1.0
1.0
0.9
1.0
8.8
1.9
1.6
1.0
0.8
1.4
1.5
1.8
0.4
2.5
1.2
1.4
2.1
2.2
16.1
1.8
2.1
2.2
0.7
2.0
0.7
0.4
4.7
4.8
1.2
0.7
2.7
1.9
5.0
1.3
2.1
0.7
0.9
5.4
33.0
3.7
1.1
0.4
10.9
2.8
3.5
4.8
0.4
6.4
1.6
3.7
0.8
1.1
1.1
1.0
11.0
2.1
1.5
0.9
0.9
0.9
1.2
1.4
0.5
1.8
1.8
1.2
3.0
6.8
18.3
1.4
2.5
1.2
1.2
1.0
1.8
1.5
0.9
3.0
13.1
0.8
0.9
2.2
1.3
3.4
1.5
1.0
0.7
1.5
6.0
48.0
3.4
1.1
1.2
23.4
1.2
5.1
5.8
0.6
84
150.0
0.3
86.7
0.7
1.0
0.1
1.1
16.0
0.4
6.6
0.3
0.5
0.6
0.4
1.1
0.2
2.7
30.0
4.9
52.0
5.0
1.3
35.0
9.0
0.5
0.4
2.0
1.4
250.0
37.0
0.3
145.0
3.0
0.6
1.4
0.2
0.2
-
88.0
0.4
2.7
0.7
0.7
0.2
0.7
240.0
3.7
4.6
0.5
0.4
0.6
0.3
0.8
0.3
1.9
10.0
5.0
26.0
1.7
1.7
3.0
109.0
7.6
47.0
123.0
500.0
31.0
213.0
-
240.0
0.5
107.8
0.6
0.9
0.2
0.7
240.0
240.0
2.8
1.0
0.5
0.6
0.3
1.0
0.6
1.9
3.1
11.6
55.0
64.0
3.0
161.0
500.0
75.0
206.0
500.0
500.0
500.0
500.0
0.4
195.0
580.0
0.4
4.7
0.2
0.2
-
Appendix C. The Complete Datasets
i159
i160
i161
i162
i163
i164
i165
i166
i167
i168
i169
i170
i171
i172
i173
i174
i175
i176
i177
i178
i179
i180
i181
i182
i183
i184
i185
i186
i187
i188
i189
i190
i191
i192
i193
i194
i195
i196
i197
i198
i199
i200
i201
i202
i203
i204
i205
i206
i207
i208
i209
i210
i211
i212
i213
i214
i215
i216
i217
i218
i219
i220
i221
i222
i223
i224
i225
i226
i227
i228
i229
i230
i231
i232
i233
i234
i235
i236
i237
i238
i239
i240
i241
15331
15332
15333
15334
15335
15336
15337
15338
15339
15340
15341
15342
15343
15554
15555
15556
15558
15559
16033
16271
16577
16578
16579
16580
16581
16582
25066
25067
25068
25069
25070
25071
25072
25073
25074
25075
25076
25077
25078
25079
25080
25081
25082
25512
25513
25539
25540
25516
25541
25518
25529
25510
25543
25520
25545
25523
26040
26042
26044
26046
26067
26071
26072
26074
26049
26050
26028
26030
26053
26055
26058
26060
26031
26032
26062
26063
26034
26036
26037
40788
26506
26512
27085
1.0
1.5
62.0
62.0
62.0
2.9
11.0
20.0
4.2
6.1
11.0
3.2
2.0
130.0
200.0
100.0
100.0
182.0
182.0
182.0
182.0
182.0
182.0
182.0
182.0
182.0
182.0
182.0
35.8
182.0
200.0
200.0
200.0
200.0
1.1
200.0
200.0
200.0
200.0
200.0
8.3
200.0
200.0
21.3
200.0
200.0
200.0
1.5
80.0
30.0
4.4
1.4
7.1
4.1
8.7
4.0
9.6
12.0
7.0
12.0
12.0
1.8
2.8
3.8
6.9
20.9
31.8
30.2
32.7
15.7
18.6
21.4
27.7
36.9
9.9
5.0
19.5
9.1
34.2
5.1
6.8
5.0
5.5
1.1
2.3
3.5
4.6
6.3
8.5
4.2
7.7
7.5
7.8
3.5
3.1
5.7
3.5
0.7
5.6
10.0
2.6
4.1
2.2
3.7
4.0
4.4
4.0
6.6
6.9
3.1
17.0
2.7
3.0
1.2
0.8
1.3
2.4
0.8
1.1
1.9
1.7
1.3
0.7
1.2
1.6
1.3
3.4
4.3
1.3
2.1
2.4
0.6
0.9
1.0
2.0
0.8
1.9
20.0
0.8
0.8
5.0
5.1
6.7
1.7
2.8
4.8
2.8
2.5
3.5
3.0
5.4
3.7
1.8
28.7
27.4
6.8
7.5
22.3
9.6
27.3
35.8
6.4
5.2
9.4
6.9
26.9
2.0
2.2
1.9
0.9
1.7
1.5
2.1
1.7
2.8
1.2
2.0
2.4
1.9
0.9
1.9
2.2
2.3
-
0.9
0.9
8.0
2.4
6.7
2.1
5.0
4.4
3.8
6.5
6.5
2.5
3.5
2.6
1.5
1.3
6.8
17.5
20.1
13.9
38.7
8.2
8.0
7.8
20.9
7.9
7.6
7.6
10.4
13.9
1.5
2.1
1.3
1.7
1.2
1.0
1.4
1.5
1.7
2.5
2.0
1.7
2.1
1.9
1.6
1.6
2.6
1.4
0.8
2.2
4.4
85
58.0
250.0
250.0
161.0
250.0
5.4
1.0
0.4
0.3
6.0
4.0
270.0
20.0
3.0
3.0
0.2
14.0
1.0
75.0
190.0
53.0
190.0
190.0
190.0
1.0
0.1
1.5
0.6
0.2
11.0
0.2
3.5
0.3
0.4
0.3
1.5
1.1
25.7
0.4
7.8
5.9
6.1
0.4
0.6
0.3
1.4
1.1
0.7
1.0
0.2
-
23.0
25.0
450.0
5.3
270.0
3.8
0.7
7.0
56.0
150.0
300.0
170.0
300.0
3.0
12.0
0.5
8.0
3.0
2.0
167.0
1.0
150.0
270.0
1.0
5.0
7.0
680.0
270.0
0.3
1.2
0.4
0.3
0.3
1.5
0.2
0.3
0.3
9.5
1.4
330.0
0.3
9.4
41.6
34.4
0.6
0.7
2.1
0.6
0.5
0.5
0.2
85.0
39.0
780.0
780.0
600.0
600.0
5.5
0.5
96.0
191.0
356.0
600.0
170.0
600.0
680.0
680.0
3.0
680.0
8.0
3.0
150.0
1.0
680.0
680.0
2.0
165.0
680.0
69.0
680.0
0.4
1.5
2261.0
49.2
6659.0
6659.0
6659.0
1923.0
17.9
6659.0
3.9
19.9
6659.0
26.6
0.6
0.4
47.1
0.3
3.4
0.4
0.3
0.5
9.6
2.6
750.0
0.4
16.0
39.2
34.1
0.8
1.2
0.7
2.2
0.7
0.7
0.7
1.0
0.4
218.0
Appendix C. The Complete Datasets
i242
i243
i244
i245
i246
i247
i248
i249
i250
i251
i252
i253
i254
i255
i256
i257
i258
i259
i260
i261
i262
i263
i264
i265
i266
i267
i268
i269
i270
i271
i272
i273
i274
i275
i276
i277
i278
i279
i280
i281
i282
i283
i284
i285
i286
i287
i288
i289
i290
i291
i292
i293
i294
i295
i296
i297
i298
i299
i300
i301
i302
i303
i304
i305
i306
i307
i308
i309
i310
i311
i312
i313
i314
i315
i316
i317
i318
i319
i320
i321
i322
i323
i324
28210
28211
28214
28247
28243
28234
28233
28244
28242
28241
28236
28238
28250
28252
28253
28254
28255
28256
28257
28258
28259
28260
28261
28262
28263
28264
28265
28266
28267
28268
28269
28270
28271
28272
28273
28274
28275
28276
28277
28278
28279
28280
28281
28282
28897
28898
28899
28900
28901
38707
38709
38711
38713
38715
38717
38719
38721
38723
38725
38727
38729
38731
38733
38735
38739
38741
38743
38745
38747
38749
38751
38753
38755
38757
38759
38763
38765
38767
38769
38771
38773
38775
38777
40.0
50.0
80.0
80.0
80.0
2.6
80.0
80.0
80.0
1.6
116.0
4.7
8.6
20.0
116.0
166.7
90.9
155.8
90.9
7.9
155.8
2.2
90.9
77.1
7.1
4.9
90.9
90.9
166.7
90.9
90.9
8.8
166.7
1.0
1.0
155.8
166.7
4.8
60.6
166.7
166.7
90.9
90.9
5.4
13.0
5.9
8.8
4.4
3.2
3.7
8.2
3.1
1.1
19.0
19.0
6.1
5.6
19.0
6.6
3.3
11.0
1.6
6.6
6.7
3.2
4.2
0.8
0.9
3.3
5.1
3.6
4.0
3.8
-
1.5
3.7
2.9
3.4
0.9
1.8
0.8
1.7
1.0
1.0
5.2
3.9
4.3
4.2
1.7
8.8
0.5
0.9
0.6
3.8
4.9
1.7
2.4
0.9
4.3
1.2
0.7
1.4
3.3
0.7
0.7
1.6
1.7
1.0
1.1
0.8
3.1
5.1
0.9
1.1
1.4
1.3
0.6
2.1
2.1
2.2
4.3
1.8
1.2
1.9
2.4
1.7
1.0
10.0
2.9
4.1
4.4
3.7
4.8
0.8
2.3
2.6
5.4
3.0
1.3
2.7
3.6
1.5
0.8
1.7
3.2
3.1
1.0
4.6
3.3
2.3
1.1
1.1
1.9
1.9
1.4
1.6
1.9
1.9
2.2
1.2
1.5
3.3
2.1
3.1
1.7
1.1
1.7
1.9
1.3
1.0
25.0
25.0
25.0
25.0
25.0
3.7
0.8
1.6
0.7
1.8
2.0
1.2
6.3
4.0
1.9
1.8
1.3
3.9
2.3
0.7
2.2
2.3
1.8
1.1
1.1
1.7
2.1
1.6
1.0
1.4
1.6
3.2
1.2
86
38.0
80.0
7.5
100.0
1.4
0.4
46.0
2.2
30.0
3.5
6.0
24.0
68.0
1.0
13.0
1.3
0.9
33.0
5.0
17.0
2.3
0.2
0.7
0.4
66.0
0.8
870.0
130.0
36.0
39.0
270.0
170.0
97.0
11.0
2.0
12.0
1.8
1.2
0.5
3.6
2.8
0.2
1.1
3.5
2.6
1.3
8.8
2.1
4.5
1.1
6.5
0.2
4.8
11.7
5.0
1.6
0.4
1.3
2.1
1.3
1.5
0.2
0.1
1.1
0.7
0.1
5.6
3.6
57.0
80.0
44.0
100.0
0.9
0.5
20.0
1.2
5.2
24.0
6.9
5.6
19.0
36.0
0.6
3.8
1.1
1.6
1.1
1.1
3.8
140.0
4.0
4.6
97.0
1.2
0.9
0.6
110.0
2400.0
210.0
250.0
84.0
32.0
404.0
1400.0
3800.0
100.0
57.0
625.0
0.8
0.4
0.3
0.4
2.0
20.0
0.3
0.8
0.3
0.8
1.0
0.8
0.2
0.5
0.4
0.2
-
195.0
207.0
100.0
100.0
0.8
0.6
43.0
1.3
6.2
5.3
14.0
3.2
44.0
50.0
0.9
120.0
1.9
2.8
100.0
1.9
1500.0
44.0
41.0
290.0
1.7
3.1
1.7
590.0
110.0
310.0
136.0
270.0
1600.0
2600.0
1100.0
140.0
260.0
268.0
2.8
2.6
133.0
4.3
1.3
2.0
0.6
0.6
1.9
3.7
0.5
0.4
4.3
2.2
3.6
114.0
1.7
1.5
1.8
5.6
0.3
1.9
3.0
2.0
1.5
0.3
0.9
1.1
4.4
0.8
0.2
1.2
1.0
0.4
0.1
3.3
3.2
Appendix C. The Complete Datasets
i325
i326
i327
i328
i329
i330
i331
i332
i333
i334
i335
i336
i337
i338
i339
i340
i341
i342
i343
i344
i345
i346
i347
i348
i349
i350
i351
i352
i353
i354
i355
i356
i357
i358
i359
i360
i361
i362
i363
i364
i365
i366
i367
i368
i369
i370
i371
i372
i373
i374
i375
i376
i377
i378
i379
i380
i381
i382
i383
i384
i385
i386
i387
i388
i389
i390
i391
i392
i393
i394
i395
i396
i397
i398
i399
i400
i401
i402
i403
i404
i405
i406
i407
38779
38781
38783
38785
38787
38789
38791
38793
38795
38797
38799
38801
38803
38805
38807
38809
38811
38813
38815
38817
38819
38821
38823
38825
38827
38829
38831
38833
38835
38837
38839
38841
38843
38845
38847
38849
38851
38853
38855
38857
38859
38861
38863
38865
38867
38869
38871
38873
38875
38877
38879
38881
38883
38885
38887
38889
38891
38893
38895
38897
38899
38901
38903
38905
38907
38909
38911
38913
38915
38917
38919
38921
38923
38925
38927
38929
39552
42158
2930
2905
2997
39448
43993
3.2
90.9
155.8
166.7
3.1
3.5
90.9
90.9
3.3
90.9
90.9
155.8
166.7
3.1
90.9
90.9
90.9
155.8
90.9
166.7
90.9
90.9
90.9
90.9
90.9
90.9
5.4
166.7
7.0
90.9
1.9
155.8
3.9
166.7
196.7
155.8
2.0
90.9
166.7
0.9
84.2
9.0
81.4
4.4
3.0
155.8
1.3
90.9
166.7
0.8
90.9
90.9
166.7
90.9
90.9
0.8
155.8
196.7
90.9
90.9
81.4
90.9
90.9
155.8
166.7
3.9
80.0
78.4
2.1
100.0
65.4
1.1
2.5
2.1
6.3
3.0
3.5
6.7
7.0
1.0
9.0
4.9
1.3
3.7
4.6
2.0
5.0
5.9
6.0
4.4
5.1
0.7
6.6
2.3
3.1
7.9
1.2
7.5
9.2
4.6
7.7
6.7
6.6
2.1
5.3
4.0
1.4
4.3
2.9
1.1
3.0
0.9
0.8
1.7
4.4
2.2
0.5
0.5
2.3
4.7
0.5
2.1
1.9
1.7
1.3
0.6
1.9
1.1
1.7
0.9
0.8
1.0
1.3
0.8
0.5
1.9
1.8
12.3
1.4
2.0
2.4
2.7
3.0
0.9
0.5
0.9
1.8
0.8
3.2
8.8
0.5
3.6
2.6
2.2
1.6
2.8
3.5
0.7
17.3
0.6
3.9
0.6
0.5
0.9
0.9
1.4
0.5
0.8
0.8
1.0
0.8
1.6
3.2
1.7
2.3
0.9
1.5
1.4
0.7
1.0
1.7
2.7
0.6
1.5
4.5
1.2
0.8
0.5
1.4
3.5
2.3
2.3
2.3
1.9
2.5
2.3
1.4
2.3
7.4
2.2
2.3
1.9
1.9
1.6
2.5
0.5
1.7
2.1
2.9
2.9
1.5
2.7
3.0
2.5
2.6
2.4
1.3
0.7
2.1
1.0
1.4
7.0
2.3
1.4
1.5
2.9
1.0
1.9
3.7
0.8
2.6
1.7
2.8
1.3
0.6
0.5
2.5
2.5
1.4
1.5
2.4
0.7
1.2
2.5
2.6
1.9
1.4
3.0
1.5
3.6
1.1
1.9
1.5
1.2
1.5
1.8
1.5
0.7
0.7
1.3
2.0
1.2
2.5
1.9
2.0
1.0
0.8
1.3
1.7
3.4
2.2
1.6
1.6
2.1
11.4
1.7
1.5
2.0
1.5
2.1
2.4
1.5
2.6
2.7
2.3
2.2
1.7
0.8
0.8
1.8
1.1
2.3
5.4
0.9
1.6
1.7
1.8
0.9
1.5
2.4
0.8
3.9
0.7
2.3
1.6
1.0
0.7
1.6
2.1
0.9
2.0
1.5
0.9
3.5
2.3
2.5
0.7
1.7
0.8
1.5
1.2
1.3
1.0
87
2.3
0.6
4.3
0.8
13.0
1.3
1.2
0.3
3.4
3.5
0.6
4.1
1.1
0.4
0.7
0.9
0.3
0.6
3.7
0.3
6.9
0.8
2.4
0.4
0.5
3.1
4.6
4.3
0.2
23.5
2.6
4.2
0.1
0.2
4.1
4.1
0.4
0.6
1.5
3.9
0.2
1.6
1.0
11.4
5.4
0.2
2.3
3.8
0.5
1.4
0.1
0.4
0.2
0.3
4.2
0.5
5.5
2.1
1.9
0.2
6.5
21.4
4.5
1.3
0.2
2.2
0.5
1.9
1.0
0.5
0.9
0.5
0.5
0.6
0.5
0.4
4.0
1.0
0.6
2.1
0.4
0.8
4.7
0.9
0.3
0.7
0.5
0.8
0.3
0.2
1.5
0.6
0.4
0.5
0.2
0.3
0.3
0.8
0.4
0.2
1.6
0.2
0.8
0.9
0.3
0.9
0.7
2.2
0.1
0.4
0.2
0.2
0.6
0.2
0.7
0.1
0.3
0.3
0.4
0.2
0.4
33.6
0.3
2.5
1.2
1.0
1.2
1.9
0.6
1.0
0.5
11.0
1.8
0.6
0.6
1.4
2.2
0.7
7.0
1.3
0.9
2.7
0.6
0.4
0.2
3.5
0.2
2.4
1.7
4.0
0.5
0.7
1.3
2.5
9.2
0.4
12.3
1.5
0.7
1.1
0.4
0.9
8.9
0.6
0.2
3.1
3.5
0.2
0.7
0.9
4.1
3.1
0.6
1.0
1.7
0.6
4.5
0.3
0.3
0.4
0.2
1.3
0.7
1.1
7.6
1.5
0.4
3.6
2.7
1.3
0.8
0.2
0.8
0.5
0.6
0.3
1.5
0.3
2.3
1.3
0.4
0.6
0.4
3.8
2.1
1.1
1.6
Appendix C. The Complete Datasets
i408
i409
i410
i411
i412
i413
i414
i415
i416
i417
i418
i419
i420
i421
i422
i423
i424
i425
i426
i427
i428
i429
i430
i431
i432
i433
i434
i435
i436
i437
i438
i439
i440
i441
i442
i443
i444
i445
i446
i447
i448
i449
i450
i451
i452
i453
i454
i455
i456
i457
i458
i459
i460
i461
i462
i463
i464
i465
i466
i467
i468
i469
i470
i471
i472
i473
i474
i475
i476
i477
i478
i479
i480
i481
i482
i483
i484
i485
i486
i487
i488
i489
i490
43994
43995
43996
43997
2986
43998
44000
44001
44002
44003
44004
44005
44006
44007
44008
44009
44010
44011
44012
44013
44014
44015
44016
44018
44019
44020
44021
44022
44023
44024
44025
44026
44027
44028
44029
44031
44033
44034
44035
44037
44037
44039
44039
44041
44041
44043
44043
44045
44045
44047
44047
44049
44049
44051
44051
44053
44053
44055
44055
44057
44057
44059
44059
44061
44061
44063
44063
44065
44065
44067
44067
44069
44069
44071
44071
44073
44073
44075
44075
44077
44077
44079
44079
4.1
2.0
3.6
1.7
65.4
2.5
200.0
200.0
200.0
200.0
200.0
200.0
200.0
200.0
200.0
7.5
200.0
200.0
7.2
1.7
1.4
2.4
1.0
200.0
4.3
5.9
5.4
3.0
200.0
200.0
200.0
0.9
0.8
200.0
89.0
200.0
200.0
200.0
200.0
2.1
1.4
0.5
1.1
0.7
0.8
1.4
1.2
0.5
1.1
0.8
1.1
2.5
4.7
0.5
1.1
1.4
1.0
2.7
2.0
0.8
0.9
8.5
2.0
3.1
1.2
1.1
1.3
1.0
1.3
7.3
4.7
1.5
1.1
0.3
1.1
4.2
2.3
3.8
1.3
1.5
2.5
4.6
3.4
4.8
1.6
1.5
70.0
6.0
6.4
7.7
8.4
7.7
7.7
6.6
2.4
1.2
2.9
1.2
4.9
2.8
3.4
7.1
5.0
8.4
7.3
22.0
0.9
0.8
6.5
70.0
5.0
6.1
5.5
6.8
0.7
0.8
0.4
1.0
0.8
0.9
1.4
1.1
0.4
0.8
1.3
1.0
2.1
5.1
2.2
1.4
0.4
0.8
1.0
0.7
3.0
0.8
2.0
1.7
2.4
1.6
0.5
1.0
0.5
0.7
2.7
1.9
0.4
0.9
0.3
0.7
2.6
2.2
2.3
1.3
0.9
1.5
2.3
2.5
2.0
0.6
0.9
11.0
2.2
2.2
3.9
6.7
2.9
3.0
4.2
1.7
1.5
1.8
1.2
2.5
2.0
2.4
11.0
7.1
5.4
2.3
9.0
1.1
1.4
1.5
20.0
1.6
1.9
1.8
2.3
1.3
1.2
0.9
1.0
0.4
0.8
1.1
1.0
0.3
0.8
0.8
1.1
0.8
2.6
2.6
1.1
0.3
1.2
0.9
1.2
1.9
0.7
1.1
2.1
1.7
1.1
1.2
1.1
0.9
1.0
5.9
1.7
0.8
0.8
0.3
1.2
1.0
0.9
1.4
0.9
2.4
1.3
2.2
2.3
2.2
1.1
1.0
47.0
2.0
2.6
3.2
1.8
2.3
2.4
1.8
1.1
0.7
1.3
0.7
2.1
1.2
1.4
17.0
10.0
3.8
2.4
4.1
0.5
0.5
2.4
28.0
2.5
2.3
2.1
2.3
2.1
1.3
1.2
1.0
0.7
1.0
2.0
0.9
1.7
1.0
0.7
1.1
0.6
1.4
3.1
1.0
0.3
0.5
2.1
1.0
1.2
0.9
1.4
1.2
1.9
1.0
1.4
1.2
2.5
0.9
1.7
1.1
0.3
0.9
1.2
1.3
1.7
1.2
1.5
1.1
1.1
1.6
1.4
1.8
1.7
1.4
0.7
23.0
1.7
1.6
2.4
2.3
1.8
1.9
1.6
1.4
0.9
1.5
0.9
1.7
1.2
1.4
11.0
8.1
3.3
2.3
3.3
0.7
0.8
1.7
28.0
1.6
1.9
1.8
1.9
0.5
1.2
0.1
1.0
2.0
0.8
0.6
0.9
0.1
1.0
0.5
1.0
0.6
1.3
1.1
0.9
0.3
1.2
1.9
1.1
1.7
1.0
1.6
1.2
2.2
0.9
3.0
1.1
0.4
0.9
1.7
1.8
0.9
0.9
0.2
1.3
88
1.7
49.0
3.0
6.1
88.7
6.4
62.0
60.0
15.0
250.0
250.0
2.3
9.2
0.4
0.4
9.2
0.7
0.6
41.0
250.0
250.0
250.0
250.0
2.1
0.4
0.6
1.2
0.1
0.3
250.0
250.0
0.8
250.0
0.4
0.7
0.3
0.5
2.1
1.0
0.8
1.9
3.5
0.9
1.6
0.9
0.2
2.1
1.1
0.8
1.2
1.6
3.7
2.0
0.4
1.0
1.2
0.3
1.0
3.3
4.1
0.5
2.8
1.5
1.8
1.8
0.1
4.4
1.5
4.0
1.3
4.5
1.2
5.3
1.2
110.6
19.7
1.9
0.5
6.4
2.9
1.6
2.8
2.8
1.0
45.0
0.8
15.0
15.0
3.7
46.0
84.0
0.9
9.6
0.4
0.5
5.1
0.6
0.5
15.0
700.0
700.0
132.0
700.0
0.1
76.0
4.3
0.4
1.1
0.1
0.4
700.0
700.0
700.0
1.0
700.0
0.4
0.5
0.2
0.3
0.9
1.1
1.2
0.8
2.8
0.8
1.0
0.7
0.7
1.2
2.4
0.9
0.4
0.8
2.4
7.1
1.8
4.5
2.7
0.4
2.0
1.8
1.0
0.4
1.0
0.6
3.3
1.0
1.2
0.2
2.5
0.8
3.5
0.9
1.7
0.8
2.5
0.9
283.3
101.4
1.2
0.5
1.7
1.4
3.6
400.0
5.3
85.0
79.4
77.0
400.0
400.0
280.0
400.0
400.0
1.0
26.0
0.5
0.7
26.0
0.7
0.6
55.0
400.0
400.0
400.0
400.0
0.3
400.0
85.0
0.4
1.3
0.1
0.7
400.0
29.0
33.0
2.0
400.0
0.6
0.5
0.3
0.5
1.5
1.0
2.1
1.1
4.0
1.1
1.4
0.8
1.8
1.2
2.4
0.9
0.3
0.7
4.0
2.9
52.3
625.0
3.4
0.5
1.3
2.4
1.4
0.3
1.8
0.8
5.3
1.4
2.5
0.1
2.7
0.8
2.1
0.9
1.3
0.8
2.1
0.8
45.0
893.0
8.4
0.4
1.0
0.7
Appendix C. The Complete Datasets
i491
i492
i493
i494
i495
i496
i497
i498
i499
i500
i501
i502
i503
i504
i505
i506
i507
i508
i509
i510
i511
i512
i513
i514
i515
i516
i517
i518
i519
i520
i521
i522
i523
i524
i525
i526
i527
i528
i529
i530
i531
i532
i533
i534
i535
i536
i537
i538
i539
i540
i541
i542
i543
i544
i545
i546
i547
i548
i549
i550
i551
i552
i553
i554
i555
i556
i557
i558
i559
i560
i561
i562
i563
i564
i565
i566
i567
i568
i569
i570
i571
i572
i573
44081
44081
44083
44083
44085
44085
44087
44087
44089
44089
44091
44091
44093
44093
44095
44095
44097
44097
44099
44099
44101
44101
44103
44103
44105
44105
44107
44107
44109
44109
44111
44111
45037
45039
45041
45043
45045
45126
45130
45132
45134
45136
45138
45140
45142
45144
45146
45148
45150
45152
45154
45156
45158
45160
45162
45164
45166
45168
45170
45172
45174
45176
45178
45180
45182
45184
45186
45188
45190
45192
45194
45196
45198
45200
45202
45204
45206
45208
45210
45212
45214
45216
45218
39.8
113.0
39.8
183.0
0.4
1.7
38.3
183.0
0.3
1.0
0.6
1.2
0.8
1.3
2.6
1.1
1.2
0.9
1.0
1.2
0.7
1.0
1.2
1.2
0.5
0.9
0.5
2.1
0.9
1.3
100.0
100.0
100.0
0.7
100.0
32.4
31.2
42.4
31.2
1.2
25.1
42.9
0.7
13.2
31.2
34.6
1.6
20.5
33.9
42.9
134.5
7.5
0.8
34.6
23.4
1.0
31.3
2.0
23.4
41.5
50.3
39.4
21.5
42.4
32.8
1.2
18.9
31.2
30.1
3.3
32.4
32.4
25.6
42.9
32.4
31.2
50.3
25.1
31.2
29.5
2.0
2.7
1.9
0.6
2.5
1.0
0.9
1.4
2.9
0.7
0.8
0.7
0.9
1.3
1.1
2.4
1.1
0.4
0.8
0.5
0.8
1.2
0.7
0.4
0.8
0.4
0.7
0.5
0.6
0.3
1.4
2.4
2.5
1.9
0.9
5.7
4.0
0.7
2.2
14.9
0.4
2.7
0.5
0.4
2.2
0.6
8.0
11.5
1.5
3.1
1.4
0.4
1.8
1.0
1.3
0.8
2.1
1.4
0.4
0.7
0.6
0.9
0.9
0.9
1.5
1.0
2.0
0.9
1.0
0.7
1.0
0.7
1.3
1.0
0.7
1.2
0.3
0.8
0.6
1.3
0.4
1.1
0.7
0.8
0.7
1.0
1.5
2.2
0.7
0.8
0.1
0.6
0.5
0.8
0.3
1.2
1.4
2.1
1.1
2.7
0.6
3.0
3.6
0.5
0.5
0.2
1.0
3.3
1.3
3.9
2.0
3.3
1.7
1.3
1.8
0.3
0.9
0.8
0.9
1.2
1.1
4.2
0.9
1.6
0.8
0.7
0.8
1.7
1.0
0.6
1.1
0.7
0.6
2.8
1.3
0.7
1.1
1.5
2.4
2.4
0.9
2.1
3.3
2.2
0.3
0.5
1.0
0.4
0.5
0.9
0.5
1.3
0.6
4.7
2.9
0.3
3.6
2.1
1.0
0.8
0.3
1.6
1.7
0.7
1.3
0.9
1.2
1.0
1.5
0.2
1.0
0.4
0.9
0.6
1.1
2.4
1.0
1.0
0.8
0.4
0.7
0.4
1.1
0.3
1.1
0.2
0.7
1.6
1.2
0.5
1.3
1.2
2.0
1.7
1.1
1.8
2.5
208.4
2.8
1.5
0.8
0.1
0.5
0.7
2.4
4.2
1.3
0.4
5.9
1.7
0.7
144.7
0.2
0.7
89
2.5
1.4
0.5
0.6
0.7
1.0
2.7
18.5
160.3
94.3
2.2
1.3
1.4
0.4
2.6
2.1
4.7
0.6
9.5
0.8
0.7
0.4
3.8
2.4
0.4
0.3
1.4
0.2
0.6
111.0
1.1
100.0
2.4
42.0
3.8
80.5
685.3
144.7
482.3
57.9
4.3
17.2
384.4
276.0
0.2
390.6
63.2
81.8
0.9
107.1
191.3
321.4
165.9
0.9
42.7
0.4
2.6
3.7
0.8
0.7
0.7
1.1
0.6
2.9
5.2
322.8
24.4
0.6
0.9
1.3
0.5
1.3
1.4
2.7
0.7
2.0
0.6
0.6
0.6
2.2
1.8
1.0
0.3
0.6
1.2
0.4
1.0
31.0
0.7
100.0
1.1
87.0
1.4
2.8
42.7
11.2
0.2
1.7
7.1
0.5
3.1
0.5
0.5
88.3
121.9
1.2
172.4
541.6
209.3
100.9
0.6
1.7
5.1
1.4
1.2
0.9
1.4
0.5
0.5
7.8
75.3
45.9
0.4
1.1
1.0
0.4
1.0
1.9
1.7
0.5
4.0
0.5
0.9
0.6
7.5
3.5
0.5
0.2
2.0
0.2
0.6
110.0
0.8
100.0
1.5
230.0
0.9
1.1
1.1
1.1
0.4
57.9
1.3
1.5
3.3
1.2
0.5
103.6
115.3
1.0
0.6
0.4
0.8
46.6
Appendix C. The Complete Datasets
i574
i575
i576
i577
i578
i579
i580
i581
i582
i583
i584
i585
i586
i587
i588
i589
i590
i591
i592
i593
i594
i595
i596
i597
i598
i599
i600
i601
i602
i603
i604
i605
i606
i607
i608
i609
i610
i611
i612
i613
i614
i615
i616
i617
i618
i619
i620
i621
i622
i623
i624
i625
i626
i627
i628
i629
i630
i631
i632
i633
i634
i635
i636
i637
i638
i639
i640
i641
i642
i643
i644
i645
i646
i647
i648
i649
i650
i651
i652
i653
i654
i655
i656
45220
45222
45224
45226
45228
45230
45232
45234
45236
45238
45240
45242
45244
45246
45248
45250
45252
45254
45256
45258
45260
45262
45264
45266
45268
45270
45272
45274
45276
45278
45280
45282
45284
45286
45288
45290
45292
45294
45296
45298
45300
45302
45304
45306
45308
45310
45312
45314
45316
45318
45320
45322
45324
45326
45328
45330
45332
45334
45336
45338
45340
45342
45344
45346
45348
45350
45352
45354
45356
45358
45360
45362
45364
45366
45368
45370
45372
45374
45376
45378
45380
45382
45384
1.8
0.2
4.4
1.4
32.4
30.4
4.1
0.3
32.4
3.5
55.0
68.5
32.4
34.6
38.6
88.7
1.3
1.0
44.1
42.9
2.3
0.2
31.2
31.2
32.4
1.2
42.9
42.9
30.4
30.1
31.3
30.1
2.0
32.4
4.8
38.4
31.2
55.4
42.4
37.1
2.2
2.6
1.8
41.5
2.1
2.8
47.7
50.3
30.4
2.3
31.2
31.2
44.1
0.5
32.4
30.1
31.2
32.4
0.7
1.2
25.1
31.2
47.7
2.7
38.4
31.2
4.2
38.4
3.3
42.9
32.4
37.5
30.4
0.3
25.1
65.6
47.7
42.9
31.2
47.7
1.3
0.3
0.9
1.3
0.8
2.1
0.2
1.6
0.8
0.9
1.6
1.1
0.7
1.3
3.3
1.0
1.1
2.2
0.6
0.6
0.3
0.3
1.2
2.1
1.6
5.5
0.7
0.2
3.8
38.1
0.4
0.3
1.4
0.5
3.1
0.5
1.2
0.6
1.7
0.6
2.0
1.2
2.1
1.2
0.7
0.7
1.7
1.4
0.5
0.5
0.7
1.2
0.4
4.9
1.4
2.3
3.5
0.2
1.1
0.5
0.6
1.0
0.5
0.8
0.7
0.6
1.6
0.8
0.6
0.5
0.7
1.1
0.1
1.5
1.3
0.3
1.1
0.1
0.4
0.7
1.2
0.4
2.7
7.2
0.1
1.0
1.1
1.0
1.1
1.0
1.3
1.6
0.4
0.7
0.5
1.3
1.2
1.3
0.7
1.0
1.2
1.1
0.9
1.0
0.2
0.7
0.7
1.0
1.0
0.8
2.0
90
173.6
7.4
0.4
171.3
161.7
59.4
384.4
6.3
59.1
230.0
9.4
0.9
0.6
0.9
19.8
482.3
0.5
1.1
170.9
0.7
1.6
0.8
18.0
31.6
144.7
2.1
208.4
321.4
365.5
455.5
176.2
0.4
45.2
0.2
91.7
7.1
92.7
2.9
0.6
415.0
57.9
398.4
0.9
421.2
62.7
48.3
98.8
4.5
0.5
1.0
0.6
209.3
1.9
0.5
90.4
2.6
0.8
103.6
6.4
1.2
0.6
121.9
384.4
80.5
1.1
42.7
0.2
209.3
0.7
0.3
0.5
57.9
0.4
0.7
1.3
209.3
4.6
1.8
0.4
77.8
49.5
0.6
1.4
2.6
1.2
49.5
1.3
2.2
1.0
1.6
0.9
1.1
1.1
1.5
32.1
1.9
32.1
0.5
1.9
0.8
1.9
2.5
2.5
4.1
0.9
0.1
3.1
4.1
Appendix C. The Complete Datasets
i657
i658
i659
i660
i661
i662
i663
i664
i665
i666
i667
i668
i669
i670
i671
i672
i673
i674
i675
i676
i677
i678
i679
i680
i681
i682
i683
i684
i685
i686
i687
i688
i689
i690
i691
i692
i693
i694
i695
i696
i697
i698
i699
i700
i701
i702
i703
i704
i705
i706
i707
i708
i709
i710
i711
i712
i713
i714
i715
i716
i717
i718
i719
i720
i721
i722
i723
i724
i725
i726
i727
i728
i729
i730
i731
i732
i733
i734
i735
i736
i737
i738
i739
45386
45388
45390
45392
45394
45396
45398
45400
45402
45404
45406
45408
45410
45414
45416
45418
45420
45422
45424
45426
45428
45430
45434
45436
45438
45440
46680
46682
46684
46686
46688
46690
46692
46694
46696
46698
46700
46702
46704
46706
46708
46710
46715
46717
46718
46719
46720
46721
46726
46728
46729
46730
46731
46732
46736
46737
46739
46740
46743
46744
46746
46748
46749
46757
46762
46763
46764
46765
46766
46771
46773
46774
46775
46778
46786
46787
46788
46789
46794
46799
46803
46804
46805
32.4
1.0
0.5
0.5
46.3
5.7
31.3
32.4
3.8
41.5
32.4
8.4
2.6
32.4
32.4
30.1
42.9
32.4
1.3
32.4
62.1
0.8
32.4
32.4
17.4
1.0
2.3
0.7
174.8
174.8
0.9
2.4
15.3
174.8
174.8
174.8
174.8
10.6
1.2
1.3
49.4
1.7
0.7
34.4
0.5
0.5
34.4
0.8
0.1
38.1
0.1
30.2
0.8
19.6
0.4
27.9
0.4
0.9
54.9
45.0
45.0
1.2
53.1
1.1
1.1
0.1
30.2
7.2
12.6
1.0
1.7
34.4
0.2
74.0
30.2
0.5
0.4
24.0
1.7
0.6
1.2
0.4
6.9
0.9
1.4
3.7
5.4
0.5
1.2
4.5
31.3
0.7
1.3
0.7
4.9
4.0
0.7
2.3
4.0
7.7
13.9
5.1
8.4
5.8
1.1
1.2
6.9
0.5
0.9
2.8
0.1
0.9
1.8
0.4
0.1
1.1
0.2
1.8
1.1
1.0
2.0
0.7
0.3
1.3
0.8
4.2
3.3
1.0
2.7
1.0
0.8
0.4
5.8
2.8
3.8
0.8
0.7
2.6
0.5
0.5
2.1
0.8
0.3
1.9
0.5
0.8
1.4
0.3
0.8
0.7
3.7
1.8
0.2
0.4
2.1
2.2
1.0
15.2
1.1
1.7
1.0
1.8
1.4
0.8
1.7
2.0
1.1
3.8
1.4
2.5
6.1
0.7
1.4
2.6
1.0
1.4
0.8
0.7
1.6
0.8
0.3
0.1
0.2
0.2
0.6
1.0
0.3
1.2
0.2
0.6
1.3
0.2
1.8
0.8
1.0
0.8
1.6
0.8
0.2
3.5
2.7
5.6
1.0
0.6
0.9
0.5
0.1
1.4
1.2
0.5
0.7
0.6
0.5
0.9
0.6
3.1
0.5
3.3
1.3
0.7
0.1
0.1
0.8
0.4
3.3
1.1
1.6
0.9
2.4
2.2
0.9
1.1
1.8
2.0
4.3
2.1
2.9
3.2
1.8
1.0
7.3
1.1
1.0
5.1
0.7
0.9
1.1
0.6
0.7
0.1
1.2
1.2
0.5
1.1
0.9
0.3
0.9
0.3
5.1
1.8
1.1
1.6
1.1
0.8
0.7
3.4
0.8
1.3
0.8
1.5
1.1
0.2
0.5
1.4
0.6
0.8
1.3
1.0
0.5
0.4
0.5
1.8
0.7
1.8
1.9
0.3
1.0
0.4
4.0
1.1
1.3
0.8
1.6
1.2
0.6
0.8
1.3
1.3
2.9
1.5
1.6
2.4
0.9
1.0
2.4
0.8
0.9
1.7
0.3
0.5
1.1
0.4
1.1
0.6
0.2
1.2
0.9
0.3
0.7
0.7
0.8
1.0
0.3
4.5
1.9
0.2
3.8
0.8
1.1
0.4
3.7
0.9
2.1
1.0
1.1
1.2
0.4
1.0
1.4
0.6
0.4
0.8
1.6
91
1.0
114.2
10.8
265.2
114.1
456.2
685.3
0.4
357.1
73.1
1.4
357.1
502.4
1.4
0.3
2.8
36.5
321.4
0.5
5.8
10.1
2.1
24.0
107.5
0.2
0.1
2.8
1.3
0.1
0.2
0.6
1.6
0.2
64.0
-
0.8
16.7
58.7
0.8
7.0
0.4
0.3
621.3
0.9
0.3
2.0
0.6
1.5
42.7
0.5
32.1
365.0
1.3
20.7
31.6
0.2
0.2
1.5
31.6
0.5
1.5
0.2
0.4
1.0
0.3
4.0
-
2.5
1.7
49.5
0.3
44.5
4.0
0.6
0.7
6.0
2.9
0.3
820.2
1.6
48.6
94.2
0.5
0.3
1.4
820.2
0.8
20.3
0.4
0.8
1.1
1.3
381.9
1.3
1.2
2.1
2.7
0.7
0.8
0.9
2.4
4.1
0.4
0.6
0.8
0.4
3.2
3.9
1.2
4.4
0.8
4.5
0.7
0.5
2.4
3.4
1.4
2.1
0.5
1.2
1.4
4.3
4.6
2.0
0.4
4.4
1.6
0.9
0.9
2.0
Appendix C. The Complete Datasets
i740
i741
i742
i743
i744
i745
i746
i747
i748
i749
i750
i751
i752
i753
i754
i755
i756
i757
i758
i759
i760
i761
i762
i763
i764
i765
i766
i767
i768
i769
i770
i771
i772
i773
i774
i775
i776
i777
i778
i779
i780
i781
i782
i783
i784
i785
i786
i787
i788
i789
i790
i791
i792
i793
i794
i795
i796
i797
i798
i799
i800
i801
i802
i803
i804
i805
i806
i807
i808
i809
i810
i811
i812
i813
i814
i815
i816
i817
i818
i819
i820
i821
i822
46811
46812
46813
46819
46822
46823
46824
46826
46827
46828
46829
46833
46834
46835
46842
46844
46845
46846
46847
46848
46849
46850
46852
46854
46858
3819
3900
3831
3827
3894
3884
3811
3817
3904
3959
40016
3929
3927
3964
3838
40561
3967
3892
3890
39476
3886
3925
3923
4471
3919
3849
3851
3855
3853
41683
41620
3868
3866
3870
40564
3896
3917
50506
50508
50510
50512
50514
50598
52823
52825
52827
52829
52831
52833
52835
52837
52839
52841
52892
52894
52898
52906
52908
0.8
0.5
10.4
1.5
0.7
46.5
69.9
63.9
0.3
49.6
0.3
49.1
0.9
45.0
0.5
1.8
1.2
0.8
0.5
27.9
0.1
0.4
30.2
1.6
65.4
65.4
66.8
1.4
65.4
4.0
78.4
78.4
78.4
65.4
69.0
44.0
78.4
78.4
78.4
69.0
65.4
78.4
44.0
78.4
68.8
68.8
71.8
78.4
4.0
78.4
33.6
65.4
1.0
200.0
1.2
200.0
200.0
200.0
7.0
7.8
2.8
200.0
200.0
4.2
200.0
1.0
143.0
79.2
123.2
0.9
0.4
0.5
1.3
0.5
1.2
4.4
0.9
0.3
1.5
0.6
1.8
1.0
2.8
0.5
1.5
1.1
0.7
0.4
1.7
0.3
0.4
1.4
1.1
2.4
2.8
3.7
1.9
3.9
3.9
12.0
12.3
25.3
2.8
3.5
7.9
19.8
2.6
1.9
12.7
5.3
4.0
4.6
5.7
3.7
5.9
9.5
8.2
2.6
0.9
6.1
1.0
3.8
0.8
23.9
15.1
3.2
4.1
6.6
1.6
6.1
2.7
2.3
2.5
1.5
25.0
6.6
6.4
0.9
0.4
0.1
3.2
0.3
0.3
0.8
0.5
0.6
0.2
0.5
0.6
0.3
1.1
0.6
0.6
1.6
1.0
0.5
0.8
0.8
0.3
0.4
0.5
0.8
3.0
0.6
2.9
0.8
1.0
7.9
2.0
6.4
1.2
0.9
4.7
8.0
1.4
1.8
3.3
0.3
2.2
4.8
3.7
0.7
6.1
4.0
6.0
1.5
1.7
0.6
1.3
0.9
0.8
1.0
9.0
10.9
1.0
3.3
6.9
3.2
1.7
1.5
2.3
1.3
1.3
11.0
1.3
2.7
0.5
0.8
0.9
1.1
0.8
3.6
2.8
0.1
1.0
0.2
1.0
0.9
0.5
0.7
1.2
0.4
1.0
1.7
0.9
1.0
0.8
0.2
0.7
1.1
1.9
2.5
2.4
2.7
0.4
3.6
1.5
3.3
13.0
4.6
2.1
1.7
8.8
1.4
5.4
5.7
1.3
2.6
1.5
3.6
2.8
2.9
2.4
1.9
1.5
3.0
1.5
3.1
0.8
2.2
0.9
5.3
45.2
1.4
1.4
2.2
0.8
1.8
1.5
0.9
1.3
1.0
35.0
2.6
2.3
2.4
0.7
0.7
1.1
0.4
1.5
1.6
0.1
1.6
0.2
1.0
0.7
0.6
1.6
1.0
0.3
1.1
0.8
0.8
0.6
1.2
0.1
0.4
0.8
1.2
0.6
1.7
1.5
0.6
2.4
3.5
3.7
2.9
27.3
1.7
2.2
9.6
6.8
2.2
7.0
4.2
0.7
1.9
6.4
2.6
3.3
2.9
2.5
6.9
1.8
1.2
1.0
2.3
1.1
1.5
1.0
4.4
19.7
1.2
1.7
2.3
0.8
1.6
1.3
1.1
1.2
1.2
15.0
1.6
2.6
92
1.3
0.3
1.4
0.9
0.1
3.6
1.2
1.1
2.8
0.5
2.8
0.9
1.2
0.6
133.0
0.6
0.8
0.2
1.1
59.0
0.6
1.0
1.8
0.4
1.1
1.4
14.8
8.0
0.6
1.1
0.4
0.8
1.0
5.6
0.8
2.7
1.3
0.4
3.0
0.3
15.3
0.1
1.0
30.8
63.1
0.3
0.1
0.2
103.4
2.0
0.1
0.3
1.1
0.6
17.7
0.7
0.5
1.2
0.5
10.7
2.8
1.2
2.2
0.6
0.7
0.8
126.0
0.7
0.9
0.5
1.2
363.4
0.8
1.0
31.5
0.5
1.3
2.0
2.2
2.6
0.6
0.6
1.7
0.8
0.6
2.9
270.0
270.0
0.7
1.5
0.9
270.0
0.3
1.3
0.3
201.3
0.1
10.0
19.5
30.4
0.3
0.2
0.3
273.5
1.0
0.2
0.5
2.1
0.7
0.6
5.3
1.3
2.1
3.1
0.7
2.0
1.2
1.4
0.9
0.6
2.0
0.8
0.8
1.4
2.0
3.2
10.4
1.0
0.8
2.6
4.1
1.1
0.8
105.9
0.5
0.7
2.4
2.6
107.8
8.4
1.7
1.3
0.7
0.5
0.5
107.8
1.7
3.3
0.5
2.1
109.7
3.1
0.4
107.8
0.5
2.5
3.6
3.0
2.7
0.5
0.3
6.0
0.9
0.6
0.9
600.0
600.0
0.9
2.3
0.8
600.0
0.7
1.3
0.5
400.0
0.3
400.0
42.6
51.8
0.5
0.1
0.4
22.1
1.0
0.2
0.6
Appendix C. The Complete Datasets
i823
i824
i825
i826
i827
i828
i829
i830
i831
i832
i833
i834
i835
i836
i837
i838
i839
i840
i841
i842
i843
i844
i845
i846
i847
i848
i849
i850
i851
i852
i853
i854
i855
i856
i857
i858
i859
i860
i861
i862
i863
i864
i865
i866
i867
i868
i869
i870
i871
i872
i873
i874
i875
i876
i877
i878
i879
i880
i881
i882
i883
i884
i885
i886
i887
i888
i889
i890
i891
i892
i893
i894
i895
i896
i897
i898
i899
i900
i901
i902
i903
i904
i905
52910
52912
52914
52916
53157
53158
53159
53160
53162
53163
53166
53167
53168
53174
53175
53589
53590
53591
53592
53593
53594
53595
53852
53856
53861
53865
53867
53880
53884
53888
53890
53908
53914
53921
53923
53931
53933
53950
54190
54191
54192
54269
54270
54271
54272
54273
54274
54275
54276
54277
54278
54279
54280
54286
54287
54288
54289
54290
54291
54292
54357
54358
54359
54360
54361
54362
54363
54403
54404
54405
54601
54602
54603
54604
54605
54617
54618
54619
54620
54621
54622
54623
54624
109.1
111.6
121.5
87.6
0.5
20.3
0.7
8.0
0.9
15.2
1.5
13.3
1.8
2.4
5.9
0.3
0.3
0.9
1.6
0.7
0.6
0.7
0.5
1.1
0.6
0.4
0.7
1.4
0.7
0.7
0.6
9.7
9.0
6.3
1.3
1.6
2.7
3.1
22.0
8.0
6.6
32.0
14.0
15.0
62.0
62.0
120.0
120.0
8.2
0.8
200.0
200.0
64.0
10.0
-
7.2
5.5
5.8
4.7
0.5
5.4
0.3
1.5
1.0
6.9
0.2
2.6
0.9
0.5
4.6
0.7
0.4
1.8
1.9
1.3
0.7
0.8
1.3
0.7
1.0
0.8
2.1
1.5
0.4
1.3
0.6
2.3
2.4
1.6
2.4
1.6
2.0
0.9
4.8
5.0
3.2
14.0
7.4
7.5
6.3
10.0
5.5
4.5
3.6
0.6
4.6
3.7
11.0
7.3
-
5.3
2.1
2.0
1.1
1.1
2.3
1.2
0.6
2.1
1.0
2.4
2.6
1.7
0.5
5.5
1.4
0.5
1.1
1.9
1.2
0.7
0.5
1.3
0.9
0.8
0.9
3.1
2.2
0.6
1.2
0.5
1.6
1.8
1.4
2.3
1.6
1.4
1.6
3.2
5.1
3.0
5.7
4.9
7.5
3.0
4.4
1.3
1.5
1.6
0.8
1.0
1.3
20.0
7.1
-
3.8
2.6
2.2
2.0
1.2
0.8
1.4
4.0
1.3
0.8
1.1
1.0
0.4
2.6
1.1
1.2
2.4
0.8
0.7
0.7
2.9
1.2
1.4
0.8
2.7
1.9
1.8
1.8
3.1
2.5
3.4
3.2
1.9
2.2
1.9
0.8
2.3
2.2
28.0
2.2
-
2.7
1.6
2.0
1.6
0.2
2.2
1.2
4.0
1.4
4.3
1.2
2.3
1.8
0.7
10.1
1.3
0.7
0.7
4.2
4.0
1.1
1.3
1.0
0.3
1.8
1.2
1.1
6.5
1.2
1.0
1.0
1.6
1.6
1.3
1.8
0.8
1.5
0.9
2.1
1.4
1.3
4.8
2.2
2.8
4.3
2.2
2.5
1.5
1.5
0.9
1.9
1.4
22.0
2.4
-
93
0.4
27.4
0.7
1.4
56.0
36.0
42.0
100.0
100.0
100.0
100.0
2.2
2.9
2.0
3.5
3.6
2.5
2.9
5.3
1.2
6.6
1.3
3.9
4.3
2.1
9.3
2.9
1.0
1.0
1.0
2.0
5.0
21.0
35.0
1.2
14.0
10.8
12.9
9.5
2.2
2.9
180.0
180.0
0.6
0.3
50.0
11.0
1.5
9.2
-
0.2
1.7
0.7
0.7
15.0
1.1
1.1
23.0
43.0
3.3
84.0
1.2
0.8
1.1
3.7
3.9
0.8
1.3
1.8
0.8
0.6
2.0
2.1
3.9
0.6
2.9
1.2
8.0
47.0
1.0
11.0
2.0
1.0
3.0
1.1
7.5
6.5
1.3
1.0
1.3
1.5
180.0
180.0
3.5
0.4
1.0
10.0
0.9
5.6
-
0.2
483.4
0.9
0.8
60.0
60.0
3.0
60.0
60.0
60.0
60.0
1.0
1.1
3.1
0.7
3.2
1.8
4.3
11.9
1.5
3.3
0.6
3.3
1.8
0.5
4.4
3.5
75.0
300.0
1.0
19.0
64.0
744.0
161.0
2.4
29.1
35.0
8.5
1.9
1.6
98.0
78.0
180.0
0.6
123.0
24.0
1.1
29.0
-
Appendix C. The Complete Datasets
i906
i907
i908
i909
i910
i911
i912
i913
i914
i915
i916
i917
i918
i919
i920
i921
i922
i923
i924
i925
i926
i927
i928
i929
i930
i931
i932
i933
i934
i935
i936
i937
i938
i939
i940
i941
i942
i943
i944
i945
i946
i947
i948
i949
i950
i951
i952
i953
i954
i955
i956
i957
i958
i959
i960
i961
i962
i963
i964
i965
i966
i967
i968
i969
i970
i971
i972
i973
i974
i975
i976
i977
i978
i979
i980
i981
i982
i983
i984
i985
i986
i987
i988
54625
54626
54627
54641
4433
4483
41178
4388
4479
4539
4663
4697
5222
63
64
65
71
5465
5682
6025
6029
6139
6138
6137
2378
6149
6148
6147
6146
6145
6144
6143
6142
6141
6140
6312
6311
6313
6314
6315
6316
6317
6318
6319
6320
6321
6412
6413
6414
6415
6416
6417
6418
6419
6420
6421
6422
6423
6424
6425
6426
6427
6428
6429
6430
6431
6432
6461
6462
6463
6464
6465
6466
6467
6468
6469
6470
6471
6472
6474
6475
6476
6477
100.0
3.0
100.0
0.8
200.0
100.0
100.0
200.0
109.1
7.3
2.4
5.2
0.6
70.0
12.0
70.0
70.0
70.0
1.3
70.0
70.0
1.0
2000.0
-
8.8
2.2
8.1
0.9
3.7
7.1
5.9
5.7
7.0
18.0
3.8
3.0
2.1
2.9
3.6
6.9
10.0
8.5
4.0
6.8
11.0
1.0
-
1.9
2.1
1.8
0.7
2.0
1.3
1.3
1.2
1.2
0.8
1.0
2.7
16.0
1.0
2.0
1.0
2.0
1.0
1.0
1.0
2.0
1.0
2.0
2.0
1.0
1.7
7.6
3.3
1.1
2.2
-
2.9
1.0
2.8
0.4
1.8
2.8
1.9
3.1
3.0
1.0
1.0
2.3
4.1
4.0
1.0
15.0
4.0
2.0
1.0
3.0
4.0
8.0
7.0
5.0
2.0
3.0
4.0
1.0
18.0
19.0
15.0
15.0
30.0
1.0
3.0
4.0
3.0
6.0
4.0
2.0
4.0
4.0
2.1
1.3
1.7
1.3
1.3
1.9
1.7
2.0
2.1
4.8
7.5
1.3
8.6
1.0
3.0
1.0
2.0
4.0
5.0
6.0
4.0
4.0
5.0
5.0
8.0
1.0
0.9
6.4
7.1
7.1
1.1
0.4
1.4
1.0
1.0
6.0
5.5
1.0
2.8
2.4
3.9
2.1
1.9
2.4
1.6
94
1.1
0.2
21.2
0.5
3.1
0.8
0.5
50.6
11.0
0.2
250.0
-
0.7
0.3
5.8
0.5
119.8
0.5
0.3
17.4
154.0
0.2
700.0
-
0.7
0.4
563.5
0.6
98.4
0.5
0.3
42.0
400.0
0.3
79.0
1.0
1.0
240.0
80.0
1.0
60.0
1.0
200.0
1.0
112.0
100.0
1.0
1.0
150.0
126.0
1.0
1.0
80.0
1.0
46.0
Appendix C. The Complete Datasets
i989
i990
i991
i992
i993
i994
i995
i996
i997
i998
i999
i1000
i1001
i1002
i1003
i1004
i1005
i1006
i1007
i1008
i1009
i1010
i1011
i1012
i1013
i1014
i1015
i1016
i1017
i1018
6519
6569
40816
41011
6796
6820
6859
6876
7297
7298
7299
7305
7304
7303
7302
7301
7300
7311
7314
7318
7328
7344
7347
7348
7350
40017
7360
7364
7376
7377
200.0
3.9
87.6
0.8
7.1
4.8
1.4
100.0
3.6
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
2.7
5.3
5.5
0.8
7.0
1.6
2.3
9.6
1.7
18.5
9.9
7.7
4.6
20.0
6.7
8.5
29.3
0.8
4.2
2.0
1.0
4.4
1.8
2.0
3.3
1.5
11.2
3.9
2.3
1.0
11.3
1.6
1.7
10.1
1.6
1.7
1.9
0.8
1.7
1.1
0.8
3.1
1.0
5.4
4.0
1.4
4.0
1.3
2.0
3.6
1.0
0.9
6.3
4.0
2.3
2.2
65.4
2.2
2.6
9.0
1.3
1.8
1.5
0.9
2.4
1.4
1.0
2.1
1.0
1.1
1.0
1.0
1.0
1.3
1.0
1.5
0.8
1.1
3.5
2.1
1.8
1.6
20.1
1.4
1.7
4.1
95
34.0
176.0
0.9
2.7
0.1
1.9
0.9
0.3
5.4
0.7
0.7
1.0
1.4
0.6
0.4
-
18.0
45.6
13.2
1.9
0.2
4.3
0.4
0.3
24.0
0.5
0.4
1.0
0.8
0.6
0.4
-
25.0
480.2
416.5
3.2
0.3
400.0
0.6
0.5
400.0
0.6
0.3
1.1
0.6
0.9
0.6
-
Appendix D
Original Decision Trees
Given below are the decision trees for ZDV (a), ddC (b), ddI (c), d4T (d), 3TC (e), ABC (f),
NVP (g), DLV (h), EFV (i), SQV (j), IDV (k), NFV (m) and APV (n), as presented in [1].
image taken from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=123057&rendertype=figure&id=F3
Note that (a) is unable to offer a classification because no labels are attached to the leafs,
apart from one.
96
Bibliography
[1] Niko Beerenwinkel, Barbara Scmidt, Hauke Walter et al. Diversity and complexity of
HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype
PNAS (2002) Vol. 99 8271-8276
[2] Dechao Wang, Brendan Larder Enhanced Prediction of Lopinavir Resistance from
Genotype by Use of Artificial Neural Networks The Journal Of Infectious Diseases
2003;188;653-60
[3] Robert W. Shafer MD Genotypic Testing For HIV-1 Drug Resistance HIV InSite
Knowledge Base Chapter 2004
[4] Hirsch MS, Conway B, D’Aquila RT et al Antiretroviral drug resistance testing in adults
with HIV infection: implications for clinical management International AIDS Society – USA
Panel JAMA 1998; 279: 1984 – 91
[5] Carpenter CCJ, Cooper DA, Fischl MA et al Antiretroviral therapy in adults: updated
recommendations International AIDS Society – USA Panel, JAMA 2000; 283: 381 – 90
[6] Hirsch MS, Brun – Vezinet F, D’Aquila RT et al Antiretroviral drug resistance testing in
adult HIV-1 infection: recommendations International AIDS Society – USA Panel JAMA
2000; 283: 2417 – 26
[7] P. Richard Harrigan et al Clinical utility of testing human immunodeficiency virus for
drug resistance Clinical Infectious Diseases 2000;30(suppl 2):S177-22
97
[8] Tisdale M, Kemp SD, Parry NR & Larder BA (1993) Proc. Natl. Acad. Sci USA 90,
5653 – 5656
[9] Anne D. Sevin, Victor DeGruttola, Monique Nijhuis et al Methods for investigation of
the relationship between drug susceptibility phenotype and human immunodeficiency virus
type 1 genotype with applications to AIDS The Journal of Infectious Diseases 2000; 182: 5967
[10] Barbara Scmidt, Hauke Walter, Brigitte Moschik et al. Simple algorithm derived from a
geno-/phenotypic database to predict HIV-1 protease inhibitor resistance AIDS 2000,
14:1731-1738
[11] Schinazi RF, Larder BA, Mellors JW Mutations in retroviral genes associated with
drug resistance Int Antiviral News 1999, 7:46-69
[12] Tom M. Mitchell Machine Learning McGraw-hill international editions (1997) ISBN 007-115467-1
[13] Mingers J An empirical comparison of pruning methods for decision tree induction
Machine Learning, 1998:227-243
[14] Stanford HIV Resistance Database Protease Notes http://hivdb.stanford.edu/cgibin/PIResiNote.cgi
[15] Stanford HIV Resistance Database Reverse Transcriptase Notes
http://hivdb.stanford.edu/cgi-bin/NRTIResiNote.cgi & http://hivdb.stanford.edu/cgibin/NNRTIResiNote.cgi
[16] Cybenko G Continuous valued neural networks with two hidden layers are sufficient
Department of Computer Science, Tufts University, Medford, MA. 1988
[17] Grapham el at 8th Conference on retroviruses and Opportunistic infections 2001,
abstract 524
98