Classification of bacterial species from proteomic data using

BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 10 2005, pages 2191–2199
doi:10.1093/bioinformatics/bti368
Genome analysis
Classification of bacterial species from proteomic data using
combinatorial approaches incorporating artificial neural
networks, cluster analysis and principal components
analysis
L. Lancashire1,2 , O. Schmid3 , H. Shah3 and G. Ball1,2,∗
1 The
Nottingham Trent University, 2 Loreus Ltd., School of Biomedical and Natural Sciences, Clifton Campus,
Clifton Lane, Nottingham, NG11 8NS, UK and 3 Molecular Identification Services Unit, Central Public Health
Laboratory, Health Protection Agency, 61 Colindale Avenue, London, NW9 5HT, UK
Received October 29, 2004; revised on February 23, 2005; accepted on February 28, 2005
Advance Access publication March 3, 2005
ABSTRACT
Motivation: Robust computer algorithms are required to interpret the
vast amounts of proteomic data currently being produced and to generate generalized models which are applicable to ‘real world’ scenarios.
One such scenario is the classification of bacterial species. These
vary immensely, some remaining remarkably stable whereas others
are extremely labile showing rapid mutation and change. Such variation makes clinical diagnosis difficult and pathogens may be easily
misidentified.
Results: We applied artificial neural networks (Neuroshell 2) in parallel with cluster analysis and principal components analysis to surface
enhanced laser desorption/ionization (SELDI)-TOF mass spectrometry data with the aim of accurately identifying the bacterium
Neisseria meningitidis from species within this genus and other closely
related taxa. A subset of ions were identified that allowed for the
consistent identification of species, classifying >97% of a separate
validation subset of samples into their respective groups.
Availability: Neuroshell 2 is commercially available from Ward
Systems.
Contact: [email protected]
INTRODUCTION
The increase in the current application of proteomic technologies
in biological research has led to the generation of vast amounts of
data, the majority of which are seemingly impossible to analyze
by traditional means, especially with respect to biomarker identification. For example, a typical mass spectrometry experiment
utilizing Matrix Assisted Laser Desorption/Ionisation Time of Flight
Mass Spectrometry (MALDI-TOF-MS) with ProteinChip technology (SELDI: surface enhanced laser desorption/ionization) yields
in excess of 34 000 individual data points per sample, per analysis.
Therefore in studies aimed at analyzing datasets containing large
sample sizes and high dimensionality, more advanced computer
algorithms must be utilized if biomarkers that explain the variation
within the data and determine population characteristics are to be
identified.
∗ To
whom correspondence should be addressed.
In some cases, traditional microbiological tests (such as those outlined by Reddick, 1975; D’Amato et al., 1978; Craven et al., 1978;
Robinson and Oberhofer, 1983; Janda et al., 1984) have performed
poorly in the classification of bacterial species and strains. For
example it is often difficult to determine a correct classification
of a field isolate based on tests created for reference strains (type
strains), because in many instances the field isolates having counterpart reference strains have undergone evolutionary changes resulting
in intermediaries within the population. This variation makes clinical diagnosis difficult and pathogens may be misidentified when
diagnosis is based purely on traditional microbiological tests. If
computer algorithms are capable of identifying phenotypes through
expression profile analysis, and ultimately candidate biomarkers
which correlate strongly to a pre-determined observation, then this
would provide the potential for the development of decision support
tools, which may be utilized to supplement human judgement and
potentially reduce the turnaround time of identification or diagnosis
compared with existing clinical and microbiological methods.
Artificial neural networks (ANNs) are a form of machine learning
capable of accurately modelling proteomic data and identifying biomarkers which are capable of discriminating between one class and
another, for example different tumour states (Lancashire et al., 2003;
Mian et al., 2003; Ball et al., 2002), whilst at the same time providing
a measure of certainty to each prediction given for each individual
sample. To determine if these ANN approaches could be applied
to bacterial data, members of the genus Neisseria and other closely
related species were utilized with the aim of accurately identifying
Neisseria meningitidis and other species which are closely related.
Artificial neural networks are capable of providing a robust
approach to outcome approximation, and are being used with great
effect in the biological sciences. Their uses range from the prediction of outcome in acute lower-gastrointestinal haemorrhages (Das
et al., 2003), the diagnosis of cancers (Khan et al., 2001; Batuello
et al., 2001; Anagnostou et al., 2003; Peterson and Ringnér, 2003), to
the identification of environmental influences on plant species (Ball
et al., 2000). This study utilized a multi-layer perceptron (MLP)
ANN with a back propagation (BP) algorithm. It is well documented
that MLPs are commonly used for the data mining of complex data
(Wei et al., 1998; Balls et al., 1996; Desilva et al., 1994), as they
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
2191
L.Lancashire et al.
are able to cope with data containing high levels of background
noise and can be utilized both in identifying specific markers which
may be responsible for the classification of certain outcomes, such
as different bacterial species, and in identifying the influence of
interacting factors between these markers. Another advantage of
ANNs is that they do not rely on any pre-determined relationships
between variables, meaning that each individual input is not initially
assumed to be interacting in a biological manner with any other.
Artificial neural networks identify patterns in the data which correspond to a specific class by acquiring knowledge through a learning
process, in which interneuron connection weights (analogous to synaptic activity in the human brain) are used to store this knowledge.
Curved decision surfaces are then generated by the iterative modification of multiple sigmoidal transfer functions (Rumelhart and
McClelland, 1986), and it is these which introduce non-linearity
into the system, allowing the ANNs to generate non-linear decision
boundaries based upon the dataset in question. Artificial neural networks using the BP algorithm to learn (or train) in the following
manner; firstly input data is presented to the network. (For example,
each input may correspond to a mass-to-charge ratio from the mass
spectrometer with its respective intensity value. A mass spectrometer
does not actually measure mass directly but rather the mass-to-charge
ratio, or m : z of ions formed.) The network will then compute an output based upon the data it has seen, and this output is then compared
to the actual known output (for example an output of 1 would represent N.meningitidis and an output of 2 ‘other’ strains), and an error
value is calculated between the two. The interconnecting network
weights are then passed through a training algorithm and then modified so that the next time the inputs are presented to the system,
the network output is closer to the known, actual output. This is an
iterative process which continues repeatedly until a pre-determined
condition is met, which may be a target error value or a failure of the
network to improve for a certain amount of learning cycles (epochs).
Once the ANN is suitably trained, it can then be validated by introducing additional (previously unseen) samples to the model, which
then computes an output based on the new data. Artificial neural networks may also be validated as part of the training process, where
the dataset is split into separate training and test cases. The ANN will
then train using the samples chosen in the training set, and is then
validated on the separate set of test cases whilst training is occurring.
A second validation set is used in addition to a test set because a test
set used in the optimization process cannot be said to be truly blind
as it has had influence on model development and optimization.
Training during BP is a supervised process, meaning that before
training occurs, the identification of each sample is already known.
In this study, the means by which the bacterial samples were identified into their respective species was 16S rDNA sequence analysis.
In molecular microbiology, hierarchical clustering is a method which
is commonly used to study and determine the relationships between
different bacterial strains, and is often used to place new strains into
a particular taxon. It works by organizing the data into meaningful
tree-like structures using distance measures to establish how closely
related one sample is to another. Bearing this in mind, it seemed
appropriate that this study should incorporate clustering into the analysis in order to study whether any possible outliers misclassified by
the ANN within the two populations were more related to their identification by 16S rDNA analysis or by their actual ANN classification.
The aims of this study were to utilize ANN methodologies in order
to identify potential biological markers among well-characterized
2192
strains of N.meningitidis which could then be used to create a model
to distinguish from closely related species. The genus Neisseria contains a number of species and these may be both normal flora and
pathogens of humans and animals. Data used were generated from
standard NCTC (National Collection of Type Culture) strains. Of
these species, N.meningitidis and N.gonorrhoeae have been studied widely because of the severity of the infections they cause.
N.meningitidis and N.gonorrhoeae exhibit over 90% homology
between their genomes; however, their respective site of infection,
disease picture and antibiotic therapy vary markedly, highlighting
the need for new, more accurate, rapid diagnostic tests to aid in the
identification of these pathogens.
METHODS
Culturing and storage
Strains of N.meningitidis and N.gonorrhoeae were grown for 24 and 48 h
respectively on Chocolate Blood Agar (Media Dept., CPHL) in 5% CO2 at
37◦ C. All were harvested and applied to Microbank™ plastic storage beads
(Pro-Lab Diagnostics) and stored at −80◦ C for long term storage.
Protein assay
Protein estimations were performed according to the Bradford method using
bovine serum albumin (Sigma) as a standard. Absorbance measurements
were taken using an Ultrospec 4300 spectrophotometer (Amersham) with the
SWIFT II software package (v.2.01) (Biochrom).
SELDI-TOF-MS sample preparation
Cells from one plate of growth were harvested and re-suspended in 1 ml of prechilled distilled water (Media Dept., CPHL) in 2 ml microtubes (Eppendorf)
to which glass beads (Sigma) were added to approximately one quarter of
the total volume. Cells were disrupted mechanically by vortexing them for
4 × 1 min in a Mickle (Mickle Laboratory Engineering Co Ltd), with 1 min
on ice in between intervals. Tubes were left on ice for 20 min for the beads
to settle to the bottom. The supernatant was transferred into fresh 1.5 ml
Eppendorf tubes and spun at 21 000 RCF at 4◦ C for 40 min. The supernatant
was transferred into fresh 1.5 ml Eppendorf tubes, ready for analysis or frozen
at −80◦ C.
Chip preparation
H50 ProteinChips were bulk washed twice in 50% methanol (BDH) for 5 min
and then left to air-dry for 1 h. Each spot was pre-activated with 5 µl of
0.1% Trifluoracetic acid (TFA) (Sigma) for 5 min and the remaining solution
removed. Ten microlitres of samples were diluted in 10 µl of ammonium acetate (Sigma) with 25% acetonitrile at pH4 (BDH). The samples were pulsed
down to ensure proper mixing. Four micromlitres of sample-ammonium acetate mix were applied to individual spots and incubated in a humidity chamber
for 1 h. Excess samples were pipetted off and each spot was washed twice
with 5 µl of 50% methanol for ∼2 min and then air-dried for 10 min. The
matrix was prepared by dissolving 35 mg/ml of sinapic acid in 50% methanol and 50% TFA (1%), and 1 µl of sinapic acid matrix was applied to
each spot and left to air-dry for 10 min. The samples were analyzed using the
SELDI-TOF-MS Ciphergen (Series PBS II, Fremont, CA, USA) according
to an automated data collection protocol. The instrument was operated in
the positive ion mode and a nitrogen laser emitting at 337 nm was used. The
data was electronically exported from the ProteinChip software for further
analysis (below).
All samples were run in duplicate and all spectra were calibrated for
molecular mass with Ciphergens All-in-One Protein Mix TLF15kDa as
well as for total ion current. Previous tests have demonstrated the stability
of spectra from inoculated ProteinChip Arrays over extended time periods
when stored away from light interference, which could degrade the matrix
compound. Furthermore, experiments have been performed to test spectra
ANN classification of bacterial species
reproducibility of the same strain with repeated sample preparations which
documented the constancy of the resulting data (data unpublished). Ion intensity can vary slightly between repeated experiments without affecting pattern
stability.
Development of ANN model
We used a three-layer MLP ANN with a feed forward BP algorithm with a
sigmoidal transfer function (Bishop, 1995). Prior to training, the data were
scaled linearly between 0 and 1 using minima and maxima. The raw values
were scaled linearly so that the smallest value in the dataset was scaled to the
minimum value and the largest value in the dataset was set to the maximum
value. This scaling method ensures that all relationships amongst the data
values are kept identical, therefore not introducing any potential bias into
the data. The raw data obtained from the SELDI-MS instrument consists of
individual m : z values with their corresponding relative abundance values.
It is these relative abundance values for each m : z value that were used as
inputs in the input layer. The network utilized a constrained approach to
maximize the efficiency of the analysis whereby two hidden nodes were used
in the hidden layer. Two hidden nodes were used in order to amplify the
importance of key ions within the mass spectrometry data, while producing
accurate predictions and maintaining model generalization. This approach
was adopted with success on earlier mass spectrometry data (Ball et al.,
2002; Mian et al., 2003). Increasing the number of nodes in the hidden layer
did not result in an increase in the ability of the model to predict strain (data
not presented). The output layer consisted of a single node encoded with
Boolean representation, N.meningitidis represented by 1 and other strains
represented by 2, and the initial weights of the network were randomized
between 0 and 0.001.
Prior to analysis and model development, data with m : z values below
3 kDa were removed as anything below 3 kDa was deemed to be noisy and
unimportant, and due to the limitations in accurate mass resolution beyond
30 kDa, everything above this mass value was also removed (as shown in Ball
et al., 2002).
The model was trained, tested and validated using 206 samples. The data
utilized were evenly split with 103 as N.meningitidis and 103 as other strains
to prevent the predictive performance being in favour of one output class.
Of the other strains the majority belonged to the genus Neisseria, with the
addition of a few other closely related taxa such as Kingella and Moraxella.
An additional 188 samples (60 of which were N.meningitidis and 128 were
‘other’ species) were brought in at a later date to provide a second order of
validation utilizing blind data for the final optimal model.
Prior to training, the data were split into three sets, training, test and
validation. During training, the ANN utilizes the training data to develop
predictive performance, and model performance is continually monitored
for the test data, which is unseen to the network during training. Training
was conducted to convergence on the test data, i.e. when the mean squared
error (MSE) failed to improve for 20,000 training events. This early stopping
approach helps to minimize over-training of the ANN. Once trained the ANN
model can then be validated using the second blind validation set put aside
for this purpose.
An initial model was tested using all remaining ions as inputs (12 822 inputs
in total) resulting in a classification accuracy of 76% (results not shown).
Whilst this is relatively high, this model contains extremely high dimensionality, resulting in long training times due to its high complexity. Therefore
parameterization steps ensued with the aim of identifying which ions were
the most important within the dataset, allowing the model complexity to be
reduced and the predictive capabilities to be markedly increased.
This was achieved by a ‘rolling input subset’ approach which involved
training the data containing inputs across a 3 kDa mass range, and then shifting this along 1 kDa at a time, in order to create a new data block. For
example, the first block contained data within the 3–6 kDa mass range, the
next block ranged from 4 to 7 kDa, the next from 5 to 8 kDa and so on up to
30 kDa. Each model was then trained over 50 random training/test/validation
sub-models and relative importance values for each individual input were
recorded so that they could be ranked according to their influence upon correct
sample assignment. Relative importance values were calculated for each
input by:
(i) Multiplying the absolute weight values of the network from the input
node to the first hidden node with the absolute weight values of the
network between the first hidden layer node to the output node.
(ii) Calculating the same value using the weightings to the second hidden
node.
(iii) Summing these two values to give a relative importance value for the
input.
The process was repeated for all of the remaining inputs to give the relative
importance of each with respect to all of the other inputs.
Once this initial analysis was completed, ions with the greatest importance were selected from the data in order to reduce model complexity. This
was accomplished by selecting the top 1000 inputs with the greatest relative importance values and repeating the training process. Relative importance
analysis was again used to determine the top 100 inputs from these 1000. This
was repeated again to deduce the top 30 inputs, at which point the process
was stopped due to no further significant improvement in the model.
In addition to ANN analysis at each parameterization step, a cluster analysis and principal components analysis (PCA) was performed to firstly look
at the relationships between the samples within the populations, and secondly,
to provide additional information regarding samples which were misclassified by the ANN as a means of providing a possible explanation for these
misclassifications. It is important to note that PCA and cluster analysis were
not used as predictive tools, but purely as a means to understand and visualize the data structures being analyzed. Figure 1 shows a summary of the
methodology used in the analysis.
RESULTS
ANN analysis
Ion mass intensity profiles generated from SELDI analysis were analyzed in 3 kDa blocks in order to determine their relative importance
in classifying the samples into their respective groups. This enabled
a proteomic profile over the whole mass range that was being analyzed to be produced. Figure 2 shows the mean relative importance
profile generated from the 50 sub-models trained from each 3 kDa
block over the whole mass range of the SELDI data available. In
order to reduce the complexity of the model and determine which of
these ions were the most significant in strain prediction, all ions were
ranked in descending order of relative importance and the top 1000
(an arbitrary value) ions of most importance were selected for further
analysis. This reduction is necessary in order to find any potential
biomarkers whose intensities correspond with strain identification
and which are capable of accurate discrimination between the two
groups.
Different randomly extracted training, test and validation sets were
utilized to develop each sub-model. This repeated random sample
cross-validation approach ensures that each sample is set aside for
validation purposes a number of times, enabling confidence intervals and also probability of identification to be calculated. So, for
example, repeated random sample cross-validation enables us to calculate the probability, or percentage chance, that a sample belongs
to one group or another. The results from this 1000 input model
showed that for validation data, 203 out of the 206 samples were correctly classified, with 100% sensitivity (percentage of N.meningitidis
samples correctly classified) and 97% specificity and an area under
the curve (AUC) value of 0.9994 when analyzed using a receiver
operating characteristic (ROC) curve. It should be noted that an
error threshold value of 0.5 was used to designate sample destination,
2193
L.Lancashire et al.
Fig. 1. Flow diagram representing methodology of ANN analysis resulting in the identification of potential biomarkers. Parameterization methods were
conducted in order to deduce the top 30 molecular ions of greatest importance.
i.e. the actual output for an N.meningitidis was 1, and an output of
2 would represent ‘other’ species. So if a sample was predicted at
anything between 1 and 1.5, then it was classed as N.meningitidis,
and if it was predicted at between 1.5 and 2, then it was grouped into
the ‘other’ strain category. The magnitude of this error may be used
to identify strains of a similar nature.
Although the results, based on data from the whole profile, showed
high prediction rates, for development as a rapid diagnostic aid,
the system would need to be much more parsimonious to facilitate ease of data gathering and minimize potential sources of error.
Therefore, the relative importance values of these 1000 inputs were
again ranked and the top 100 were selected for further training.
2194
Training was again repeated as with the 1000 input model and results
from this model showed that 204 out of 206 samples were correctly classified, again with 100% sensitivity and an increase to 98%
specificity. Analysis by the ROC curve illustrated an AUC value
of 0.9999.
To deduce whether the number of inputs could be reduced further,
the top 30 inputs were chosen from these 100 and training repeated.
Predictive performance showed a further increase, with a correct
classification of 205 out of 206 samples, with 100% sensitivity and
99% specificity and an AUC value of 0.9949. An example of prediction and associated errors is shown in Table 1. Further input number
reductions beyond 30 resulted in a drop in predictive performance,
ANN classification of bacterial species
Fig. 2. Relative importance values for ion masses between 3000 and 29 999 Da. These values represent a value obtained from multiple sub-models in which
random initial weightings were applied to each.
Table 1. Error values demonstrating the ability of the top 30 ions to accurately
predict bacterial straina
Sample number
Actual output
Network output
Error
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.1415
1.0000
1.0014
1.0647
2.0000
1.0000
1.8194
2.0000
2.0000
2.0000
1.9700
2.0000
2.0000
2.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.1415
0.0000
0.0014
0.0647
0.0000
1.0000
0.1806
0.0000
0.0000
0.0000
0.0300
0.0000
0.0000
0.0000
a
The actual output is shown by a 1 for N.meningitidis and a 2 for all other species. The
prediction from the ANN model is indicated by the network output. All samples were
classified correctly except for sample 12 (shown in bold).
so these 30 inputs were deemed as the optimal set and the most
important for predictive performance. The 30 ions identified, with
their respective relative importance values, are tabulated in Table 2
and it is interesting to note that the majority of these inputs appear
to arise in clusters around certain m : z values.
Once these key ions were identified, they were then applied further to the blind dataset of 188 samples. In this sample population the
model correctly identified 184 out of 188 (97.9%) of the samples correctly, with a sensitivity of 100% and specificity of 96.9%. Figure 3
shows the ROC curve when the model was applied to this data which
resulted in an AUC value of 0.986.
Cluster analysis, PCA and similarity analysis
A clustering algorithm (complete linkage with distances measured by
Euclidean distances) was applied to the data in parallel to the ANNs
Table 2. The 30 ions with the highest relative importance with regard to
strain prediction were ranked according to their molecular mass and were
then assigned to particular sub-groups if their masses clustered around specific
mass ranges
Input m : z value
Relative importance
4709
5085
8181
10 663
10 665
11 672
11 674
11 676
11 678
12 515
12 517
14 609
14 611
19 055
19 057
19 060
19 062
19 065
19 067
19 070
19 072
23 302
23 305
23 322
23 325
23 328
23 330
23 333
23 336
29 000
0.19951
0.07419
0.00813
0.19977
0.08953
0.00324
0.01489
0.02819
0.02234
0.00799
0.00724
0.04217
0.01433
0.03528
0.02147
0.01855
0.00753
0.01346
0.00423
0.01419
0.01921
0.01651
0.017
0.01444
0.02724
0.03176
0.01239
0.00186
0.01831
0.01503
in order to measure consistencies between the various approaches and
to visualize any possible outliers in the two populations, which may
in turn lead to some explanations regarding the few samples which
were misclassified by the ANN models. The initial cluster analysis
was with the 1000 input model and results from this analysis can be
seen in Figure 4a–d.
2195
L.Lancashire et al.
closely together. Whilst it is evident that the other species belonging
to the second population exhibit much more variation and fall along
a vector far less clustered, it is interesting to see that the four samples
which the ANN identified incorrectly as N.meningitidis are clearly
placed along the same vector as the N.meningitidis by the PCA,
indicated by arrows in black.
A similarity analysis was also performed to determine which
samples were the most similar to the incorrectly classified ones with
respect to their mass spectra. From the 25 most similar samples to
these four misclassified ones, 19 were N.meningitidis and just six
were ‘other’ closely related species. This again provides the reason
as to why the ANN classified them as N.meningitidis, and that these
samples appear to actually be more related to what the ANN predicted, than to what they were initially identified as using 16S rDNA
analysis.
DISCUSSION
Fig. 3. Receiver operating characteristic curve for the 30-ion model applied
to the second validation dataset consisting of 188 samples.
It is evident from Figure 4a–d that this clustering approach,
although not used to predict class, was capable of grouping samples
according to their known identification reasonably well, with fairly
reproducible results as the dimensionality of the data is reduced,
although notably with some overlap between the populations. It is
interesting to recognize however that of the three samples which the
ANN incorrectly identified using the top 1000 ions (highlighted in
Figure 4a by arrows), two are in clear clusters of the opposite group,
i.e. the sample on the extreme right of the incorrect classifications
was identified using 16S rDNA methods as Kingella denitrificans, but
clearly showed more similarities to the N.meningitidis samples than
other species. Furthermore, this trend continued, with the majority
of the samples misclassified by the ANN being placed in clusters
together with species more closely related to the ANN classification, and not the initial pre-determined species. This is shown
even more clearly in Figure 4d, which represents the cluster analysis of the second validation set of samples. Of the four samples
which the ANN misclassified as being N.meningitidis (which by
16S rDNA were identified as N.gonorrhoeae), the cluster analysis,
like the ANN, placed these in the cluster containing the majority of
the N.meningitidis indicating that these isolates are more similar
to the latter than N.gonorrhoeae by proteome analysis. Thus by using
these clustering approaches in concert with ANNs we can visualize
the data using cluster analysis, and predict class with the ANNs. Furthermore we can begin to understand the nature of samples which
may be outliers, and therefore may show characteristics which are
not archetypal of that particular species which it is initially believed
to belong to, and begin to understand why the ANN models that
we have developed predict some species to be one class and not
another. To provide additional support to this, a PCA and similarity
analysis were conducted on this validation data (Figs 5 and 6).
This PCA matrix shows that all of the N.meningitidis samples
(represented in dark grey) fall along one clear vector, all clustered
2196
The ability to consistently and rapidly identify closely related
bacterial species, such as those from the genus Neisseria, could
pave the way for new technologies becoming more accessible to aid
human judgment as an on-site laboratory-based decision tool. Mass
spectrometry methodologies allow for the generation of information regarding thousands of ions in any particular sample. In order to
determine whether ANNs have the learning capabilities to potentially
identify subsets of ions whose phenotypic ‘fingerprints’ correlate
strongly to a particular bacterial isolate, we applied parameterization
methodologies to a relatively large dataset containing N.meningitidis
and other closely related species with the aim of correctly classifying
the samples into their respective groups.
Model parameterization allows for the identification of molecular ions which are important in classifying the population into their
respective groups. Thus, by determining which ions are contributing most in the classification, unimportant and noisy ions may
be removed. Using the ‘rolling input subset’ approach described
earlier, the number of ions used as inputs in the ANN model were
reduced from an initial set of approximately 13 000 (between 3
and 30 kDa) to just the top 1000 which could accurately predict
species type.
For a system to have practical application in a diagnostic laboratory where there is a high throughput of samples and a demand
for rapid analysis, a simple application tool is required; therefore,
it was decided that the number of ions needed to be reduced to
the smallest number possible which could discriminate between
the two species. In order to achieve this, the top 1000 ions were
ranked in order of importance in predicting a bacterial strain identity, and the top 100 were selected for further training. This resulted
in a model which out-performed the previous. This process was
repeated until the models ceased to improve in predictive capabilities resulting in a model containing just 30 molecular ions which
could predict with a sensitivity of 100% and specificity >96% on
a second order validation dataset, which had not been used in any
way in the training of the model. Therefore this approach of analyzing the interconnecting weights of the trained network can be
used as a rapid and efficient means to greatly reduce the number of
inputs in a model in order to decrease complexity, whilst increasing
predictive capabilities as a result of removing noisy inputs, which
may be causing a conflict in the model and reducing predictive
performance.
ANN classification of bacterial species
Fig. 4. Cluster analysis (complete linkage using Euclidean distance measures) for (a) top 1000 ion model, (b) top 100 ion model, (c) top 30 ion model applied
to the original dataset and (d) top 30 ion model applied to the second validation dataset. The black blocks indicate samples which were N.meningitidis whilst
the light grey blocks represent ‘other’ species. The dark grey blocks show incorrectly classified samples (also highlighted with arrows).
Prior to analysis, the ANN did not assume any prior relationships
between any inputs used in the model; thus it was interesting that the
ANN did not appear to identify single peaks around given mass values, but found clusters of ions around masses which were important
in the classification. Because of the mass accuracy of the instrument
(around 0.2%) and data averaging, these clusters are highly likely
to represent the same molecular species, which may be confirmed
with further research. The scope of this study was then expanded to
investigate why the ANN incorrectly classified the few samples that
it did. This was done mainly by cluster analysis, but also incorporated PCA and a similarity analysis, in which samples were examined
with regard to their raw mass spectra, to identify which other samples
closely matched these. From the clustering approach it was clear that
the majority of the samples that the ANN misclassified in the various models were actually in fact grouped together with samples from
their ANN predicted class and not their true class which 16S rDNA
methods identified them as. This was supported further by conducting a PCA on the validation dataset, which again showed similar
findings. The N.meningitidis samples were all found along a clear
vector, together with the non-N.meningitidis samples that the ANN
misclassified. A similarity analysis also resulted in the same again,
with only six of the 25 most similar samples belonging to the ‘other’
species group, whilst 19 of the 25 were N.meningitidis, further suggesting that the samples are actually more closely related to the ANN
predicted class than their actual class, providing valuable information about the nature of the two populations and the outliers which
are found to be exhibiting characteristics typical of both species, and
so appear to exist in a continuum between the two. This information
serves to highlight the need for rapid, highly robust methods which
are generalized and can prove to be a reliable decision support tool
in microbial identification.
CONCLUSIONS
We have shown that ANNs can be successfully used as a system in
which to identify molecular biomarkers whose proteomic profiles tie
in strongly to bacterial species. We have identified 30 ions which
are capable of identifying 184 out of 188 samples from a separate validation dataset correctly, and furthermore, have shown using
cluster analysis and PCA that these few misclassified samples may
actually show more resemblance to their ANN predicted class than
their actual class. Of these 30 ions identified, the majority of these
2197
L.Lancashire et al.
Fig. 5. Principal components analysis of the 30-ion model applied to the second validation dataset. The dark grey blocks indicate N.meningitidis samples and
the light grey blocks other samples. The black blocks and arrows indicate the samples which were misclassified.
Fig. 6. Similarity analysis to assess which spectra were most closely related to those samples incorrectly classified. Of the 25 most similar samples to the four
which were misclassified, 19 were N.meningitidis, concurring with the ANN predictions rather than the 16S rDNA assignments.
appear in clusters around certain mass values, and as such there is
a high probability that these belong to the same molecular species,
suggesting that of these 30 ions identified, there may only be around
10 individual species present. Although the dataset used was relatively small (206 samples), all models were validated using a random
sample cross-validation approach so that each sample was treated
as blind a number of times. This was enhanced further by applying the models to an additional dataset consisting of 188 samples
that had not previously been used in the training of the model in
2198
any way. The high classification accuracies for this dataset (>96%)
reinforce the assumptions that a generalized, robust model has been
created. This study has therefore shown that ANNs are powerful
enough to model for molecular microbiological data and identify
potential markers representative of the different species of interest,
which may lead to the development of robust tools, capable of
providing unbiased decision support in the laboratory, and reducing
incorrect diagnosis of disease and infections by highly pathogenic
organisms.
ANN classification of bacterial species
REFERENCES
Anagnostou,T. et al. (2003) Artificial neural networks for decision-making in urologic
oncology. Eur. Urol., 43, 596–603.
Ball,G.R. et al. (2000) Identification of non-linear influences on the seasonal ozone
dose response of sensitive and resistant clover clones using artificial neural networks.
Ecol. Model., 129, 153–168.
Ball,G. et al. (2002) An integrated approach utilizing artificial neural networks and
SELDI mass spectrometry for the classification of human tumours and rapid
identification of potential biomarkers. Bioinformatics, 18, 395–404.
Balls,G.R. et al. (1996) Towards unravelling the complex interactions between microclimate ozone dose and ozone injury in clover. Water Air Soil Poll., 85, 1467–1472.
Batuello,J.T. et al. (2001) Artificial neural network model for the assessment of lymph
node spread in patients with clinically localized prostate cancer. Urology, 57,
481–485.
Bishop,C.M. (1995) Neural Networks for Pattern Recognition. Oxford University Press
Inc, New York.
Craven,D.E. et al. (1978) Serogroup identification of Neisseria Meningitides: comparison of an antiserum agar method with bacterial slide Agglutination. J. Clin.
Microbiol., 7, 410–414.
D’Amato,R.F. et al. (1978) Rapid identification of Neisseria gonorrhoeae and Neisseria
meningitidis by using enzymatic profiles. J. Clin. Microbiol., 7, 77–81.
Das,A. et al. (2003) Prediction of outcome in acute lower-gastrointestinal haemorrhage
based on an artificial neural network: internal and external validation of a predictive
model. Lancet, 362, 1261–1266.
Desilva,C. et al. (1994) Artificial neural networks and breast-cancer prognosis. Aust.
Comput. J., 26, 78–81.
Janda,W.M. et al. (1984) Use of the API NeIdent system for identification of
pathogenic Neisseria spp. and Branhamella catarrhalis. J. Clin. Microbiol., 19,
338–341.
Khan,J. et al. (2001) Classification and diagnostic prediction of cancers using gene
expression profiling and artificial neural networks. Nat. Med., 7, 673–679.
Lancashire,L.J., Mian,S., Rees,R.C. and Ball,G.R. (2003) Preliminary artificial neural
network analysis of SELDI mass spectrometry data for the classification of
melanoma tissue. In 17th European Simulation Multiconference, Nottingham,
Society for Modeling and Simulation International, SCS European Publishing House,
Erlanger, Germany, pp. 131–135.
Mian,S. et al. (2003) A prototype methodology combining surface-enhanced
laser desorption/ionization protein chip technology and artificial neural network algorithms to predict the chemoresponsiveness of breast cancer cell liner
exposed to Paclitaxel and Doxorubicin under in vitro condition. Proteomics, 3,
1725–1737.
Peterson,C. and Ringnér,M. (2003) Analyzing tumor gene expression profiles. Artif.
Intell. Med., 28, 59–74.
Reddick,A (1975) A simple carbohydrate fermentation test for identification of the
pathogenic Neisseria. J. Clin. Microbiol., 2, 72–73.
Robinson,M.J. and Oberhofer,T.R. (1983) Identification of pathogenic Neisseria species
with the RapID NH system. J. Clin. Microbiol., 17, 400–404.
Rumelhart,D.E. and McClelland,J.L. (1986) Parallel Distribution Processing: Explorations in the Microstructure of Cognition, Vol. 1, Foundations. MIT Press, Cambridge,
MA, USA.
Wei,J.T. et al. (1998) Understanding artificial neural networks and exploring their
potential applications for the practising urologist. Urology, 52, 161–172.
2199