BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 10 2005, pages 2191–2199 doi:10.1093/bioinformatics/bti368 Genome analysis Classification of bacterial species from proteomic data using combinatorial approaches incorporating artificial neural networks, cluster analysis and principal components analysis L. Lancashire1,2 , O. Schmid3 , H. Shah3 and G. Ball1,2,∗ 1 The Nottingham Trent University, 2 Loreus Ltd., School of Biomedical and Natural Sciences, Clifton Campus, Clifton Lane, Nottingham, NG11 8NS, UK and 3 Molecular Identification Services Unit, Central Public Health Laboratory, Health Protection Agency, 61 Colindale Avenue, London, NW9 5HT, UK Received October 29, 2004; revised on February 23, 2005; accepted on February 28, 2005 Advance Access publication March 3, 2005 ABSTRACT Motivation: Robust computer algorithms are required to interpret the vast amounts of proteomic data currently being produced and to generate generalized models which are applicable to ‘real world’ scenarios. One such scenario is the classification of bacterial species. These vary immensely, some remaining remarkably stable whereas others are extremely labile showing rapid mutation and change. Such variation makes clinical diagnosis difficult and pathogens may be easily misidentified. Results: We applied artificial neural networks (Neuroshell 2) in parallel with cluster analysis and principal components analysis to surface enhanced laser desorption/ionization (SELDI)-TOF mass spectrometry data with the aim of accurately identifying the bacterium Neisseria meningitidis from species within this genus and other closely related taxa. A subset of ions were identified that allowed for the consistent identification of species, classifying >97% of a separate validation subset of samples into their respective groups. Availability: Neuroshell 2 is commercially available from Ward Systems. Contact: [email protected] INTRODUCTION The increase in the current application of proteomic technologies in biological research has led to the generation of vast amounts of data, the majority of which are seemingly impossible to analyze by traditional means, especially with respect to biomarker identification. For example, a typical mass spectrometry experiment utilizing Matrix Assisted Laser Desorption/Ionisation Time of Flight Mass Spectrometry (MALDI-TOF-MS) with ProteinChip technology (SELDI: surface enhanced laser desorption/ionization) yields in excess of 34 000 individual data points per sample, per analysis. Therefore in studies aimed at analyzing datasets containing large sample sizes and high dimensionality, more advanced computer algorithms must be utilized if biomarkers that explain the variation within the data and determine population characteristics are to be identified. ∗ To whom correspondence should be addressed. In some cases, traditional microbiological tests (such as those outlined by Reddick, 1975; D’Amato et al., 1978; Craven et al., 1978; Robinson and Oberhofer, 1983; Janda et al., 1984) have performed poorly in the classification of bacterial species and strains. For example it is often difficult to determine a correct classification of a field isolate based on tests created for reference strains (type strains), because in many instances the field isolates having counterpart reference strains have undergone evolutionary changes resulting in intermediaries within the population. This variation makes clinical diagnosis difficult and pathogens may be misidentified when diagnosis is based purely on traditional microbiological tests. If computer algorithms are capable of identifying phenotypes through expression profile analysis, and ultimately candidate biomarkers which correlate strongly to a pre-determined observation, then this would provide the potential for the development of decision support tools, which may be utilized to supplement human judgement and potentially reduce the turnaround time of identification or diagnosis compared with existing clinical and microbiological methods. Artificial neural networks (ANNs) are a form of machine learning capable of accurately modelling proteomic data and identifying biomarkers which are capable of discriminating between one class and another, for example different tumour states (Lancashire et al., 2003; Mian et al., 2003; Ball et al., 2002), whilst at the same time providing a measure of certainty to each prediction given for each individual sample. To determine if these ANN approaches could be applied to bacterial data, members of the genus Neisseria and other closely related species were utilized with the aim of accurately identifying Neisseria meningitidis and other species which are closely related. Artificial neural networks are capable of providing a robust approach to outcome approximation, and are being used with great effect in the biological sciences. Their uses range from the prediction of outcome in acute lower-gastrointestinal haemorrhages (Das et al., 2003), the diagnosis of cancers (Khan et al., 2001; Batuello et al., 2001; Anagnostou et al., 2003; Peterson and Ringnér, 2003), to the identification of environmental influences on plant species (Ball et al., 2000). This study utilized a multi-layer perceptron (MLP) ANN with a back propagation (BP) algorithm. It is well documented that MLPs are commonly used for the data mining of complex data (Wei et al., 1998; Balls et al., 1996; Desilva et al., 1994), as they © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2191 L.Lancashire et al. are able to cope with data containing high levels of background noise and can be utilized both in identifying specific markers which may be responsible for the classification of certain outcomes, such as different bacterial species, and in identifying the influence of interacting factors between these markers. Another advantage of ANNs is that they do not rely on any pre-determined relationships between variables, meaning that each individual input is not initially assumed to be interacting in a biological manner with any other. Artificial neural networks identify patterns in the data which correspond to a specific class by acquiring knowledge through a learning process, in which interneuron connection weights (analogous to synaptic activity in the human brain) are used to store this knowledge. Curved decision surfaces are then generated by the iterative modification of multiple sigmoidal transfer functions (Rumelhart and McClelland, 1986), and it is these which introduce non-linearity into the system, allowing the ANNs to generate non-linear decision boundaries based upon the dataset in question. Artificial neural networks using the BP algorithm to learn (or train) in the following manner; firstly input data is presented to the network. (For example, each input may correspond to a mass-to-charge ratio from the mass spectrometer with its respective intensity value. A mass spectrometer does not actually measure mass directly but rather the mass-to-charge ratio, or m : z of ions formed.) The network will then compute an output based upon the data it has seen, and this output is then compared to the actual known output (for example an output of 1 would represent N.meningitidis and an output of 2 ‘other’ strains), and an error value is calculated between the two. The interconnecting network weights are then passed through a training algorithm and then modified so that the next time the inputs are presented to the system, the network output is closer to the known, actual output. This is an iterative process which continues repeatedly until a pre-determined condition is met, which may be a target error value or a failure of the network to improve for a certain amount of learning cycles (epochs). Once the ANN is suitably trained, it can then be validated by introducing additional (previously unseen) samples to the model, which then computes an output based on the new data. Artificial neural networks may also be validated as part of the training process, where the dataset is split into separate training and test cases. The ANN will then train using the samples chosen in the training set, and is then validated on the separate set of test cases whilst training is occurring. A second validation set is used in addition to a test set because a test set used in the optimization process cannot be said to be truly blind as it has had influence on model development and optimization. Training during BP is a supervised process, meaning that before training occurs, the identification of each sample is already known. In this study, the means by which the bacterial samples were identified into their respective species was 16S rDNA sequence analysis. In molecular microbiology, hierarchical clustering is a method which is commonly used to study and determine the relationships between different bacterial strains, and is often used to place new strains into a particular taxon. It works by organizing the data into meaningful tree-like structures using distance measures to establish how closely related one sample is to another. Bearing this in mind, it seemed appropriate that this study should incorporate clustering into the analysis in order to study whether any possible outliers misclassified by the ANN within the two populations were more related to their identification by 16S rDNA analysis or by their actual ANN classification. The aims of this study were to utilize ANN methodologies in order to identify potential biological markers among well-characterized 2192 strains of N.meningitidis which could then be used to create a model to distinguish from closely related species. The genus Neisseria contains a number of species and these may be both normal flora and pathogens of humans and animals. Data used were generated from standard NCTC (National Collection of Type Culture) strains. Of these species, N.meningitidis and N.gonorrhoeae have been studied widely because of the severity of the infections they cause. N.meningitidis and N.gonorrhoeae exhibit over 90% homology between their genomes; however, their respective site of infection, disease picture and antibiotic therapy vary markedly, highlighting the need for new, more accurate, rapid diagnostic tests to aid in the identification of these pathogens. METHODS Culturing and storage Strains of N.meningitidis and N.gonorrhoeae were grown for 24 and 48 h respectively on Chocolate Blood Agar (Media Dept., CPHL) in 5% CO2 at 37◦ C. All were harvested and applied to Microbank™ plastic storage beads (Pro-Lab Diagnostics) and stored at −80◦ C for long term storage. Protein assay Protein estimations were performed according to the Bradford method using bovine serum albumin (Sigma) as a standard. Absorbance measurements were taken using an Ultrospec 4300 spectrophotometer (Amersham) with the SWIFT II software package (v.2.01) (Biochrom). SELDI-TOF-MS sample preparation Cells from one plate of growth were harvested and re-suspended in 1 ml of prechilled distilled water (Media Dept., CPHL) in 2 ml microtubes (Eppendorf) to which glass beads (Sigma) were added to approximately one quarter of the total volume. Cells were disrupted mechanically by vortexing them for 4 × 1 min in a Mickle (Mickle Laboratory Engineering Co Ltd), with 1 min on ice in between intervals. Tubes were left on ice for 20 min for the beads to settle to the bottom. The supernatant was transferred into fresh 1.5 ml Eppendorf tubes and spun at 21 000 RCF at 4◦ C for 40 min. The supernatant was transferred into fresh 1.5 ml Eppendorf tubes, ready for analysis or frozen at −80◦ C. Chip preparation H50 ProteinChips were bulk washed twice in 50% methanol (BDH) for 5 min and then left to air-dry for 1 h. Each spot was pre-activated with 5 µl of 0.1% Trifluoracetic acid (TFA) (Sigma) for 5 min and the remaining solution removed. Ten microlitres of samples were diluted in 10 µl of ammonium acetate (Sigma) with 25% acetonitrile at pH4 (BDH). The samples were pulsed down to ensure proper mixing. Four micromlitres of sample-ammonium acetate mix were applied to individual spots and incubated in a humidity chamber for 1 h. Excess samples were pipetted off and each spot was washed twice with 5 µl of 50% methanol for ∼2 min and then air-dried for 10 min. The matrix was prepared by dissolving 35 mg/ml of sinapic acid in 50% methanol and 50% TFA (1%), and 1 µl of sinapic acid matrix was applied to each spot and left to air-dry for 10 min. The samples were analyzed using the SELDI-TOF-MS Ciphergen (Series PBS II, Fremont, CA, USA) according to an automated data collection protocol. The instrument was operated in the positive ion mode and a nitrogen laser emitting at 337 nm was used. The data was electronically exported from the ProteinChip software for further analysis (below). All samples were run in duplicate and all spectra were calibrated for molecular mass with Ciphergens All-in-One Protein Mix TLF15kDa as well as for total ion current. Previous tests have demonstrated the stability of spectra from inoculated ProteinChip Arrays over extended time periods when stored away from light interference, which could degrade the matrix compound. Furthermore, experiments have been performed to test spectra ANN classification of bacterial species reproducibility of the same strain with repeated sample preparations which documented the constancy of the resulting data (data unpublished). Ion intensity can vary slightly between repeated experiments without affecting pattern stability. Development of ANN model We used a three-layer MLP ANN with a feed forward BP algorithm with a sigmoidal transfer function (Bishop, 1995). Prior to training, the data were scaled linearly between 0 and 1 using minima and maxima. The raw values were scaled linearly so that the smallest value in the dataset was scaled to the minimum value and the largest value in the dataset was set to the maximum value. This scaling method ensures that all relationships amongst the data values are kept identical, therefore not introducing any potential bias into the data. The raw data obtained from the SELDI-MS instrument consists of individual m : z values with their corresponding relative abundance values. It is these relative abundance values for each m : z value that were used as inputs in the input layer. The network utilized a constrained approach to maximize the efficiency of the analysis whereby two hidden nodes were used in the hidden layer. Two hidden nodes were used in order to amplify the importance of key ions within the mass spectrometry data, while producing accurate predictions and maintaining model generalization. This approach was adopted with success on earlier mass spectrometry data (Ball et al., 2002; Mian et al., 2003). Increasing the number of nodes in the hidden layer did not result in an increase in the ability of the model to predict strain (data not presented). The output layer consisted of a single node encoded with Boolean representation, N.meningitidis represented by 1 and other strains represented by 2, and the initial weights of the network were randomized between 0 and 0.001. Prior to analysis and model development, data with m : z values below 3 kDa were removed as anything below 3 kDa was deemed to be noisy and unimportant, and due to the limitations in accurate mass resolution beyond 30 kDa, everything above this mass value was also removed (as shown in Ball et al., 2002). The model was trained, tested and validated using 206 samples. The data utilized were evenly split with 103 as N.meningitidis and 103 as other strains to prevent the predictive performance being in favour of one output class. Of the other strains the majority belonged to the genus Neisseria, with the addition of a few other closely related taxa such as Kingella and Moraxella. An additional 188 samples (60 of which were N.meningitidis and 128 were ‘other’ species) were brought in at a later date to provide a second order of validation utilizing blind data for the final optimal model. Prior to training, the data were split into three sets, training, test and validation. During training, the ANN utilizes the training data to develop predictive performance, and model performance is continually monitored for the test data, which is unseen to the network during training. Training was conducted to convergence on the test data, i.e. when the mean squared error (MSE) failed to improve for 20,000 training events. This early stopping approach helps to minimize over-training of the ANN. Once trained the ANN model can then be validated using the second blind validation set put aside for this purpose. An initial model was tested using all remaining ions as inputs (12 822 inputs in total) resulting in a classification accuracy of 76% (results not shown). Whilst this is relatively high, this model contains extremely high dimensionality, resulting in long training times due to its high complexity. Therefore parameterization steps ensued with the aim of identifying which ions were the most important within the dataset, allowing the model complexity to be reduced and the predictive capabilities to be markedly increased. This was achieved by a ‘rolling input subset’ approach which involved training the data containing inputs across a 3 kDa mass range, and then shifting this along 1 kDa at a time, in order to create a new data block. For example, the first block contained data within the 3–6 kDa mass range, the next block ranged from 4 to 7 kDa, the next from 5 to 8 kDa and so on up to 30 kDa. Each model was then trained over 50 random training/test/validation sub-models and relative importance values for each individual input were recorded so that they could be ranked according to their influence upon correct sample assignment. Relative importance values were calculated for each input by: (i) Multiplying the absolute weight values of the network from the input node to the first hidden node with the absolute weight values of the network between the first hidden layer node to the output node. (ii) Calculating the same value using the weightings to the second hidden node. (iii) Summing these two values to give a relative importance value for the input. The process was repeated for all of the remaining inputs to give the relative importance of each with respect to all of the other inputs. Once this initial analysis was completed, ions with the greatest importance were selected from the data in order to reduce model complexity. This was accomplished by selecting the top 1000 inputs with the greatest relative importance values and repeating the training process. Relative importance analysis was again used to determine the top 100 inputs from these 1000. This was repeated again to deduce the top 30 inputs, at which point the process was stopped due to no further significant improvement in the model. In addition to ANN analysis at each parameterization step, a cluster analysis and principal components analysis (PCA) was performed to firstly look at the relationships between the samples within the populations, and secondly, to provide additional information regarding samples which were misclassified by the ANN as a means of providing a possible explanation for these misclassifications. It is important to note that PCA and cluster analysis were not used as predictive tools, but purely as a means to understand and visualize the data structures being analyzed. Figure 1 shows a summary of the methodology used in the analysis. RESULTS ANN analysis Ion mass intensity profiles generated from SELDI analysis were analyzed in 3 kDa blocks in order to determine their relative importance in classifying the samples into their respective groups. This enabled a proteomic profile over the whole mass range that was being analyzed to be produced. Figure 2 shows the mean relative importance profile generated from the 50 sub-models trained from each 3 kDa block over the whole mass range of the SELDI data available. In order to reduce the complexity of the model and determine which of these ions were the most significant in strain prediction, all ions were ranked in descending order of relative importance and the top 1000 (an arbitrary value) ions of most importance were selected for further analysis. This reduction is necessary in order to find any potential biomarkers whose intensities correspond with strain identification and which are capable of accurate discrimination between the two groups. Different randomly extracted training, test and validation sets were utilized to develop each sub-model. This repeated random sample cross-validation approach ensures that each sample is set aside for validation purposes a number of times, enabling confidence intervals and also probability of identification to be calculated. So, for example, repeated random sample cross-validation enables us to calculate the probability, or percentage chance, that a sample belongs to one group or another. The results from this 1000 input model showed that for validation data, 203 out of the 206 samples were correctly classified, with 100% sensitivity (percentage of N.meningitidis samples correctly classified) and 97% specificity and an area under the curve (AUC) value of 0.9994 when analyzed using a receiver operating characteristic (ROC) curve. It should be noted that an error threshold value of 0.5 was used to designate sample destination, 2193 L.Lancashire et al. Fig. 1. Flow diagram representing methodology of ANN analysis resulting in the identification of potential biomarkers. Parameterization methods were conducted in order to deduce the top 30 molecular ions of greatest importance. i.e. the actual output for an N.meningitidis was 1, and an output of 2 would represent ‘other’ species. So if a sample was predicted at anything between 1 and 1.5, then it was classed as N.meningitidis, and if it was predicted at between 1.5 and 2, then it was grouped into the ‘other’ strain category. The magnitude of this error may be used to identify strains of a similar nature. Although the results, based on data from the whole profile, showed high prediction rates, for development as a rapid diagnostic aid, the system would need to be much more parsimonious to facilitate ease of data gathering and minimize potential sources of error. Therefore, the relative importance values of these 1000 inputs were again ranked and the top 100 were selected for further training. 2194 Training was again repeated as with the 1000 input model and results from this model showed that 204 out of 206 samples were correctly classified, again with 100% sensitivity and an increase to 98% specificity. Analysis by the ROC curve illustrated an AUC value of 0.9999. To deduce whether the number of inputs could be reduced further, the top 30 inputs were chosen from these 100 and training repeated. Predictive performance showed a further increase, with a correct classification of 205 out of 206 samples, with 100% sensitivity and 99% specificity and an AUC value of 0.9949. An example of prediction and associated errors is shown in Table 1. Further input number reductions beyond 30 resulted in a drop in predictive performance, ANN classification of bacterial species Fig. 2. Relative importance values for ion masses between 3000 and 29 999 Da. These values represent a value obtained from multiple sub-models in which random initial weightings were applied to each. Table 1. Error values demonstrating the ability of the top 30 ions to accurately predict bacterial straina Sample number Actual output Network output Error 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.1415 1.0000 1.0014 1.0647 2.0000 1.0000 1.8194 2.0000 2.0000 2.0000 1.9700 2.0000 2.0000 2.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.1415 0.0000 0.0014 0.0647 0.0000 1.0000 0.1806 0.0000 0.0000 0.0000 0.0300 0.0000 0.0000 0.0000 a The actual output is shown by a 1 for N.meningitidis and a 2 for all other species. The prediction from the ANN model is indicated by the network output. All samples were classified correctly except for sample 12 (shown in bold). so these 30 inputs were deemed as the optimal set and the most important for predictive performance. The 30 ions identified, with their respective relative importance values, are tabulated in Table 2 and it is interesting to note that the majority of these inputs appear to arise in clusters around certain m : z values. Once these key ions were identified, they were then applied further to the blind dataset of 188 samples. In this sample population the model correctly identified 184 out of 188 (97.9%) of the samples correctly, with a sensitivity of 100% and specificity of 96.9%. Figure 3 shows the ROC curve when the model was applied to this data which resulted in an AUC value of 0.986. Cluster analysis, PCA and similarity analysis A clustering algorithm (complete linkage with distances measured by Euclidean distances) was applied to the data in parallel to the ANNs Table 2. The 30 ions with the highest relative importance with regard to strain prediction were ranked according to their molecular mass and were then assigned to particular sub-groups if their masses clustered around specific mass ranges Input m : z value Relative importance 4709 5085 8181 10 663 10 665 11 672 11 674 11 676 11 678 12 515 12 517 14 609 14 611 19 055 19 057 19 060 19 062 19 065 19 067 19 070 19 072 23 302 23 305 23 322 23 325 23 328 23 330 23 333 23 336 29 000 0.19951 0.07419 0.00813 0.19977 0.08953 0.00324 0.01489 0.02819 0.02234 0.00799 0.00724 0.04217 0.01433 0.03528 0.02147 0.01855 0.00753 0.01346 0.00423 0.01419 0.01921 0.01651 0.017 0.01444 0.02724 0.03176 0.01239 0.00186 0.01831 0.01503 in order to measure consistencies between the various approaches and to visualize any possible outliers in the two populations, which may in turn lead to some explanations regarding the few samples which were misclassified by the ANN models. The initial cluster analysis was with the 1000 input model and results from this analysis can be seen in Figure 4a–d. 2195 L.Lancashire et al. closely together. Whilst it is evident that the other species belonging to the second population exhibit much more variation and fall along a vector far less clustered, it is interesting to see that the four samples which the ANN identified incorrectly as N.meningitidis are clearly placed along the same vector as the N.meningitidis by the PCA, indicated by arrows in black. A similarity analysis was also performed to determine which samples were the most similar to the incorrectly classified ones with respect to their mass spectra. From the 25 most similar samples to these four misclassified ones, 19 were N.meningitidis and just six were ‘other’ closely related species. This again provides the reason as to why the ANN classified them as N.meningitidis, and that these samples appear to actually be more related to what the ANN predicted, than to what they were initially identified as using 16S rDNA analysis. DISCUSSION Fig. 3. Receiver operating characteristic curve for the 30-ion model applied to the second validation dataset consisting of 188 samples. It is evident from Figure 4a–d that this clustering approach, although not used to predict class, was capable of grouping samples according to their known identification reasonably well, with fairly reproducible results as the dimensionality of the data is reduced, although notably with some overlap between the populations. It is interesting to recognize however that of the three samples which the ANN incorrectly identified using the top 1000 ions (highlighted in Figure 4a by arrows), two are in clear clusters of the opposite group, i.e. the sample on the extreme right of the incorrect classifications was identified using 16S rDNA methods as Kingella denitrificans, but clearly showed more similarities to the N.meningitidis samples than other species. Furthermore, this trend continued, with the majority of the samples misclassified by the ANN being placed in clusters together with species more closely related to the ANN classification, and not the initial pre-determined species. This is shown even more clearly in Figure 4d, which represents the cluster analysis of the second validation set of samples. Of the four samples which the ANN misclassified as being N.meningitidis (which by 16S rDNA were identified as N.gonorrhoeae), the cluster analysis, like the ANN, placed these in the cluster containing the majority of the N.meningitidis indicating that these isolates are more similar to the latter than N.gonorrhoeae by proteome analysis. Thus by using these clustering approaches in concert with ANNs we can visualize the data using cluster analysis, and predict class with the ANNs. Furthermore we can begin to understand the nature of samples which may be outliers, and therefore may show characteristics which are not archetypal of that particular species which it is initially believed to belong to, and begin to understand why the ANN models that we have developed predict some species to be one class and not another. To provide additional support to this, a PCA and similarity analysis were conducted on this validation data (Figs 5 and 6). This PCA matrix shows that all of the N.meningitidis samples (represented in dark grey) fall along one clear vector, all clustered 2196 The ability to consistently and rapidly identify closely related bacterial species, such as those from the genus Neisseria, could pave the way for new technologies becoming more accessible to aid human judgment as an on-site laboratory-based decision tool. Mass spectrometry methodologies allow for the generation of information regarding thousands of ions in any particular sample. In order to determine whether ANNs have the learning capabilities to potentially identify subsets of ions whose phenotypic ‘fingerprints’ correlate strongly to a particular bacterial isolate, we applied parameterization methodologies to a relatively large dataset containing N.meningitidis and other closely related species with the aim of correctly classifying the samples into their respective groups. Model parameterization allows for the identification of molecular ions which are important in classifying the population into their respective groups. Thus, by determining which ions are contributing most in the classification, unimportant and noisy ions may be removed. Using the ‘rolling input subset’ approach described earlier, the number of ions used as inputs in the ANN model were reduced from an initial set of approximately 13 000 (between 3 and 30 kDa) to just the top 1000 which could accurately predict species type. For a system to have practical application in a diagnostic laboratory where there is a high throughput of samples and a demand for rapid analysis, a simple application tool is required; therefore, it was decided that the number of ions needed to be reduced to the smallest number possible which could discriminate between the two species. In order to achieve this, the top 1000 ions were ranked in order of importance in predicting a bacterial strain identity, and the top 100 were selected for further training. This resulted in a model which out-performed the previous. This process was repeated until the models ceased to improve in predictive capabilities resulting in a model containing just 30 molecular ions which could predict with a sensitivity of 100% and specificity >96% on a second order validation dataset, which had not been used in any way in the training of the model. Therefore this approach of analyzing the interconnecting weights of the trained network can be used as a rapid and efficient means to greatly reduce the number of inputs in a model in order to decrease complexity, whilst increasing predictive capabilities as a result of removing noisy inputs, which may be causing a conflict in the model and reducing predictive performance. ANN classification of bacterial species Fig. 4. Cluster analysis (complete linkage using Euclidean distance measures) for (a) top 1000 ion model, (b) top 100 ion model, (c) top 30 ion model applied to the original dataset and (d) top 30 ion model applied to the second validation dataset. The black blocks indicate samples which were N.meningitidis whilst the light grey blocks represent ‘other’ species. The dark grey blocks show incorrectly classified samples (also highlighted with arrows). Prior to analysis, the ANN did not assume any prior relationships between any inputs used in the model; thus it was interesting that the ANN did not appear to identify single peaks around given mass values, but found clusters of ions around masses which were important in the classification. Because of the mass accuracy of the instrument (around 0.2%) and data averaging, these clusters are highly likely to represent the same molecular species, which may be confirmed with further research. The scope of this study was then expanded to investigate why the ANN incorrectly classified the few samples that it did. This was done mainly by cluster analysis, but also incorporated PCA and a similarity analysis, in which samples were examined with regard to their raw mass spectra, to identify which other samples closely matched these. From the clustering approach it was clear that the majority of the samples that the ANN misclassified in the various models were actually in fact grouped together with samples from their ANN predicted class and not their true class which 16S rDNA methods identified them as. This was supported further by conducting a PCA on the validation dataset, which again showed similar findings. The N.meningitidis samples were all found along a clear vector, together with the non-N.meningitidis samples that the ANN misclassified. A similarity analysis also resulted in the same again, with only six of the 25 most similar samples belonging to the ‘other’ species group, whilst 19 of the 25 were N.meningitidis, further suggesting that the samples are actually more closely related to the ANN predicted class than their actual class, providing valuable information about the nature of the two populations and the outliers which are found to be exhibiting characteristics typical of both species, and so appear to exist in a continuum between the two. This information serves to highlight the need for rapid, highly robust methods which are generalized and can prove to be a reliable decision support tool in microbial identification. CONCLUSIONS We have shown that ANNs can be successfully used as a system in which to identify molecular biomarkers whose proteomic profiles tie in strongly to bacterial species. We have identified 30 ions which are capable of identifying 184 out of 188 samples from a separate validation dataset correctly, and furthermore, have shown using cluster analysis and PCA that these few misclassified samples may actually show more resemblance to their ANN predicted class than their actual class. Of these 30 ions identified, the majority of these 2197 L.Lancashire et al. Fig. 5. Principal components analysis of the 30-ion model applied to the second validation dataset. The dark grey blocks indicate N.meningitidis samples and the light grey blocks other samples. The black blocks and arrows indicate the samples which were misclassified. Fig. 6. Similarity analysis to assess which spectra were most closely related to those samples incorrectly classified. Of the 25 most similar samples to the four which were misclassified, 19 were N.meningitidis, concurring with the ANN predictions rather than the 16S rDNA assignments. appear in clusters around certain mass values, and as such there is a high probability that these belong to the same molecular species, suggesting that of these 30 ions identified, there may only be around 10 individual species present. Although the dataset used was relatively small (206 samples), all models were validated using a random sample cross-validation approach so that each sample was treated as blind a number of times. This was enhanced further by applying the models to an additional dataset consisting of 188 samples that had not previously been used in the training of the model in 2198 any way. The high classification accuracies for this dataset (>96%) reinforce the assumptions that a generalized, robust model has been created. This study has therefore shown that ANNs are powerful enough to model for molecular microbiological data and identify potential markers representative of the different species of interest, which may lead to the development of robust tools, capable of providing unbiased decision support in the laboratory, and reducing incorrect diagnosis of disease and infections by highly pathogenic organisms. ANN classification of bacterial species REFERENCES Anagnostou,T. et al. (2003) Artificial neural networks for decision-making in urologic oncology. Eur. Urol., 43, 596–603. Ball,G.R. et al. (2000) Identification of non-linear influences on the seasonal ozone dose response of sensitive and resistant clover clones using artificial neural networks. Ecol. Model., 129, 153–168. Ball,G. et al. (2002) An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics, 18, 395–404. Balls,G.R. et al. (1996) Towards unravelling the complex interactions between microclimate ozone dose and ozone injury in clover. Water Air Soil Poll., 85, 1467–1472. Batuello,J.T. et al. (2001) Artificial neural network model for the assessment of lymph node spread in patients with clinically localized prostate cancer. Urology, 57, 481–485. Bishop,C.M. (1995) Neural Networks for Pattern Recognition. Oxford University Press Inc, New York. Craven,D.E. et al. (1978) Serogroup identification of Neisseria Meningitides: comparison of an antiserum agar method with bacterial slide Agglutination. J. Clin. Microbiol., 7, 410–414. D’Amato,R.F. et al. (1978) Rapid identification of Neisseria gonorrhoeae and Neisseria meningitidis by using enzymatic profiles. J. Clin. Microbiol., 7, 77–81. Das,A. et al. (2003) Prediction of outcome in acute lower-gastrointestinal haemorrhage based on an artificial neural network: internal and external validation of a predictive model. Lancet, 362, 1261–1266. Desilva,C. et al. (1994) Artificial neural networks and breast-cancer prognosis. Aust. Comput. J., 26, 78–81. Janda,W.M. et al. (1984) Use of the API NeIdent system for identification of pathogenic Neisseria spp. and Branhamella catarrhalis. J. Clin. Microbiol., 19, 338–341. Khan,J. et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med., 7, 673–679. Lancashire,L.J., Mian,S., Rees,R.C. and Ball,G.R. (2003) Preliminary artificial neural network analysis of SELDI mass spectrometry data for the classification of melanoma tissue. In 17th European Simulation Multiconference, Nottingham, Society for Modeling and Simulation International, SCS European Publishing House, Erlanger, Germany, pp. 131–135. Mian,S. et al. (2003) A prototype methodology combining surface-enhanced laser desorption/ionization protein chip technology and artificial neural network algorithms to predict the chemoresponsiveness of breast cancer cell liner exposed to Paclitaxel and Doxorubicin under in vitro condition. Proteomics, 3, 1725–1737. Peterson,C. and Ringnér,M. (2003) Analyzing tumor gene expression profiles. Artif. Intell. Med., 28, 59–74. Reddick,A (1975) A simple carbohydrate fermentation test for identification of the pathogenic Neisseria. J. Clin. Microbiol., 2, 72–73. Robinson,M.J. and Oberhofer,T.R. (1983) Identification of pathogenic Neisseria species with the RapID NH system. J. Clin. Microbiol., 17, 400–404. Rumelhart,D.E. and McClelland,J.L. (1986) Parallel Distribution Processing: Explorations in the Microstructure of Cognition, Vol. 1, Foundations. MIT Press, Cambridge, MA, USA. Wei,J.T. et al. (1998) Understanding artificial neural networks and exploring their potential applications for the practising urologist. Urology, 52, 161–172. 2199
© Copyright 2026 Paperzz