Novel quantitative methods that enhance clinical decision support

Project title:
Novel quantitative methods that enhance clinical decision support
based on routine pathology testing
Quality Use of Pathology Program (QUPP)
Final Report (December 2015)
Prepared by:
Dr Tony Badrick, Royal College of Pathologists (QAP),
Associate Professor Brett Lidbury, ANU
Prepared for:
The Commonwealth Department of Health
TABLE OF CONTENTS
EXECUTIVE SUMMARY: ....................................................................................................... 4
GLOSSARY: .......................................................................................................................... 5
SUMMARY OF SCIENTIFIC RESULTS: .................................................................................... 6
CHANGES, LIMITATIONS EXPERIENCED DURING THE QUPP FUNDING PERIOD: ................... 8
FULL PROJECT ACTIVITY REPORTS ‐ ENTIRE PROJECT PERIOD .............................................. 8
(1) LIVER FUNCTION TESTS .................................................................................................. 8
(a)LiverFunctionTests(LFTs):IncreasingtheEfficacyofCommunityTesting–....................................8
(b)HepatitisBVirusTesting:PredictingImmunoassayResultsforEarlyandChronicHBVInfection
fromRoutineChemistryandHaematologyData..................................................................................................9
(c)HBVImmunoassayContinued:PredictingPersistentInfectionbyHBeAntigenorAntibody.....25
(d)ValidationofHBVImmunoassayPrediction................................................................................................31
(2) VITAMIN D AND RENAL FUNCTION .............................................................................. 33
(3) RED CELL DISTRIBUTION WIDTH .................................................................................. 36
ValidationofRDWPredictions:..............................................................................................................................36
(4) FINAL CONCLUSIONS ‐ HOW THE RESULTS OF THIS ACTIVITY WILL BE USED TO BENEFIT PATHOLOGY STAKEHOLDERS: ........................................................................................... 39
REFERENCES ...................................................................................................................... 42
APPENDICES (INCLUDING FULL ACTIVITY REPORTS FOR LFT AND RDW PROJECTS): ........... 43
Appendix(A).................................................................................................................................................................44
Appendix(B).................................................................................................................................................................45
Appendix(C).................................................................................................................................................................47
2
Project title: Novel quantitative methods that enhance clinical decision support based on
routine pathology testing
Project Leaders: Dr Tony Badrick, Royal College of Pathologists (QAP), Sydney NSW
(formerly Health Sciences & Medicine, Bond University), and Associate Professor Brett
Lidbury, Pattern Recognition & Pathology (Department of Genome Sciences), The John
Curtin School of Medical Research, College of Medicine, Biology and Environment, The
Australian National University (ANU).
Acknowledgements: The authors wish to thank Sullivan Nicolaides Pathology (SNP)
Brisbane for their ongoing support of these research investigations via data provision and
advice. Particular thanks to Mr Ashley Arnott who has acted as our contact at SNP, and
arranged the data access. We are also grateful to ACT Pathology (The Canberra Hospital)
for their earlier support through many discussions and data access required for our initial
work in the area of machine learning and pathology informatics. We are particularly grateful
to Dr Gus Koerbin and Mr Michael de Souza (Immunoassay). Special thanks to Dr Alice
Richardson for her expert advice and guidance on a number of statistical and machinelearning problems encountered during the project.
Thanks to the staff in the Research Offices at Bond University and the John Curtin School of
Medical Research, ANU for general administrative support and the facilitation of contracts,
payments and agreements between the Universities and the Commonwealth Department of
Health.
3
Executive Summary:
(1) Project Objectives. As stated in the original proposal, “The aim of the project is to identify,
using sophisticated computational knowledge discovery methods (e.g. pattern recognition),
relationships between existing routine clinical chemistry and haematology tests to second tier
tests, which will ultimately reduce the need for additional special tests …”. Through the
availability of mass data and machine learning algorithms developed by computer scientists
and statisticians, the opportunity to examine complex medical data to reflect an interacting
human physiological and biochemical network is possible. While reference intervals continue
to be important to the interpretation of pathology data, machine learning allows the detection
of previously unrecognised and nuanced relationships between blood test variables that adds
an extra analytical dimension for disease detection and prediction. The results and analyses
presented herein focus particularly on prediction, and specifically, how routine pathology data
can be used to anticipate the outcome of a specialised second tier test (e.g. hepatitis B
immunoassay, serum ferritin), potentially representing significant savings on time, cost and
patient anxiety. The data analysed to reach our conclusions represent community cohorts
collected across Queensland by the SNP network.
(2) Work Activity undertaken. The results and recommendations of this report were based on
the application of two machine learning algorithms, namely, recursive partitioning (“trees”)
and support vector machines (SVM), to large cross-sectional data samples obtained from
Sullivan Nicolaides Pathology (SNP, Taringa, Queensland) representing the
infection/disease state of interest to the specific research question. All data collections
interrogated contained routine chemistry and haematology markers, as well as special
(second tier) assay results where appropriate. Prior to investigation, data was subject to
cleaning and other pre-analytical preparation to provide a suitable population sample.
Descriptive and inferential statistical analyses were also performed to present and interpret
the data.
(3) Evaluation of the Model. From the analytical methods summarised in (2), models were
initially evaluated by assessing the accuracy of outcome prediction (e.g. prediction of HBsAg
positive or negative immunoassay result) after model training and testing. The top predictors
from tree modelling were applied to SVM for higher dimensional investigation (via kernel
selection), to produce definitive predictive models of infection/disease via routine blood and
serum markers. Model evaluation was ultimately performed by validating model predictions
on additional data collections from other time periods, by assessing the accuracy of
predictive rules generated from trees. Data from another laboratory, ACT Pathology
Canberra, were used for validation of predictions obtained from SNP data sourced from
Brisbane, on HBsAg models.
(4) Challenges and how resolved. Challenges encountered during the conduct of the project
were largely of a technical nature, and were resolved gradually as the machine learning
4
experiments were developed towards the final model. For example, for HBV studies it was
normal to have many more negative immunoassay cases when compared to positive results;
to this a range of solutions were tried and successfully implemented, for example, building a
data-weighting component into machine learning code. (5) Key Findings/Conclusions
(including benefit for pathology stakeholders). Summaries of the scientific results for
individual studies are on the next page. Benefits for stakeholders are the provision of new
decision tools, beyond the reference interval, to assist enhanced diagnostic procedures and
thereafter patient outcomes. A tangible example was the identification of redundant assays;
particular examples were for routine LFTs, IDA testing via RDW and the role of serum ferritin.
Early prediction of HBV infection and persistence via routine blood/serum markers will
support early intervention, particularly for rural and remote laboratories that do not have easy
access to reference labs and second tier assays. The conclusions herein were produced via
blood test results only; combined with patient history and clinical examination it would be
expected that the power of laboratory-based predictions would be further augmented.
Glossary:
Abbreviation
Definition
Ab
Antibody
Ag
Antigen
AFP
Alpha-Foetal Protein
Category
Class *
CRP
C-Reactive Protein
ESR
Erythrocyte Sedimentation Rate
HBV
Hepatitis B virus
HBe
Hepatitis B “e” antigen (HBV persistence)
HBsAg
Hepatitis B surface Antigen
IDA
Iron Deficient Anaemia
INR
International Normalised Ratio
(PT - Prothrombin Time)
LFT
Liver Function Test
RDW
Red cell Distribution Width
RFA
Random Forest Analysis
SVM
Support Vector Machine
* Class/Category are used interchangeably
** Refer to the footnotes for Table 1 for definitions of routine blood and serum pathology markers, as well as Appendices (B) and
(C)
5
Period of activity: Final Report (July 2013 – November 2015)
This document will report on the following research activities and results, as stated in the
agreement finalised before receipt of the QUPP funding to support the above project;
1. “All work undertaken for the entire period of the Activity including an evaluation of the
Activity against the Performance Indicators”;
2. “The extent to which the Activity achieved the goal of developing machine learning
modelling to recognise factors in routine clinical chemistry and haematology tests that will
predict secondary special test results thereby obviating the need for additional special
tests”;
3. “The evaluation of outcomes via a collaborating pathology laboratory”;
4. “How the results of this Activity will be used to benefit pathology stakeholders”.
Context: This report must be read in the context of community patient testing. Our project
industry partner, Sullivan Nicolaides Pathology (Brisbane) generously provided the data for
modelling, representing communities across Queensland.
Summary of Scientific Results:
Liver Function Tests (LFTs) - The application of consecutive decision tree (DT) and support
vector machine (SVM) machine learning algorithms to mass data (~ 25,000 cases)
successfully predicted normal or elevated serum gamma-glutamyl transferase (GGT)
concentrations at accuracies at > 80% using only alanine aminotransferase (ALT) and
alkaline phosphatase (ALP). Analyses also calculated decision thresholds to suggest normal
or abnormal GGT response. The addition of other LFT markers (e.g. albumin, LDH) did not
enhance accuracy. Recommendation: ALT and ALP alone are sufficient for routine
community screening.
Hepatitis B virus (HBV) - Applying decision tree machine learning algorithms (single trees
and random forests) to model hepatitis B surface antigen (HBsAg) immunoassay results with
routine clinical chemistry and haematology data, found predictive patterns among routine
pathology serum/blood markers that successfully predicted a HBsAg positive or negative test
outcome. Patient age at the time of testing was an important predictor variable. Patterns
were distinct for cases with elevated versus normal serum ALT concentrations. For HBsAg
positive cases, tree-based modelling also successfully differentiated HBV e antigen (HBeAg)
and antibody (HBeAb) positive from negative cases through modelling routine clinical
chemistry and haematology results. The HBeAb predictive pattern featured RDW and WCC,
while HBeAg required only a serum albumin threshold, suggesting chronic hepatic damage
from persistent HBV infection. Recommendation: Rules calculated from routine pathology
6
test results can provide very early indications of whether a patient has been infected with
HBV, particularly for specific age ranges. The progress of HBV infection can also be
predicted through modelling HBeAg results from those whom were previously HBsAg
positive. These results will be very valuable to remote, rural and other isolated laboratories
without easy access to immunoassay testing.
Vitamin D testing and Kidney Function - The leading predictor of eGFR (estimated
glomerular filtration rate) stage, indicating degrees of kidney health or disease, was 25-OH
Vitamin D. An associated finding was that serum calcium (corrected for albumin) and
phosphate did not significantly alter with eGFR stage. These conclusions were reached via
DT modelling and inferential statistical analysis. Recommendation: When assessing serum
vitamin D sufficiency or deficiency, kidney function status as reflected by eGFR must be
consulted as a variable that impacts on serum concentration. Aggregated results did not
show a significant impact on serum calcium or phosphate. In spite of the relationship with 25OH Vitamin D, the decision threshold was high (> < 72.5 nmol/L) and as observed serum
calcium and phosphate were not impacted by kidney impairment, suggesting that Vitamin D
testing may not be required often.
Red cell Distribution Width (RDW) - Based on a previous commentary to the Medical Journal
of Australia (MJA) reporting the potential of RDW as an early marker of anaemia (Dugdale,
2011), analysis of co-variance (ANCOVA) modelling was conducted on female and males
divided into age cohorts to assess the value of second tier anaemia tests, particularly serum
ferritin. The results showed that serum ferritin was significantly predicted by RDW, but only
for young females (15 - 50 years of age), and not males of the same age range, or older
females and males (> 55 years of age at testing). RDW was a significant predictor of vitamin
B12 status in older men (> 55 years), but in general RDW was effective only for the
diagnosis of iron-deficient anaemia (IDA), not folate or vitamin B12 deficiency.
Recommendation: RDW is an excellent early marker of IDA as suggested by the previous
MJA comment. RDW predicted serum ferritin as a significant marker, but only for younger
women. When using RDW as an early marker for IDA, running ferritin assays for men and
older women may not be necessary. For older men with elevated RDW, vitamin B12
deficiency should be considered.
7
Changes, limitations experienced during the QUPP funding period:
Overall, the Activity achieved all of its goals as stated in the original proposal. However, there
were some changes to research approach to that originally described, movement of a
collaborator and a chief investigator that influenced progress and/or altered the stated order
of investigation.
 Associate Professor Tony Badrick left Bond University in January 2015, but will continue
as an adjunct academic at that institution. This move will not change the original
arrangement of the grant leadership or investigator position on the grant, but Tony’s move
to a new position did slow activity progress temporarily;
 A key collaborator and contact at ACT Pathology (Canberra) left to take another position
interstate. This move impacted our ability to obtain validation data from another laboratory
(although previously obtained data can be used for a validation study);
 It was originally stated that the production of manuscripts from the project would occur in
the final six months of the project. Opportunities to publish came earlier than anticipated two papers have been written based on the results of this QUPP study, and both
published during 2015 (LFT and RDW studies: Appendices D and E). This delayed the
final stage of machine learning modelling via SVMs. Please note that the final publications
for the LFT and RDW projects will be attached to this report, and will deliver the specific
final documentation for each of these sub-projects.
Full Project Activity Reports - Entire Project Period
(1) Liver Function Tests
Liver Function Tests - Increasing the Efficiency of Routine Community Testing, and Earlier
Prediction of Second Tier Tests (Hepatitis B Immunoassay) via Routine Data.
(a) Liver Function Tests (LFTs): Increasing the Efficacy of Community Testing – A
study reporting the interactions between the routine LFT markers ALT, serum albumin, total
bilirubin, LDH, GGT and ALP were published during early 2015. This study focussed on the
relationship of serum GGT with other routine LFT markers, and found that serum GGT could
be predicted at high percentage accuracy by ALT and ALP alone. For screening of
community patients this infers that ALT and ALP may be sufficient, particularly for cases of
drug or alcohol abuse where increases above calculated ALT and ALP concentrations
accurately suggest GGT elevation. Significant interactions were also detected for LFT and
other chemistry markers (e.g. cholesterol) to distinguish between degrees of serum GGT
elevation above the upper limit of the reference range. Overall, an ALT level > 30 U/L in
combination with an ALP level > 100–125 U/L suggested an elevated GGT at accuracies of >
8
80% (with variation associated with sex, but not age). This was also a proof-of-concept study
for machine learning value.
Please refer to the report Appendix (D) for the complete and published paper on the machine
learning modelling of data to provide a strategy for individual LFT redundancy; paper entitled
- Assessment of machine learning techniques on large pathology data sets to address assay
redundancy in routine liver function test profiles.
(b) Hepatitis B Virus Testing: Predicting Immunoassay Results for Early and Chronic HBV
Infection from Routine Chemistry and Haematology Data - The following analyses were
conducted on SNP community patient data that comprised results for HBsAg alone and HBe
antigen or antibody immunoassay (response variable) and routine clinical chemistry and
haematology results (predictor variables). The HBsAg and HBe responses was categorised
as either “positive” or “negative” as dictated by specific SNP reference intervals and/or
confirmatory test results. The tree-based machine learning algorithms were applied to this
data to ascertain predictor variable patterns and thresholds to differentiate immunoassay
positive from negative responses. These analyses were done for all cases, as well as for
sub-cohorts separated by an elevated or normal ALT response.
Initial HBV infection (HBsAg) Prediction - The final cohort for investigation comprised 3942
individuals, with 42% male and 58% female cases. Age ranged from 17 to 111 years of age.
Cases were sorted and duplicates removed prior to analysis. Patient samples to produce the
data analysed were collected by Sullivan Nicolaides Pathology between the 1st June 2011
and the 31st May 2012. Approximately 85 - 90% of cases were HBsAg negative thus
producing a significant data imbalance. Negative cases were divided into 6 - 8 subcategories
of similar sample size as the HBsAg positive cohort. Age is a very powerful predictor of
HBsAg positive versus negative immunoassay results, therefore only HBsAg negative
subcategories where the mean age did not differ significantly (p > 0.10) from the HBsAg
positive cohort were used for machine learning analyses. Mean age for HBsAg positive
cases generally incorporated an age range of late-thirties to mid-forties. Mean age and other
variables included in modelling were compared by unpaired t-test (Table 1), and
comparisons also performed by one-way ANOVA with post-hoc analysis of subcategories
done by Sidak or Dunnett methods (SPSS version 22).
Table 1 summarises the comparison of HBsAg positive cases (n = 470) with HBsAg negative
cases (sub-category 6) (n = 434), and includes significance as estimated by unpaired t-test.
9
Age was not significantly different between the HBsAg cohorts (as stated above, age is a
strong predictor of HBV infection - the negative cohort was age-matched to the positive
group). Routine markers that were significantly different at p < 0.001 were serum albumin,
cholesterol, globulin, MCV, monocytes, neutrophils, platelets and WCC. Total white cell
count, neutrophils and monocytes for HBsAg positive cases had significantly reduced means
compared to the negative cohort, as did platelet count. Serum globulin, which is produced by
the liver, was significantly increased for HBsAg infection (Figure 4).
Analysis by the tree machine learning algorithms provided a predictive model of HBsAg
response (Figure 1). Random Forest algorithms, where 500 – 10,000 individual trees were
run and the predictor variables (routine chemistry and haematology markers) were ranked in
order of importance for classification as HBsAg positive or negative, were run on all cases
(Fig. 1a). Identical analyses were performed for cases with elevated ALT (Fig. 1b), or ALT
within the reference interval (Fig. 1c). Figure 1(d) shows the single decision tree results for
the same data used to produce the random forest summarised by Fig. 1(a).
Figure 1 summarises the varied HBsAg prediction patterns according to cohort stratification
by ALT response. With all cases included (ALT range = 4 - 3334 U/L), serum globulin was
the top ranked predictor variable for HBsAg classification as positive or negative, followed by
platelet count, neutrophils and monocytes (Fig. 1a). In spite of controlling for age, age was
the clear top predictor for cases with elevated ALT (50 - 3334 U/L), followed by the renal
markers urea and eGFR (Fig. 1b). Where all cases had an ALT result within reference range
(4 - 49 U/L), bilirubin, RDW and lymphocytes were the top positive/negative classification
predictors. The random forest algorithm also calculated and “out-of-bag” (OOB) estimate for
prediction error rate (Table 2)
Table 1 - Mean serum and blood markers for patients testing positive or negative for hepatitis B surface antigen
(HBsAg). HBsAg positive and negative cohorts comprise female and male cases of age 17 to 111 years, with no
significant difference in mean age between groups.
Variable
HBsAg Category
N
Mean
Std. Deviation
Std. Error Mean
Age at Test
0 (HBsAg POS)
470
42.36
13.172
.608
Age at Test
6 (HBsAg NEG)
434
41.11
13.894
.667
Albumin *
0
470
43.42
3.355
.155
Albumin *
6
434
44.20
3.159
.152
ALT
0
470
47.00
182.464
8.416
ALT
6
434
32.91
74.381
3.570
Anion Gap
0
447
10.30
2.862
.135
Anion Gap
6
432
10.61
2.870
.138
10
Variable
HBsAg Category
N
Mean
Std. Deviation
Std. Error Mean
AST
0
470
35.71
93.555
4.315
AST
6
433
26.62
39.971
1.921
Basophils
0
381
.0279
.02211
.00113
Basophils
6
391
.0304
.02602
.00132
Bilirubin #
0
470
11.18
14.566
.672
Bilirubin #
6
434
8.89
8.947
.429
Calcium
(corrected) #
0
447
2.3069
.07814
.00370
Calcium (corrected)
#
6
432
2.3221
.08889
.00428
Cholesterol *
0
447
4.892
1.0393
.0492
Cholesterol *
6
434
5.243
.1344
.0065
Creatinine ^
0
449
77.12
19.135
.903
Creatinine ^
6
432
84.35
70.228
3.379
eGFR #
0
446
83.17
10.670
.505
eGFR #
6
426
80.53
14.205
.688
Eosinophils
0
382
.2177
.19709
.01008
Eosinophils
6
392
.1947
.19023
.00961
ESR
0
82
11.63
18.571
2.051
ESR
6
100
11.72
15.360
1.536
Fasting Glucose ^
0
188
5.518
1.4275
.1041
Fasting Glucose ^
6
175
5.225
.8145
.0616
GGT
0
470
33.32
52.396
2.417
GGT
6
434
36.13
63.372
3.042
Globulin *
0
470
28.40
5.163
.238
Globulin *
6
434
26.34
4.186
.201
Variable
HBsAg Category
N
Mean
Std. Deviation
Std. Error Mean
Haematocrit
0 (HBsAg POS)
383
.4262
.04189
.00214
Haematocrit
6 (HBsAg NEG)
394
.4266
.04363
.00220
Haemoglobin
0
382
141.437
15.7178
.8042
Haemoglobin
6
393
140.824
15.7053
.7922
LDH
0
447
173.86
44.700
2.114
LDH
6
429
176.90
49.508
2.390
Lymphocytes
0
382
2.0373
.60038
.03072
Lymphocytes
6
392
2.1336
.75806
.03829
MCH ^
0
382
29.601
2.3950
.1225
MCH ^
6
392
29.918
1.8456
.0932
MCHC
0
382
332.13
11.773
.602
MCHC
6
392
330.67
10.579
.534
11
Variable
HBsAg Category
N
Mean
Std. Deviation
Std. Error Mean
MCV *
0
382
89.15
5.899
.302
MCV *
6
392
90.56
4.892
.247
Monocytes *
0
382
.4776
.19117
.00978
Monocytes *
6
392
.5615
.21289
.01075
Neutrophils *
0
382
3.6608
1.69202
.08657
Neutrophils *
6
392
4.3909
1.76968
.08938
Platelets *
0
373
223.32
58.703
3.040
Platelets *
6
390
249.60
67.960
3.441
RDW
0
382
13.427
1.1809
.0604
RDW
6
392
13.476
1.1108
.0561
RCC
0
382
4.802
.5160
.0264
RCC
6
392
4.731
.5280
.0267
Triglycerides
0
131
1.315
.9123
.0797
Triglycerides
6
149
1.432
.8927
.0731
Urea
0
448
5.106
1.5916
.0752
Urea
6
432
5.314
2.6042
.1253
WCC *
0
382
6.417
2.0002
.1023
WCC *
6
392
7.310
2.1047
.1063
* p < 0.001 # p < 0.05 ^ p < 0.01.
ALT - Alanine Aminotransferase; AST - Aspartate Aminotransferase; ESR - Erythrocyte Sedimentation Rate;
eGFR - Glomerular Filtration rate (estimated); GGT - Gamma glutamyl transferase; MCV – Mean Corpuscular
Volume; MCH - Mean Corpuscular Haemoglobin; MCHC - Mean Corpuscular Haemoglobin Concentration; LDH –
Lactate Dehydrogenase; RCC - Red Cell Count; RDW – Red cell Distribution Width; WCC – White Cell Count.
12
Figure 1 - Random Forest analysis to identify the leading routine pathology assay predictors of HBsAg positive or
negative status for (a) all cases, (b) where cases had elevated ALT, and (c) cases with normal (within reference
interval) ALT serum concentrations. A single decision tree (d) example is presented summarising the all-case
analysis from random forest (a). Within the single decision tree predictor variable thresholds are calculated to
formulate rules to guide HBsAg positive or negative prediction. The data used for the above tree models was agematched and is summarised in Table 1. HBsAg Positive = 0, HBsAg Negative = 6 (age-matched sub-category)
13
Table 2 - HBsAg positive or negative classification out-of-bag error rate for (a) all cases analysed by random
forest (overall error rate = 34.32%), (b) cases with elevated ALT (overall error rate = 3.36%), and (c) cases with
ALT within the laboratory reference interval (overall error rate = 31.22%). Tables support the results presented in
Figure 1a - 1c. Classification error was calculated using all predictor variables presented in Fig. 1a, whereas the
top 4 predictors were used for the calculation of error rate for elevated and normal ALT response (Fig. 1b, Fig. 1c)
(a) Prediction
HBsAg Category
HBsAg Pos.
HBsAg Neg.
Classification Error
Positive
231
131
0.362
Negative
124
257
0.325
HBsAg Pos.
HBsAg Neg.
Classification
(b) Prediction
HBsAg
Category
Error
Positive
54
4
0.069
Negative
0
61
0.00
HBsAg Pos.
HBsAg Neg.
Classification
(c) Prediction
HBsAg
Category
Error
Positive
227
110
0.326
Negative
117
273
0.300
For all cases (Table 2a) and cases with ALT within the normal range (Table 2c), error rates
were lower for the prediction of HBsAg negative classification at 30 - 32.5%, showing that the
correct prediction of negative cases was between 67.5 - 70%. HBsAg positive classification
prediction had higher error rates at 33 - 36% (suggesting correct HBsAg prediction at 64 67%). The analysis of cases with elevated ALT (Table 2b) showed perfect prediction (0%
error rate) of HBsAg negative and the correct prediction of 93.1% of HBsAg positive cases,
suggesting that random forest machine learning is particularly effective for the later phases of
initial HBV infection when liver damage occurs with subsequent transaminase enzymes
increase detectable in the serum. This prediction for elevated ALT cases was achieved with
only patient age, serum urea, eGFR and GGT. As expected mean ALT and AST were higher
for HBsAg positive cases, while mean GGT was < 100U/L for positive cases, in keeping with
a known LFT marker pattern associated with viral hepatitis (results not shown).
Figure 1(d) provides an example of a single decision tree (10-fold cross validation) analysis
of the all cases (to assist tree interpretation, the complexity parameter (cp) was increased
from the default of 0.01 to 0.02). As shown for the random forest analysis of the same data
(Fig. 1a), globulin, neutrophils, platelets and monocytes were the leading predictor variables
of HBsAg positive or negative immunoassay results. The advantage of single decision trees
14
is the calculation of decision thresholds for each predictor used to understand a response,
allowing the formulation of “rules” to define the classification accuracy of interest. Like
random forests, classification accuracy is also calculated. Therefore, the following rule
applies to the most accurate prediction of HBsAg positive or negative for all cases,
regardless of ALT response;
(1) Neutrophils < 3.875 + Globulin < 21.5 = HBsAg Negative (73.9% accuracy)
(2) Neutrophils < 3.875 + Globulin > 21.5 + Monocytes < 0.335 = HBsAg Positive (78.9%
accuracy).
Therefore, with two or three routine pathology markers, predictions of whether a
patient has been infected with HBV can be made with an accuracy of ~ 75%.
Conclusions – Tree Analyses: The tree analyses, both forests and single decision trees,
provide an excellent precursor to SVM modelling. This style of machine learning is effective
for dimension reduction, namely, the identification of the leading predictors of class/category.
Figure 1 shows the full results of three forest analyses; it is important to note that the error
rates (accuracies) presented in Table 2 were calculated for the top 3 – 5 predictor variables.
For all random forest models, the inclusion of the additional 15 or more predictor variables
only increased prediction accuracy by 1 – 2%. The application of single trees to the same
data set confirms the primacy of the leading predictors, as illustrated by Figs. 1a and d. The
calculations of multiple predictor variable thresholds provide the basis for constructing simple
rules for category prediction, as displayed above for the interaction of neutrophils, serum
globulin and monocytes.
The random forest results summarised in Fig. 1 emphasise the impact of ALT concentration
on predictor profile for HBsAg category. ALT is routinely measured for suspected HBV cases,
and has importance as a marker of infection phase, indicating a shift from the pre-clinical,
asymptomatic phase to active virus-mediated liver damage and symptom development
(Álvarez-Suárez et al. 2010). The inclusion of all cases regardless of ALT concentration
produced different predictor patterns compared to cases selected with elevated ALT (Fig.
1b), and cases selected with ALT within the reference range (Fig. 1c), with profound
differences also noted for ALT selected cases (selection based on ALT concentrations was
done for both HBsAg classes). Of course, with selection of ALT as elevated or within the
reference interval effects the profile of other markers used for category prediction; for
example, for elevated ALT is more likely that serum ALP and LDH are elevated. As stated
above, this is important to monitoring the shift from pre-clinical to clinical viral hepatitis. Of
particular note from the random forest analysis for elevated ALT cases was the very high
15
accuracy of category prediction, suggesting tree analysis as useful to profiling cases once
shifted into the clinical phase. However, the relative small sample sizes used for these
predictions require further testing with a larger data set.
Support Vector Machine (SVM) Analysis of HBsAg Detection by Immunoassay: SVM is a
powerful classification tool, modelling data in “high-dimensional space” as kernel patterns
that represent linear, radial and other distribution models; these features allow the analysis of
complex data without data distribution complications. The other advantage of SVM is that this
machine learning algorithm does not discard cases like the tree-based algorithms that rely on
node purity to predict a response classification. Please see Appendix D for the recent
publication on SVM modelling of LFT data, and Karatzoglou et al. for the theoretical
foundations of SVM (Karatzoglou A, et al., 2006).
The SVM models represented by Figures 2 and 3 predicted HBsAg positive cases at 70.1%,
while negative results were predicted at 63.5%, using the routine serum and blood markers
ALT, neutrophils, platelets, monocytes and globulin, as well as age. Prior to modelling, the
data matrix for SVM was tuned and the optimal cost and gamma () coefficients were
calculated as 8 and 0.1 respectively. With the response (dependent) variable represented as
HBsAg positive or negative categories/classes, the C-classification method was applied to
this SVM model, with a radial kernel chosen for the model as a recommended formula for
data with complex features (the HBsAg data set, predictor variable distributions were
variable, with for example, haematology markers following a normal distribution while
enzyme markers followed a variety of skewed distributions).
SVM modelling of HBsAg positive and negative classes were age-matched and used the
same data set as summarised in Table 1, and applied for tree modelling (Figure 1). The
investigation of age across the spectrum of 20 – 90 years was performed, and the kinetics of
age as a predictor of HBV infection investigated (Fig. 3). Similarly, ALT is a powerful
predictor of liver damage post-infection, so was also featured in the SVM models presented
in Figure 2.
Data imbalance between classes was a problem, with approximately 10-times more HBsAg
negative immunoassay cases than positive cases. Weighting the classes within the SVM
code was effective (data not shown) and was subsequently confirmed by randomising the
negative cases (via the Excel RAND() function) and running similar numbers of randomised
negative cases as a sub-population versus all available positive cases as the categorical
response for modelling. The SVM results presented here represented the results for
randomised negative cases.
16
The general SVM models were run with the top six predictor variables of HBsAg positive or
negative immunoassay results from decision trees and forests (Fig. 1a), namely ALT,
neutrophils, platelets, monocytes, globulin and age, which provided the model for
calculations of prediction accuracy. However, only the relationship between serum globulin
and platelets were plotted for Figures 2 and 3, with slices introduced into the models for ALT
(20 – 100 U/L, Fig. 2) and Age (20 – 90 years at the time of HBV immunoassay, Fig. 3).
17
Figure 2 – SVM plots describing the interaction of serum globulin, blood platelets and serum ALT for the
classification of Hepatitis B surface antigen (HBsAg) positive ( - blue) versus HBsAg negative ( - pink) cases,
as previously detected by specific HBsAg immunoassay. (a) ALT = 20U/L; (b) ALT = 30U/L; (c) ALT = 50U/L; (d)
ALT = 100U/L.
The total SVM model to separate HBsAg positive from negative responses included the predictor (independent)
9
9
9
variables globulin (g/L), ALT (U/L), age (years), neutrophils (x 10 /L), platelets (x 10 /L) and monocytes (x 10 /L):
(cost = 8, gamma = 0.1, C-classification method and radial kernel). After 10-fold training/testing of the data set,
HBsAg positive cases were predicted at 70.1%, while negative results were predicted at 63.5%.
HBsAg Positive = 0 (Blue), HBsAg Negative = 6 (Pink: age-matched sub-category).
18
(a)
(b)
SVM classification plot
SVM classification plot
x
x
70
70
60
6
6
60
x
x
x
o
x
x
x
x
o x o
xx
xo
x
o
oxx x x
x x x
o xoo xo
x x o x
x
xxx x xxxo o x o x xooo x
oxxx x x
o xoxxo
o xx o
x x
o x oxxxoxo x o
x x xxx x xx x o
x oxxx xox x o xxxo
x xx xxxxoxxxxx xx xxx
x
oxxx
xxoo
xo
x xx
xo
xxxxx
oxxxxoxx x
xox x
xoxxo
xxo x
o
x xo
xxxxxxooxxx
x xx ooxo
x x
o
xo
oxoo
o
xxxo
xxxxxxxo
oo
oo
xxxo
x xx xoxox xx
xxxxxoxoxooxxxxxxxxxo
xxo
xo
xxx o o
xxxo
x x
ox
o
x xo oxo
xoxxxxxoo
xx
xxx
xxxxxxxxxxxxxxx
xxx xoxxxoxx o
xx xxxxx o
o
o
o
xxxxx
xxo
xxxxoo
o xxo
oooxxxxo
x
xxoo
x
xxxo
xxxxoxxxx
xx
oo
oxoxoxxo
xx x x xox o
xxxoxxxo
xo x o
xx
x
xxxxxoxoo
xx o
xxxxxxx
oxo
o xxx x xxxx
oxx x o x
xxxxxxxxxxxx o
o
xxo
xxxoxx xxx
oxo
o o
x
xx
oxoxxxxxxoo
x xxxxxxxxxxo
xo
xxx x x x xoox
xo
oxxx
oo o
x
xxx o oxoxxo xxoxooxxx o x o
x
x
x
oo o
x
x xxx x xx o
ox
oo
x
x
xx x
xx x
x
x
o
40
30
20
x
100
200
300
o
x
x
x
x
o
o
o
30
20
500
x
100
200
Platelets
300
o
x
x
x
x
x
400
500
Platelets
(c)
(d)
SVM classification plot
SVM classification plot
x
x
70
70
60
6
6
60
x
30
20
x
o
o
o
50
x
x
x
x
o x o
xx
xo
x
o
x x x
oxx x x
o xoo xo
x x o x
x
xxx x xxxo o x o x xooo x
oxxx x x
o xx o
o xoxxo
x x
x x xxx x xx x o
o x oxxxoxo x o
x x o xxxo
x xx xxxxoxxxxx xx xxx
x oxxx xo
xo
x xx
xxoo
xo
xoxxo
oxxx
x
xxxxx
oxxxxoxx x
xox x
o
xxxxxxxo
xxxxxxooxxx
x x
ox xo
xxo x
xo
oxoo
o
x xx ooxo
xxxo
xxxo
x x
oo
x xx xoxox xx
xxxxxoxoxooxxxxxxxxxo
xxo
xo
xxx o o
oo
xxxo
xxx
ox
x xo oxo
xx
xxxxxxxxxxxxxxxo
xoxxxxxoo
xxx xoxxxoxx o
xxxxo
o
x
o
xxxxx
xxoo
xxxxoo
xx
xx xxxxx o
x
xxxo
xxxx
xxo
o xxo
o
oooxxxxo
oo
xxxxxxx
xxxoxxxo
xx
x
oxoxoxxo
xo x o
xx x x xox o
xx o
xxxxxoxoo
xxxxxxxxxxxx o
xxo
xxxoxx xxx
oxo
o
o xxx x xxxx
oxx x o x
oxo
x
o o
x xxxxxxxxxxo
xo
oxoxxxxxxoo
xx
oxxx
xxx x x x xoox
xo
oo o
x
xxx o oxoxxo xxoxooxxx o x o
x
x
x
x
oo o
x xxx x xx o
ox
x
oo
x
xx x
xx x
x
x
o
100
x
Globulin
x
x
40
x
x
x
200
300
400
o
x
x
x
40
30
x
20
500
x
x
x
x
x
0
50
Globulin
x
x
x
x
x
xx
o x o
xo
o
x
x x x
oxx x x
o xoo xo
x x o x x
xxx x xxxo o x o x xooo x
o xx oo xoxxooxxx x x
x x
x x xxx x xx x o
o x oxxxoxo x o
x xx xxxxoxxxxx xx xxx
x oxxx xox x o xxxo
xo
x xx
xxxxx
oxxx
xox x
xxoo
xoxxo
oxxxxoxx x
xo
x
x x
xo
oxoo
ox xoo
o
xxxo
xxxxxxxo
xxxxxxooxxx
x xx ooxo
xxo x
x x
xxxxxoxoxooxxxxxxxxxo
xxo
xo
xxx o o
x xx xoxox xx
oo
oo
xxxo
xxxo
xxxxxxxxxxxxxxxo
x xo oxo
xxx
xoxxxxxoo
xxx xoxxxoxx o
oxxx
xxoo
xxxo
xxo
x
xxxxoo
o
o
oooxxxxo
xx
x
o xxo
o
xxxxoxxxx
xxxxx
xx xxxxx o
xxxxxxx
xx o
xxxxxoxoo
xo x o
oxoxoxxo
xx
x
oo
xxxoxxxo
xx x x xox o
xxxoxx xxx
xxxxxxxxxxxx o
xxooxx x o x
oxo
o
o xxx x xxxx
x xxxxxxxxxxo
oo
oxoxxxxxxoo
xooxo
x
xx
oo o
oxxx
xxx x x x xoox
xo
x
x
xxx o oxoxxo xxoxooxxx o x o
x
x
x xxx x xx o
oo o
x
x
oo
ox x
xx x x
xx x
x
o
40
x
400
x
x
0
x
o
o
x
x
0
x
50
x
x
Globulin
x
0
Globulin
50
x
x
o
x
x
o x o
xx
xo
x
o
x x x
oxx x x
x x o x x
o xoo xo
xxx x xxxo o x o x xooo x
x x
o xx oo xoxxooxxx x x
x x xxx x xx x o
o x oxxxoxo x o
x xx xxxxoxxxxx xx xxx
x oxxx xox x o xxxo
xox x
xoxxo
xxoo
xo
xxxxx
x
xo
x xx
oxxxxoxx x
oxxx
xxxxxxxo
x xx ooxo
xxxo
xxxxxxooxxx
xxo x
oxoo
o
x x
ox xo
o
xo
xxxo
xxxo
xxxxxoxoxooxxxxxxxxxo
xxo
oo
x x
oo
xo
x xx xoxox xx
xxx o o
xxxxxxxxxxxxxxxo
xxx
x xo oxo
oxxx
xoxxxxxoo
xxx xoxxxoxx o
xxoo
xxxxoxxxx
x
xx
xxxxx
xxo
xxxxoo
o
xx xxxxx o
x
o xxo
oooxxxxo
o
o
xxxo
xxxxxxx
xx x x xox o
xxxoxxxo
xx
x
xx o
xo x o
oxoxoxxo
xxxxxoxoo
oo
oxo
xxo
o
o xxx x xxxx
oxx x o x
xxxxxxxxxxxx o
xxxoxx xxx
xx
xooxo
x
x xxxxxxxxxxo
oxoxxxxxxoo
oo
xxx x x x xoox
xo
oo o
oxxx
xxx o oxoxxo xxoxooxxx o x o
x
x
x
x
oo o
x
x xxx x xx o
x
oo
ox x
xx x
xx x
x
x
o
x
o
o
100
x
x
200
Platelets
300
400
o
x
x
x
500
Platelets
Figure 3 – SVM plots describing the interaction of serum globulin, blood platelets and patient age at the time of
HBV testing for the classification of Hepatitis B surface antigen (HBsAg) positive ( - blue) versus HBsAg
negative ( - pink) cases, as previously detected by specific HBsAg immunoassay. (a) Age = 20 years; (b) Age =
30 years; (c) Age = 40 years; (d) Age = 50 years. The total SVM model to separate HBsAg positive from negative
responses included the predictor (independent) variables globulin (g/L), ALT (U/L), age (years), neutrophils (x
9
9
9
10 /L), platelets (x 10 /L) and monocytes (x 10 /L): (cost = 8, gamma = 0.1, C-classification method and radial
kernel). After 10-fold training/testing of the data set, HBsAg positive cases were predicted at 70.1%, while
negative results were predicted at 63.5%. HBsAg Positive = 0, HBsAg Negative = 6 (age-matched sub-category).
Figure 3 continued on the next page.
19
Figure 3 (continued), from previous page.
(e) Age = 60 years; (f) Age = 70 years; (g) Age = 80 years; (h) Age = 90 years.
ALT Kinetics Associated with the Primary Predictor Variables Globulin and Platelets: For the
SVM plots presented in Figure 2, the blue area represents the HBsAg positive cases, and the
pink negative HBsAg immunoassay results. A serum ALT of 20U/L (Fig. 2a) is well below the
upper limit of the SNP reference interval for this LFT marker, and therefore represents
HBsAg positive cases prior to liver damage. Interesting is the window of approximately 40 –
50 g/L serum globulin for very low platelet counts. With increasing platelet count, up to 300 x
109/L, the globulin range extends to approximately 25 – 55 g/L before gradually decreasing
to an upper limit of 40 g/L. Greater than 300 x 109/L platelets and less than 25 g/L globulin
results in only negative case predictions. For Fig. 2(a) as well as Figs. 2 (b – d), a
pronounced relationship between globulin and platelets was detected, which interacted with
ALT, as demonstrated by the alterations in globulin – platelet relationship with increasing
serum concentrations of this enzyme. For Figures 2 (b – d), while the upper range of globulin
stayed at approximately 55 g/L, two other features of the HBsAg population (blue) were
20
pronounced with the increase of ALT from 30 to 100 U/L. First was the increase in platelet
count for the HBsAg positive category to almost 500 x 109/L, and the widening of the globulin
concentration due to gradual decreases in the lower limit associated with increasing ALT. At
100 U/L ALT, the globulin range for positive HBsAg was from less than (<) 20 g/L to
approximately 55 g/L (at ~ 150 x 109/L platelets). At ALT 50 U/L, very low serum globulin (<
20 g/L) for HBsAg positive was associated with platelet counts from approximately 200 – 480
x 109/L; for ALT 100 U/L the lower platelet threshold for positive cases ranged from almost
zero to 480 x 109/L, with a wide globulin range from < 20 g/L to > 50 g/L.
The SVM investigation summarised in Figure 2 emphasise the interaction of ALT with platelet
count. An increase in ALT from 20 – 30 U/L resulted in a dramatic increase in the upper
range of platelet count for positive cases. With further ALT increases (50 and 100 U/L),
platelets increase only slightly in comparison to 30 U/L, with the serum globulin range
reducing dramatically.
The value of the Figure 2 analyses could be for HBV-infected, asymptomatic patients with
low serum ALT concentrations, well within the reference interval. Elevated serum globulin in
the range of 40 – 50 g/L in the presence of very low platelet count (< 100) may be an early
warning of infection, particularly in individual cases with suggestive histories (e.g.
intravenous drug use, recent travel to environments with poor sanitation/contaminated water
supply). The widest globulin concentration range (approx. 25 – 30 g/L to 55 g/L) was found
for platelet counts between close to 200 x 109/L to slightly above 300 x 109/L. Within these
ALT, globulin and platelet boundaries, additional decision support can be calculated to allow
the earliest possible detection of HBV infection, and this was achieved with only three routine
markers.
The sensitivity of globulin, platelets and some white cell markers to HBV infection is
supported further by the kinetics of these serum and blood markers across the age range
(Figure 4). When comparing HBsAg positive with negative cases across the range of ages
tested (20 – 90 years), mean globulin is consistently and significantly higher for positives
(ANOVA, p < 0.01).
Impact of Age on the SVM Prediction of HBsAg Immunoassay Result by Globulin and
Platelets: As noted earlier, the age of the patient at the time of HBsAg testing is a powerful
predictor of immunoassay outcome. This is most likely a reflection of the time of life where
people are more likely to be involved in high-risk behaviours that expose them to HBV
infection. Across the many studies conducted for this project, the mean age of HBsAg
21
positive cohorts cohered with a narrow age range from the late thirties to the early – mid
forties (Table 1).
Figure 3 examines the impact of increasing age on prediction of HBsAg outcome by serum
globulin and platelet count. The age range introduced into the SVM model was from 20 – 90
years at the time of HBsAg testing. The age factor, as done for ALT (Fig. 2), was introduced
into the SVM model as a static slice, hence providing a model of globulin – platelet
interaction at that specific age.
At 20 years of age (Fig. 3a) the distribution of negative (pink) to positive (blue) is dominated
by two distinct HBsAg negative sub-populations. The first of these sub-populations is defined
by globulin responses less than 31 g/L and a platelet count not greater than 400 x 109/L.
From approximately 180 x 109/L, increasing platelet count is associated with decreasing
globulin. The second of the negative sub-populations has a globulin range of 45 to > 70 g/L
and platelet counts ranging from approximately 280 x 109/L to greater then (>) 500 x 109/L.
The positive sub-population (blue) that dissects the two negative populations is defined by
globulin concentrations from less the 20 g/L to > 70 g/L depending on the platelet count (for
example, a platelet count 500 x 109/L has a serum globulin concentration from less than (<)
20 g/L to 45 g/L).
For 30 years of age (Fig. 3 b) the general pattern was similar to 20 years with two negative
sub-populations; however, between approximately 100 – 220 x 109/L platelets a distinct
diagonal pattern emerged suggesting that at this platelet range serum globulin
concentrations increased to 40 g/L, before decreasing rapidly with further increases in
platelet count. By 40 years (Fig. 3 c) the diagonal pattern extended to dissect the HBsAg
positive class (blue) into two distinct sub-populations. The first sub-population is defined by
serum globulin concentrations greater than 40 g/L but with platelet counts not exceeding
approximately 260 x 109/L. The second positive sub-population was defined by a maximum
serum globulin of 45 g/L and a minimum platelet count of 220 x 109/L (20 – 30 g/L globulin).
For the ages 50 – 90 years (Fig. 3 d – h) the negative (pink) sub-population that divides the
two positive sub-populations (blue) becomes thicker, representing the changing relationship
between globulin and platelets with increasing age, and the associated shrinking of the
HBsAg positive sub-populations. By 90 years of age (Fig. 3 h), the first positive subpopulation is defined by a serum globulin range of approximately 50 – 70 g/L and a platelet
count not exceeding 200 x 109/L, while the second subpopulation is defined by higher
platelet counts (200 to almost 500 x 109/L), but globulin concentrations less than 40 g/L.
From 50 – 90 years, it was interesting to note that for the second sub-population (lower right)
22
the upper limit for globulin remained consistent at around 40 – 45 g/L, while the platelet count
fluctuated between 200 – 400 x 109/L.
Conclusions - SVMs: The final SVM modelling demonstrated that prediction of HBsAg
immunoassay results could be achieved via the routine blood test markers ALT, neutrophils,
platelets, monocytes and globulin, as well as patient age at the time of testing. Therefore, via
access to results from basic (first tier) laboratory blood and serum testing, and patient age,
HBsAg positive immunoassay results could be predicted with an accuracy of 70.1%.
Knowledge of patient history and access to clinical notes almost certainly would enhance this
accuracy rate (clinical criteria can be coded and included in the machine learning models
presented here). The SVM models presented herein highlighted the utility of the serum
globulin concentration – blood platelet count interactions to reveal predictive rules at varying
serum ALT concentrations or age (Figs. 2 and 3 respectively). Similar analyses can be
conducted for other marker pairings such as neutrophils – monocytes, globulin – neutrophils,
platelets – monocytes and so on, allowing the formulation of additional predictive rules of
HBsAg detection. Among the advantages of SVMs, plotting the category patterns after
applying the radial kernel produced visible evidential guidance on the nature of the classes
being predicted; this was very useful when considering the globulin – platelet interaction at
different ages, with the detection of two distinct HBsAg positive populations.
23
Figure 4 – Comparison of mean ALT (a), Globulin (b), Platelets (c), White Cell Count (d), Monocytes (e) and
Neutrophils (f) kinetics for HBsAg positive (green) and HBsAg negative (blue) cases across the Age range
investigated by SVM and tree machine learning. Mean Monocytes (e) and Neutrophils (f) versus Age at the time
of testing.
24
(c) HBV Immunoassay Continued: Predicting Persistent Infection by HBe Antigen or Antibody
– To assess the capacity for routine chemistry and haematology markers to predict whether
a HBV infection will persist, identical machine learning analyses were conducted on known
HBsAg positive patients subsequently tested for HBe antigen or antibody. For random
forests, single decision trees and SVM, HBe antigen (HBeAg) and HBe antibody (HBeAb)
positive cases were separated into separate response categories and tested individually
against HBsAg positive cases that tested negative to HBe antigen and antibody (of the cases
available, only one was positive for HBe antigen and antibody).
HBe Antibody: A total of 297 (n = 297) HBeAb positive cases were available for classification
analysis against HBeAb negative cases that numbered only forty in total (n = 40); therefore
as described elsewhere, a class imbalance problem existed that required a solution before
interrogation via machine learning. As described, class weighting within the R code is an
option to accommodate class imbalance, however, the results presented here were based on
randomising the 297 HBeAb positive cases and dividing the total cohort into five subcategories of equal size (n = 55). All sub-categories were tested against the single HBeAb
negative class; the following results for one HBeAb positive sub-category represent the
overall findings.
Figure 5 shows the top-rated predictor variables to separate the HBeAb positive class from
HBeAb negatives. The accompanying table shows that this algorithm was effective at
predicting HBeAb positive, with an error rate of 18.2% (therefore successful prediction
accuracy of 81.2%), whereas prediction of antibody negative HBe was poor. In total the
single HBeAb negative class (n = 40) was compared to HBeAb positive sub-classes five
times, and once as a weighted comparison of the total 297 positive cases versus the 40
negative cases. From the six separate random forest analyses, Age, RDW and WCC were
most prevalent, with Age rating within the top 4 – 5 predictors for all six analyses, with RDW
and WCC appearing in five from six (5/6) analyses.
25
Predictions Table
Confusion Matrix:
HBeAb NEG HBeAb POS Class Error
HBe Antibody Class
HBeAb NEG
21
17
0.4473684
HBeAb POS
10
45
0.1818182
Overall OOB (Out-of-Bag) estimate of error rate: 29.03%
Figure 5 – Top predictor variables from a random forest classifier to discriminate between classes representing
Hepatitis B e antibody positive cases (n = 55) and Hepatitis B e antibody negative cases (n = 40). Both classes
comprised individuals who are positive for HBsAg and subsequently presented for further HBV immunoassay.
The table summarises the accuracy of prediction for both classes via an “out-of-bag” estimate of error rate
calculated over 10,000 trees.
26
The primacy of WCC and RDW as predictors of HBe antibody status was confirmed by single
decision trees (minimum split number of 50, complexity parameter 0.01 – 0.02), which
yielded the following result (Figure 6).
Predictive Rules:
WCC > 6.6 x 109/L + RDW > 13.75 = Class 1 (HBeAb Negative): Accuracy = 76%
WCC < 6.6 x 109/L
= Class 2 (HBeAb Positive): Accuracy 75.6%
Figure 6 – Single decision tree classifier of HBe antibody (HBeAb) status based on routine pathology chemistry
and haematology predictor variables. This Figure links directly to the random forest results presented in Figure 5,
and utilised the identical data set for decision tree interrogation. Class 1 = HBeAb Neg: Class 2 = HBeAb Pos.
HBe Antigen: HBsAg positive cases that on future immunoassay also tested positive for HBe
antigen (HBeAg), but negative for HBe antibody, were separated into a separate category for
classification analysis against the HBsAg positive cases that tested negative for HBeAg (and
HBeAb). Thirty-eight (38) HBeAg positive cases were available for analysis, matching the
size of the control HBeAg negative class, so class weighting or randomisation was not
required for this experiment.
As for previous machine learning investigations, random forest (10,000 trees) and single
decision trees were conducted to identify the powerful predictors of class discrimination.
27
Figure 7 summarises the results from a 10,000 tree RFA (3 – 4 variables tested per tree),
with the top ranked predictor variables of HBeAg positive/negative class discrimination also
modelled by a single decision tree.
(a)
(b)
(a) OOB estimate of error rate: 28.77%
Confusion matrix:
1
2
1
31
14
2
7
21
Class Error
0.18
0.40
(b) Decision Tree Rules:
Serum Albumin > 41.5 g/L = HBeAg Negative (Accuracy = 65.3% (32/49))
Serum Albumin < 41.5 g/L = HBeAg Positive (Accuracy = 75% (18/24))
Figure 7 – Random Forest (a) and single decision tree (b) classifier analysis of HBe Antigen (Ag) positive (n =
38) versus negative cases (n = 40) with prediction of top-ranked predictor variables from routine chemistry and
haematology markers, and overall prediction accuracy (Confusion matrix and out-of-bag estimate of error rate for
random forest (a) and decision rules for the single decision tree (b).
Class/Category 1 = HBe antigen Negative: Class/Category 2 = HBe antigen Positive
The profile of top predictors for HBeAg shared ALP with HBe antibody (Ab), but was
ultimately a different pattern, highlighted by the role of serum albumin for HBeAg prediction.
Figure 7(b) emphasises this finding with the ultimate single tree model detecting albumin
alone as the top predictor of HBeAg positive or negative immunoassay result. The HBeAg
28
prediction via albumin was simply whether the serum concentration is above or below 41.5
g/L, with prediction accuracy for positive cases at 75%, from 24 cases (Fig. 7b). Serum
albumin is a marker of chronic liver disease and damage, suggesting for this analysis that
HBeAg positive cases had a persistent HBV infection and associated inflammation, with an
attendant loss of liver tissue and biosynthetic capacity.
Support Vector Machine Investigation of HBe Positive or Negative Immunoassay Results: To
detect nuanced responses to explain how to predict HBV immunoassay results through
routine data, an identical SVM interrogation of HBe response was performed taking the same
approach as described earlier for HBsAg. The results presented in Figure 8 deals only with
HBe antibody (HBeAb positive or negative) response as an example.
Figure 8 - SVM plots describing the interaction of red cell diameter width (RDW), total white cell count (WCC) and
patient age at the time of HBe antibody testing for the detection of positive ( - pink) versus negative ( - blue)
cases, as detected by subsequent immunoassay on HBsAg positive patients. (a) Age = 25 years; (b) Age = 35
years.The total SVM model to separate HBeAb positive from negative responses included the predictor
(independent) variables RDW (%), WCC (x 109/L) and Age (years): (cost = 10, gamma = 0.1, C-classification
method and radial kernel). After 10-fold cross-validation of the data set, total prediction accuracy for the above
model was calculated at 67.1%. HBsAg Positive = 2 (positive sub-category 2.22), HBsAg Negative = 1.
Figure 8 (a) and (b) describe the predictions of HBeAb positive (pink) from negative (blue)
cases using RDW and WCC in patients aged 25 and 35 years respectively. The SVM results
for both plots show three separate HBeAb positive sub-populations, with the proportion of
positive cases greater at age of 35 years. In the middle of the plot x-axis, a positive
population is defined by a RDW of approximately 12% or less and a WCC range of 6 – 11 x
109/L (25 years; Fig. 8a); Figure 8(b) shows the expansion in range of this population with
RDW increasing to almost 13%, and WCC ranging from slightly above 4 x 109/L to almost 13
x 109/L. To additional sub-populations are positioned at the top left and top right of the plots,
29
for both 25 and 35 year old cases. As for the sub-population on the x-axis, these additional
positive sub-populations increase range with an age increase. The additional positive subpopulations are defined by RDW values within the reference range (see appendices B and C
for SNP reference intervals) to significantly increased RDW values of 17% and beyond. This
RDW characteristic is combined with comparatively narrow WCC ranges distinctly divided
between low to moderate WCC, to WCC elevation above the upper limit of the laboratory
reference interval; the blue plot area are representative of HBeAb negative cases (Figure 8).
Similar to the SVMs for HBsAg (Figs. 2 and 3), the SVM for HBeAb demonstrated that this
algorithm is powerful for prediction, but also has a significant pattern recognition capacity.
Through adopting a radial kernel analysis, complex sub-class (sub-category) patterns were
detected, with multiple positive or negative HBV immunoassay sub-populations illustrated by
the plots that accompanied prediction matrices after SVM training and testing. The results of
the plots in Figures 2, 3 and 8 emphasise a potential explanation for difficulties faced when
analysing pathology data by standard inferential statistics and correlation/regression. These
methods are unlikely to detect the fragmentation of an apparently homogeneous response
class/category into sub-populations, thus not capturing the complexity of the infection,
disease or other laboratory/clinical response of interest.
Conclusions (HBe): Through the prediction of HBe antibody and antigen immunoassay
results, classified as positive or negative categories, via routine chemistry and haematology
markers, a number of rules of persistent HBV infection were formulated, that while involving
HBe, were distinct for antigen and antibody prediction. For HBeAb, the model presented by
single decision tree emphasised the utility of RDW and WCC, while for antigen serum
albumin alone provided a strong single decision tree prediction. The serum albumin
prediction for HBeAg suggests an impact on liver function due to a persistent, unresolved
HBV infection. It must be noted that the HBeAg positive cases used for modelling were all
HBeAb (antibody) negative, suggesting lack of a specific immune response to the HBe
antigen, hence further suggesting unrestrained persistent infection. Serum albumin did not
feature as an important predictor for HBeAb, with WCC significant in the tree models.
Both HBe antibody and antigen profiles contained one haematology marker each (MCH and
RDW respectively) as a leading predictor of positive or negative status. This may point to an
interaction between chronic liver inflammation and changes in reticuloendothelial function, or
haemoglobin biosynthesis.
30
(d) Validation of HBV Immunoassay Prediction
Prediction is via Comparison of Results from Sullivan Nicolaides Pathology (Queensland)
and ACT Pathology (Canberra).
Data collected from SNP over 2011 to 2012, and from January to June 2014 were compared
to ACT Pathology data collected over 2007 from patients tested for HBsAg (Fig. 9). The
prediction of HBsAg positive immunoassay results had accuracy rates across the
comparisons of SNP and ACT Pathology data of 68 – 75%, while HBsAg negative classes
were predicted at 64 – 75%.
31
The profile of HBsAg positive/negative predictors was similar for SNP (2011 – 2012)
compared to ACT Pathology (Figs. 9a and b), with ALT, WCC, Monocytes, Neutrophils and
Age common predictors of HBsAg positive or negative status. Serum globulin was not
available in the ACT Pathology data set, which depending on the nature of marker
interactions could influence the outcomes for these analyses. However, only ALT and
platelets were common when comparing the profile of SNP - 2014 data with ACT Pathology
(Figs. 9b and c). Platelets ranked eighth most powerful predictor for SNP 2011 – 2012 data
32
!
WCC
Age
Monocytes
ALT
Mono
WCC
Globulin
Neutrophils
Age
30 40 50 60 70 80 90
MeanDecreaseAccuracy
WCC
Mono
ALT
Plt
Neut
Age
0 5 10 15 20 25 30 35
MeanDecreaseGini
(b) ACT Pathology Data
Platelets
ALP
LDH
Creatinine
Hb
ALT
30 40 50 60 70
MeanDecreaseAccuracy
LDH
Creatinine
Platelets
ALP
Hb
ALT
0 2 4 6 8 10 14
MeanDecreaseGini
(c) SNP Data (2014)
All random forests tested 3 – 4 predictor variables per tree, over a total of 10,000 individual decision trees. As well as a decreased Gini Index scale to calculate
differences, a mean decrease in accuracy scale is also presented for each analysis. Proximity and “keep forest” parameters were true for all random forests
presented.
Figure 9 – Comparison of profiles for the prediction of HBsAg positive or negative immunoassay classes (categories) by random forest classification modelling.
Only the top six predictors are featured, generated by data from (a) Sullivan Nicolaides Pathology (SNP, Taringa, Qld) between June 2011 – May 2012; (b) ACT
Pathology (Woden, ACT – data collected during 2007), and (c) SNP data collected between January – June 2014.
Plt
Monocytes
Globulin
0 10 30 50 70
MeanDecreaseGini
Age
ALT
WCC
45 50 55 60
MeanDecreaseAccuracy
Neut
Neutrophils
ALT
(a) SNP Data (2011 - 2012)
!
(with serum creatinine seventh) for this RFA model, and was one of three primary variables
used for SVM modelling of HBsAg response (Figs. 2 and 3). The SNP 2014 profile featured
the LFT markers ALP and LDH, as well as serum creatinine and haemoglobin, so was quite
different to SNP 2011 – 2012 and ACT Pathology that featured white cell markers
prominently (Fig. 9a).
Generally, there was strong validation between the SNP and ACT Pathology predictions for
HBsAg in spite of being collected from patients separated geographically by large distances
and living under different climates (SNP provides Pathology services for all of Queensland).
The lesser agreement between SNP 2014 (Fig. 9c) and ACT Pathology (Fig. 9b) may be
related to the smaller SNP data set available; the SNP 2011 – 2012 data set comprised a
total of approximately 700 cases, ACT Pathology approximately 550 cases, but the SNP
2014 data set contained only around 100 cases in total for analysis.
From the above comparison across time and location, ALT was the most consistent
predictor, consistent with current knowledge, with age (years) and total white cell count.
Within the white cell population, neutrophils and monocytes were also consistently powerful
predictors, while lymphocytes, eosinophils and basophils were not important to HBsAg.
Based on SVM analyses (Figs. 2 and 3), serum globulin and platelets are also important and
must be considered for a final refined model of early HBV detection via HBsAg prediction.
(2) Vitamin D and Renal Function - There is considerable interest currently on
vitamin D deficiency in populations around the world, including in Australia. This concern
stems from research reports linking vitamin D deficiency to various disease conditions,
including cancer, which has resulted in a dramatic increase in pathology laboratory testing.
To increase the concentration of active (25-OH) vitamin D in the body, increased exposure to
UV via sunlight, and/or oral supplementation are recommended. Questions of interest to
diagnostic pathology and concerns about unnecessary testing (and hence health budgets)
are whether the current volume of vitamin D testing is necessary? Do other factors influence
serum vitamin D concentration, and if so, what additional questions might be asked before
ordering tests? Kidney function was the subject of analysis in the context of other factors
associated with vitamin D serum measurement. This was assessed by machine learning
modelling of eGFR responses, with three response categories created based on the five
eGFR stages used for kidney function diagnosis and monitoring. Only three categories (not
five stages) were used because the numbers of very low eGFR cases (stages 4 & 5) were
insufficient for analysis, resulting in the consolidation of available cases into three eGFR
categories (see caption to Figure 10 for definitions).
33
On both the random forest (Fig. 10a) and single decision tree (Fig. 10b) 25-OH vitamin D
was the top-ranked predictor variable for eGFR category. The single decision tree also
provided the vitamin D threshold of 72.5 nmol/L, suggesting this as the critical serum
concentration for separating patient kidney function as reflected by eGFR. In terms of
separating healthy from poor kidney function, the age of 59.5 years was also important
leading category 1 and 3 prediction accuracies of > 80%. The rules, therefore suggest
healthy kidney function (eGFR > 90) by this simple rule;
(3) 25-OH vitamin D < 72.5 nmol/L + Age < 59.5 years (eGFR category 1 prediction =
87.7%).
eGFR.Cat_Subcat2.1.rf
X25.OH.Vitamin.D
Age
Alk.Phos
Ca..corr.
Phosphate
0
50
100
150
MeanDecreaseGini
34
Figure 10 - Random forest (a) and single decision tree (b) analysis of eGFR category predictors (n = 874).
Categories for eGFR response were: 1 = > 90 (healthy), 2.1 = 90 - 61 (moderate), and 3 = 60 - 7 (poor).
While poorest kidney function is described by; (4) 25-OH vitamin D < 72.5 nmol/L + Age >
59.5 years (eGFR category 3 prediction = 80.9%).
For 25-OH vitamin D > 72.5 nmol/L, an age threshold of 70.5 years was crucial, but
predictions of eGFR category were not as strong at accuracies of 69% and 58%.
As shown in Figure 10(a), vitamin D and age were clearly the top predictors at > 150 on the
mean decrease Gini Index, with calcium concentration (corrected for albumin) at less than
100 on this scale, and phosphate less than 50. Alkaline phosphatase (ALP) was included as
a control because it is a marker of bone disease. Sex (female or male) was included in
preliminary analyses and was found to be the least important predictor of eGFR category,
with a score of slightly above 0 on the decreased Gini Index scale; therefore, sex did not
influence kidney function in this community sample.
It could be expected that given the role of 25-OH vitamin D in calcium metabolism, and the
link revealed here between vitamin D and eGFR, some change to serum calcium and
phosphate concentrations may be observed for the different levels of kidney function. Figure
11 shows the mean serum calcium and phosphate concentrations for the three eGFR
categories investigated; no significant difference due to eGFR category was found for either
bone marker. The fluctuation of ALP, 25-OH vitamin D and age for the three eGFR
categories is shown in Figure 12, which shows clearly increasing age as strongly associated
with decreasing eGFR.
Figure 11 - Mean (+ 2 x SEM) corrected serum calcium and phosphate concentrations for three eGFR stages
(categories) - Categories for eGFR response: 1 = > 90 (healthy), 2.1 = 90 - 61 (moderate), and 3 = 60 - 7 (poor).
35
Figure 12 - Mean (+ 2 x SEM) patient age, alkaline phosphatase and 25-OH vitamin D concentrations for three
eGFR stages (categories) - Categories for eGFR response: 1 = > 90 (healthy), 2.1 = 90 - 61 (moderate), and 3 =
60 - 7 (poor).
As demonstrated by tree-based machine learning, 25-OH vitamin D was the leading predictor
of kidney function as assessed by eGFR, with the further complication that kidney function
declined with age and thus may also influence vitamin serum concentration. Data on diabetic
status was not available with the cases provided, so kidney damage due to poor diabetic
control could not be screened out of the data set. While age had a distinct linear relationship
with eGFR, vitamin D did not with mean concentrations not significantly different between
eGFR > 90 and eGFR less than 60. Serum calcium and phosphate did not significantly alter
between healthy, moderate and poor kidney function. The relationship between vitamin D
and kidney function is well known, and we have shown with this machine learning
investigation that vitamin D testing may not always be warranted without first considering
other health markers and age.
(3) Red cell Distribution Width - Please refer to the report Appendix (E) for the
published paper on the modelling of RDW data to assess its capacity to predict the value of
second tier anaemia tests, for example serum ferritin. This paper was published in August
2015; paper entitled - The Early Detection of Anaemia and Aetiology Prediction through the
Modelling of Red Cell Distribution Width (RDW) in Cross-Sectional Community Patient Data.
A summary of this project and recommendation can be found on page 4 of this report.
Validation of RDW Predictions: The modelling and predictions presented in our publication,
The Early Detection of Anaemia and Aetiology Prediction through the Modelling of Red Cell
Distribution Width (RDW) in Cross-Sectional Community Patient Data, used data collected
via SNP over August – September 2012. For a validation study, additional SNP data
36
collected over March 2012 was compared to assess the accuracy of prediction on a different
patient group from earlier in the year.
(a)
(b)
Figure 13 – Random Forest regression modelling of linear RDW values obtained from patients tested during (a)
March, or (b) September 2012 by Sullivan Nicolaides Pathology (Brisbane). Both forests modelled 500 trees with
the most powerful predictors of linear RDW response rated at the top right of the plots, with decreasing prediction
power represented by a lower Node Purity. The above models for March and September 2012 explained 40%
and 45% of RDW variation, respectively.
Figure 13 summarises the results of random forest analysis applied to determine the best
predictor variables of linear RDW variation, with data sourced from SNP collected in March
or September 2012. For both months examined, MCH and MCHC were the top-ranked
predictors, validating the findings of the initial models using only September data. Beyond
these top two ranked RDW predictors, the orders of importance were different, but the trends
were similar. Haemoglobin and serum ferritin were in the top half of predictors, while other
special, second tier tests for IDA ranked lower for both the March and September results.
MCV was also in the top half of predictors for both, but surprisingly not of a higher rating
given it is used to calculate RDW. These results also support the published findings from our
earlier studies, suggesting that iron, saturation and TIBC are not of primary importance to
predicting IDA via RDW; it appears that haemoglobin, ferritin and one or more of the red cell
indices (e.g. MCH) are most useful.
To further evaluate the generalizability of RDW predictions to other patients, Chi-Square (2)
assessed the frequency of correct RDW classifications by comparing September decision
tree rules alignment with March. For this analysis, RDW responses were divided into two
separate categories (classes) based on ranges inside or above the SNP RDW reference
interval. These classes were; RDW from 11 - 12.2% (Category 0; n = 122) and RDW 14.5 22.1% (Category 1; n = 132). Single decision trees calculated decision rules, and then the
37
rules tested on the March data examined by examining the number of RDW cases with each
category that fell above or below and predictor variable threshold as determined by trees.
The results (Table 3) give an example of a comparison to validate the original predictions, by
applying decision rules associated with MCH.
Table 3 – Validation of RDW class prediction originally calculated by a single decision tree
on patient data collected in September 2012, by comparison with March 2012 data.
(a) Category 0 (RDW range of 11 – 12%; n = 122) was predicted by MCH alone at > 28.95
pg (picograms);
(b) Category 1 (RDW range of 14.5 – 22.1%; n = 132) was predicted by MCH < 28.95 pg.
March 2012 cases that fell within the category 0 or 1 RDW ranges were counted for the
frequency of MCH greater than or less than 28.95 pg, and compared to the prediction
frequencies from trees.
(a) Month SNP
Data Collected
Cat 0
Cat 1
RDW = 11 - 12.2%
RDW = 14.5 - 22.1%
March
112
36
75.7
September
157
8
95.0
Cat 0
Cat 1
RDW = 11 - 12.2%
RDW = 14.5 - 22.1%
Prediction Accuracy
(%)
March
10
96
90.6
September
72
183
71.8
(b) Month SNP
Data Collected
Prediction Accuracy
(%)
Chi-Square (2) analyses were conducted on individual comparisons, like presented in Table
3, as well as the aggregated data across four separate decision tree prediction models. To
statistically validate the original decision tree models, no significant difference (p > 0.05)
between March and September data was required. All of the 2 estimates showed significant
differences (p < 0.05) between the September predictions and the March validation data,
with the pooling of data across the four predictions attempted also resulting in a significant
difference (2 = 6.01, p = 0.014, df = 1). Therefore, the validation of the original September
single decision tree predictions was not achieved with the frequency of correct class (Cat 0
versus Cat 1) assignment significantly different when compared to an unrelated RDW data
set from another time period.
38
Prediction accuracies for RDW class, when comparing March to September, were within
approximately 20%, with individual predictions for category 0 and category 1 exceeding 90%.
The choice of single decision tree may have been a contributing factor to the observation that
statistical similarity was not detected. Single decision trees create their ultimate rules by
discarding cases that don’t agree with a decision at a particular node/terminus within the
overall tree, thus enriching a sub-population for a final decision rule. When revisiting this
investigation in future, it will be worthwhile to also generate class predictions for comparison
using RFA or SVM, since these algorithms do not discard cases. As for any machine learning
model, a larger data set would also be beneficial to enhance accuracy through representing
a truer sample of the population to be investigated.
(4) Final Conclusions - How the Results of this Activity will be used
to Benefit Pathology Stakeholders: Through a machine learning and statistical
focus on predictive models of liver function tests (LFTs), hepatitis B virus (HBV)
immunoassay, red cell distribution width (RDW) for early anaemia detection and the
investigation on the impact of kidney function on serum vitamin D measurements, novel
patterns have been detected based primarily on the modelling of blood/serum pathology
markers that comprise routine multiple biochemical analysis (including serum urea,
electrolytes and creatinine) and the full blood count (FBC). The key findings from each
project are summarised from page 4 of this report. As an overall contribution to quality in
pathology testing, the novel application of the machine learning algorithms random forest,
single decision trees and SVM (radial kernel) in sequence to cross-sectional data have
revealed outcome/response predictions generally in excess of 70% using only routine
pathology markers, as well as detecting unexpected interactions among markers worthy of
further research of interest to diagnostics and pathogenesis (for example, the interaction
between serum cholesterol and ALP for elevated GGT cases).
The results from this study have provided impetus for the re-evaluation of LFT profiles for
community patients, provided tools to assist the early detection of HBV infection, and
predictions of infection persistence for HBsAg positive cases, all based on routine data
available to all laboratories. Furthermore, RDW has been confirmed as an early marker of
anaemia and additional analyses discriminated between various second tier anaemia
markers based on sex and age. Parameters to consider for serum vitamin D testing have
been proposed too, providing evidence to encourage extra thought on the value of vitamin D
testing, rather than being used indiscriminately.
39
The abundance of routine pathology marker profiles from patients who are healthy, but
present each year for routine health checks, to cases of an initial diagnosis of disease, to the
monitoring of known health conditions, provides massive regular population sampling from
the healthy to the chronically unwell, nationwide. Availability of massive data collections
combined with advances in computer science that lead to reliable machine learning
algorithms presents opportunities to evolve pathology beyond the interpretation of reference
intervals alone, to networks of routine markers associated with health changes. The range of
routine markers available today for pathology testing reflect alterations to physiology and
biochemistry caused by disease, but alterations that may be subtle and not always detected
in relation to reference intervals.
Using mass data and machine learning, four conditions where pathology testing is central to
diagnosis were investigated with the primary aim to enhance clinical decision making by
maximising the potential of routine results to provide a laboratory diagnosis. In this respect,
two of the research activities presented herein closely examined the capacity for routine data
patterns to predict the outcomes of second tier tests, namely HBV immunoassay and special
markers of iron-deficient anaemia (IDA) like serum ferritin, iron and vitamin B12. Another
aspect was inspired by the UK BALLETS study (Lilford et al., 2013), and examined how
machine learning could be used to reduce the number of assays required for LFTs, while
continuing to provide effective diagnostic information (Lidbury et al., 2015 – Appendix D).
The other aspect concerned the ordering of unnecessary serum vitamin D tests, as assessed
via kidney function testing.
The results from each project have the potential to enhance diagnostic decision-making
through the application of methods capable of dissecting complex relationships from routine
pathology data – the enhancement pertains to savings of patient and medical practitioner
time, quicker diagnosis and intervention, if warranted, through predictive rules of second tier
test outcomes, data-centred decisions to avoid ordering unnecessary routine or second tier
tests. Providing additional and earlier clarity to, for example, the results of a routine LFT for a
community patient will also contribute to reduced patient anxiety and unnecessary follow up
procedures (e.g. liver ultrasound). All of these time savings translate into significant financial
savings for citizens, the health system and government. With the routine LFT central to the
testing of community patients for example, the reduction of the LFT profile to only ALT and
ALP (Appendix D) would result in significant cost savings due to the enormous volume of
LFTs performed.
As concluded, the benefits for stakeholders are the provision of new decision tools, beyond
the reference interval; to assist enhanced diagnostic procedures and thereafter patient
40
outcomes, both in terms of earlier diagnosis and savings of time, money and anxiety.
Another example of significant time savings are in the context of rural and remote
laboratories that do not have easy access to reference labs and second tier assays. For a
potentially life-threatening infection like HBV, early prediction of infection and persistence via
routine blood/serum markers can support early intervention, especially in cases where there
is history of behaviours that place the patient at risk of infection. The conclusions herein were
produced via blood test results only, with no access to clinical notes or history to set the
clinical parameters. Combined with patient history and clinical examination it would be
expected that the power of laboratory-based predictions would be further augmented.
With an eye to the future and further benefits to health practitioners and hospitals, the results
described herein, once validated, could be integrated into existing pathology department
computer systems, forming intelligent systems in silico for the enhanced management and
application of patient results. As shown, rules based on routine pathology test results can
predict to high accuracy conditions that normally require special tests, like immunoassay.
While the special test may still be conducted, the early indication of disease or infection will
provide quicker responses and may prevent the development of serious disease through
timely intervention. While the predictions based on routine markers are rarely 100%, early
flags from the intelligent system would provide a basis for direct action, even if simply
suggesting additional questions for patients that may provide further clues to a health
problem (e.g. “Have you travelled to S.E. Asia recently?” – possibility of hepatitis B/C
exposure?).
In addition to the immediate benefits for pathology stakeholders discussed above, the
research conducted to promote the better utilisation of pathology testing also uncovered
some unexpected relationships between pathological processes, which were revealed by
using biomarkers of those processes. Therefore in addition, the studies reported herein have
become exploratory pathology research, which promotes the value of this rich data as a
source of further investigation.
41
References
1. Dugdale AE. Diagnosis and management of iron deficiency anaemia: a clinical update.
Med J Aust. 2011;194:429.
2. Álvarez-Suárez B, de-la-Revilla-Negro J, Ruiz-Antorán B and Calleja-Panero J. L.
Hepatitis B reactivation and current clinical impact. REV ESP ENFERM DIG (Madrid),
2010, 102: 542-552.
3. Karatzoglou A, Meyer D, Hornik K. Support Vector Machines in R. J Stat Softw.
2006;15(9).
4. Lilford RJ, Bentham L, Girling A, Litchfield I, Lancashire R, Armstrong D, et al.
Birmingham and Lambeth Liver Evaluation Testing Strategies (BALLETS): a prospective
cohort study. Health Technol Assess. 2013;17:i-xiv, 1-307.
5. Lidbury BA, Richardson AM, Badrick T. Assessment of machine-learning techniques on
large pathology data sets to address assay redundancy in routine liver function test
profiles. Diagnosis. 2015;2:41-51.
6. Badrick T, Richardson AM, Arnott A, Lidbury BA. The early detection of anaemia and
aetiology prediction through the modelling of red cell distribution width (RDW) in crosssectional community patient data. Diagnosis. 2015;2:171-9.
42
Appendices (Including full activity reports for LFT and RDW
projects):
(A) Summary - Evaluation of the Activity against the Performance Indicators
(B) Sullivan Nicolaides Pathology Chemistry Reference Intervals and Abbreviations
(C) Sullivan Nicolaides Pathology Haematology Reference Intervals and Abbreviations
(D) Manuscript accepted for publication with minor revision by Diagnosis - “Assessment of
Machine-Learning Techniques on Large Pathology Data Sets to Address Assay Redundancy
in Routine Liver Function Test Profiles”.
(E) Manuscript accepted for publication by Diagnosis - “The Early Detection of Anaemia and
Aetiology Prediction through the Modelling of Red Cell Distribution Width (RDW) in CrossSectional Community Patient Data”.
43
Appendix (A)
Evaluation of the Activity against the Performance Indicators - Summary:
What are the key milestones for this project that will identify
that you have achieved the objectives of the project?
Milestone Status -
Completed spread sheets for loading into R statistical software,
devoid of ambiguous data (e.g. ND, > 5, < 1000); clear response
versus predictor variables
Achieved
Determine independence of certain parameters – Ca/P; ALT/AST;
GGT/Alk Phosphatase
Achieved
The successful production of data network “rules” reflected by
routine pathology predictor variables (e.g. ALT, RCC) in response to
the dependent (response) variable (e.g. RDW). In silico prediction
accuracy > 70%.
On statistical analysis (Chi-square), there are not significant
differences between the in silico predicted results (as predicted via
rules in routine pathology data) and the results from prospective
testing of new samples, indicating accurate machine learning
(pattern recognition) prediction. May require repeated analyses and
validation to optimise methods.
Completed and published research manuscripts, adoption of rules
approach by SNP and validation trial at a non - SNP pathology
laboratory.
44
June 2013 - Nov
2015
Achieved
Achieved. RDW
validation testing
performed and
validation studies for
HBV reported above.
More optimisation
needed.
Achieved - two
manuscripts
published, and one
letter to Annals of
Clin.Biochem. in
press, plus an invited
review article in
preparation. Rules
not adopted as yet –
more validation
needed.
Appendix (B)
Sullivan Nicolaides Pathology Chemistry Reference Intervals and Abbreviations
Liver function:
Full Analyte Name
Abbreviation
Reference Range
(Adult Male)
Reference Range
(Adult Female)
Total Bilirubin
TBili
4 - 20 mol/L
3 - 15 mol/L
Alkaline phosphatase
ALP
60 - 200 U/L
(17 - 19 yrs)
20 - 105 U/L
(19 - 49 yrs)
Alkaline phosphatase
ALP
35 - 110 U/L
(20 + yrs)
30 - 115 U/L
(50 + yrs)
Aspartate aminotransferase
AST
10 - 40 U/L
10 - 35 U/L
Alanine aminotransferase
ALT
5 - 40 U/L
5 - 30 U/L
Gamma glutamyl transferase
GGT
5 - 50 U/L
5 - 35 U/L
LDH
120 - 250 U/L
120 - 250 U/L
Lactate dehydrogenase
Serum Proteins:
Full Analyte Name
Total Protein
Albumin
Globulin
Abbreviation
Reference Range
(Adult Male)
61 - 83 g/L
34 - 50 g/L
23 - 39 g/L
Reference Range
(Adult Female)
61 - 83 g/L
34 - 50 g/L
23 - 39 g/L
Na+
K+
ClHCO3
AG
Creat
Urea
Urate
eGFR
Reference Range
(Adult Male)
135 - 145 mmol/L
3.5 - 5.5 mmol/L
95 - 110 mmol/L
20 - 32 mmol/L
5 - 15 mmol/L
60 - 120 mol/L
3 - 10 mmol/L
0.20 - 0.50 mmol/L
≥ 60 mL/min/1.73 m2
Reference Range
(Adult Female)
135 - 145 mmol/L
3.5 - 5.5 mmol/L
95 - 110 mmol/L
20 - 32 mmol/L
5 - 15 mmol/L
45 - 85 mol/L
3 - 10 mmol/L
0.15 - 0.40 mmol/L
≥ 60 mL/min/1.73 m2
Osmol
285-300 mmol/kg
285-300 mmol/kg
TP
ALB
GLOB
Urea, Electrolytes, Creatinine:
Full Analyte Name
Sodium
Potassium
Chloride
Bicarbonate
Anion Gap
Creatinine
Urea
Urate
Estimated
Glomerular Filtration
Rate
Osmolality
Abbreviation
45
Bone:
Full Analyte Name
Abbreviation
Reference Range
(Adult Male)
Reference Range
(Adult Female)
Ca++
2.15 - 2.55 mmol/L
2.15 - 2.55 mmol/L
Ca_Corr
2.15 - 2.55 mmol/L
2.15 - 2.55 mmol/L
PO4
0.8 - 1.5 mmol/L
0.8 - 1.5 mmol/L
Abbreviation
Reference Range
(Adult Male)
Reference Range
(Adult Female)
Calcium (total)
Corrected Calcium
(corrected for
albumin binding)
Phosphate
Other:
Full Analyte Name
Cholesterol
Age
Sex
CHOL
Varies with lab and
population years
Male (2)
46
Varies with lab and
population years
Female (1)
Appendix (C)
Sullivan Nicolaides Pathology Haematology Reference Intervals and Abbreviations
RED CELL MARKERS
(Explanatory Variables)
Routine Red cell Markers
LABORATORY
REFERENCE RANGE
LABORATORY
REFERENCE
RANGE
Male
Female
Haemoglobin (Hb)
125 - 175 g/L*
110 - 165 g/L*
Haematocrit (Hct)
0.38 - 0.54 *
0.34 - 0.47 *
Mean Corpuscular Volume (MCV)
80 - 100 fL
80 - 100 fL
Mean Corpuscular Haemoglobin (MCH)
27.5 - 34 pg
27.5 - 34 pg
Mean Corpuscular Haemoglobin Concentration
(MCHC)
310 – 360 g/L
310 – 360 g/L
4.2 - 6.5 x 1012/L*
3.7 - 5.6 x 1012/L*
Red Cell Count (RCC)
Special, Second tier tests for Anaemia Investigation
Male
Female
Ferritin
25 - 220 ng/mL
25 -110 ng/mL
Serum Iron
5 - 30 mol/L
5 - 30 mol/L
Saturation
20 - 55%
20 - 55%
Transferrin
445 - 472 mol/L
445 - 472 mol/L
Red cell Folate (RCF)
> 150 nmol/L
> 150 nmol/L
Serum Folate
> 7.0 nmol/L
> 7.0 nmol/L
Vitamin B12 (VitB12)
> 150 pmol/L
> 150 pmol/L
Red Cell Distribution Width (RDW)
(CV%)
(SD)
(Response Variable)
11-16%
30-50 fL
47