A Framework to Predict Outcome for Cancer Patients Using

2016 IEEE International Conference on Big Data (Big Data)
A Framework to Predict Outcome for Cancer Patients
Using data from a Nursing EHR
Muhammad K Lodhi∗ , Rashid Ansari∗ , Yingwei Yao† , Gail M. Keenan† , Diana J. Wilkie † and Ashfaq Khokhar‡
∗ College of Engineering
University of Illinois at Chicago, Chicago, IL, 60607
Email: mlodhi3,[email protected]
† College of Nursing
University of Florida, Gainsville, FL, 32610
Email:y.yao,gkeenan,[email protected]
‡ College of Engineering, Chicago, IL, 60616
Email: [email protected]
Abstract—With the rapid growth of electronic data repositories in diverse application domains, including healthcare,
considerable research interest has been developed to solve
issues related to extraction of hidden knowledge in these
repositories. Electronic health record systems (EHRs) are the
fastest growing in terms of size and data diversity. In this work,
we focus on mining a high dimensional sparse dataset using
nursing care data as an exemplar. To mine a high-dimensional
and sparse dataset is a challenging task due to a number
of reasons. There are several dimension reduction methods,
however, they do not work well with contextual datasets. In
our study, we have used association mining as a dimension
reduction step and for extracting important features from
the dataset. Our results show that association mining can be
effectively used for dimension reduction and feature extraction
step. Our predictive modeling results show that decision tree
models generally have high accuracy and the results are easy
to interpret and determine the influence of different variables.
Keywords-dimension reduction, association mining, predictive modeling, Electronic Health Records (EHRs).
I. I NTRODUCTION
The use of electronic media to capture, process and
accumulate information is witnessing extraordinary developments [1], in part due to developments in digital technologies. This situation has resulted in high dimensional
and sparse datasets for various applications such as in
electronic health records (EHRs), biology, astronomy, medical imaging, video archiving, and web data. Different data
mining techniques have been utilized to extract knowledge
embedded in some of these data sets, albeit with limited
success [2], [3].
Existing data mining techniques and models can be applied
successfully on traditional transactional datasets. However, a
number of challenges exist for analyzing high-dimensional
data using the traditional methods. Firstly, the complexity
of most of these algorithms is exponential in the number
of dimensions (columns) since there are an exponential
number of column arrangements. Hence, as the number of
978-1-4673-9005-7/16/$31.00 ©2016 IEEE
3387
dimensions of a dataset increase, so does the search space.
These column-enumeration based algorithms are therefore
infeasible for many of the real world applications. A
few row-enumeration algorithms have been introduced [4],
[5], though they work best with dense datasets. Secondly,
predictive power of predictive models reduces as the dimensionality increases [7]. Different dimension-reduction
algorithms, such as Principal Component Analysis [6] are
effective and popular algorithms for reducing dimensionality.
However, the effectiveness of these algorithms is limited
due to their global linearity. All of these conventional
dimension-reduction techniques only characterize linear subspaces within the data and do not consider the context of
different variables and predictors in the dataset.
In this work, we analyze a high dimensional and sparse
dataset stemming from nursing care EHRs. These datasets
contain thousands of variables (high dimensionality), yet,
only a few of them are monitored for an individual patient.
Although nursing data are a vital part of EHR, it is frequently disregarded [8], even though nurses are the forefront
providers of care to the patients. Using different data mining
methods effectively can be exceptionally useful for provision
of better care to the patient by development of more effective
care, and thus reducing the healthcare costs. Our aim of
this work is to utilize the EHR data to reveal best palliative
care practices and deliver clinical decision support directly
to the nurses after development of an automated system.
Accurately timed and controlled palliative care has been
shown to decrease the healthcare costs significantly, as well
as increasing the quality of care provided to the chronically
ill patients [9]. A lot of money being spent every year on
chronically ill patients, but palliative care is underutilized
due, in part, to ineffective use of available technology [10].
Applying big data techniques to discover knowledge can
help in help saving around $300 million annually [11].
Previously, nursing EHR data has not been amenable to big
data methods due to the lack of standardization.
Our emphasis, in this study, is on the patients suffering
from cancer. Cancer is one of the leading cause of death
worldwide and the costs of treatments are high due to
complexity of this disease. Cancer is not a single disease,
but a multitude of diseases, meaning that effective treatment
of one type of cancer may not be as effective for treatment
of another type of cancer. According to a study from the
World Health Organization, cancer accounted for nearly
13% of all deaths over the world in 2012. As the baby
boomers age, cancer-related deaths are projected to continue
rising, reaching 12 million deaths by 2030, as cancer is
more prevalent in those who are advanced in age [12],
[13]. Therefore, there is an urgent need to provide cost
effective care, and medical and nursing interventions need
to be applied more successfully to the hospitalized patients
suffering from cancer.
Recently, a lot of studies have considered patient-level data
to generate useful results in one step towards individualized
care. Predictive models have been built on data of cancer
patients [14]–[21]. However, these studies have some limitations. Some of the studies [15], [16], [18], [19] target
only a selective group of patients based on sex or race,
whereas some studies only consider patients belonging to
a single hospital unit [14], [20], [21]. Only a few studies by
Caballero et al [14] and Ghassemi et al [21] have considered
text data for prediction models. None of the works, however,
have considered nursing data for predictive modeling which
makes our study unique in this regard.
Despite all the research, our work is innovative and different
in several ways. Firstly, none of the previous works have
considered nursing care data provided to the hospitalized
patients suffering from cancer. Secondly, we use data mining
techniques to help discover hidden knowledge about best
practices performed by the nurses associated with better
outcomes among such patients. We believe that this is an
important innovation for advancing the relevance of EHRs
to clinicians as due to unavailability of standardized nursing
care data, since much is still not known about the care that
the nurses provide to the patients. Standardized data allows
for personalization of care provided to the patient to achieve
the desired goals and will also help in understanding the
important features of care to specific populations. Lastly,
although a hierarchical structure of classification has been
used comprehensively in some of the other domains like
text mining [23], they have not been used as extensively
in the medical domain and specially for predicting cancer
outcomes.
In this on-going work, our aim is to identify models with
strong and significant prediction of patient outcomes relevant to palliative care at the shift level and the episode
level from different predictors including patient and nurse
characteristics and nursing care. The goal is to determine the
associations between a variety of nursing care interventions
or patient characteristics and improved outcomes for cancer
3388
Table I
S AMPLE L IST OF WORDS OR PHRASES USED TO IDENTIFY CANCER
PATIENTS IN THE DATABASE
Adenocarcinoma
Basal Cell
Breast
Chemomobilization
Duodenectomy
Glioblastoma
hemifacialectomy
laryngeal cancer
Leukemia
Lymphoma
Mass
Merkell Cell
Prostatectomy
Radiation
Radiation Therapy
Rectal Bleeding
Tumor
VATS
patients.
II. DATA D ESCRIPTION
Our data have been gathered using HANDS database,
a nursing EHR system, deployed in 9 different hospital
units for three years (2005-2008) [24]. In the database, the
nursing diagnoses are based on NANDA-Is [25], different
nurse outcomes are based on Nursing Outcome Classification (NOC) [26] for each diagnosis, and the interventions
provided to achieve the expected outcomes are based on
Nursing Intervention Classification (NIC) [27].
During the original study, 34,927 patients were recorded
with 42,403 unique episodes. An episode is defined as an
uninterrupted patient stay in a hospital unit, comprised of
nurse shifts (single or multiple). Each shift corresponds to
a single plan of care (POC) in which nurses documented
many attributes (variables), including the patients’ diagnoses
(medical and NANDA-I), different outcomes (NOCs) for
each diagnosis along with their initial and expected score
(between 1 and 5), and the interventions (NICs) provided to
achieve the expected outcomes. The POC also incorporates
the characteristics of each nurse (e.g., experience, education,
patient load, continuity, work pattern, etc.) who provided
care during the patient’s hospital stay. Linked to the POC,
the HANDS also stores demographic information about the
patient (e.g., medical diagnosis, age, gender, and race). The
database also consists of medical notes that contains text
data.
A. Identification of Cancer Patients
To identify the cancer patients, the data were extracted
from the database using SQL queries. For this study, we
extracted cancer patients that had a NOC of Knowledge
Cancer Management. We also examine the medical notes
in the database to determine all the patients suffering from
cancer. We extracted a list of nouns from notes associated
with each nurse shift in the database. These lists were
reviewed by the domain experts and the final list consist
of only those words or phrases that indicate the presence of
cancer in the patient. Table I gives a sample of list of words
that have been used to distinguish cancer patients.
In this work, our focus is on cancer patients suffering from
pain since cancer is a common condition where addressing
pain is a concern for the patient. Only a limited number of
the patients receive proper pain treatment [22]. Table II gives
a brief overview of the characteristics of our final dataset.
Table II
DATASET C HARACTERISTICS
Full Dataset Characteristics
Total number of Patient Admissions
42,403
Total number of Plan of Cares (POCs)
400,000
Number of Patients
34,927
Number of Cancer Patients
3,685
Number of POCs for Cancer Patients
40,753
III. D IMENSION R EDUCTION AND F EATURE
E XTRACTION
In this work, our emphasis is on finding patterns of
predictors of interest and building predictive models. Our
aim is to perform predictive analysis at two different levels
of granularity. The initial analysis has been implemented on
data at the episode-level, whereas the second analysis has
been completed at the POC or shift level and is more ”fine
grained”. The episode-level or ”coarse-grained” models will
help predict the result at the end of patient hospitalization,
whereas the fine-grained models, conducted on the shiftlevel data, will aid in the prediction of the outcome at
the end of each nurse shift, thus providing us a course of
predict outcomes over the entire patient hospitalization.
The coarse-grained models is helpful in predicting outcome
at the conclusion of the patient hospitalization, whereas
fine-grained models can be utilize to predict the outcome at
the end of each shift, thus giving us the ability to track the
care provision during hospitalization.
We use different nursing and patient characteristics, along
with nursing care parameters to determine whether a
NOC is met or not met. The target variable for predictive
modeling is ”NOC met” which is defined as follows:


Final NOC rating ≥
M et,
N OC met =
Expected Outcome rating.


N ot met, Otherwise.
(1)
Our dataset has 747 variables or features, including but not
limited to patients and nurses demographics and the nursing
diagnoses, outcomes and interventions. Most of these features, however, are null or empty for a given patient record.
3389
A typical patient has four different diagnoses, five NOCs
and ten different NICs on their care plans, thus making only
about 2% of the entire dataset containing any useful information. This statistic results in our dataset being a sparse
and high-dimensional dataset. To reduce the dimensions,
we do not use typical dimension reduction techniques like
Principal component analysis or data manifolds since these
techniques do not consider the contextual information of
the variables. These techniques transform the data and have
global linearity assumptions. Instead, association mining has
been utilized to reduce the dimensions. Association mining
helps in identification of prevalent patterns and hence those
patterns can be used as a basis for predictive modeling. We
use the apriori algorithm to discover and extract association
rules among the dataset. Only rules having a minimum
confidence of 55% and support of at least 10% have been
included.
To study the clinical significance of the patient and nurse
variables like patient’s age, length of stay (LOS) and nurse
experience on the outcome being met or not met, these
features have been discretized based on theoretical rationale
and data frequency distribution [28]–[30]. Furthermore, the
NANDA-Is and NICs have been clustered together into their
respective domains and classes [25]–[27], according to the
nursing literature to further reduce the dimensions of the
data. Some rules generated using the association mining are
given:
•
•
•
•
•
•
•
•
•
•
•
Young Patient → Expected Outcome is not met (confidence: 61.2%, support = 11.3%)
Old Patient → Expected Outcome is met (confidence:
57.5%, support = 20.7%)
Short Stay → Expected Outcome is not met (confidence: 59.0%, support = 12.1%)
Medium Stay → Expected Outcome is met (confidence:
56.0%, support = 18.1%)
First Rating = 5 → Expected Outcome is not met
(confidence: 60.3%, support = 29.0%)
First Rating = 3 or 4 → Expected Outcome is met
(confidence: 77.1%, support = 36.2%)
Expected Rating ≤ 4 → Expected Outcome is met
(confidence: 76.5%, support = 37.4%)
At least one diagnosis is present from Infection Class
is present → Expected Outcome is met (confidence:
60.4%, support = 40.9%)
No NANDA from Infection Class is present → Expected Outcome is met (confidence: 55.8%, support =
16.0%)
No Intervention from the Patient Education Class is
present → Expected Outcome is not met (confidence:
58.6%, support = 12.3%)
Inexperienced Nurse → Expected Outcome is met
(confidence: 56.9%, support = 39.2%)
Using the results from the association mining, we determine
Table III
F EATURES U SED F OR P REDICTIVE M ODELING
Features Used
Feature Name
Feature Values
Initial & Expected NOC Rating
1-5
Age
Young(18-49),
Middle-aged(5064), Old(65-84) & Very Old(85+)
Length of Stay(LOS) (Derived)
Short(<2 days), Medium (2-5
days) & Long(5+ days)
Average Nurse Experience (Derived)
Inexperienced (Experience less
than 2 years) & Experienced
(Experience of 2+ years)
NANDA Domains
Activity/Rest,
Comfort,
Coping/Stress
Tolerance,
Elimination, Nutrition, Perception,
& Safety/Protection
NANDA Classes
Cardiovascular/Pulmonary
Responses, Cognition, Hydration,
Infection,
Physical
Comfort,
Physical Injury, & Pulmonary
System
NOC Domains
Community
Health,
Family
Health, Functional Health, Health
Knowledge & Behavior, Perceived
Health, Physiologic Health, &
Psychosocial Health
NOC Classes
Community Health Protection,
Family Caregiver Performance,
Growth & Development, Mobility,
Self-Care,
Health
Behavior,
Health Knowledge, Risk Control
&
Safety,
Cardiopulmonary,
Sensory, Therapeutic Response,
Psychological Well-Being, Social
Interaction, & Self-Control
NIC Domains
Behavioral, Health System, Safety,
Physiological: Basic, & Physiological: Complex
NIC Classes
Activity & Exercise Management,
Cognitive Therapy, Communication Enhancement, Drug Management, Electrolyte and Acid/Base
Management, Immobility Management, Information Management,
Nutrition Support, Patient Education, Physical Comfort Promotion, Psychological Comfort Promotion, Respiratory Management,
Risk Management, Self-Care Facilitation, Skin/Wound Management,
& Tissue Perfusion Management
a comprehensive list of all the features that we will be using
for our predictive modeling experiments. All of the features
(primitive as well as derived) are listed in Table III.
IV. F RAMEWORK OVERVIEW
In this section, we thoroughly explain the framework,
as given in the Figure 1, and the experimental setup that
has been used for predictive modeling. A brief overview of
3390
Figure 1. Framework for Predictive Analysis of Patient Hospitalization
and Analysis of Plan of Cares
the models is also provided that have been applied in our
experiments.
A. Problem Definition 1
Given a patient suffering from cancer, our objective is
to determine whether that patient meets their expected pain
outcome at the end of their hospitalization, irrespective
of the fact that whether the patient is discharged or dies.
Different classification methods have been availed for prediction analysis using the features listed. Our aim, therefore,
is to determine the outcome at the end of the patient
hospitalization (Z) such that :


1, Patient i meets the expected outcome at the end
Z=
of patient hospitalization.


0, Otherwise.
(2)
Figure 1 depicts the two different types of analyses that can
be performed on the data. Figure 1-”Workflow for Episodes
Analysis” shows a framework for prediction analysis of
the episodes. After identifying all the patients suffering
from cancer, as mentioned previously, we then start our
prediction analysis using different classification methods.
The process may be repeated iteratively to improve the
results. The best predictive models are selected depending
on the selective evaluation metrics and the results can be
clinically interpreted.
A simplified procedure to determine patients that met or did
not meet their expected outcome is given.
Algorithm 1 : Finding best classification model to find
patients that meet their Expected Outcome
Input: set S = {s | s is a patient in the database}.
Select Y ⊂ S such that Y = {y | y is a patient that has
cancer and pain through SQL and medical notes }
while (y in Y ) do
Use different classification methods to build a model
that predicts whether patient y meets their expected outcome.
end while
Output: : Classification model X that gives the best accuracy and AUC results.
B. Problem Definition 2
Given a patient suffering from cancer, our second objective is to determine whether that patient meets their
pain outcome at the end of the current nurse shift. Similar
classification methods are utilized for prediction analysis
using the features listed along with the result of the previous
shift. Hence, our objective is to determine the outcome at
the end of each nurse shift (Z) such that:


1, If patient i meets the expected outcome at the end
Z0 =
of the current nurse shift.


0, Otherwise.
(3)
We have developed a dynamic hierarchical framework
for prediction analysis of the care plans as given in Figure
1 -”Workflow of POC Analysis”. After identifying all the
patients suffering from cancer, as mentioned previously, we
then start our prediction analysis using different classification methods. The first level predicts the outcome based
on all the features whether the patient will be meeting
their outcome at the end of the nurse shift. Based on
the results, in the further steps, we determine whether the
patient meets their outcome in the current nurse shift. The
process continues until the result of the last care plan is
predicted. Since, the patient’s total number of care plans
is not known at the beginning of the hospitalization, the
framework does not have a definite number of steps, and
thus, can be considered dynamic and in a hierarchy. A
simplified algorithm 2 is given:
C. Modeling
After the data extraction and data refinement step, the next
step is the building of the actual predictive models. To predict whether the patient will be able to meet their expected
outcome either at the end of their hospitalization or at the
end of the current nurse shift, we build multiple predictive
models on our dataset and compared their performances.
3391
Algorithm 2 : Finding best classification model that determines whether the patient meets their expected outcome at
the end of the current nurse shift
Input: set S = {s | s is a patient in the database}.
Select Y ⊂ S such that Y = {y | y is a patient that has
cancer and pain through SQL and medical notes }
while (y in Y ) do
For the first care plan or POC of patient y, use different
classification methods to build a predictive model that
predicts whether patient y meets their expected outcome
at the end of first nurse shift.
while (shif t number = 2 till last shif t) do
Use the result from the previous iteration, along
with other features to predict whether patient y meets
their expected outcome at the end of the nurse shift
shif t number, utilizing the various classification algorithms.
end while
end while
Output: : Classification Model X’ that gives the best accuracy and AUC results.
These models are based on Decision Trees [31], k-NN [32],
Support Vector Machines (SVM) [33], Naı̈ve-Bayes [34] and
Linear Regression (LR) [32].
D. Experimental Setup and Evaluation Metrics
The performance of the models has been evaluated using
10-fold cross-validation method [35]. Different evaluation
metrics, such as accuracy, f-measure and Area Under Curve
are utilized for comparing the results from the experiments.
These different evaluation metrics are beneficial under different environments and can benefit under different circumstances.Wherever practical, we have also performed χ2 test [36] to verify if our achieved results are statistically
significant.
V. R ESULTS
A. Predicting outcome at the end of hospitalization (CoarseGrained Models)
In the first set of experiments, we determine whether the
patient will be able to meet their expected outcome at the
end of their hospitalization, based on different patient and
nurse characteristics, along with different nursing diagnoses
and interventions that were applied to achieve the expected
rating. The foremost objective is to determine best practices
from these different variables to have better outcome results.
Therefore, the target variable ”NOC met” was a binary
variable. The results are delivered in the Table IV below.
As observed from the results, the decision tree (given in
2) has the best prediction accuracy at 80.1%, AUC with
0.731 and F-measure of 0.81, when compared with all the
other prediction results. Naı̈ve-Bayes model also has a good
accuracy and comparable AUC measure with the AUC of
decision tree. The k-NN models fare less well when accuracy
measures are compared, though k-NN (k = 10) has a better
AUC than the decision tree and k-NN (k = 5) has the
almost the same AUC as decision tree. SVM results are
virtually same with the k-NN models, whereas LR has the
worst prediction accuracy, AUC and the f-measure of all the
models.
We use z-test to check whether the results of different
predictive models are statistically the same or different. We
assume that the difference is significant if the p-value is
below 0.05. Also, we only compare the best and the second
best results. If both are same, then we check the best result
with the third best and so on. Performing z-test, Decision
Tree model accuracy results were statistically significant
compared with the accuracy results of Naı̈ve-Bayes (p-value
<0.001). AUC results are statistically same for Decision
Tree, Naı̈ve-Bayes, k-NN ( k = 5 or 10) and SVM. They
were statistically significantly better than k-NN (k = 2) (pvalue = 0.035). F-Measure statistics for Decision Tree were
also statistically significant from Naı̈ve-Bayes’s statistics (pvalue <0.01).
The prediction model given by the decision tree of whether
a patient meets their expectation is given in Figure 2. The
NOC rating in the first nurse visit is the most important
attribute on predicting whether the patient will meet their
expectation. When the nurse sets the rating at 5, there is a
68.6% probability that the patient will not be meeting the
outcome; conversely, when the nurse sets the first rating at
4 or below, the patients meet their outcome 77.1% of the
times. This difference was determined to be very significant
statistically (p-value ≤ 0.0005).
When the rating in the first nurse shift is set at 4 or under,
the next important attribute is the Electrolyte and Acid/Base
Management NIC class. Whenever any intervention from
this class was applied to the patient, they did not meet their
outcome 55.6% of the times, whereas, when no intervention
from Electrolyte and Acid/Base Management Class was
used, 79.7% of the patients met their outcome. After running
the chi-squared test, this result was also deemed statistically
significant (p-value ≤ 0.001).
When the rating was set to 5 by the nurses in the first
shift, the next significant feature is the Cognitive Therapy
Table IV
P REDICTION R ESULTS FOR C ANCER C OARSE -G RAINED M ODELS
Model
Decision Tree
Naı̈ve-Bayes
k-NN (k=2)
k-NN (k=5)
k-NN (k=10)
SVM
Linear Regression
Accuracy
80.1
70.8
68.1
66.7
69.6
67.1
63.1
AUC
0.731
0.728
0.671
0.718
0.744
0.732
0.638
F Measure
0.81
0.74
0.73
0.70
0.72
0.68
0.63
3392
Table V
P REDICTION R ESULTS FOR F INE -G RAINED M ODELS
Model Used
Decision Tree
Naı̈ve-Bayes
k-NN (k = 2)
k-NN (k = 5)
k-NN (k = 10)
SVM
LR
Accuracy
72.8
68.2
63.8
69.8
70.2
64.4
62.6
NIC class. Whenever no interventions from the Cognitive
Therapy class was administered, the probability of the
patient not meeting their outcome was 71.5%. Conversely,
55.2% of the patients met their expected outcome whenever
there was an intervention present in the patients’ care plan.
Once more, the finding was statistically significant (p-value
= 0.003). In the instances when patients’ POC included an
intervention from Cognitive Therapy class, age was the next
key element. Young patients did not fare too well as twothird of the young patients did not meet their outcome. On
the other hand, 59.7% of the middle-aged patients had met
their expected outcomes. 54.7% of the older patients had
also met their outcome.
B. Fine-grained Analysis of Plan of Cares (POC)
After performing the earlier experiment at the episodelevel, we then proceeded to build models at plan of care
(POC) or the nurse shift level. Our main objective for this
experiment was to predict whether the patient would meet
their expected outcome at the end of each nurse shift within
each episode. The predictors for this experiment included
NOC rating in the previous shift (not available for first shift),
Expected NOC rating set by the nurse in the first shift, age
of the patient, experience of the nurse taking care of the
patient in the current shift, and different NANDA and NIC
domains and classes. We have excluded the LOS feature
from the analysis on POC-level data.
SVM and LR models work only for binomial target variables, therefore we have used 1-against-all strategy under
polynomial by binomial classification for both SVM and LR
models. Only accuracy was used as a performance metric
for this experiment. Others could not be used because of
the polynomial nature of the target variable. The obtained
results are mentioned in Table V.
The results indicate that decision tree model (Figure 3) has
the best accuracy, at 72.8% prediction accuracy, among all
the models. K-NN models (k = 5 or 10) and Naı̈ve-Bayes
model also have a good accuracy. 2-NN, SVM and LR do
not perform as well as compared to all the other models.
Using z-test, it was determined that the difference between
the accuracy results of decision tree model and the k-NN
(k = 10) model is statistically significant (p-value <0.01).
Figure 3 depicts the decision tree model for predicting the
NOC rating at the end of each shift. The NOC ratings
Figure 2.
Figure 3.
Decision Tree Diagram for Patients suffering from Cancer
Decision Tree Model for Cancer Patients shift level records (Predicting NOC Rating in the current shift)
3393
were coded in the experiment as follows: NOC rating of
1 as worst, 2 as bad, 3 as average, 4 as good, and 5 was
coded as the best rating. The decision tree shows that the
most important attribute to determine the NOC rating in the
current shift is the NOC rating of the previous shift. Other
important attributes were Patient Education NIC class, Risk
Management NIC class, diagnoses from Comfort domain,
interventions from the Tissue Perfusion Management class,
and age of the patient.
VI. C ONCLUSION
Big data from EHR systems have provided new
opportunities for rapidly discovering evidence that can be
used to improve care for the patients. Predictive modeling is
one method of examining EHR data that has great promise.
It is, however, a difficult task to employ predictive modeling
techniques on high-dimensional and sparse datasets like
that gathered in EHRs. Traditional techniques of dimension
reduction cannot be used for such datasets because these
methods do not consider the contextual features, which
is taken into account while applying association mining
as a dimension-reduction technique. Through association
mining, strong correlations are found in the data and thus
can be of aid in eliminating the dimensions having little to
no impact on the prediction accuracy results.
In this study, we used association mining as dimensionreduction step in data transformation stage to determine
the relationship, if any, between different nursing and
patient features and outcome of cancer patients. The results
obtained from association mining were used to extract
features needed for the predictive modeling. Models were
then built utilizing various existing techniques. Generally
high accuracy were achieved for our experiments at the
episode-level (coarse-grained experiment) with decision
tree giving the best accuracy and a good AUC. Using the
decision tree, we also gave a list of features that were vital
in predicting whether the outcome was met or not met.
Some of the results were statistically significant and with
the help of domain experts, we were able to make sure that
those results were clinically significant.
In the later experiment, we drilled down to shift-level data
to make more fine-grained models in a dynamic hierarchical
framework. We predicted the outcome rating at the end
of each shift, using different features, including the rating
from the previous shift, once again getting high accuracy
results, especially from decision tree model. Using the
prediction results at both the episode-level and POC level,
we are providing a trajectory of predicted outcomes.
The results achieved from different experiments conducted
on big data from a nursing EHR system has demonstrated
that these systems can be used as a building step in decision
support framework for nursing and all the related fields
in healthcare. Using innovative and interesting methods to
explore data can help in understanding the full potential of
3394
such systems.
The incorporation of EHRs with data mining can balance
traditional research methods by filling gaps in knowledge.
This can be achieved by suggestions of novel methods for
systematic research. The results achieved using predictive
models can be integrated into these systems quite easily,
whereas previously, it took decades to include research
evidence into common practice. Without using the data
from EHRs in prediction, EHRs are only useful to monitor
the progress of the patient.
Different data mining techniques, including association
mining and predictive modeling, can aid the hospital
management to improve the quality of care provided to the
general community [37]. These models can be potentially
helpful in minimizing the healthcare costs, along with
improving nurse care. Although, limitations in the data
exist, standardization of different techniques to help the
patients is an important step forward towards personalized
care.
R EFERENCES
[1] M. Hilbert and P. Lø’pez,The worlds technological capacity
to store, communicate, and compute information, Science, 332
(2011), pp. 60–65.
[2] Y. Chen, D. Hu and G. Zhang, Data Mining and Critical Success Factors in Data Mining, Knowledge Enterprise: Intelligent
Strategies in Product Design, Manufacturing, and Management,
(2006), pp. 281–287.
[3] M. K. Lodhi, R. Ansaari, Y. Yao, G. M. Keenan, D. J. Wilkie,
and A. A. Khokhar,Predictive Modeling for Comfortable Death
Outcome Using Electronic Health Records, 2015 IEEE International Congress on Big Data, 2015, pp. 409–415.
[4] F. Pan, G. Cong, A. Tung, J. Yang, and M. J. Zaki,
CARPENTER: Finding Closed Patterns in Long Biological
Datasets,Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
(2003), pp. 637–642.
[5] H. Han, J. Liu, D. Shao, and Z. Xin, Mining Frequent Patterns
from Very High Dimensional Data: A Top-Down Row Enumeration Approach, Proceedings of the Sixth SIAM International
Conference on Data Mining, 124 (2006), pp. 282
[6] I. Jolliffe, Principal component analysis, Wiley Online Library
(2002).
[7] G. Hughes, On the Mean Accuracy of Statistical Pattern
Recognizers, IEEE Transaction of Information Theory, 14(1),
January 1968, pp. 55–63.
[8] E. Harper and J. Sensmeier, Why is big data important
to Nurses?, HIMSS (2015), Retrieved June 10, 2016
from: http://www.himss.org/News/NewsDetail.aspx?ItemNu
mber=43374
[9] D. E. Meier, Increased access to palliative care and hospice
services: opportunities to improve value in health care,Milbank
Quarterly, 89(3), 2011, pp. 343–380.
[10] B. Fung, How the U.S. Health-Care System Wastes
$750 Billion Annually, Retrieved June 12, 2016 from
http://www.theatlantic.com/health/archive/2012/09/how-theus-health-care-system-wastes-750-billion-annually/262106/
[23] J. Li, S. Fong, Y. Zhuang, and R. Khoury, Hierarchical
Classification in Text Mining for Sentiment Analysis, IEEE
International Conference on Soft Computing and Machine
Intelligence (ISCMI), 2014, pp. 46–51.
[11] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C.
Roxburgh, and A. H. Byers, Big data: The next frontier for
innovation, competition, and productivity, 2011
[24] G. M. Keenan, E. Yakel, Y. Yao, D. Xu, L. Szalacha, D.
Tschannen, Y. Ford, Y. Chen, A. Johnson, K. D. Lopez, and
D. J. Wilkie, Maintaining a Consistent Big Picture: Meaningful
Use of a Web-based POC EHR System, International journal
of nursing knowledge, 23 (2012), pp. 119–113.
[12] S. H. Landis, T. Murray, S. Bolden, and P. A. Wingo, Cancer
statistics, 1998, CA: a cancer journal for clinicians,48(1), 1998,
pp. 6–29.
[13] R. Yancik, and M. E. Holmes, NIA/NCI Report of the Cancer
Center Workshop (June 13–15, 2001). Exploring the role of
cancer centers for integrating aging and cancer research 2002.
[14] B. Caballero, L. Karla and R. Akella, Dynamically Modeling
Patient’s Health State from Electronic Medical Records: A
Time Series Approach, Proceedings of the 21th ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining, 2015, pp. 69–78.
[15] M. H. Gail, L. A. Brinton, D. P. Byar, D. K. Corle, S. B.
Green, C. Schairer, and J. J. Mulvihill, Projecting individualized probabilities of developing breast cancer for white females
who are being examined annually, Journal of the National
Cancer Institute, 81(24), 1989, pp. 1879–1886.
[25] North American Nursing Diagnosis Association, NANDA
Nursing Diagnoses, North American Nursing Diagnosis Association, (2007).
[26] S. Moorhead, M. Johnson, and M. Maas, Iowa Outcomes
Project, Nursing Outcomes Classification (NOC), Mosby,
(2004).
[27] G. M. Bulecheck, H. K. Butcher, and J. M. Dochterman,
Nursing interventions classification (NIC), Mosby, (2008).
[28] K. W. Gronbach, The age curve: How to profit from the
coming demographic storm, AMACOM Div American Mgmt
Assn, (2008).
[29] US Department of Health and Human Services, Hospital
utilization (in non-federal short-stay hospitals), (2012).
[16] M. H. Gail, J. P, Costantino, D. Pee, M. Bondy, L. Newman,
M. Selvan, G. L. Anderson, K. E. Malone, P. A. Marchbanks,
W. McCaskill-Stevens and others. Journal of the National
Cancer Institute, 99(23), 2007, pp. 1782–1792.
[30] P. Benner, From novice to expert, Menlo Park, (1984).
[17] M. Garzotto, G. Hudson, L. Peters, Y. Hsieh, E. Barrera, M.
Mori, T. M. Beer, and T. Klein, Predictive modeling for the
presence of prostate carcinoma using clinical, laboratory, and
ultrasound parameters in patients with prostate specific antigen
levels leq 10 ng/mL, Cancer, 98(7), 2003, pp. 1417–1422.
[32] D. W. Aha, D. Kibler, and M. K. Albert,Instance-based
learning algorithms, Machine Learning, 6 (1991), pp. 37–66.
[31] J. R. Quinlan, C4. 5: Programs for Machine Learning,
Elsevier, (2014).
[33] C. Cortes and V. Vapnik, Support-vector networks, Machine
Learning, 20 (1995), pp. 273–297.
[18] R. K. Matsuno, J. P. Costantino, R. G. Ziegler, G. L. Anderson, H. Li, D. Pee, and M. H. Gail, Projecting individualized
absolute invasive breast cancer risk in Asian and Pacific
Islander American women, Journal of the National Cancer
Institute, 2011.
[34] M. A. Arbib, The Handbook of Brain Theory and Neural
Networks, MIT press, (2003).
[19] R. Yancik, M. N. Wesley, L. A. Ries, R. J. Haylik, B. K.
Edwards, and J. W. Yates, Effect of age and comorbidity in
postmenopausal breast cancer patients aged 55 years and
older, Jama, 285(7), 2001, pp. 885–892.
[36] F. Yates, Contingency tables involving small numbers and the
χ 2 test, Supplement to the Journal of the Royal Statistical
Society, 1(2), 1934, pp. 217–235.
[20] K. Angelo, A. Dalhaug, A. Pawinski, E. Haukland, and C.
Nieder, Survival prediction score: a simple but age-dependent
method predicting prognosis in patients undergoing palliative
radiotherapy, ISRN oncology, 2014.
[21] M. Ghassemi, T. Naumann, F. Doshi-Velez, N. Brimmer, R.
Joshi, A. Rumshisky, and P. Szolvits, Unfolding physiological
state: mortality modelling in intensive care units, Proceedings
of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 75–84.
[22] N. S. Murthy, and A. Mathew, Cancer epidemiology, prevention and control, Curr Sci, 86(4), 2004, pp. 518-524.
3395
[35] I. H. Witten and E. Frank, Data Mining: Practical machine
learning tools and techniques, Morgan Kaufmann, (2005).
[37] P. Desikan, R. Khare, J. Srivastasa, R. M. Kaplan, J. Ghosh,
and L. Liu, Predictive Modeling in Healthcare: Challenges and
Opportunities (2013).