An Evaluation of Patient Safety Event Report Categories Using

338
Original Articles
An Evaluation of Patient Safety
Event Report Categories Using
Unsupervised Topic Modeling
A. Fong1; R. Ratwani1,2
1MedStar
Institute for Innovation – National Center for Human Factors in Healthcare, Washington, D.C., USA;
University School of Medicine, Washington, D.C., USA
2Georgetown
Keywords
Patient safety event reports, topic model,
latent dirichlet allocation, general event type,
natural language processing, unsupervised
learning
Summary
Objective: Patient safety event data repositories have the potential to dramatically improve safety if analyzed and leveraged appropriately. These safety event reports often
consist of both structured data, such as general event type categories, and unstructured
data, such as free text descriptions of the
event. Analyzing these data, particularly the
rich free text narratives, can be challenging,
especially with tens of thousands of reports.
To overcome the resource intensive manual
review process of the free text descriptions,
we demonstrate the effectiveness of using an
unsupervised natural language processing
approach.
Methods: An unsupervised natural language
processing technique, called topic modeling,
was applied to a large repository of patient
Correspondence to:
Allan Fong, MS
MedStar Institute for Innovation –
National Center for Human Factors in Healthcare
3007 Tilden St. NW, Suite 7M
Washington, D.C. 20008
USA
E-mail: [email protected]
safety event data to identify topics, or
themes, from the free text descriptions of the
data. Entropy measures were used to evaluate and compare these topics to the general
event type categories that were originally
assigned by the event reporter.
Results: Measures of entropy demonstrated
that some topics generated from the unsupervised modeling approach aligned with
the clinical general event type categories
that were originally selected by the individual entering the report. Importantly, several
new latent topics emerged that were not
originally identified. The new topics provide
additional insights into the patient safety
event data that would not otherwise easily
be detected.
Conclusion: The topic modeling approach
provides a method to identify topics or
themes that may not be immediately apparent and has the potential to allow for
automatic reclassification of events that are
ambiguously classified by the event reporter.
Methods Inf Med 2015; 54: 338–345
http://dx.doi.org/10.3414/ME15-01-0010
received: January 14, 2015
accepted: February 27, 2015
epub ahead of print: April 2, 2015
1. Introduction
The Institute of Medicine and several state
legislatures have recommended the use of
patient safety event reporting systems
(PSRS) to better understand and improve
safety hazards [1, 2]. Numerous healthcare
providers have adopted these systems
which provide a framework for healthcare
provider staff, including frontline clinicians, nurses, and technicians to report
patient safety events [3]. Reported patient
events range from “near misses”, where no
patient harm occurs, to serious safety
events that result in patient harm. If the
reported data can be analyzed effectively,
reporting systems have the potential to
dramatically improve the safety and quality
of care by exposing possible weaknesses in
the care process [4].
A patient safety event (PSE) report generally consists of both structured and unstructured data elements [5]. Structured
data are pre-defined, fixed fields that solicit
specific information about the event. For
example, there is often a field for “event
type” which provides pre-determined categories of different types of patient safety
hazards (e.g. medication, falls, lab/specimen, diagnosis/imaging, safety/security,
miscellaneous, etc.). Generally, no definition of the event type category is provided
on the reporting portal; consequently, the
reporter must select a category based on
their own knowledge and intuition. As a result, the miscellaneous category is used frequently as a “catch all” when the reporter is
unsure of the specific category for the event
being reported. The unstructured data
fields generally includes a free text field
where the reporter can enter a text description of the event. The text descriptions are
often a rich data source in that the reporter
is not constrained to limited categories or
selection options and is able to freely describe the details of the event. Below is an
example of a PSE report narrative that was
originally classified by the reporter as a
Diagnosis/Imaging event:
“... patient had MRI ordered. Pt needed
premedication. Pt called for at 1100. At 1130
nurse said she needed to call MD to get
meds. At 1145 nurse called to say pt was
given meds and would be sent up. At 1225
nurse called to ask why we hadn’t gotten the
pt. When asked if she called transport she
said no. I told her that transport needs to be
© Schattauer 2015
Methods Inf Med 4/2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.
339
A. Fong, R. Ratwani: An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling
called or they won’t come. At 1245 transport
called us back to let us know the pt was on
the way. Pt was on the MRI table in the
scanner. Before we could start she started
yelling and tried to get out of the scanner. It
was difficult to get her back on the stretcher
to continue the scan. We made several attempts to reach her nurse and decided to
send her back downstairs without the MRI.
Called the charge nurse to have her let her
nurse know since we couldn’t reach her. We
were later told that the family was very
angry saying that the delay caused the medicine to wear off . . .”
While this event took place in the context of performing an MRI, it is apparent
from the free text description that there are
several contributing factors (e.g. medication, communication, etc). Eliciting this
kind of detail in a structured data format
would require the reporter to spend a tremendous amount of time inputting a report; yet, analyzing this kind of free text
narrative is resource intensive.
PSRS databases can grow to include
thousands of case reports, depending on
the size of the healthcare provider, and effectively analyzing the report data to make
improvements in safety and quality is a significant challenge [6]. While the structured
data elements of the reports are conducive
to analysis with descriptive statistics, the
unstructured text descriptions are particularly difficult to analyze. Without a detailed
review of the text descriptions it is difficult
to categorize and quantify the events. Many
events span multiple categories (e.g. diagnostic imaging, medication, communication) and unlike the structured data
fields that constrain to a single category,
the unstructured descriptions contain details that reflect multiple categories. With
databases containing tens of thousands of
event reports, more efficient algorithmic
techniques are needed to effectively analyze
these data.
In the present study we describe the application of an unsupervised natural language processing (NLP) technique, called
topic modeling, to better understand and
categorize the free text descriptions of PSE
reports. The topics that emerge from the
unsupervised NLP approach are compared
to the general event categories provided by
the reporter to determine how well the
topics align and to examine which new
topics arise. New topics that emerge from
this NLP approach represent latent concepts that have the potential to shed light
on patterns in the PSE data that may otherwise be difficult to detect. An algorithmic
approach to analyzing the free text narratives of PSE reports has the potential to
dramatically reduce the resources required
to analyze these data and provide new insights to better identify safety hazards.
2. Background
Natural language processing (NLP) is the
research and application of computers to
study and analyze written or spoken language [7, 8]. Numerous NLP algorithms
and models have been developed to analyze and help extract meaning from text in
a variety of disciplines and applications,
such as understanding social media, political speeches, and physician discharge summaries [9 –11].
With the increased use of electronic
health records, there has been a growing
corpus of text information in healthcare
that is ripe for NLP. Researchers have applied NLP to the analysis of clinical documentation and discharge summaries to improve care and workflow processes [8, 12].
For example, NLP has been used to assess
clinical conditions, improve clinical decision support, and identify medications
[13–15].
2.1 Natural Language Processing
to Improve Patient Safety
NLP techniques can be used to better recognize patient safety hazards and to improve the analysis of patient safety data.
Researchers have applied NLP to recognize
adverse events from clinical documentation and have also applied NLP to large adverse event reporting repositories to classify events [11, 16, 17]. Our focus is on the
application of NLP to PSE report data with
a focus on the free text narrative within
each report. NLP techniques have been
successfully used to categorize and classify
PSE reports, for example, to classify serious
safety events or health information technology (IT) events [17–20]. This research
demonstrated that the NLP approach can
be applied successfully and that unique insights can be gained from the analysis of
the free text. Previous NLP approaches,
however, have relied on supervised modeling which require a “gold standard” or
ground truth from which algorithms can
learn from and train on in order to perform well. One challenge with supervised
learning is that the training sets are not always available and often have to be created
using a manual review process that is resource intensive, especially given large
datasets with free text as found in PSE reports.
Unsupervised NLP techniques, such as
topic modeling, offer another method to
understand free text without the resource
intensive training datasets required of
supervised learning. Topic modeling, such
as Latent Dirichlet Allocation (LDA), is a
Table 1
Distribution of patient safety events
General Event Type
Category
Percent of total
reports
Medication/Fluid
18%
Lab/Specimen
15%
Fall
12%
Miscellaneous
10%
Blood Bank
8%
Diagnosis/Treatment
5%
Patient ID/Documentation/Consent
5%
Surgery/Procedure
4%
Skin/Tissue
4%
Lines/Tubes/Drain
3%
Safety/Security
3%
Diagnostic Imaging
3%
Professional Conduct
2%
Equipment/Medical
Device
2%
Maternal/Childbirth
1%
Airway Management
1%
Infection Prevention
1%
Facilities
1%
Healthcare IT
less than 1%
Restraints/Seclusion
Injury
less than 1%
Tube/Drain
less than 1%
Methods Inf Med 4/2015
© Schattauer 2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.
A. Fong, R. Ratwani: An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling
the event, the categories available to reporters are shown in the left column of
▶ Table 1. Furthermore, we used entropy
measures, which are indicators of information context, to evaluate the report distributions along the LDA topics and the
GET categories. Our method for topic
modeling and evaluation is summarized in
▶Figure 1.
4.1 Data Source and
Preprocessing
Figure 1
Outline of topic modeling and evaluation
methodology
statistical approach to discover or identify
“topics” associated with words or phrases
[21]. Topic models have several advantages
when working with large complex text
datasets where the topic categories might
not be clear or easy to discern a priori and
have been shown to be generally useful in
several applications, such as analyzing text
based survey responses, foreign media, and
FDA drug labels [22–24]. Topic models,
however, have not been used to analyze
free text in PSE reports. The application of
an unsupervised NLP approach to analyzing PSE reports provides several advantages. This approach does not require the
resource intensive task of creating a training set, it is less influenced by human annotation biases, and it has the potential to
identify latent themes in the data that are
not immediately apparent.
3. Objectives
We extend previous topic modeling research by utilizing unsupervised LDA to
model topics in PSE reports and compare
the resulting LDA topics to the general
event type (GET) categories originally selected by the event reporters. In this study,
we are interested in how reports are distributed across the LDA topics, and in particular, how this distribution relates to the
reporter defined GET categories. We hypothesize that while some LDA topics will
align well with certain GET categories,
there will be several new topics or themes
discovered through LDA that provide
greater insight into the PSE data. The algorithmic recognition of these “latent” topics
may serve to uncover patterns in the data
that would otherwise be difficult to detect.
4. Methods
LDA was used to discover topics, or
themes, in the free text descriptions of PSE
reports. We compared the LDA topics to
the default general event type (GET) categories available to event reporters. When
submitting a report, reporters have to select
one category (from the list of 21 GET categories) believed to be most applicable to
PSE reports were collected over a 16 month
period (January 2013 to April 2014) from a
large, multi-hospital, healthcare system in
the mid-Atlantic region of the United
States. A total of 29,300 reports were collected during this time period from a
common patient safety reporting system
(PSRS), RL Solutions (www.rlsolutions.
com). Each report has an associated GET
category that is selected by the reporter of
the event. ▶ Table 1 shows the GET categories and the frequency of reports in
each GET category. Reports were most
commonly categorized by the reporter as
“Medication/Fluid”, “Lab/Specimen”, “Fall”,
and “Miscellaneous”. Reports categorized
under “Tube/Drain” were combined with
the “Miscellaneous” report category because there were only two “Tube/Drain” reports; all other GET categories had at least
100 unique reports. We used these 20 GET
categories for the remainder of our analysis. The free text narrative in each PSE report was extracted for analysis using the
LDA approach.
To perform the topic model analysis the
data were first preprocessed. Duplicate free
text reports and reports that had less than
two words were removed from the data.
This resulted in 29,131 useful reports. We
also removed numbers, punctuations, extra
white space, and tokenized the text into
unigrams. Unlike newspaper articles or
physician discharge summaries, which are
typical candidates for NLP, the text in the
PSE reports are generally written in a colloquial manner and the text are considerably
less structured than other sources. The nature of the PSE reports makes determining
certain features, such as abbreviations difficult. For example, the “pt” in the phrase “pt
arrived in the morning” could mean “pa-
© Schattauer 2015
Methods Inf Med 4/2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.
340
341
A. Fong, R. Ratwani: An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling
Figure 2
Generative model
for LDA
tient” or “physical therapist”. For the purposes of this study, we did not account for
all of the nuanced and context specific differences in abbreviations, synonyms, or
misspellings. This will be an important
topic to address in future work.
egories. At the end of this process, each of
the 29,131 reports had two labeled assignments, one from the 33 LDA topics and
one from the 20 GET categories.
The dispersion of reports across LDA
topics and GET categories were evaluated
using measures of entropy [26]. Entropy
was used as a proxy of the information
4.2 Topic Modeling
LDA was used to model topics in the PSE
reports [21]. LDA assumes a generative
process represented by a standard plate
model (▶ Figure 2). In this model, there
are M number of reports and N number of
words in a report. Each report i has a topic
distribution Θi and each latent topic k has
a word distribution. Unlike the GET categories, the topics generated through this
process are considered latent because they
are not directly observable. The LDA approach assumes a Dirichlet prior on the
distributions of topics per report α, and
also for the distribution of words per topic
β. Intuitively, each word w in report i has a
probability of being generated by a latent
topic j.
We used LDA as implemented in the
LDA R package [25]. All reports were first
assigned to the LDA topic they most align
with. Following a method similar to Bisgin
et al., topics were ranked by the total
number of reports assigned to the topic
[24]. To reduce the topic space, we selected
the top one-third most populated topics
(33 topics) and used these topics to re-categorize all the reports. Thirty-three topics
were chosen because it was slightly more
but similar to the number of GET cat-
Figure 3
LDA topic distribution results after re-categorization
Methods Inf Med 4/2015
© Schattauer 2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.
A. Fong, R. Ratwani: An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling
topic with low entropy implies the LDA
topic can be explained fairly well by a few
GET categories. There is little information
in the LDA topic that was not already accounted for by the GET category and there
is little randomness in the GET category
associated with that specific LDA topic. On
the other hand, a LDA topic with high entropy suggests that the LDA topic is not
clearly accounted for by any of the GET
categories. A LDA topic with high entropy
implies a latent topic in the PSE reports
that spans many GET categories. The high
entropy LDA topics are particularly interesting as they highlight important patterns
and trends in the reports that would otherwise be hidden amongst the GET categories. Similarly, GET categories with low
entropy imply low randomness across the
LDA topics. GET categories with high entropy suggest that these GET categories do
not have highly structured themes. This
approach allows us to both understand the
latent topics in the PSE reports as well as
how well these topics align with the GET
categories.
5. Results
Figure 4
Entropy for LDA topics
across GET categories
content of a topic or category. A topic, or
category, with low entropy is considered
stable or well understood and characterized. However, a topic with a higher entropy value would be considered more uncertain or unpredictable and suggests that
the topic has new additional information.
We defined the entropy of the LDA topics
and GET categories as:
H(xt) is the entropy of the reports across
the LDA topic t. It is the probability that a
report categorized as LDA topic t, would
have a general event type category g, scaled
by the informational context of xt,g . The resulting entropy is this product summed
across all GET categories. For example,
suppose there were only two GET cat-
egories (G-a, G-b) and we were evaluating
two LDA topics or themes (L-a, L-b). Assume L-a was more highly associated with
G-a category (the probability of L-a appearing in G-a was higher than the probability of L-a appearing in G-b) and L-b appeared similarly throughout the G-a and
G-b reports. As a result, the entropy of L-a
across the GET categories would be much
lower than L-b. The former LDA topic
more clearly aligned with the GET categories while the latter LDA topic represented a theme that crossed multiple clinical GET categories. Similarly we calculate
the entropy of the reports across the GET
categories, H(xg), by switching t and g
notations.
We used this approach to evaluate and
compare how the LDA topics and the GET
categories categorize the reports. A LDA
LDA was used to generate 100 topics and
each PSE report was initially assigned to
their most probable LDA topic. The most
populated 33% of LDA topics were then
used to re-categorize all the reports, resulting in the LDA topic distribution as shown
in ▶ Figure 3. This process reduced the
topic space to the more relevant topics and
provided smoothing to the report distribution.
5.1 LDA Entropy
We first used entropy to evaluate the GET
category information content for each LDA
topic. The entropy varied greatly (range =
0.6 to 3.5, m = 2.4, std = 1.2) depending on
the LDA topic (▶ Figure 4). The LDA
topics with the lowest entropy were clearly
associated with specific general event type
(GET) categories such as “Medication/
Fluid”, “Fall”, and “Lab/Specimen” (▶ Table
2). The top words for these LDA topics are
intuitive and are easy to understand as
© Schattauer 2015
Methods Inf Med 4/2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.
342
343
A. Fong, R. Ratwani: An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling
clinically relevant to the assigned GET category.
On the other hand, LDA topics with the
highest entropy, 26 and 32, were not clearly
identifiable with any single GET category
(▶ Table 3). The high entropy topic words
were more representative of themes or
topics relevant to several GET categories
and tended to cross the more clinically
oriented GET boundaries. For example,
words from topic 26 centered around parts
of the body, spatial orientation, and position. This body orientation/position
“theme” relates to one of many GETs such
as “Fall”, “Surgery/Procedure”, and “Skin/
Tissue”. Furthermore, words from topic 32
centered around the topic of breathing
which highlights the complexity of respiratory events as not just an “Airway Management” GET category, but also as related
to the diagnosis, treatment, safety and security of the patient.
5.2 GET Entropy
Lastly we used entropy to evaluate the LDA
topic information content for each of the
GET categories. Our results (▶ Figure 5)
show that “Airway Management”, “Skin/
Tissue”, “Blood Bank”, “Maternal/Childbirth”, and “Fall” categories had the lowest
entropy across the LDA topics. “Equipment/Medical Device”, “Infection Prevention”, “Diagnosis/Treatment”, “Miscellaneous”, and “Healthcare IT” had the
highest entropy across the LDA topics.
While some of these results were expected, such as high entropy for “Miscellaneous” reports, others results were more
surprising. For example, we expected “Fall”
reports to have much lower entropy because falls are generally well defined.
However, upon further inspection we noticed that “Fall” reports were distributed
amongst several LDA topics including 2
[i.e., fall, bathroom, head, floor, hit], 1 [i.e.,
bed, floor, alarm, chair, sit], and 15 [i.e.,
chair, floor, wheelchair, therapist].
6. Discusssion
An unsupervised topic modeling approach
was used to identify latent topics in a large
dataset of PSE reports and these topics
Table 2
Top topic words and associated GET categories for LDA topics with low entropies
LDA
Topic ID
H(xt)
Top words
Predominately associated
GET categories
8
0.655
mg, dose, po, order, daily
Medication/Fluid
2
0.943
fell bathroom, head, fall, floor
Fall
4
0.944
lab, specimen, urine, results, receive
Lab/Specimen
5
1.084
medication, pharmacy, pyxis, med
Medication/Fluid
9
1.131
enter, done, critical, result, mgdl
Lab/Specimen
Table 3
Top topic words and associated GET categories for LDA topics with large entropies
LDA Topic ID
H(xt)
Top words
Predominately associated GET categories
26
3.644
right, left, hand, arm,
side
Fall, Miscellaneous, Safety/Security, Skin/Tissue,
Surgery/Procedure
32
3.586
respirator, oxygen,
sat, air, place
Airway Management, Diagnosis/Treatment,
Miscellaneous, Safety/Security
were compared to the general event type
(GET) categories as defined by the event
reporters. The analyses demonstrated instances where certain LDA topics aligned
with the GET categories, while there were
several other instances where the LDA
topics spanned multiple GET categories
and there was no clear alignment.
While previous researchers have utilized
supervised modeling approaches to better
understand PSE reports, supervised modeling requires an established training and
test set for effective model development
[17, 18, 20]. The advantage of the unsupervised approach, as demonstrated here, is
that no training and test set is required.
There are, however, certain drawbacks to
the unsupervised approach. Convergence
time of unsupervised approaches often
takes longer and the results may not be as
accurate as supervised approaches. Furthermore, the unsupervised topic models
do require humans to interpret and evaluate the utility of the outputs. Both supervised and unsupervised approaches have
advantages and limitations; utilizing a
combination of the approaches will likely
yield the most fruitful path for more effectively analyzing PSE data. Natural language
processing (NLP) approaches provide several benefits to PSE reporting and, if leveraged appropriately, NLP can both enhance
the rate of reporting PSEs and can dramatically improve the analysis of these events.
6.1 Reducing the Burden of
Event Reporting
One of the primary barriers to event
reporting is the time cost of entering a report; for busy provider staff taking several
minutes to fill out an event report is difficult to do and, consequently, reports may
not be submitted. Traditional event reporting systems require several different structured data fields to be completed, including
selection of the GET. By using a NLP approach some of the structured data elements may be reduced in favor of a more
natural text based narrative from the reported. This may then reduce the time
burden of reporting. Utilizing NLP can
provide a balance between structured and
free text data elements in the reporting
process.
Selecting a category for an event is also
difficult for provider staff entering event
reports. In many circumstances, the event
reporter may not know the GET. For
example, if a nurse is about to administer a
medication that was ordered by the physician through a computerized physician
order entry (CPOE) system and the nurse
notices that the medication is the wrong
dose what type of event should the nurse
Methods Inf Med 4/2015
© Schattauer 2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.
A. Fong, R. Ratwani: An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling
Figure 5
The LDA topics that do not align with the
GET categories provide a novel way to look
at the data that may lead to new insights to
improve patient safety and better understand underlying causes. For example, LDA
topic 26 represents spatial orientation and
position, especially as it relates to the patient’s body. This theme spans multiple
GET categories including “Surgery/Procedure”, “Skin/Tissue”, and “Fall”. Without
this LDA category, one might not readily
think about confusion over body orientation as a potential underlying factor that
connects seemingly unrelated reports. The
LDA topics highlight the commonalities
that would be more difficult to discover
through the GET categories.
This approach also allows for the discovery of topics that may fall outside of
traditional clinical categories. Topic clusters centered on human factors concepts
like communication and teamwork may
arise providing even greater insight into the
underlying causes of the PSE reports being
entered. Leveraging NLP to reshape the
data to make the underlying causes of
events more transparent is an area ripe for
future research.
Entropy for GET categories across LDA topics
6.3 Challenges and Opportunities
enter? Is this a “Medication” event or a
“Healthcare IT” event, or both? Expecting
the nurse to determine the event type
under these conditions may be unreasonable. The NLP approach offers an alternative to having to select the event category
and may also serve to control bias that may
be introduced by the reporter selecting the
event category. Further, certain events may
very well span multiple categories and
should be classified as such for a more
rigorous analysis of the data. While reporting systems can allow for the reporter to select multiple categories it is more efficient
to utilize an algorithmic approach.
6.2.1
Re-categorizing Reports
In the dataset that was analyzed, approximately ten percent of the reports were categorized as miscellaneous. Without an algorithmic approach these events would
require manual review in order to re-categorize these events under more meaningful topics. The NLP approach provides a
method to more efficiently re-categorize
these events. In addition, some event reports may be inappropriately categorized
and the NLP approach can be used to
identify those reports that do not fit within
the category that was specified by the
reporter.
6.2 Enhancing the Analysis of
Patient Safety Event Reports
6.2.2 Discovery of New Topics
There are several ways in which the NLP
approach can dramatically improve the
analysis of PSE data and reduce the resources required to analyze these data.
The high entropy LDA topics represent
latent topic areas that are different from the
GET categories and in many cases these
LDA topics span multiple GET categories.
PSE reports are a rich data source that, if
utilized appropriately, can be used to identify safety hazards and dramatically improve the delivery of safe care. However, efficiently and effectively analyzing the narratives in PSE reports is going to be critical
to achieving the promise that this data
source holds. Supervised and unsupervised
NLP approaches are showing signs of success, yet there are several challenges to
overcome with the PSE narratives such as
domain specificity, nuanced language differences, medical jargon, abbreviations,
and colloquial acronyms. Similar to other
application areas, specific ontologies centered on patient safety may need to be developed to address these challenges. The
unsupervised approach presented here
contributes to the foundational work to
demonstrate the effectiveness of NLP in
analyzing PSE data.
© Schattauer 2015
Methods Inf Med 4/2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.
344
345
A. Fong, R. Ratwani: An Evaluation of Patient Safety Event Report Categories Using Unsupervised Topic Modeling
Acknowledgment
The work presented here was supported by
many Safety and Quality staff and we are
grateful to them for their support.
References
1. Aspden P, Corrigan JW, Erickson SM. Patient
Safety Reporting Systems and Applications. In: Patient Safety: Achieving a new standard of care.
Washington, D.C.: : National Academy Press 2004.
pp 250 –278.
2. Rosenthal J, Booth M. Maxmizing the Use of State
Adverse Event Data to Improve Patient Safety.
Portlan, ME: 2005.
3. Clarke JR. How a system for reporting medical
errors can and cannot improve patient safety. Am
Surg 2006; 72: 1088–1091; discussion 1126 –1148.
http://www.ncbi.nlm.nih.gov/pubmed/17120952
4. Pronovost P, Morlock LL, Sexton B. Improving the
value of patient safety reporting systems. In: Advances in patient safety: New directions and alternative approaches. Vol 1. Assessment. Rockville,
MD: Agency for Healthcare Research and Quality;
2008.
5. White J. Adverse Event Reporting and Learning
Systems: A Review of the Relevant Literature. The
Canadian Patient Safety Institute; 2007.
6. Longo DR, Hewett JE, Ge B, et al. The long road to
patient safety: a status report on patient safety
systems. JAMA 2005; 294: 2858–2865. http://
www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=
Retrieve&db=PubMed&dopt=Citation&list_uids=
16352793
7. Spyns P. Natural language processing in medicine:
an overview. Methods Inf Med 1996; 35: 285–301.
Methods on twitter:
8. Nadkarni PM, Ohno-Machado L, Chapman WW.
Natural language processing: an introduction.
J Am Med Inform Assoc 2011; 18: 544–551.
doi:10.1136/amiajnl-2011–000464
9. Choudhury M De, Gamon M, Counts S, et al. Predicting Depression via Social Media. ICWSM
2013; 2: 128–137. http://www.aaai.org/ocs/index.
php/ICWSM/ICWSM13/paper/viewFile/6124/
6351 (accessed Sep 11, 2014).
10. Monroe BL, Colaresi MP, Quinn KM. Fightin’
Words: Lexical Feature Selection and Evaluation
for Identifying the Content of Political Conflict.
Polit Anal 2008; 16: 372– 403. doi: 10.1093/pan/
mpn018
11. Melton G, Hripcsak G. Automated detection of
adverse events using natural language processing
of discharge summaries. J Am Med Informatics
Assoc 2005; 12: 448 – 457.
12. Chapman WW, Nadkarni PM, Hirschman L, et al.
Overcoming barriers to NLP for clinical text: the
role of shared tasks and the need for additional
creative solutions. J Am Med Inform Assoc 2011;
18: 540 –543. doi: 10.1136/amiajnl-2011-000465.
13. Wagholikar KB, MacLaughlin KL, Henry MR, et
al. Clinical decision support with automated text
processing for cervical cancer screening. J Am
Med Inform Assoc 2012; 19: 833–839. doi:
10.1136/amiajnl-2012-000820.
14. Doan S, Bastarache L, Klimkowski S, et al. Integrating existing natural language processing tools
for medication extraction from discharge summaries. J Am Med Inform Assoc 2010; 17:
528–531. doi: 10.1136/jamia.2010.003855.
15. Ware H, Mullett CJ, Jagannathan V. Natural language processing framework to assess clinical conditions. J Am Med Inform Assoc 2009; 16:
585–589. doi: 10.1197/jamia.M3091.
16. Botsis T, Nguyen MD, Woo EJ, et al. Text mining
for the Vaccine Adverse Event Reporting System:
medical text classification using informative feature selection. J Am Med Inform Assoc 2011; 18:
631– 638. doi: 10.1136/amiajnl-2010-000022.
17. Chai KEK, Anthony S, Coiera E, et al. Using statistical text classification to identify health information technology incidents. J Am Med Informatics
Assoc 2013; 20: 1– 6. doi: 10.1136/amiajnl-2012001409.
18. Ong M-S, Magrabi F, Coiera E. Automated identification of extreme-risk events in clinical incident
reports. J Am Med Informatics Assoc 2012; 19:
e110 –118. doi:10.1136/amiajnl-2011-000562.
19. Magrabi F, Ong M-S, Runciman W, et al. Using
FDA reports to inform a classification for health
information technology safety problems. J Am
Med Informatics Assoc 2012; 19: 45–53. doi:
10.1136/amiajnl-2011-000369.
20. Ong M-S, Magrabi F, Coiera E. Automated categorisation of clinical incident reports using statistical text classification. Qual Saf Health Care 2010;
19: e55. doi: 10.1136/qshc.2009.036657.
21. Blei D, Ng A, Jordan M. Latent dirichlet allocation.
J Mach Learn Res 2003; 3: 993–1022.
http://dl.acm.org/citation.cfm?id=944937
(accessed Sep 11, 2014).
22. Roberts ME, Stewart BM, Tingley D, et al. Structural Topic Models for Open-Ended Survey Responses. Am J Pol Sci 2014; 58 (4): 1064 –1082.
doi: 10.1111/ajps.12103
23. Roberts M, Stewart B, Tingley D, et al. The structural topic model and applied social science. 2013.
http://mimno.infosci.cornell.edu/nips2013ws/
slides/stm.pdf (accessed Sep 11, 2014).
24. Bisgin H, Liu Z, Fang H, et al. Mining FDA drug
labels using an unsupervised learning technique –
topic modeling. BMC Bioinformatics 2011; 12
(Suppl 1): S11. doi: 10.1186/1471-2105-12-S10S11.
25. Chang J. Collapsed Gibbs sampling methods for
topic models 1.3.2 Retrieved fromhttp://cran.rproject.org/web/packages/lda/index.html. 2014.
26. Shannon C. A mathematical theory of communication. ACM SIGMOBILE Mob Comput Commun
Rev 2001; 5: 3–55. http://dl.acm.org/citation.cfm?
id=584093 (accessed Sep 11, 2014).
https://twitter.com/MethodsInfMed
Methods Inf Med 4/2015
© Schattauer 2015
Downloaded from www.methods-online.com on 2017-07-29 | IP: 88.99.165.207
For personal or educational use only. No other uses without permission. All rights reserved.