Health Care Data Analytics

Audio Transcript
Health Care Data Analytics
Machine Learning and
Natural Language
Processing
Lecture b
Health IT Workforce Curriculum
Version 4.0/Spring 2016
This material (Comp 24 Unit 6) was developed by Oregon Health & Science University, funded by the
Department of Health and Human Services, Office of the National Coordinator for Health Information
Technology under Award Number 1U24OC000015. This material was updated in 2016 by [Institution
Name] under Award Number 90WT00006.
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/4.0/
Slide 1
Welcome to Health Care Data Analytics: Machine Learning and Natural Language
Processing. This is lecture b.
The component, Health Care Data Analytics, covers the topic of health care data
analytics, which applies the use of data, statistical and quantitative analysis, and
explanatory and predictive models to drive decisions and actions in health care
Slide 2
The learning objectives for this unit, Machine Learning and Natural Language
Processing, are to:
Describe the major tasks for which machine learning is used
Compare and contrast the major approaches for machine learning
Describe the major tasks for which natural language processing is used
And discuss the major approaches and challenges for processing clinical narratives
Slide 3
In this lecture, we begin our discussion of natural language processing, or NLP, of
clinical text. First, we will look at basic definitions and approaches to NLP. This will be
followed by challenges in processing the clinical narrative. In the next lecture, we will
discuss various clinical NLP approaches and projects. And finally, we will describe
alternatives and future directions.
Slide 4
Let's begin with basic definitions and approaches. Successful NLP of the clinical
narrative could help better enable the use of data in electronic health records, or EHRs.
We know, for example, that current coded data, such as ICD-10, does not cover the
complexity of what is described in the clinical narrative. We also know that a good deal
of clinical information is “locked” in that text, meaning we cannot easily extract and
process the information to use for various purposes.
Slide 5
Some have noted that the term NLP could actually be better described as natural
language understanding, because the goal of NLP is the understanding of natural
language in computerized text. For those desiring more detail on the various
approaches to NLP and its uses, the references given in the last few slides of this
presentation can be consulted.
Slide 6
What are some of the use cases for clinical NLP? The three major ones are listed on
this slide.
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
2
The first use case is classification, where we are trying to classify what we find in the
text into some sort of category. For example, we may want to classify a patient finding
into a category such as when determining if they might be eligible for a clinical study.
Probably the major use case for clinical NLP is extraction, where we want to extract
information from a clinical narrative. For example, we might want to extract the findings
that occur in a radiology report and even the measurements that are reported within that
text.
A third use case is summarization, where we may want to summarize or abstract the
information that is in the narrative. We may do this for medical literature to summarize
scientific information or the clinical narrative where were trying to summarize the major
findings that the patient has.
Slide 7
We can delve further into use cases for NLP by considering cancer care. This set of use
cases comes from some promotional literature from a company that sells clinical NLP
products, but actually gives a good set of use cases for which NLP might help us. For
example, we might identify potential clinical trials matches, something akin to what we
mentioned on the last slide. We might be able to do advanced information extraction
from complex patient documents. We may be able to carry out more precise information
retrieval for clinical case histories and outcome studies. We also may be able to
streamline the process of entering patients into cancer registries. In addition, we may be
able to use the data that we extract using NLP to apply predictive models and care
coordination rules to clinical narratives in the patient record.
Slide 8
We may also be able to perform semantic enrichment of patient documentation to
improve the ability to search their notes.
We can analyze patient narratives for insights into treatment outcomes and also to
assess the effect of genetic aberrations on disease.
Finally, we may be able to support tumor boards, where the care of patients who
developed cancer is discussed by those providing care for them.
Slide 9
Let's take a more detailed look at human language so we can understand the
applications and limitations of NLP tools in clinical documents.
Linguists talk about the levels of human language. We begin with phonology. The sound
units that make up a language’s discrete sounds are called phonemes.
The next level up is morphology, which is the analysis of parts of words, which are
called morphemes. Sometimes a whole word is the morpheme, but other times they
may be bound morphemes that are a part of the word. For example, many anatomic
locations such as the appendix are bound to another word such as –itis, indicating
inflammation. Thus, appendicitis indicates there is inflammation of the appendix.
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
3
There are other morphemes, such as pharyng- and –itis. There are also bound
morphemes that indicate procedures such as appendectomy.
Syntax refers to the rules that govern the construction of language, sometimes called
the grammar.
Slide 10
Semantics describes the meaning of the words, phrases, and th sentences that make
up language.
Linguists also talk about pragmatics, which is how the context of language affects its
meaning.
And then there is larger world knowledge that is not explicitly part of language but is the
general knowledge that's necessary to understand it.
Slide 11
The classic approach to NLP goes through three phases.
The first phase is syntax, where we attempt to recognize the grammatical constituents
of language: sentences, phrases within them, and down to nouns, verbs, adjectives etc.
The next phase is semantics, where we attempt to recognize the meaning of those
words, phrases, and sentences.
Finally is the context in which the sentence occurs.
Slide 12
Each of these levels is successively harder and requires more knowledge engineering,
but would add more value if we could solve those problems.
One of the ways we address the inability to completely perform classic NLP is through
the use of rules and matching, where we don't aim for complete understanding of
everything in the document but instead try to recognize the terms that occur and
perhaps normalize them. This may allow us to understand what was said. Or, instead of
using detailed grammar rules, we may use machine learning techniques where we learn
the rules of parsing rather than developing human enumerations of all the possible
grammar rules that might exist.
Slide 13
Let's explore syntax and semantics in more detail. Processing of syntax is usually done
via a technique called parsing. This requires a grammar, which is the rules that govern
the syntax of language.
The most common way that we express a grammar is as a set of rewrite rules, where a
more complex grammatical construct is rewritten from constituent parts.
Slide 14
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
4
For example, a relatively simple grammatical rule is that a sentence consists of a noun
phrase, a verb phrase, and a noun phrase. An example is, the patient has severe
hypertension. The first noun phrase is the patient. The verb phrase is the verb has. The
second noun phrase is severe hypertension.
Of course, noun phrases themselves can be rewritten into more basic constituents.
There are determiners, such as a grammatical article an example of which is the word
the.
There are also adjectives, such as severe.
And a noun phrase can also just consist of a single noun.
The symbols that cannot be further decomposed, such as an adjective and a noun, are
called terminal symbols. Likewise, those that can be further decomposed, such as
sentence and noun phrase, are called non-terminal symbols.
Slide 15
As you can imagine, the grammar supporting the English language can get highly
complex, with many, many rewrite rules. This is why the machine learning approach has
superseded the approach of trying to enumerate every last grammar rule.
In semantics, we aim to map the parts of speech, these nouns, adjectives, verbs, etc.
into standardized terminology. For medicine, probably the most descriptive terminology
is SNOMED CT.
Slide 16
Processing language has been one of the most challenging computer tasks and is
difficult not only in the clinical narrative but almost in all forms of natural language.
Clinical narratives, such as progress notes and discharge summaries, can be even
more difficult to process than other types of text for many reasons. One is that clinical
narratives are written in a telegraphic, elliptical style. Oftentimes, the narratives are not
complete sentences. We will see examples of that in a moment. Clinical text also may
have spelling errors or grammatical errors. We also know that physicians and others
may take license with language and oftentimes there may be important information that
is buried within normal language that is implicit but not actually in the words and
phrases.
Slide 17
We will look at some of the challenges at the syntactic, semantic, and contextual levels.
Slide 18
Here is a look at some of the syntactic challenges that were first enumerated by Sager
in the 1980s; others have since validated these challenges. As mentioned in the
previous slide, a great deal of clinical narrative text is syntactically incomplete, that is, at
least according to Sager's analysis, half of all sentences in the clinical narrative were
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
5
found to be grammatically incomplete. If we think of the minimal English sentence as
subject verb object we see different types of incomplete sentences.
Slide 19
For example, the medical record may delete the verb and object. When the text says,
“stiff neck and fever,” there has been a deletion of the verb and object from the
sentence.
In “Brain scan negative,” there is deletion of the verb “is.”
For the statement, “Positive for heart disease,” there is deletion of the subject and verb,
such as “The patient has.”
And finally, “Was seen by local doctor” has deletion of the subject.
As humans we can read these and still for the most part understand what is happening,
but computer algorithms, especially those that are solely based on rules, have difficulty
with these sorts of violations of rules of basic English grammar.
Slide 20
There are also semantic challenges, which again as humans, especially those who
have some clinical knowledge, we readily understand. But to a computer that is just
functioning based on rules, there is a lot more difficulty. We know that words have
different senses and meanings. For example, when we read in a medical chart murmur
is appreciated, we know that likely there is a clinician who is listening, probably with a
stethoscope, to the heart and there is a murmur. It's not so much that the murmur is
appreciated in the sense of it being liked.
By the same token, when we read about eye drops, we are thinking about drops of
liquid containing medication put into the eye and not the eye physically dropping.
Likewise, when we read mass at 3 o'clock, we know that were likely reading about
something that is felt on the left-hand side of the abdomen and not that there's a
religious service in the afternoon.
Slide 21
Another semantic challenge is synonymy, where different words and phrases have the
same meaning but they are expressed differently. For example, consider the phrase
epigastric pain after eating versus another phrase postprandial stomach discomfort.
These two phrases have no words in common but essentially mean the same thing.
There is also polysemy, where the same words and phrases have different meetings.
For example, someone might say, the PCP of the patient with PCP advised him to stop
using PCP. PCP is an acronym that stands for several things, such as primary care
physician, pneumocystis carinii pneumonia, or an abbreviated name for the drug
phencyclidine.
Slide 22
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
6
There are a number of additional semantic challenges. One is negation. The clinical
narrative is often full of negation. Clinicians may say the patient does not have this
finding or that finding. Or that this disease is not present. Or saying we are choosing not
to use this treatment and instead are using another one. Negation is common in medical
text, for example, Patient does not have any chest pain.
There is also uncertainty in natural language text. Clinicians may say things like, Patient
treated for possible pneumonia.
There is also temporality. Just because something is mentioned doesn't mean that it is
present now. For example, Patient has history of pneumonia. Or there might be
something that has been resolved, such as, Chest pain resolved after administration of
nitroglycerin.
Slide 23
There are also contextual challenges in the clinical narrative. The term that describes a
broad category of these is coreference, which is the relation between linguistic
expressions that refer to the same real-world entity.
Consider the sentence, Chest x-ray shows nodule in left upper lobe, followed by another
sentence, The tumor has increased in size to 2 cm. The phrase the tumor from the
second sentence is actually referring to that same nodule from the first sentence.
There is a particular type of coreference that can be challenging, which is anaphora, or
the use of pronouns. Consider these two sentences: He complains of chest pain. It
awakens him at night. “It” in the second sentence refers to “chest pain” in the first
sentence.
Slide 24
There is another type of contextual challenge where there is the deletion of subjects.
This is quite common in clinical narratives, so we may see strings of sentences such as,
Complains of chest pain. Increasing frequency. Worse in the morning. Again, as human
readers we usually understand that quite easily, but when we have a natural language
processing system, the computer may not make the connections across the sentences.
Slide 25
Are there any silver linings that may enable us to have hope that we can carry out
clinical NLP. It turns out that there are.
First is the notion of subgrammars, after work by Sager in medicine and other
disciplines. She determined that there were subgrammars that were grammars that
were specific to disciplines, and that there was a subgrammar of clinical narratives that
actually fairly regular and predictable.
Another finding is that medical charts tend to have a predictable discourse, especially
documents like the history and physical, where the document begins with the history of
the patient, goes into the past medical history, and then into the physical exam.
Physicians for the most part follow a well prescribed pathway through the exam.
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
7
More recently, another silver lining has been that we should abandon the notion of
processing the entire clinical narrative and instead focus on specific elements that we
can identify to indicate whether or not a specific disease or specific clinical finding is
present. Thus, giving up on the approach of processing everything and instead focusing
on specific elements present.
Slide 26
Before we look at usage of clinical NLP and systems for it in the next lecture, let's talk
briefly about how we evaluate how well NLP systems work. There are a variety of ways
that systems can be measured but basically we want to determine how well they identify
correct concepts and how well they don't identify incorrect concepts.
The measures that we typically use are recall and precision. Recall is the proportion of
correct concepts found. For example, if there are100 concepts that should be found by
an NLP system and 75 actually are found, then the recall is 75%. Precision is the
proportion of found concepts that are correct, so if we identify 150 concepts and 75 of
them are correct, then our precision is 50%.
Slide 27
Many evaluations of NLP are carried out in so-called challenge evaluations, where there
is a common data set that different researchers use. These different research groups
will compare the results on the same task. For the clinical NLP community, the largest
and most participatory challenge evaluation has been the i2b2 NLP shared task. There
has also been a systematic review of all studies through 2010 that was published and
will be described more in the next lecture.
Slide 28
This concludes lecture b of machine learning and natural language processing. In
summarizing this lecture, we learned:
The major use cases for NLP are classification, extraction, and summarization.
The major phases of NLP are syntax, semantics, and context, each of which has
challenges and is successively harder to do with computers.
Slide 29
And, there may be some silver linings to help with NLP, such as sub grammars,
predictable discourse, and focus on processing less than the entire meaning of
everything in the document.
Slide 30
No Audio.
Slide 31
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
8
No Audio.
Slide 32
No Audio.
Slide 33
No Audio.
Slide 34
No Audio.
Health IT Workforce Curriculum
Version 4.0
Health Care Data Analytics
Machine Learning and Natural Language Processing
9