Slides - Healthcare Analytics Summit 2017

Session 35:
Text Analytics: You Need More than NLP
Eric Just
Senior Vice President
Health Catalyst
Learning Objectives
•
Why text search is an important part of clinical text analytics
•
The fundamentals of how search works
•
How clinical text search can be refined with natural language processing (NLP)
and other techniques
2
Poll Question #1
For my organization, I see text analytics as:
a) Completely unnecessary for analytics
b) A “nice to have” for analytics
c) Very important for a few key areas of analytics
d) Mission critical across nearly all areas of analytics
e) Unsure or not applicable
High-Risk Population:
Peripheral Arterial Disease PAD
• PAD affects over 3 million
patients per year
• Narrowed arteries reduce blood
flow to limbs
• Patients with PAD are considered
high risk
For organizations trying to understand their
risk, not being able to find high-risk patients
is a problem.
ICD/CPT
N=9,592
Natural Language
Processing (NLP)
N=41,741
Peripheral artery disease
Claudication
Rest pain
Ischemic Limb
Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly
increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL,
April 2016.
4
Analytics has a problem
Most organizations ignore
text analytics – because it is
expensive and difficult
Up to 80% of
clinical data
stored in text
Most text analytics requires
advanced technical skillsets
5
Typical Scenario
“As a healthcare system administrator, I want to understand my high-risk population better.
I want to find all patients with peripheral arterial disease (PAD). I know there are more
patients than I was able to find by simply querying diagnosis and procedure codes.”
Data scientist
develops PAD text
mining algorithm
Algorithm validated
The
patient
cohortto
Results
returned
isinvestigator
defined
6
Better Scenario
“As a healthcare system administrator, I want to understand my high-risk population better.
I want to find all patients with peripheral arterial disease (PAD). I know there are more
patients than I was able to find by simply querying diagnosis and procedure codes.”
Data scientist
develops PAD
algorithm
Algorithm validated
PAD algorithm run
nightly and stored in
data warehouse
PAD algorithm output
combined with coded
data to create PAD
registry
7
How to Best Leverage Text Analytics
PAD
Algorithm
Diabetes
Algorithm
Ejection
Fraction
Algorithm
PreDiabetes
Algorithm
CHF
Algorithm
Hypertension
Breast
Cancer
8
Poll Question #2
I see text analytics being most important in the area of:
(Choose 3, if applicable)
a) Clinical care improvement
b) Regulatory reporting
c) Research
d) Operational improvement
e) Financial analytics
f)
Unsure or not applicable
Google
10
Why do we love Google?
Simple, effective
interface
Fast
Accurate
11
c
How Would You Build ‘Google’ For Clinical
Text?
12
The Basis of Text Search: The Inverted Index
Document 0
Patient is a 67 year old female with
NIDDM and hypertension.
Document 1
The patient has no diabetes or
hypertension.
Document 2
Patient’s mother is diabetic.
Patient’s sister is diabetic.
Words
Document
Inverted Index
67
0
{(0,3)}
diabet
1,2
{(1,4),(2,3),(2,7)}
female
0
{(0,6)}
hyperten
0,1
{(0,8),(1,6)}
mother
2
{(2,1)}
niddm
0
{(0,8)}
no
1
{(1,3)}
old
0
{(0,5)}
patient
0,1,2
{(0,0),(1,1), (2,0),(2,4)}
sister
2
{(2,5)}
year
0
{(0,4)}
13
Tools To Quickly Index Text and Provide Search Capability
Originally written in 2004
Open Source Enterprise Search
Built on Lucene
Scalability (distributed indexes)
REST APIs
Plugin architecture
Additional features over Lucene
Originally written in 1999
Open-Source Java API
•
•
•
•
•
•
Originally written in 2010
Open Source Enterprise Search
Built on Lucene
Scalability (distributed indexes)
REST APIs
Plugin architecture
Additional features over Lucene
Create index
Maintain index
Search index
Hit ranking
Result sorting
.. Much more
Provides the foundation for more advanced
search engine capabilities. Most users use
through SOLR or ElasticSearch. Used directly
by Twitter.
14
diabetes
Results: 2 records, 0.0 ms
Document 2:
Patient’s mother is diabetic. Patient’s sister is
diabetic.
Document 1:
The patient has no diabetes or hypertension.
Go
Found both diabetes and
diabetic
(word stemming)
Missed mention of NIDDM
(synonyms)
Neither result is relevant to a
medical cohort query for
diabetics (context)
What Works?
•
Simple, familiar interface
•
Using inverted index means fast results
What Doesn’t
•
Results display not optimized for use cases

•
Want more results!

•
Need better ability to view aggregate results
Medical language has many synonyms. (How do we find NIDDM?)
Want less results!

Context matters for different search types (How do we exclude ‘no diabetes’)
16
Showing the results
•
Many users are more interested in exploring aggregate results than reviewing individual records
•
Aggregating results opens up to users without access to PHI
17
Get More Results: Synonyms
When you say ‘diabetes’ what do you really mean?
"diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle"
OR "diabetes mellitus insulin-dependent" OR "diabetes mellitus juvenile onset" OR "iddm"
OR "insulin dependent diabetic" OR "insulin-dependent diabetes mellitus" OR "juvenile
diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i
diabetes mellitus without mention of complication" OR "type 1 diabetes mellitus" OR
"diabetes mellitus maturity onset" OR "diabetes mellitus non insulin-dep" OR "diabetes
mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes
of the young" OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii
diabetes mellitus" OR "type ii diabetes mellitus without mention of complication" OR "type 2
diabetes mellitus"
18
Leveraging Medical Terminologies
19
Expanding Search With Terminologies
20
A more complex example:
Diabetic patients who are on an ACE/ARB or who had their microalbumin checked during the calendar
year
Queries free text for all reports that contain “Diabetes” AND “(ace OR arb)” AND “microalbumin”
Filtered for reports within the last year
note: terms are selected by synonym finder, or grouped terms of all trade name, generic name, or active
medication ingredients
("diabetes" OR "diabetes mellitus" OR "diabetic" OR "brittle diabetes" OR "diabetes brittle" OR "diabetes mellitus insulindependent" OR "diabetes mellitus juvenile onset" OR "iddm" OR "insulin dependent diabetic" OR "insulin-dependent diabetes
mellitus" OR "juvenile diabetes" OR "ketosis-prone diabetes mellitus" OR "type i diabetes mellitus" OR "type i diabetes mellitus
without mention of complication" OR "type 1 diabetes mellitus" OR "diabetes mellitus maturity onset" OR "diabetes mellitus non
insulin-dep" OR "diabetes mellitus non-insulin-dependent" OR "maturity onset diabetes" OR "maturity-onset diabetes of the young"
OR "niddm" OR "non-insulin-dependent diabetes mellitus" OR "type ii diabetes mellitus" OR "type ii diabetes mellitus without
mention of complication" OR "type 2 diabetes mellitus") AND ( ("benazepril" OR "lotensin" OR "captopril" OR "capoten" OR
"enalapril" OR "vasotec" OR "epaned" OR "fosinopril" OR "monopril" OR "lisinopril" OR "prinivil" OR "zestril" OR "moexipril" OR
"univasc" OR "perindopril" OR "aceon" OR "quinapril" OR "accupril" OR "ramipril" OR "altace" OR "trandolapril" OR "mavik") OR
("azilsartan" OR "edarbi" OR "candesartan" OR "atacand" OR "eprosartan" OR "teveten" OR "irbesartan" OR "avapro" OR
"telmisartan" OR "micardis" OR "valsartan" OR "diovan" OR "losartan" OR "cozaar" OR "olmesartan" OR "benicar") ) AND
("albumin urine" OR "urine microalbumin" OR "urine microalbumin present”)
21
Get Less Results: ConText Matters
ConText is a NLP pattern matching algorithm published in 2009
To be useful for clinical applications such as looking for genotype/phenotype correlations, retrieving
patients eligible for a clinical trial, or identifying disease outbreaks, simply identifying clinical
conditions in the text is not sufficient—information described in the context of the clinical condition is
critical for understanding the patient’s state.
J Biomed Inform. 2009 Oct; 42(5): 839–851.
Detects conditions and whether they are
•
Negated (e.g., “ruled out pneumonia”)
•
Historical (“past history of pneumonia”)
•
Experienced by someone else (e.g., “family history of pneumonia”)
22
ConText Algorithm
Wendy W. Chapman, David Chu, John N. Dowling
J Biomed Inform. 2009 Oct; 42(5): 839–851.
Chest tightness
Chest tightness
Negation:
Negation:
affirmed
negated
Experiencer:
patient
Experiencer:
patient
Temporality:
recent
Temporality:
historical
ConText
CHF
CHF
Negation:
Negation:
affirmed
affirmed
Experiencer:
patient
Experiencer:
other
Temporality:
recent
Temporality:
historical
”No history of chest tightness but family history of CHF.”
Condition
Negation trigger
Historical trigger
Condition
Termination
Termination
Historical trigger
Other experiencer trigger
23
ConText: Negation
The patient had no diabetes or hypertension.
Experiencer
Diabetes
Negation: negated
Negation trigger
Clinical conditions
Termination
Hypertension
Negation: negated
Experiencer: patient
Experiencer: patient
Temporality: recent
Temporality: recent
24
ConText: Experiencer
Patient’s mother has diabetes.
Diabetes
Negation: affirmed
Experiencer: other
Experiencer
Clinical conditions
Termination
Patient’s sister has hypertension.
Temporality: recent
Hypertension
Negation: affirmed
Experiencer
Clinical conditions
Termination
Experiencer: other
Temporality: recent
25
How?
•
Analysis of context uses a sentence as an operand
•
Identifying sentences in clinical text is not straightforward

•
Have you ever seen punctuation in a clinical note?
An NLP analysis pipeline ties it all together
Sentence
detection
Search results
Entity
recognition
(i.e. diabetes)
Context
Algorithm
Present user
with additional
filters
NLP Pipeline Frameworks
•
•
•
Apache Unstructured Information Management Architecture (UIMA)
General Architecture for Text Engineering (GATE)
Natural Language Toolkit (NLTK)
26
ConText: Apply
to Search
Results
Filter Diabetes Results
27
28
Other Pieces to the NLP Pipeline: Extract Values
ef_phrase
qualifiers
ef_low
ef_high
ef_mid
ef_word
ejection fraction is at least 70-75
is at least
70
75
72.5
NULL
ejection fraction of about 20
of about
20
20
20
NULL
of
60
60
60
NULL
of greater than
65
65
65
NULL
of
55
55
55
NULL
by visual inspection is
65
65
65
NULL
is
NULL
NULL
NULL
normal
ejection fraction of 60
ejection fraction of greater than 65
ejection fraction of 55
ejection fraction by visual inspection is 65
LVEF is normal
\b(((LV)?EF)|(Ejection\s+Fraction))\s+(?<qualifiers>([^\s\d]+\s+){0,5})\(?(((?<ef_low>\d+)(?<ef_high>\d+))|(?<ef_mid_txt>\d+)|(?<ef_word>([^\s]*?normal)|(moderate)|(severe)))
29
Other Extraction Projects
•
Aortic Root Size
•
Blood Pressure
•
Breast Cancer ER Biomarker
•
Cancer Staging, TNM, and stage
•
Abdominal fistula
•
Height/Weight/BMI
•
Hypoglycemia with low blood sugars
•
Microalbumin
•
Ankle Brachial Index
30
High-risk Population:
Peripheral Arterial Disease PAD
ABI < 0.9
N=4,349
• PAD affects over 3 million
patients per year
• Narrowed arteries reduce blood
flow to limbs
• Patients with PAD are considered
high risk
• Measured by Ankle Brachial
Index (ABI)
ICD/CPT
N=9,592
Natural Language
Processing (NLP)
N=41,741
Peripheral artery disease
Claudication
Rest pain
Ischemic Limb
This is a precise patient registry!
Duke J, Chase M, Ring N, Martin J, Fuhr R, Chatterjee A, Hirsh AT. Use of natural language processing of unstructured data significantly
increases the detection of peripheral arterial disease in observational data. American College of Cardiologists Scientific Session. Chicago, IL,
April 2016.
31
Validation
• Build studies to review
results of query
• Assign to team members
to review results
• Randomly selects records
to represent study
• Highlights key words for
easy chart review
32
Text Analytics Must Be
Interoperable!
Validated Text Analytics
Diabetes Cohort
PAD Cohort
Data Warehouse
Ejection Fraction
Tumor Sizes
•
•
•
•
•
Population Analytics
Care Improvement
Operational Improvement
Financial Improvement
Research
…
33
A ‘Late Binding’ Approach to Text Analytics
Context Filtering
•
•
Search:
Easy starting
point
•
•
•
•
Uses
terminologies
Allow user to find
synonyms
Extraction of
discrete values:
• Ejection
Fractions
• ABI
Integration
Validation
Regular Expression
Synonym Finding
•
Excludes negated
concepts
Good for cohort
queries
Expert review of
algorithm output
Performance
measurement
•
•
Operationalize
algorithm
Incorporate into
analytics
Many More
Techniques
•
•
•
•
Section tagging
Entity recognition
N-gram analysis
Document clustering
34
Final Thought
To leverage the
power of text
analytics…
Make the data
accessible first!
36
Lessons Learned
•
Using search technology for clinical text is an engaging and accessible entry
point for text analytics problems. Searching clinical text is powered by an
inverted index that catalogs words present in the documents, which
documents they are present in, and their position in the documents.
•
Medical terminologies provide a dictionary of relevant terms, synonyms, and
logical structures that can enhance clinical text exploration.
•
NLP algorithms that are based on the context surrounding clinical terms can
identify when the term is negated (“no evidence of pneumonia”) or applies to
another person (“patient's grandmother had breast cancer”). Regular
expressions can be applied to text to identify patterns and extract discrete
values, like ejection fraction and ankle brachial index, that are stored in text.
•
Text analytics should be validated and integrated with an enterprise data
warehouse where the information extracted from text can be combined with
discrete, coded data.
37
Analytic
Insights
Questions &
Answers
A
38
What You Learned…
Write down the key things you’ve learned related
to each of the learning objectives after attending
this session
39
Thank
You
40