Using the UMLS MetaMap as a Cause of Death Analyzer

Using the UMLS MetaMap as
a Cause of Death Analyzer
Michael Hogarth, MD
Michael Resendez, MS
Univ. of California, Davis
Overview
•
•
•
•
•
•
Causes of Death: A Historical Perspective
Overview of the California EDRS
Cause of Death Analysis tool (BECA)
NLM MetaMap and the UMLS
BECA-MetaMap experiment
Discussion
NAPHSIS 2007
2
Historical Perspectives on causes of death
•
Bills of Mortality (1532)
• Arose from the need to better
understand death rates in
medieval England -- plague
epidemics(1361,1368,1375,1390,
1406, …)
•
John Graunt (1620-74)
• Used the Bills of Mortality and
found an infant death rate of
36% in England -- not previously
known or understood
•
London Bills of Mortality classification
• Used by Dr. John Snow to
characterize a cholera outbreak
traced to a water source in
London
• Evolved to become the Intl.
Classification of Disease (1850’s)
•
International Classification of
Disease(ICD) -- used for the last 150
years
NAPHSIS 2007
CA-EDRS Causes of Death
NAPHSIS 2007
4
Causes of Death
•
Importance
•
•
•
key epidemiological information is contained in the cause of death
Issues and Challenges
•
absolutely correct versus ‘close to correct’
•
absolute correctness requires significant time/effort and manual effort
•
is ‘close to correct’ in an automated fashion still useful?
Typical process in California
•
COD --> SuperMICAR --> Stat Master File
•
turnaround for entire process can be lengthy (~2 years)
•
could have a trend in causes of death and it would not be known by local
jurisdictions for 2 years.
•
Today in California
• a significant number of jurisdictions today don’t wait for the final statistical
files from the State office to look at trends --- they *manually* ‘code’ (if
they have the staff) -- takes time and funding
NAPHSIS 2007
5
Preliminary COD classification
• Possible uses of a preliminary COD classification
using automated methods that are ‘close to
correct’
• early identification of trends in a local jurisdiction
• disease vs. injury/poisoning -- coroner referral cross-checking
• identify specific infectious causes (encephalitis, cholera, etc..)
• What it is not
• not for ‘absolutely correct’ cause of death classification
• will not replace the nosologist’s expertise in understanding the
sequence of events leading to death nor their understanding of
ICD-10, with its includes/excludes
NAPHSIS 2007
6
How to analyze causes of death?
• Challenges
• text is verbatim and thus ‘arbitrary’ (free
text)
• need to go beyond simple keyword
matching
• biomedical knowledge and content is vast - and constantly changing!
• A possible approach - text mining and
computational linguistic techniques
NAPHSIS 2007
7
BECA
•
We built BECA, a generic concept analyzer framework that can
incorporate any ‘concept identifier’ engine such as NLM MetaMap and
other text processing tools
• BECA = BECA Enables Concept Analysis
• Supports a ‘plug-in’ design for the concept matcher and other
components (ie, spell checker)
•
Designed to support multiple transformations of the text in step-by-step
fashion
• transformations -- strip special characters, lower case, run it
through the concept matcher engine (MetaMap or other), run it
through an available spell checker (jazzy spell, etc..)
• example transformations
• convert to lowercase, remove all punctuation, map string using
concept mapper, etc..
•
First version of BECA uses the NLM MetaMap as a concept mapper
NAPHSIS 2007
8
BECA system design
NAPHSIS 2007
9
Example transformations
NAPHSIS 2007
10
What is NLM MetaMap?
• The National Library of Medicine’s MetaMap
• a free, open source software component built by the
NLM Lister Hill Laboratory
• uses computational linguistic techniques to map
biomedical text to a large corpus of biomedical content
(the NLM Unified Medical Language System)
• Provides a number of text processing functions
• Includes a ‘concept mapper’ that attempts to match
phrases with concepts in the UMLS Metathesaurus
• Includes a UMLS concept-to-code mapping for multiple
coding systems (ICD, SNOMED, etc..)
NAPHSIS 2007
11
How does MetaMap work?
• Takes text as input and attempts to identify ‘concepts’
in the text and match them to concepts in a large
corpus of phrases and concepts in biomedicine (UMLS
Metathesaurus)
• The retrieved “candidate” matches include a score that
reflects how sure it believes the match is correct
• The candidates retrieved include their semantic type
• “Disease or Syndrome”, “Injury or Poisoning”, etc...
NAPHSIS 2007
12
The UMLS
• Developed by the National Library of Medicine
• Derived from over 100 sources (ICD, SNOMED,)
• The Unified Medical Language System
• A system built to support information
retrieval in biomedicine
• Used in PubMed, ClinicalTrials.gov, etc..
• Consists of:
• (1) UMLS Metathesaurus
• (2) UMLS Semantic Network
• (3) UMLS SPECIALIST Lexicon
NAPHSIS 2007
13
UMLS in detail
• UMLS Metathesaurus -- the world’s largest repository of
biomedical phrases
• 1.3 million concepts, 6.4 million unique phrases (concept
names)
• over 100 source vocabularies (ICD,SNOMED,CPT, etc..)
• UMLS SPECIALIST LEXICON
• a file that provides individual words found in the UMLS
metathesaurus and their linguistic information including
grammatical ‘type’ (noun, verb, adjective, adverb, etc..)
• UMLS Sematic Network
• a set of files that classify the metathesaurus ‘concept’ into
a particular type
• Examples -- “Disease”, “Injury/Poisoning”, “Neoplasm”, ..
NAPHSIS 2007
14
MetaMap Algorithm
•
MetaMap’s algorithm consists of four steps
• (1) Parsing
• using a part-of-speech tagger text is decomposed into one or more
noun phrases
• “ocular complications of myasthenia gravis” ==> “ocular
complications” and “myasthenia gravis”.
• noun phrases are processed independently by decomposing them into
their grammatical origins
• “ocular complications” ==> modifier “ocular” and head of the
phrase “complications”
• (2) Variant Generation -- ‘variants’ for each phrase are generated using
SPECIALIST
• variants -- all synonyms of the term, acronyms containing the term,
abbreviations, plural/singular variants
• each variants has a ‘distance’ score obtained from SPECIALIST
• “ocular” - “eye”, “eyes”, “optic”, “opthalmic”, “opthalmia”, “oculus”,
“oculi”
NAPHSIS 2007
15
MetaMap Algorithm
• MetaMap Algorithm continued
• (3) Candidate Retrieval from Metathesaurus
• all metathesaurus strings that have at least one of the
variants is retrieved
• can exclude those where the variant is present in a large
number of strings (ie, very common string)
• (4) Candidate evaluation -- the MMTX score
• each metathesaurus candidate is evaluated by calculating the
‘strength’ of the similarity between the original input phrase
and the candidate phrase from metathesaurus
• the calculation involves a weighted average of four metrics
including distance scores for variants from input noun
phrase(variation), whether the phrase is part of the ‘head’
(centrality), ”, ‘coverage’ and ‘cohesiveness’
NAPHSIS 2007
16
Example
• BECA MetaMap output
• Input phrase: “ocular complications”
NAPHSIS 2007
17
The question
• ?Can BECA using the NLM MetaMap be useful in:
1. Identifying biomedical concepts in a cause of death
literal, which is narrative text.
2. “auto-coding” literals into ICD-10 codes
NAPHSIS 2007
18
Cause of Death Literals in CA-EDRS
• CA-EDRS data is a combination of records initiated in
EDRS (EDRS counties) and those submitted on
paper (non EDRS counties)
• Causes of death are verbatim from the certifier and
typically entered into EDRS or the typed on a paper
certificate by funeral home staff or hospital staff
• Overall COD statistics for CA-EDRS
• 462,564 registered death certificates
• 985,330 unique literals (phrases) in all COD fields
• 88,719 unique literals (phrases) in the Immediate
Cause of Death field
NAPHSIS 2007
19
Experiment
•
We randomly selected 1,000 literals from the 88,719 unique literals in the Immediate
Cause of Death field
•
We submitted these “as is” to BECA (MetaMap, no spell checking component)
•
BECA returned 7.9 candidate matches per literal (7,791 candidates for 1,000 strings)
•
Candidate scores ranged from 517 - 1000
•
Match score distribution for the 7,791 candidates
Match Score Distribution
700
500
400
300
200
100
7
4
1
7
5
1
2
4
5
7
7
7
4
9
1
98
94
92
91
89
88
87
86
85
84
83
82
81
80
7
2
8
2
1
3
4
5
9
1
5
2
7
2
0
0
match score
78
77
76
74
74
72
71
70
69
68
67
66
64
63
61
60
4
0
0
0
NAPHSIS 2007
59
57
56
54
53
7
0
51
number at that score
600
20
Example Output
NAPHSIS 2007
21
Literals with high score matches >=800
NAPHSIS 2007
22
High Score Candidate Matches
•
3,017 (38.7%) of the 7,791 candidates had a score >=800
•
95.3% of the original literals (953/1000) had at least one candidate with a match
score>=800
•
54.5% of the original literals (545/1000) had at least one candidate with a match
score>=900
•
30.7% of the original literals (307/1000) had at least one candidate with a match
score=1000
•
Note: only 7.5% were the exact string as found the UMLS Metathesaurus
•
Match score distribution for the 3,017 candidates
Match Score Distribution (>800)
700
number at score
600
500
400
300
200
100
NAPHSIS 2007
8
1
98
6
98
8
96
7
95
1
94
6
94
4
93
9
92
3
91
1
91
4
91
1
90
7
90
3
89
7
89
5
88
7
88
5
match score
87
1
87
8
87
5
86
2
86
0
86
8
86
4
85
1
85
8
85
5
84
2
84
0
84
7
84
3
83
0
83
7
83
3
82
1
82
7
82
1
81
8
81
4
80
80
79
9
0
23
Semantic Type correct matches
• BECA with MetaMap correctly categorized 720
(72%) of the literals by semantic type
• Of these, “Neoplastic Process” had the highest
reliability
NAPHSIS 2007
24
Wrong matches
• Semantic types most frequently in error
NAPHSIS 2007
25
ICD-10 Coding
• 252 of the 1,000 (25.2%) literals had an ICD-10
matched by BECA-MetaMap
• Categories
• 1 = good match
• 2 = approximate match (within ICD category)
• 0 = incorrect code
• Results - 97% were good or approximate
• 82.5% “good match”
• 14.3% “approximate match”
• 3.2% “incorrect match”
NAPHSIS 2007
26
ICD-10 Autocoding data
NAPHSIS 2007
27
Some interesting challenges
•
•
•
•
•
•
•
•
•
“CSTFIOTRDPIRATORY FAILURE”
“CHRONIC ALCOHOLISHM”
“ESOPHAGELA VARICES”
“END STAGE RENAL DOSEASE”
“HEAR FAILURE”
“OVARION CANCER WITH METASTASES”
“LUNF CARCINOMA, METASTATIC”
“PENDING TOX & MICRO”
“SEP[TIC SHOCK”
NAPHSIS 2007
28
Discussion
• MetaMap may be useful for preliminary categorization of causes
of death by semantic type
• Excluding certain semantic types would improve match precision
(at the cost of lower # of matches)
• BECA-MetaMap only assigned an ICD-10 code 25.2% of the time
• If BECA-MetaMap assigned an ICD-10 code, it was correct over in
83% of cases, and near correct in 97% of cases
• We found that MetaMap was “confused” if:
• there are multiple concepts (noun phrases) in a single string
• the phrase has a compound statement (“metastasis to brain
and bone” or “gunshot wounds of the head and right arm“
• the phrases begin with certain words (ie, complications, etc...)
NAPHSIS 2007
29
Future Directions for BECA
• Build a new “concept mapper” to replace
MetaMap, and specifically design it to analyze
causes of death phrases
• include a spell checker
• disambiguation for phrases that have compound
statements
• match SNOMED first, then match to ICD-10 (increases
the hit rate for ICD-10 autocoding)
• improve performance
• implement for ICD-10 includes/excludes using an open
source rules engine (jBoss Rules Engine)
NAPHSIS 2007
30
Credits
• National Library of Medicine, Lister Hill Lab
• University of California
• Michael Resendez, MS
• Cecil Lynch, MD, MS
• California Department of Health (California
Department of Public Health)
• Terry Trinidad
• David Fisher
• Debbie McDowell
NAPHSIS 2007
31
California EDRS
•
Developed by the University of California and California DHS (2004-2005)
•
Implementation (2005 - 2008)
• all death certificates entered into EDRS since Jan 1, 2005
• full EDRS (implemented counties)-- DC originates in EDRS and
electronically completed locally
• KDE EDRS (non-EDRS counties) -- DC completed in standard ‘paper’
fashion, eventually entered by State office into EDRS
•
June 2007 - where are we?
• today --> 510,000 certificates (2005 - present)
• Originate locally (EDRS records) or are entered later into EDRS (nonEDRS records)
• Today, June 2007, ~ 65% originate locally as EDRS electronic
• By Nov 2007 over 90% of all CA records will originate in EDRS
NAPHSIS 2007
32
Cause of Death Workflow with CA-EDRS
• CA-EDRS does not provide electronic support for
gathering of the COD today
certifier and
funeral home
exchange (fax)
worksheet
Once COD is finalized by
certifier, funeral home staff
create EDRS record
and enters them
QuickTime™ and a
Photo - JPEG decompressor
are needed to see this picture.
NAPHSIS 2007
33