MCORES: a system for noun phrase coreference resolution for

MCORES: a system for noun phrase
coreference resolution for clinical records
2012 SHARPn Summit “Secondary Use”
Andreea Bodnari,1 Peter Szolovits,1 Ozlem Uzuner2
2Department
1MIT,
CSAIL, Cambridge, MA, USA
of Information Studies, University at Albany SUNY,
Albany, NY, USA
10.16.2012- Rochester, MN
!
Medical coreference resolution system (MCORES)
!
Experimental results
!
Conclusion
Page 2
!
Electronic Medical Records (EMRs) – large
information repositories
!
Clinical information requires processing
¡ Lower level: sentence parsing, tokenization
¡ Higher level: coreference resolution, semantic
disambiguation
!
Coreference resolution: a fundamental step in text
processing
Page 3
!
English medical corpus provided by i2b2 National Center
for Biomedical Computing
¡  De-identified medical discharge summaries
▪  Source: PH & BIDMC
▪  Content: 230(PH) + 196(BIDMC) discharge summaries
¡ Annotated concepts and coreference chains
!
Concept types
Persons
Problems
Treatments
Tests
Pronouns
Page 4
NP Instance Creation
Feature Generation
Classification
Output Clustering
Page 5
!
Markables of same semantic category are
paired together
!
MCORES creates positive instances only
from neighboring markable pairs in a
chain
1Instance
creation akin to McCharty and Lehnert
Page 6
Persons Problems Treatments Tests Across all categories
!
Exact Textual Overlap
Coreferent
3347
984
786
206
5323
Non-Coreferent
100
29
21
7
157
Partial Textual Overlap
Coreferent
337
1353
764
239
2693
Non-Coreferent
711
1217
557
317
2802
No Textual Overlap
Coreferent
5461
597
329
56
6443
Non-Coreferent
5403
46056
19328
6709
77496
Coreferent
9145
2934
1879
501
14459
Non-coreferent
6214
47302
19906
7033
80455
Total
Table 3: Distribution of coreferent and non-coreferent instances per semantic
category over instances containing exact, partial, and no textual overlap. Page 7
!
Multi-perspective features
¡  Antecedent perspective
¡  Anaphor perspective
¡  Greedy perspective
¡  Stingy perspective
!
!
!
!
!
Phrase-level lexical
Sentence-level lexical
Syntactic
Semantic
Miscellaneous
Page 8
Phrase-level lexical
!
!
!
!
Token overlap*
Normalized token
overlap
Edit-distance
Normalized edit-distance
Sentence-level lexical
!
!
!
Sentence-level token
overlap*
Filtered sentence-level
token overlap*
Left and right mention
overlap
¡ 
stingy and greedy
perspectives only
* multi-perspective feature
Page 9
Syntactic
!
!
!
Number agreement
Noun overlap*
Surname match
Semantic
!
!
!
!
UMLS CUI overlap*
UMLS CUI token overlap*
UMLS semantic type
overlap*
Anaphor UMLS semantic
type
* multi-perspective feature
Page 10
!
!
!
!
!
!
Token distance
Mention distance
All-mention distance
Sentence distance
Section match
Section distance
Page 11
!
C4.5 decision tree algorithm
¡ Flexible
¡ Readable prediction model
!
Classify pairs of markables based on values
of the feature vectors
Page 12
!
!
Classifier makes pairwise predictions only
Pairwise predictions clustered into coference chains
¡ Aggressive-merge1 clustering algorithm
prediction [M1] - [M2]
all preceding pairwise predictions linked to [M1]or
[M2]
1Aggresive-merge
algorithm proposed by McCarthy and Lehnert
Page 13
!
!
!
Feature set evaluation
Perspectives evaluation
Performance evaluation against
¡  In house baseline
¡  Third party system (RECONCILEACL09 & BART)
!
Evaluation metric: unweighted averages of Recall,
Precision, and F-measures of
¡  MUC
¡  B3
¡  CEAF
¡  BLANC
Page 14
Page 15
!
MCORES’ advantage comes from linking markables with no token overlap
!
Phrase-level sub-MCORES performs similarly to MCORES
!
Greedy perspective system is the most favorable single-perspective system
!
Multi-perspective system performs as well or better than single-perspective
systems
!
Error analysis
¡  MCORES fails to classify misspelled person pairs
¡  Medical problems false positives due to difference between newly and recurring events
¡  Treatments false positives due to medications presenting different routes of administration
¡  Tests false positive due to the large number of full overlap instances that did not corefer
Page 16
!
Developed coreference resolution system for the
medical domain (MCORES)
!
MCORES innovates through a multi-perspective
and knowledge-based feature set
!
MCORES outperforms third party systems and
an in-house baseline, improving coreference
resolution on clinical records
Page 17