Matrix Generation - School of Information

CRF &SVM in Medication Extraction
-------A Cascade Approach to Extracting Medication Events
Dongfang Xu
School of Information
Outline
• Methodology
– Extraction Definition
– Data Preparation
– System Architecture
• Results & Discussion
– NER(CRF) Experiment
– Relationship Classification (SVM)
– CONTEXT Engine Evaluation
• Conclusion
Extraction Definition
Extract the following information(called field) on
Medication experienced by the patient from
discharge summary:
– Medications (m): names, brand names, generics, and collective names of
prescription substances, over the counter medications, and other biological
substances
– Dosages (do): indicating the amount of a medication
– Modes (mo): indicating the route for administering the medication
– Frequencies (f): indicating how often each dose of the medication should be
taken.
– Durations (du): indicating how long the medication is to be administered.
– Reasons (r): stating the medical reason for which the medication is given.
– List/narrative (ln): indicating whether the medication information appears in a list
structure or in narrative running text in the discharge summary.
Data Preparation
160 Discharge Summaries
Annotated by Physician,
revised by the researcher
Training Data
Testing Data
130
30
The annotation process took approximately 1.5 hours
per record due to the length of clinical records.
Outline
• Methodology
– Extraction Definition
– Data Preparation
– System Architecture
• Results & Discussion
– NER(CRF) Experiment
– Relationship Classification (SVM)
– CONTEXT Engine Evaluation
• Conclusion
System Architecture
• Basic strategy for medication event extraction
– Use CRF to identify the entities, including medication,
dosage, frequency, mode, etc.
– Build pairs for each medication relationship (only
consider drug and its related entity).
– Classify the binary relationship by SVM
– Generate medication entries based on the results from
the CRF and SVM.
System Architecture
• Basic strategy for medication event extraction
System Architecture
• CRF feature builder
– 7 feature sets were built for CRF, including drug, dosage,
mode, frequency, duration, reason, morphology.
– Many other features were also used: the medical
category for each word, whether the word is capitalized ,
etc.
– Use backward elimination to get the best useful feature
sets.
– The context window for the CRF was set to be five
words.
System Architecture
• Basic strategy for medication event extraction
System Architecture
• SVM Convertor converts CRF results into SVM input:
– Unigram Sentences
Each pair of medication elements at the unigram sentence level is
used to build an SVM training record.
– Sentence Pairs
MEDICATION and its REASON could be across two sentences. Like
the mechanism to generate the unigram sentence input, medication
pairs are also built at the sentence pair level.
System Architecture
• SVM Features built based on the input data:
– 1. Three words before and after the first entity.
– 2. Three words before and after the second entity
– 3. Words between the two entities.
– 4. Words inside of each entity.
– 5. The types of the two entities determined by the CRF
classifier.
– 6. The entities types between the two entities.
System Architecture
• Basic strategy for medication event extraction
System Architecture
• Context Engine
– The CONTEXT engine identifies the medication en-try under the
special section headings, such as “MEDICATIONS ON
ADMISSION:”, “DISCHARGE MEDICATIONS:” etc., or in the
narrative part of the clinical record.
• Medication Entry Generation
– The results from the previous steps are used here, namely CRF,
SVM and CONTEXT Engine.
Outline
• Methodology
– Extraction Definition
– Data Preparation
– System Architecture
• Results & Discussion
– NER(CRF) Experiment
– Relationship Classification (SVM)
– CONTEXT Engine Evaluation
– Final output evaluation
NER (CRF) results
• The comparison of the performance for exact match by using
the 7 feature sets and bag of words feature sets (baseline, in
bracket).
NER (CRF) results
•The F scores for Duration and Reason by using the 7
features sets are approximately 10% higher than
baseline, because:
1. The frequencies for the REASON and DURATION are much
smaller than the other four entity types.
2. For the DURATION entities, the rule based regular expression
can match other non-medication terms (low precision). Some
DURATION terms that can’t be discovered by our rules (low
recall).
3. REASON extraction depends highly on the Finding category in
SNOMED CT and the performance of TTSCT (Patrick et al.
2007).
NER (CRF) results
•The F scores for Mode, Dosage and Frequency were
improved by 2%.
•These errors come from:
1. Misspelling of drug names, such as “nitrog-lycerin” .
2. Drug names used in other contexts, such as the “coumadin” in
the “Coumadin Clinic” phrase.
3. The drug allergies detector cannot cover all situations.
Relation Classification
• The comparison of the performance for relation
classification by using the all feature sets and subset
feature sets (1,2,4; baseline, in bracket) in SVM.
CONTEXT Engine Evaluation
The CONTEXT engine was adopted to discover the span of
the medication list (the span between the medication
heading and the next following heading).
Final Output Evaluation
Due to the errors in the NER, Relationship Classification
and Medication Entry Generator, the final F-scores for
each entity type are lower than in the NER processing.
The final scores for the medication event are between
86.23%.
Thank you!