CRF &SVM in Medication Extraction -------A Cascade Approach to Extracting Medication Events Dongfang Xu School of Information Outline • Methodology – Extraction Definition – Data Preparation – System Architecture • Results & Discussion – NER(CRF) Experiment – Relationship Classification (SVM) – CONTEXT Engine Evaluation • Conclusion Extraction Definition Extract the following information(called field) on Medication experienced by the patient from discharge summary: – Medications (m): names, brand names, generics, and collective names of prescription substances, over the counter medications, and other biological substances – Dosages (do): indicating the amount of a medication – Modes (mo): indicating the route for administering the medication – Frequencies (f): indicating how often each dose of the medication should be taken. – Durations (du): indicating how long the medication is to be administered. – Reasons (r): stating the medical reason for which the medication is given. – List/narrative (ln): indicating whether the medication information appears in a list structure or in narrative running text in the discharge summary. Data Preparation 160 Discharge Summaries Annotated by Physician, revised by the researcher Training Data Testing Data 130 30 The annotation process took approximately 1.5 hours per record due to the length of clinical records. Outline • Methodology – Extraction Definition – Data Preparation – System Architecture • Results & Discussion – NER(CRF) Experiment – Relationship Classification (SVM) – CONTEXT Engine Evaluation • Conclusion System Architecture • Basic strategy for medication event extraction – Use CRF to identify the entities, including medication, dosage, frequency, mode, etc. – Build pairs for each medication relationship (only consider drug and its related entity). – Classify the binary relationship by SVM – Generate medication entries based on the results from the CRF and SVM. System Architecture • Basic strategy for medication event extraction System Architecture • CRF feature builder – 7 feature sets were built for CRF, including drug, dosage, mode, frequency, duration, reason, morphology. – Many other features were also used: the medical category for each word, whether the word is capitalized , etc. – Use backward elimination to get the best useful feature sets. – The context window for the CRF was set to be five words. System Architecture • Basic strategy for medication event extraction System Architecture • SVM Convertor converts CRF results into SVM input: – Unigram Sentences Each pair of medication elements at the unigram sentence level is used to build an SVM training record. – Sentence Pairs MEDICATION and its REASON could be across two sentences. Like the mechanism to generate the unigram sentence input, medication pairs are also built at the sentence pair level. System Architecture • SVM Features built based on the input data: – 1. Three words before and after the first entity. – 2. Three words before and after the second entity – 3. Words between the two entities. – 4. Words inside of each entity. – 5. The types of the two entities determined by the CRF classifier. – 6. The entities types between the two entities. System Architecture • Basic strategy for medication event extraction System Architecture • Context Engine – The CONTEXT engine identifies the medication en-try under the special section headings, such as “MEDICATIONS ON ADMISSION:”, “DISCHARGE MEDICATIONS:” etc., or in the narrative part of the clinical record. • Medication Entry Generation – The results from the previous steps are used here, namely CRF, SVM and CONTEXT Engine. Outline • Methodology – Extraction Definition – Data Preparation – System Architecture • Results & Discussion – NER(CRF) Experiment – Relationship Classification (SVM) – CONTEXT Engine Evaluation – Final output evaluation NER (CRF) results • The comparison of the performance for exact match by using the 7 feature sets and bag of words feature sets (baseline, in bracket). NER (CRF) results •The F scores for Duration and Reason by using the 7 features sets are approximately 10% higher than baseline, because: 1. The frequencies for the REASON and DURATION are much smaller than the other four entity types. 2. For the DURATION entities, the rule based regular expression can match other non-medication terms (low precision). Some DURATION terms that can’t be discovered by our rules (low recall). 3. REASON extraction depends highly on the Finding category in SNOMED CT and the performance of TTSCT (Patrick et al. 2007). NER (CRF) results •The F scores for Mode, Dosage and Frequency were improved by 2%. •These errors come from: 1. Misspelling of drug names, such as “nitrog-lycerin” . 2. Drug names used in other contexts, such as the “coumadin” in the “Coumadin Clinic” phrase. 3. The drug allergies detector cannot cover all situations. Relation Classification • The comparison of the performance for relation classification by using the all feature sets and subset feature sets (1,2,4; baseline, in bracket) in SVM. CONTEXT Engine Evaluation The CONTEXT engine was adopted to discover the span of the medication list (the span between the medication heading and the next following heading). Final Output Evaluation Due to the errors in the NER, Relationship Classification and Medication Entry Generator, the final F-scores for each entity type are lower than in the NER processing. The final scores for the medication event are between 86.23%. Thank you!
© Copyright 2026 Paperzz