Algorithms Methods Results Conclusion Comparison of abbreviation recognition algorithms Michal Gawlik 2010 REU Program MSCS Department Marquette University August 12, 2010 Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Introduction Abbreviations occur frequently in scientific journals Can lead to confusion Identifying abbreviations and definitions is important Purpose- compare algorithms for abbreviation finding Abbreviation, acronym= short form (SF) Definition= long form (LF) To be used on MEDLINE abstracts. MEDLINE is a collection of 19 million biomedical publications Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Schwartz & Hearst (Schwartz, 2003) Alignment-based Abbreviations identified in parentheses Long form must appear in same sentence Contain all characters of short form Characters must be in order First letter of SF and LF must be same Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Schwartz & Hearst (Schwartz, 2003) Alignment-based Abbreviations identified in parentheses Long form must appear in same sentence Contain all characters of short form Characters must be in order First letter of SF and LF must be same Example (PMID: 12505779) A quantitative analysis indicates that as the number of neurons containing neurofibrillary tangles (NFTs) Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Schwartz & Hearst (Schwartz, 2003) Alignment-based Abbreviations identified in parentheses Long form must appear in same sentence Contain all characters of short form Characters must be in order First letter of SF and LF must be same Example (PMID: 12505779) A quantitative analysis indicates that as the number of neurons containing neurofibrillary tangles (NFTs) <NFTs, neurofibrillary tangles> Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Schwartz & Hearst (Schwartz, 2003) Alignment-based Abbreviations identified in parentheses Long form must appear in same sentence Contain all characters of short form Characters must be in order First letter of SF and LF must be same Example (PMID: 12505779) A quantitative analysis indicates that as the number of neurons containing neurofibrillary tangles (NFTs) <NFTs, neurofibrillary tangles> Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Schwartz & Hearst (Schwartz, 2003) Alignment-based Abbreviations identified in parentheses Long form must appear in same sentence Contain all characters of short form Characters must be in order First letter of SF and LF must be same Example (PMID: 12505779) A quantitative analysis indicates that as the number of neurons containing neurofibrillary tangles (NFTs) <NFTs, neurofibrillary tangles> Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Schwartz & Hearst (Schwartz, 2003) Alignment-based Abbreviations identified in parentheses Long form must appear in same sentence Contain all characters of short form Characters must be in order First letter of SF and LF must be same Example (PMID: 12505779) A quantitative analysis indicates that as the number of neurons containing neurofibrillary tangles (NFTs) <NFTs, neurofibrillary tangles> Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion ALICE (Ao, 2005) Template/Rule-based Examines abbreviation pair in 3 phases 1 2 3 Inner Search Outer Extraction Validity Judgment Uses handcrafted templates 320 pattern combinations Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion ALICE (Ao, 2005) Template/Rule-based Examines abbreviation pair in 3 phases 1 2 3 Inner Search Outer Extraction Validity Judgment Uses handcrafted templates 320 pattern combinations Example template- F/pp/S/T|F/S/pp/T F,S,T= first 3 characters of abbreviation pp= prepositional phrase National Institute of Health Stroke Scale (NIHSS) Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion BIOADI (Kuo, 2009) Machine Learning Learns about abbreviations from training data Morphological, numerical, and contextual Uses learning algorithm SF-LF pairs given probablilty of being correct Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion BIOADI (Kuo, 2009) Machine Learning Learns about abbreviations from training data Morphological, numerical, and contextual Uses learning algorithm SF-LF pairs given probablilty of being correct Character Pattern of SF (PMID: 11336653) <p85-PI3K, p85-regulatory subunit of the phosphoinositide-3-kinase> p85-PI3K 99K a1A1A Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Corpus Corpus= collection of writings Used previously annotated MEDLINE abstracts Created list of all abbreviations and definitions Corpus Articles SF-LF Pairs Unique SFs BioText 1000 974 722 A&T 986 1095 909 Ab3P 1250 1220 997 Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Performance Measure Match: both SF and LF match the true pair Partial: SF matches, LF is incorrect Wrong: both SF and LF are incorrect Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Performance Measure Match: both SF and LF match the true pair Partial: SF matches, LF is incorrect Wrong: both SF and LF are incorrect Precision (P) = Recall (R) = #matches #total F-score (F ) = Michal Gawlik #matches #found 2PR P+R Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Results Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Results, cont. Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Results, cont. Ptotal Rtotal Ftotal S&H 95.7 83.9 89.3 ALICE 96.9 85.2 90.6 BIOADI 93.1 86.6 89.6 Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion Conclusion Each algorithm uses different approach, but all attain overall precision in mid-90%, recall around 85%, and F-score near 90% ALICE is best- highest precision, strong recall BIOADI has highest recall but lowest precision Size of training data can affect result Schwartz & Hearst is fastest (due to simplicity) Michal Gawlik Comparison of abbreviation recognition algorithms Algorithms Methods Results Conclusion References 1 2 3 4 5 Schwartz, A. and Hearst, M. (2003) A simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of PSB’03. Kauai, 8, 451-462. Ao, H. and Takagi, T. (2005) ALICE: an algorithm to extract abbreviations from MEDLINE. J. Am. Med. Inform. Assoc., 12, 576-586. Kuo, C., Ling, M., Lin, K., and Hsu, C. (2009) BIOADI: a machine learning approach to identify abbreviations and definitions in biological literature. Bioinformatics, 10(Suppl 15):S7. Sohn, S., Comeau, D., Kim, W., and Wilbur, JW. (2008) Abbreviation definition identification based on automatic precision estimates. Bioinformatics 9:402+. Manning, C. and Schütze, H. (2001) Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, MA. Michal Gawlik Comparison of abbreviation recognition algorithms
© Copyright 2026 Paperzz