Comparison of abbreviation recognition algorithms

Algorithms
Methods
Results
Conclusion
Comparison of abbreviation
recognition algorithms
Michal Gawlik
2010 REU Program
MSCS Department
Marquette University
August 12, 2010
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Introduction
Abbreviations occur frequently in scientific journals
Can lead to confusion
Identifying abbreviations and definitions is important
Purpose- compare algorithms for abbreviation finding
Abbreviation, acronym= short form (SF)
Definition= long form (LF)
To be used on MEDLINE abstracts. MEDLINE is a
collection of 19 million biomedical publications
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Schwartz & Hearst
(Schwartz, 2003)
Alignment-based
Abbreviations identified in parentheses
Long form must appear in same sentence
Contain all characters of short form
Characters must be in order
First letter of SF and LF must be same
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Schwartz & Hearst
(Schwartz, 2003)
Alignment-based
Abbreviations identified in parentheses
Long form must appear in same sentence
Contain all characters of short form
Characters must be in order
First letter of SF and LF must be same
Example (PMID: 12505779)
A quantitative analysis indicates that as the number
of neurons containing neurofibrillary tangles (NFTs)
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Schwartz & Hearst
(Schwartz, 2003)
Alignment-based
Abbreviations identified in parentheses
Long form must appear in same sentence
Contain all characters of short form
Characters must be in order
First letter of SF and LF must be same
Example (PMID: 12505779)
A quantitative analysis indicates that as the number
of neurons containing neurofibrillary tangles (NFTs)
<NFTs, neurofibrillary tangles>
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Schwartz & Hearst
(Schwartz, 2003)
Alignment-based
Abbreviations identified in parentheses
Long form must appear in same sentence
Contain all characters of short form
Characters must be in order
First letter of SF and LF must be same
Example (PMID: 12505779)
A quantitative analysis indicates that as the number
of neurons containing neurofibrillary tangles (NFTs)
<NFTs, neurofibrillary tangles>
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Schwartz & Hearst
(Schwartz, 2003)
Alignment-based
Abbreviations identified in parentheses
Long form must appear in same sentence
Contain all characters of short form
Characters must be in order
First letter of SF and LF must be same
Example (PMID: 12505779)
A quantitative analysis indicates that as the number
of neurons containing neurofibrillary tangles (NFTs)
<NFTs, neurofibrillary tangles>
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Schwartz & Hearst
(Schwartz, 2003)
Alignment-based
Abbreviations identified in parentheses
Long form must appear in same sentence
Contain all characters of short form
Characters must be in order
First letter of SF and LF must be same
Example (PMID: 12505779)
A quantitative analysis indicates that as the number
of neurons containing neurofibrillary tangles (NFTs)
<NFTs, neurofibrillary tangles>
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
ALICE
(Ao, 2005)
Template/Rule-based
Examines abbreviation pair in 3 phases
1
2
3
Inner Search
Outer Extraction
Validity Judgment
Uses handcrafted templates
320 pattern combinations
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
ALICE
(Ao, 2005)
Template/Rule-based
Examines abbreviation pair in 3 phases
1
2
3
Inner Search
Outer Extraction
Validity Judgment
Uses handcrafted templates
320 pattern combinations
Example template- F/pp/S/T|F/S/pp/T
F,S,T= first 3 characters of abbreviation
pp= prepositional phrase
National Institute of Health Stroke Scale (NIHSS)
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
BIOADI
(Kuo, 2009)
Machine Learning
Learns about abbreviations from training data
Morphological, numerical, and contextual
Uses learning algorithm
SF-LF pairs given probablilty of being correct
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
BIOADI
(Kuo, 2009)
Machine Learning
Learns about abbreviations from training data
Morphological, numerical, and contextual
Uses learning algorithm
SF-LF pairs given probablilty of being correct
Character Pattern of SF (PMID: 11336653)
<p85-PI3K, p85-regulatory subunit of the
phosphoinositide-3-kinase>
p85-PI3K 99K a1A1A
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Corpus
Corpus= collection of writings
Used previously annotated MEDLINE abstracts
Created list of all abbreviations and definitions
Corpus Articles SF-LF Pairs Unique SFs
BioText
1000
974
722
A&T
986
1095
909
Ab3P
1250
1220
997
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Performance Measure
Match: both SF and LF match the true pair
Partial: SF matches, LF is incorrect
Wrong: both SF and LF are incorrect
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Performance Measure
Match: both SF and LF match the true pair
Partial: SF matches, LF is incorrect
Wrong: both SF and LF are incorrect
Precision (P) =
Recall (R) =
#matches
#total
F-score (F ) =
Michal Gawlik
#matches
#found
2PR
P+R
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Results
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Results, cont.
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Results, cont.
Ptotal Rtotal Ftotal
S&H
95.7 83.9 89.3
ALICE
96.9 85.2 90.6
BIOADI 93.1 86.6 89.6
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
Conclusion
Each algorithm uses different approach, but all
attain overall precision in mid-90%, recall around
85%, and F-score near 90%
ALICE is best- highest precision, strong recall
BIOADI has highest recall but lowest precision
Size of training data can affect result
Schwartz & Hearst is fastest (due to simplicity)
Michal Gawlik
Comparison of abbreviation recognition algorithms
Algorithms
Methods
Results
Conclusion
References
1
2
3
4
5
Schwartz, A. and Hearst, M. (2003) A simple algorithm for identifying abbreviation
definitions in biomedical text. In Proceedings of PSB’03. Kauai, 8, 451-462.
Ao, H. and Takagi, T. (2005) ALICE: an algorithm to extract abbreviations from
MEDLINE. J. Am. Med. Inform. Assoc., 12, 576-586.
Kuo, C., Ling, M., Lin, K., and Hsu, C. (2009) BIOADI: a machine learning
approach to identify abbreviations and definitions in biological literature.
Bioinformatics, 10(Suppl 15):S7.
Sohn, S., Comeau, D., Kim, W., and Wilbur, JW. (2008) Abbreviation definition
identification based on automatic precision estimates. Bioinformatics 9:402+.
Manning, C. and Schütze, H. (2001) Foundations of Statistical Natural Language
Processing. MIT Press: Cambridge, MA.
Michal Gawlik
Comparison of abbreviation recognition algorithms