Robust Biomedical Entity Recognition Using Optimal Feature Set

Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Robust Biomedical Entity Recognition
Using Optimal Feature Set
Faisal Chowdhury 1, 2
[email protected]
1
Alberto Lavelli 2
[email protected]
Human Language Technologies Research Unit
Fondazione Bruno Kessler (FBK), Italy
2 ICT Doctoral School
University of Trento, Italy
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Outline
1
Background
2
Our Participation in CALBC Challenge I
3
Ongoing and future work on BNER
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Trends
Genes....Genes....Genes
Extensive use of orthographic features
Limited use of contextual features
Reluctance of using syntactic dependencies
Exploitation of systems based on multiple classifiers
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Trends
Genes....Genes....Genes
Extensive use of orthographic features
Limited use of contextual features
Reluctance of using syntactic dependencies
Exploitation of systems based on multiple classifiers
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Trends
Genes....Genes....Genes
Extensive use of orthographic features
Limited use of contextual features
Reluctance of using syntactic dependencies
Exploitation of systems based on multiple classifiers
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Trends
Genes....Genes....Genes
Extensive use of orthographic features
Limited use of contextual features
Reluctance of using syntactic dependencies
Exploitation of systems based on multiple classifiers
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Trends
Genes....Genes....Genes
Extensive use of orthographic features
Limited use of contextual features
Reluctance of using syntactic dependencies
Exploitation of systems based on multiple classifiers
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Trends
Genes....Genes....Genes
Extensive use of orthographic features
Limited use of contextual features
Reluctance of using syntactic dependencies
Exploitation of systems based on multiple classifiers
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Issues regarding multiple classifiers
Complex and computational resource intensive approach
Difficulties in case of disagreements and overlaps
Not clear how the classifiers complement each other
=> Unreliable error analyses
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Issues regarding multiple classifiers
Complex and computational resource intensive approach
Difficulties in case of disagreements and overlaps
Not clear how the classifiers complement each other
=> Unreliable error analyses
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Issues regarding multiple classifiers
Complex and computational resource intensive approach
Difficulties in case of disagreements and overlaps
Not clear how the classifiers complement each other
=> Unreliable error analyses
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Issues regarding multiple classifiers
Complex and computational resource intensive approach
Difficulties in case of disagreements and overlaps
Not clear how the classifiers complement each other
=> Unreliable error analyses
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Issues regarding multiple classifiers
Complex and computational resource intensive approach
Difficulties in case of disagreements and overlaps
Not clear how the classifiers complement each other
=> Unreliable error analyses
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our approach
Beyond genes/proteins
More emphasis on contextual features
Extensive segmentation and normalization
Single classifier with more informative features
Based on conditional random fields (CRFs)
Default annotation scheme: BioCreAtivE II
e.g. 76 Comparison with alkaline phosphatases
76|14 33|alkaline phosphatases
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our system in CALBC challenge I
Goal: to build a robust and portable BNER system
Target semantic groups: Genes/Proteins, Diseases,
Species and Chemicals
Separate 1st order CRF model for each semantic group
Features and post-processing rules are selected based on
experiments on BioCreative II GM corpus
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our system in CALBC challenge I
Goal: to build a robust and portable BNER system
Target semantic groups: Genes/Proteins, Diseases,
Species and Chemicals
Separate 1st order CRF model for each semantic group
Features and post-processing rules are selected based on
experiments on BioCreative II GM corpus
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our system in CALBC challenge I
Goal: to build a robust and portable BNER system
Target semantic groups: Genes/Proteins, Diseases,
Species and Chemicals
Separate 1st order CRF model for each semantic group
Features and post-processing rules are selected based on
experiments on BioCreative II GM corpus
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our system in CALBC challenge I
Goal: to build a robust and portable BNER system
Target semantic groups: Genes/Proteins, Diseases,
Species and Chemicals
Separate 1st order CRF model for each semantic group
Features and post-processing rules are selected based on
experiments on BioCreative II GM corpus
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Our system in CALBC challenge I
Goal: to build a robust and portable BNER system
Target semantic groups: Genes/Proteins, Diseases,
Species and Chemicals
Separate 1st order CRF model for each semantic group
Features and post-processing rules are selected based on
experiments on BioCreative II GM corpus
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Features and post-processing rules selection
Extensive experiments on the BioCreAtivE II GM corpus
No post-processing rule specific to any particular semantic group
e.g. extension of the mention boundary to the left if a single
letter with a hyphen precedes X
Features that are general enough for a broad range of semantic
groups
e.g. PoS tags of the next two tokens
Has_Two_Digits X
F-measure of as much as 85% on BioCreAtivE II GM test data
– without using any external dictionary
– Version: January 31, 2010
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Feature types
Orthographic features
e.g. Has_Greek
General linguistic features
e.g. 2–4 character suffixes
Contextual features
e.g. bi-grams of lemmatized token
=> CtxLemmak ,k +1 for i ≤ k < i + 2
Others
=> is_Nucleoside, is_Nucleotide, is_Nucleic_Acid,
is_Amino_Acid
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Pre-processing
Training data preparation
=> CALBC data format to the system’s default format mapping
=> Analyses of annotations of 50 randomly selected sentences
=> Fixing some specific error types
Initial tokenization by GeniaTagger
Adjustment of inconsistencies in tokenized data
e.g. two single inverted commas instead of double inverted
comma
Further tokenization
e.g. Denys-Drash => Denys - Drash
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Types of CALBC data error that we correct/discard
Digits tagged as entities
Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s>
Overlapping annotations of the same semantic group
Hypothesis: a certain word in the same context can
refer to (or be part of) only one concept of a certain
semantic group
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Types of CALBC data error that we correct/discard
Digits tagged as entities
Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s>
Overlapping annotations of the same semantic group
Hypothesis: a certain word in the same context can
refer to (or be part of) only one concept of a certain
semantic group
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Types of CALBC data error that we correct/discard
Digits tagged as entities
Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s>
Overlapping annotations of the same semantic group
Hypothesis: a certain word in the same context can
refer to (or be part of) only one concept of a certain
semantic group
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Types of CALBC data error that we correct/discard
Digits tagged as entities
Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s>
Overlapping annotations of the same semantic group
Hypothesis: a certain word in the same context can
refer to (or be part of) only one concept of a certain
semantic group
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Types of CALBC data error that we correct/discard
Digits tagged as entities
Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s>
Overlapping annotations of the same semantic group
Hypothesis: a certain word in the same context can
refer to (or be part of) only one concept of a certain
semantic group
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Overlapping annotations of the same semantic group
473442 .... comparison with C57BL/6 and AKR mice, CBA
mice displayed ....
SPE annotations: AKR mice
mice, CBA
CBA
424059 .... pathophysiology of penicillamine-induced
myasthenia gravis were studied ....
DISO annotations: penicillamine-induced myastheni
myasthenia gravis
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Overlapping annotations of the same semantic group
473442 .... comparison with C57BL/6 and AKR mice, CBA
mice displayed ....
SPE annotations: AKR mice
mice, CBA
CBA
424059 .... pathophysiology of penicillamine-induced
myasthenia gravis were studied ....
DISO annotations: penicillamine-induced myastheni
myasthenia gravis
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Overlapping annotations of the same semantic group
473442 .... comparison with C57BL/6 and AKR mice, CBA
mice displayed ....
SPE annotations: AKR mice
mice, CBA
CBA
424059 .... pathophysiology of penicillamine-induced
myasthenia gravis were studied ....
DISO annotations: penicillamine-induced myastheni
myasthenia gravis
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Other issues of CALBC corpus
Inconsistent tokenization: our system do
tokenization (using whitespace and any non-alphanumeric
characters) regardless the data is already tokenized or not.
Overlapping entities for different semantic
groups: we train separate models for separate semantic
groups.
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Other issues of CALBC corpus
Inconsistent tokenization: our system do
tokenization (using whitespace and any non-alphanumeric
characters) regardless the data is already tokenized or not.
Overlapping entities for different semantic
groups: we train separate models for separate semantic
groups.
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Other issues of CALBC corpus
Inconsistent tokenization: our system do
tokenization (using whitespace and any non-alphanumeric
characters) regardless the data is already tokenized or not.
Overlapping entities for different semantic
groups: we train separate models for separate semantic
groups.
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Model training
Labelling format: IOB2
1st order CRF
Using Mallet
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Model training
Labelling format: IOB2
1st order CRF
Using Mallet
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Post-processing
Bracket mismatch correction
e.g. P-450(11 beta => P-450(11 beta)
OR
P-450
One sense per discourse
=> same label(s) for all instances of same token(s)
Short/long form detection and annotation
(using a modified version of Schwartz et al. (2003) algo.)
e.g. procollagen (PC III)
Ungrammatical conjunction structure correction
e.g. neurodegeneration, cancer
=> neurodegeneration
=> cancer
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Results - training data
Species
Diseases
Chemicals
Genes/Proteins
Precision
90.9%
85.9%
85.2%
80.9%
Recall
90.6%
85.0%
78.9%
74.5%
F-measure
90.7%
85.4%
81.9%
77.6%
Two fold cross validation on CALBC training data
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Results - test data
Species
Diseases
Chemicals
Genes/Proteins
exact
cos98
exact
cos98
exact
cos98
exact
cos98
Precision
91.2%
92.4%
86.0%
86.5%
82.0%
82.9%
80.0%
81.7%
Recall
91.9%
93.2%
87.8%
88.3%
81.4%
82.3%
79.1%
80.8%
F-measure
91.6%
92.8%
86.9%
87.4%
81.7%
82.6%
79.6%
81.2%
Official evaluation results for the CALBC SSC I test data
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Ongoing and Future Work on BNER
Disease Mention Recognition with Specific Features
(To appear in ACL 2010 BioNLP workshop)
Our system on AZDC corpus: 81.08%
BANNER: 77.9%, JNET: 77.2%
(Leaman et al. 2009)
Treatment, drug, symptom identification
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Ongoing and Future Work on BNER
Disease Mention Recognition with Specific Features
(To appear in ACL 2010 BioNLP workshop)
Our system on AZDC corpus: 81.08%
BANNER: 77.9%, JNET: 77.2%
(Leaman et al. 2009)
Treatment, drug, symptom identification
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Ongoing and Future Work on BNER
Disease Mention Recognition with Specific Features
(To appear in ACL 2010 BioNLP workshop)
Our system on AZDC corpus: 81.08%
BANNER: 77.9%, JNET: 77.2%
(Leaman et al. 2009)
Treatment, drug, symptom identification
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition
Background
Our Participation in CALBC Challenge I
Ongoing and future work on BNER
Thank you.
Faisal Chowdhury and Alberto Lavelli
Robust Biomedical Entity Recognition