Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Robust Biomedical Entity Recognition Using Optimal Feature Set Faisal Chowdhury 1, 2 [email protected] 1 Alberto Lavelli 2 [email protected] Human Language Technologies Research Unit Fondazione Bruno Kessler (FBK), Italy 2 ICT Doctoral School University of Trento, Italy Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Outline 1 Background 2 Our Participation in CALBC Challenge I 3 Ongoing and future work on BNER Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Trends Genes....Genes....Genes Extensive use of orthographic features Limited use of contextual features Reluctance of using syntactic dependencies Exploitation of systems based on multiple classifiers Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Trends Genes....Genes....Genes Extensive use of orthographic features Limited use of contextual features Reluctance of using syntactic dependencies Exploitation of systems based on multiple classifiers Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Trends Genes....Genes....Genes Extensive use of orthographic features Limited use of contextual features Reluctance of using syntactic dependencies Exploitation of systems based on multiple classifiers Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Trends Genes....Genes....Genes Extensive use of orthographic features Limited use of contextual features Reluctance of using syntactic dependencies Exploitation of systems based on multiple classifiers Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Trends Genes....Genes....Genes Extensive use of orthographic features Limited use of contextual features Reluctance of using syntactic dependencies Exploitation of systems based on multiple classifiers Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Trends Genes....Genes....Genes Extensive use of orthographic features Limited use of contextual features Reluctance of using syntactic dependencies Exploitation of systems based on multiple classifiers Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Issues regarding multiple classifiers Complex and computational resource intensive approach Difficulties in case of disagreements and overlaps Not clear how the classifiers complement each other => Unreliable error analyses Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Issues regarding multiple classifiers Complex and computational resource intensive approach Difficulties in case of disagreements and overlaps Not clear how the classifiers complement each other => Unreliable error analyses Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Issues regarding multiple classifiers Complex and computational resource intensive approach Difficulties in case of disagreements and overlaps Not clear how the classifiers complement each other => Unreliable error analyses Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Issues regarding multiple classifiers Complex and computational resource intensive approach Difficulties in case of disagreements and overlaps Not clear how the classifiers complement each other => Unreliable error analyses Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Issues regarding multiple classifiers Complex and computational resource intensive approach Difficulties in case of disagreements and overlaps Not clear how the classifiers complement each other => Unreliable error analyses Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our approach Beyond genes/proteins More emphasis on contextual features Extensive segmentation and normalization Single classifier with more informative features Based on conditional random fields (CRFs) Default annotation scheme: BioCreAtivE II e.g. 76 Comparison with alkaline phosphatases 76|14 33|alkaline phosphatases Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our system in CALBC challenge I Goal: to build a robust and portable BNER system Target semantic groups: Genes/Proteins, Diseases, Species and Chemicals Separate 1st order CRF model for each semantic group Features and post-processing rules are selected based on experiments on BioCreative II GM corpus Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our system in CALBC challenge I Goal: to build a robust and portable BNER system Target semantic groups: Genes/Proteins, Diseases, Species and Chemicals Separate 1st order CRF model for each semantic group Features and post-processing rules are selected based on experiments on BioCreative II GM corpus Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our system in CALBC challenge I Goal: to build a robust and portable BNER system Target semantic groups: Genes/Proteins, Diseases, Species and Chemicals Separate 1st order CRF model for each semantic group Features and post-processing rules are selected based on experiments on BioCreative II GM corpus Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our system in CALBC challenge I Goal: to build a robust and portable BNER system Target semantic groups: Genes/Proteins, Diseases, Species and Chemicals Separate 1st order CRF model for each semantic group Features and post-processing rules are selected based on experiments on BioCreative II GM corpus Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Our system in CALBC challenge I Goal: to build a robust and portable BNER system Target semantic groups: Genes/Proteins, Diseases, Species and Chemicals Separate 1st order CRF model for each semantic group Features and post-processing rules are selected based on experiments on BioCreative II GM corpus Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Features and post-processing rules selection Extensive experiments on the BioCreAtivE II GM corpus No post-processing rule specific to any particular semantic group e.g. extension of the mention boundary to the left if a single letter with a hyphen precedes X Features that are general enough for a broad range of semantic groups e.g. PoS tags of the next two tokens Has_Two_Digits X F-measure of as much as 85% on BioCreAtivE II GM test data – without using any external dictionary – Version: January 31, 2010 Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Feature types Orthographic features e.g. Has_Greek General linguistic features e.g. 2–4 character suffixes Contextual features e.g. bi-grams of lemmatized token => CtxLemmak ,k +1 for i ≤ k < i + 2 Others => is_Nucleoside, is_Nucleotide, is_Nucleic_Acid, is_Amino_Acid Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Pre-processing Training data preparation => CALBC data format to the system’s default format mapping => Analyses of annotations of 50 randomly selected sentences => Fixing some specific error types Initial tokenization by GeniaTagger Adjustment of inconsistencies in tokenized data e.g. two single inverted commas instead of double inverted comma Further tokenization e.g. Denys-Drash => Denys - Drash Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Types of CALBC data error that we correct/discard Digits tagged as entities Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s> Overlapping annotations of the same semantic group Hypothesis: a certain word in the same context can refer to (or be part of) only one concept of a certain semantic group Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Types of CALBC data error that we correct/discard Digits tagged as entities Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s> Overlapping annotations of the same semantic group Hypothesis: a certain word in the same context can refer to (or be part of) only one concept of a certain semantic group Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Types of CALBC data error that we correct/discard Digits tagged as entities Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s> Overlapping annotations of the same semantic group Hypothesis: a certain word in the same context can refer to (or be part of) only one concept of a certain semantic group Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Types of CALBC data error that we correct/discard Digits tagged as entities Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s> Overlapping annotations of the same semantic group Hypothesis: a certain word in the same context can refer to (or be part of) only one concept of a certain semantic group Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Types of CALBC data error that we correct/discard Digits tagged as entities Incomplete sentences, e.g. <s id="365735">NAP12.9 .</s> Overlapping annotations of the same semantic group Hypothesis: a certain word in the same context can refer to (or be part of) only one concept of a certain semantic group Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Overlapping annotations of the same semantic group 473442 .... comparison with C57BL/6 and AKR mice, CBA mice displayed .... SPE annotations: AKR mice mice, CBA CBA 424059 .... pathophysiology of penicillamine-induced myasthenia gravis were studied .... DISO annotations: penicillamine-induced myastheni myasthenia gravis Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Overlapping annotations of the same semantic group 473442 .... comparison with C57BL/6 and AKR mice, CBA mice displayed .... SPE annotations: AKR mice mice, CBA CBA 424059 .... pathophysiology of penicillamine-induced myasthenia gravis were studied .... DISO annotations: penicillamine-induced myastheni myasthenia gravis Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Overlapping annotations of the same semantic group 473442 .... comparison with C57BL/6 and AKR mice, CBA mice displayed .... SPE annotations: AKR mice mice, CBA CBA 424059 .... pathophysiology of penicillamine-induced myasthenia gravis were studied .... DISO annotations: penicillamine-induced myastheni myasthenia gravis Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Other issues of CALBC corpus Inconsistent tokenization: our system do tokenization (using whitespace and any non-alphanumeric characters) regardless the data is already tokenized or not. Overlapping entities for different semantic groups: we train separate models for separate semantic groups. Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Other issues of CALBC corpus Inconsistent tokenization: our system do tokenization (using whitespace and any non-alphanumeric characters) regardless the data is already tokenized or not. Overlapping entities for different semantic groups: we train separate models for separate semantic groups. Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Other issues of CALBC corpus Inconsistent tokenization: our system do tokenization (using whitespace and any non-alphanumeric characters) regardless the data is already tokenized or not. Overlapping entities for different semantic groups: we train separate models for separate semantic groups. Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Model training Labelling format: IOB2 1st order CRF Using Mallet Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Model training Labelling format: IOB2 1st order CRF Using Mallet Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Post-processing Bracket mismatch correction e.g. P-450(11 beta => P-450(11 beta) OR P-450 One sense per discourse => same label(s) for all instances of same token(s) Short/long form detection and annotation (using a modified version of Schwartz et al. (2003) algo.) e.g. procollagen (PC III) Ungrammatical conjunction structure correction e.g. neurodegeneration, cancer => neurodegeneration => cancer Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Results - training data Species Diseases Chemicals Genes/Proteins Precision 90.9% 85.9% 85.2% 80.9% Recall 90.6% 85.0% 78.9% 74.5% F-measure 90.7% 85.4% 81.9% 77.6% Two fold cross validation on CALBC training data Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Results - test data Species Diseases Chemicals Genes/Proteins exact cos98 exact cos98 exact cos98 exact cos98 Precision 91.2% 92.4% 86.0% 86.5% 82.0% 82.9% 80.0% 81.7% Recall 91.9% 93.2% 87.8% 88.3% 81.4% 82.3% 79.1% 80.8% F-measure 91.6% 92.8% 86.9% 87.4% 81.7% 82.6% 79.6% 81.2% Official evaluation results for the CALBC SSC I test data Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Ongoing and Future Work on BNER Disease Mention Recognition with Specific Features (To appear in ACL 2010 BioNLP workshop) Our system on AZDC corpus: 81.08% BANNER: 77.9%, JNET: 77.2% (Leaman et al. 2009) Treatment, drug, symptom identification Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Ongoing and Future Work on BNER Disease Mention Recognition with Specific Features (To appear in ACL 2010 BioNLP workshop) Our system on AZDC corpus: 81.08% BANNER: 77.9%, JNET: 77.2% (Leaman et al. 2009) Treatment, drug, symptom identification Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Ongoing and Future Work on BNER Disease Mention Recognition with Specific Features (To appear in ACL 2010 BioNLP workshop) Our system on AZDC corpus: 81.08% BANNER: 77.9%, JNET: 77.2% (Leaman et al. 2009) Treatment, drug, symptom identification Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition Background Our Participation in CALBC Challenge I Ongoing and future work on BNER Thank you. Faisal Chowdhury and Alberto Lavelli Robust Biomedical Entity Recognition
© Copyright 2026 Paperzz