Annotating Chemical Names Peter Corbett University of Cambridge Named Entity Recognition • Go through text and highlight names of: – People – Places – Organisations – Genes – Proteins – Chemicals –… Some Chemistry In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). Some Chemistry In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). Human Disagreement In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). That’s Zn2+ and Tris That’s Zn2+-Tris Named entity guidelines • Extensive guidelines (~30 pages, 100 rules) => consistent annotation => high quality test + training data • Five classes of named entity • Excludes: – Large biomolecules – Application/biological role terms Annotation of Chemical Named Entities Peter Corbett, Colin Batchelor and Simone Teufel, Proceedings of BioNLP 2007 Example In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). Example In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). Example In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). Example In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). Example In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). An Extract From The Guidelines Inter-annotator agreement • 14 Chemistry Papers • Analytical, Synthetic, Organic, Inorganic, Physical, Biological, Materials, Environmental etc. Chemistry • ~ 40 000 tokens, 3000 named entities • 3 Annotators, all with chemistry degrees • A = main guidelines developer • B = assisted with guidelines • C = (almost) no work on guidelines, a little training • Reference to extensive guidelines • Reference to PubChem, Wikipedia, Google etc. • No reference to previous annotations • No conferring Disagreement • Whether to annotate – Lipid (agreement), Fat (disagreement) • How much to annotate – Lipid A, Lipid A – 8:0 2-OH FA, 8:0 2-OH FA • One entity or two? – Zn2+-Tris, Zn2+-Tris Results • “average” = macro-average over documents • “corpus” = micro-average over whole corpus A Problem • CM does not distinguish between – Specific chemical compounds – Classes of chemical compounds – Parts of chemical compounds • Early versions of guidelines attempted to deal with this, using simple name-internal cues (e.g. plural => class) • Problem: Polysemy Pyridine H H H C C C N Properties C C H H N “The green residue was dissolved in pyridine” Molecular formula C5H5N Molar mass 79.101 g/mol Appearance colourless liquid Density 0.9819 g/cm³, liquid Melting point −41.6 °C Boiling point 115.2 °C Solubility in water Miscible Viscosity 0.94 cP at 20 °C (From Wikipedia) Pyridines N N N 4-Dimethylaminopyridine 2,6-lutidine C 7H 10 N2 C7 H9 N m.p. -5.8 °C m.p. 110-113 °C N 2,4,5-collidine C8 H 11 N m.p. -46 °C “Typically this reaction may be carried out in the presence of a pyridine such as an alkylpyridine…” Pyridine Rings N N N N pyridine ring C 5N m.p. NOT APPLICABLE “In this paper, we report two pyridine-containing triphenylbenzene derivatives of 1,3,5-tri(m-pyrid-3-pyl-phenyl)benzene…” Pyridine is a pyridine • One Sense Per Discourse does not apply • Found using Google – “A pyridine such as pyridine” – “Pyridines such as pyridine itself” – “Pyridines including pyridine, 4dimethylaminopyridine…” Regular Polysemy • Ambiguity is not just for pyridine, but widespread throughout chemical nomenclature • Some chemical terms are less ambiguous – e.g. “alkane” • No specific-compound sense • Usually in class-of-compounds sense • Also has part-of-compound sense • Other regular polysemies exist, e.g.: – Metonymy – Gene/protein ambiguity Guidelines • Apply to pre-existing NE annotation • Classification problem – Assign exactly one “subtype” to each NE • Use informal “practise” rounds on other papers to develop guidelines • Test agreement on 42 papers Example In addition, we have found in previous studies that the Zn2+–Tris system is also capable of efficiently hydrolyzing other β-lactams, such as clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β– lactamases (clavulanic acid is also a fairly good substrate of the zinc-β-lactamase from B. fragilis). EXACT CLASS PART Inter-Annotator Agreement • 42 papers, already annotated for NEs • 2 annotators – Both PhD chemists – Both guidelines developers • • • • Reference to guidelines, reference sources etc. No conferring, or reference to previous attempts 86.0% Agreement Cohen’s kappa = 0.784 Pyridine, Pyridines and Pyridine Rings: Disambiguating Chemical Named Entities Peter Corbett, Colin Batchelor and Ann Copestake, Proceedings of BERBTM 2008 (LREC 2008 Workshop) Automatic Annotation – The Challenge • • • • • • • • 2-acetoxybenzoic acid acetylsalicylic acid aspirin C6H12O6, EtOAc NADH, AZT, 5-HT A23187 H, He, Li, Be, B, C, N… As, In, P, lead… OSCAR3 • • • • • • http://sourceforge.net/projects/oscar3-chem/ Open Source (Artistic License) Recognition of chemical names Association of chemical names with structures Chemical text search engine Experimental data processing Methods • Dictionary Look-up – ChEBI • Character 4-Grams – “acetone” -> “^^^a” “^^ac” “^ace” “acet” “ceto” “eton” “tone” “one$” – Compare frequencies in chemical names vs. English words – Gather from hand-annotated papers, ChEBI, English dictionary • Maximum Entropy Markov Model Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition Peter Corbett and Ann Copestake, Proceedings of BioNLP 2008 Evaluating • Precision: – True Positives True Positives + False Positives – 80% = “If I say it’s a chemical name, there’s an 80% chance it really is a chemical name” • Recall: – True Positives True Positives + False Negatives – 80% = “If it’s a chemical name, there’s an 80% chance I’ll spot it and say it is one” • F – harmonic average of P and R Recall-Precision Curve Overlaps fuming sulphuric acid fuming sulphuric acid fuming sulphuric acid Acknowledgements • • • • • • Peter Murray-Rust Ann Copestake Simone Teufel Colin Batchelor David Jessop The Royal Society of Chemistry
© Copyright 2026 Paperzz