Annotating Chemical Names - Royal Society of Chemistry

Annotating Chemical Names
Peter Corbett
University of Cambridge
Named Entity Recognition
• Go through text and highlight names of:
– People
– Places
– Organisations
– Genes
– Proteins
– Chemicals
–…
Some Chemistry
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
Some Chemistry
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
Human Disagreement
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
That’s
Zn2+ and
Tris
That’s
Zn2+-Tris
Named entity guidelines
• Extensive guidelines (~30 pages, 100 rules) =>
consistent annotation => high quality test +
training data
• Five classes of named entity
• Excludes:
– Large biomolecules
– Application/biological role terms
Annotation of Chemical Named Entities
Peter Corbett, Colin Batchelor and Simone Teufel, Proceedings of BioNLP 2007
Example
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
Example
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
Example
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
Example
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
Example
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
An Extract From The Guidelines
Inter-annotator agreement
• 14 Chemistry Papers
• Analytical, Synthetic, Organic, Inorganic, Physical,
Biological, Materials, Environmental etc. Chemistry
• ~ 40 000 tokens, 3000 named entities
• 3 Annotators, all with chemistry degrees
• A = main guidelines developer
• B = assisted with guidelines
• C = (almost) no work on guidelines, a little training
• Reference to extensive guidelines
• Reference to PubChem, Wikipedia, Google etc.
• No reference to previous annotations
• No conferring
Disagreement
• Whether to annotate
– Lipid (agreement), Fat (disagreement)
• How much to annotate
– Lipid A, Lipid A
– 8:0 2-OH FA, 8:0 2-OH FA
• One entity or two?
– Zn2+-Tris, Zn2+-Tris
Results
• “average” = macro-average over documents
• “corpus” = micro-average over whole corpus
A Problem
• CM does not distinguish between
– Specific chemical compounds
– Classes of chemical compounds
– Parts of chemical compounds
• Early versions of guidelines attempted to deal
with this, using simple name-internal cues
(e.g. plural => class)
• Problem: Polysemy
Pyridine
H
H
H
C
C
C
N
Properties
C
C
H
H
N
“The green residue was
dissolved in pyridine”
Molecular
formula
C5H5N
Molar mass
79.101 g/mol
Appearance
colourless liquid
Density
0.9819 g/cm³,
liquid
Melting point
−41.6 °C
Boiling point
115.2 °C
Solubility in
water
Miscible
Viscosity
0.94 cP at 20 °C
(From Wikipedia)
Pyridines
N
N
N
4-Dimethylaminopyridine 2,6-lutidine
C 7H 10 N2
C7 H9 N
m.p. -5.8 °C
m.p. 110-113 °C
N
2,4,5-collidine
C8 H 11 N
m.p. -46 °C
“Typically this reaction may be carried out in the
presence of a pyridine such as an alkylpyridine…”
Pyridine Rings
N
N
N
N
pyridine ring
C 5N
m.p. NOT APPLICABLE
“In this paper, we report two pyridine-containing
triphenylbenzene derivatives of
1,3,5-tri(m-pyrid-3-pyl-phenyl)benzene…”
Pyridine is a pyridine
• One Sense Per Discourse does not apply
• Found using Google
– “A pyridine such as pyridine”
– “Pyridines such as pyridine itself”
– “Pyridines including pyridine, 4dimethylaminopyridine…”
Regular Polysemy
• Ambiguity is not just for pyridine, but widespread
throughout chemical nomenclature
• Some chemical terms are less ambiguous
– e.g. “alkane”
• No specific-compound sense
• Usually in class-of-compounds sense
• Also has part-of-compound sense
• Other regular polysemies exist, e.g.:
– Metonymy
– Gene/protein ambiguity
Guidelines
• Apply to pre-existing NE annotation
• Classification problem
– Assign exactly one “subtype” to each NE
• Use informal “practise” rounds on other
papers to develop guidelines
• Test agreement on 42 papers
Example
In addition, we have found in previous studies
that the Zn2+–Tris system is also capable of
efficiently hydrolyzing other β-lactams, such as
clavulanic acid, which is a typical mechanismbased inhibitor of active-site serine β–
lactamases (clavulanic acid is also a fairly good
substrate of the zinc-β-lactamase from B.
fragilis).
EXACT
CLASS
PART
Inter-Annotator Agreement
• 42 papers, already annotated for NEs
• 2 annotators
– Both PhD chemists
– Both guidelines developers
•
•
•
•
Reference to guidelines, reference sources etc.
No conferring, or reference to previous attempts
86.0% Agreement
Cohen’s kappa = 0.784
Pyridine, Pyridines and Pyridine Rings: Disambiguating Chemical Named Entities
Peter Corbett, Colin Batchelor and Ann Copestake, Proceedings of BERBTM 2008
(LREC 2008 Workshop)
Automatic Annotation – The Challenge
•
•
•
•
•
•
•
•
2-acetoxybenzoic acid
acetylsalicylic acid
aspirin
C6H12O6, EtOAc
NADH, AZT, 5-HT
A23187
H, He, Li, Be, B, C, N…
As, In, P, lead…
OSCAR3
•
•
•
•
•
•
http://sourceforge.net/projects/oscar3-chem/
Open Source (Artistic License)
Recognition of chemical names
Association of chemical names with structures
Chemical text search engine
Experimental data processing
Methods
• Dictionary Look-up
– ChEBI
• Character 4-Grams
– “acetone” -> “^^^a” “^^ac” “^ace” “acet” “ceto”
“eton” “tone” “one$”
– Compare frequencies in chemical names vs. English
words
– Gather from hand-annotated papers, ChEBI, English
dictionary
• Maximum Entropy Markov Model
Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition
Peter Corbett and Ann Copestake, Proceedings of BioNLP 2008
Evaluating
• Precision:
–
True Positives
True Positives + False Positives
– 80% = “If I say it’s a chemical name, there’s an 80%
chance it really is a chemical name”
• Recall:
–
True Positives
True Positives + False Negatives
– 80% = “If it’s a chemical name, there’s an 80% chance
I’ll spot it and say it is one”
• F – harmonic average of P and R
Recall-Precision Curve
Overlaps
fuming sulphuric acid
fuming sulphuric acid
fuming sulphuric acid
Acknowledgements
•
•
•
•
•
•
Peter Murray-Rust
Ann Copestake
Simone Teufel
Colin Batchelor
David Jessop
The Royal Society of Chemistry