Building a Wikipedia-derived training corpus for named entity

Research approach:
Building a Wikipedia-derived training corpus for named entity
recognition
1. Selecting sentences
Good subheadings to break
down your methodology
[info5993]
2. ࡚ll the need for highcoverage...
Good, direct justification for
project aims [info5993]
3. For a baseline implementation,
Good - describe the baseline
first. [info5993]
Prior to either of these subtasks being possible,
though, the corpus needs to be digested into a more
useful form than its default wiki-markup.
1
Contribution
My honours research aims to fill the need for highcoverage training
corpora for named entity recogni2
tion (NER), by deriving one from Wikipedia; the
alternative
is labour-intensive. This corpus aims
4
to be an up-to-date (and updatable) resource to
improve statistical NER, across domains and potentially across languages.
Independent of the system’s success at improving NER performance, it will also make contributions to the currently-popular field of Wikipedia
mining. The creation of a high-quality annotated data set will illustrate the efficacy of using
Wikipedia as a structured corpus of world knowledge, in7 particular in order to derive ML training
corpora. On a more detailed level, I will extend
the work of others in identifying the features of
Wikipedia that are most useful for named entity
identification and classification, and show the flexibility of using Wikipedia’s structure and text for
such a task.
2
Methodology
The process of deriving a corpus of named entityannotated sentences from Wikipedia will consist
of two main sub-tasks: (1) selecting sentences to
include in the corpus; (2) classifying articles men-8
tioned in those sentences into named entity classes.
Together these can be used to produce tagged data
which will train the Curran and Clark (2003) NER
system, possibly in cojunction with other training
corpora. See figure 1 for an overview of the system.
Since we are relying on redundancy, sentences that
are questionable and articles that are difficult to10
classify with confidence may simply be discarded.
1
Selecting sentences Since all named entities
within a sentence need to be annotated in our corpus, only sentences where all named entities have
known referents
can be used. For a baseline imple3
mentation, this involves using only those sentences
where all capitalised terms are linked to articles,
with some minor complications due to sentence5
initial capitalisation. This may be avoided by simply ignoring sentences that begin with links, or by
using other heuristics to identify whether sentenceinitial links refer to named entities.
6
At least two further problems exist at this stage:
(a) most entities that are mentioned multiple times
in an article are only linked on their first appearance; (b) capitalisation cues are only applicable for
English and a few other European languages, so
any attempts to use this method to train named
entity recognition in other languages would be limited. The former constraint is not merely one of reduced recall, but may produce a deficient training
corpus that does not help to identify abbreviated
forms that would not often be used on first mention of an entity in an encyclopedia article. The
latter problem may be solved by relying on predetermined article classifications as discussed in the
following paragraph. Generally, diverging from a
simple baseline may imply a loss of precision.
9
Classifying articles Choosing a NE category
(or none) for each linked article will require determining heuristics that mark each category, or
selecting features upon which a machine learner
could train association to named entity classes (see
figure 2). The latter approach is much more flexible and would allow corpora 11
to be generated for
different classification schemes, or indeed for single
-1-
4. This corpus aims to be
Good focusing on aims.
[info5993]
5. This may be avoided by
Good ideas about possible
strategies for complications.
[info5993]
6. At least two further
problems exist...
Good critical analysis of your
method. [info5993]
7. The creation of a highquality...
Good further statement &
justification of project aims
[info5993]
8. will consist of two main
sub-tasks:...
Good simplification of project
methodology [info5993]
9. Classifying articles
Good subheadings to break
down methodology
[info5993]
10. Since we are relying on
redundancy,...
Good analysis of possibile
problems, and strategy for
how they may be dealt with.
[info5993]
11. The latter approach is
much more...
Good critical analysis of your
options. [info5993]
1. Sentence selections
1. Figure 1:
Good use of figures to
illustrate method [info5993]
Annotated
NE training corpora
3. Wikipedia-derived
2. Article classifications
4. CoNLL-2003
6. NER tagger
2. its performance in
training a sys-...
Good explanation of sub-task
evaluation Vs overall
evaluation. [info5993]
5. Web-derived
1
Figure 1: System overview: 1 and 2 are extracted from Wikipedia to form 3. A selection of training
corpora (3, 4, 5) may then train 6, with performance results compared on some test set.
topical domains. It would nonetheless require the
manual classification of a representative collection
of articles from each class, although in some cases
this could plausibly be achieved with gazetteers.
Article features
ML classifier
Article classifications
Known classifications
Figure 2: A flexible article classification approach.
3
Evaluation
Some evaluation will be performed for the two subtasks, as well as the overall performance
induced
4
by our generated training corpus.
6
Sentence selection Sentence selection can be
evaluated for accuracy by manually counting the
number of retrieved sentences from a random selection that include named entities that our system
would fail to assign tags to, such as a single-word
named entity beginning a sentence. An absolute
measure of recall could also be given, comparing
the number of sentences retrieved to the estimated
number of sentences in Wikipedia (or the number with proper noun capitalisation). Otherwise,
refinements could only be compared to an initial
baseline implementation.
Article classification Different methods of article classification could be measured with the
usual statistics of precision and recall (combined
as F1 score) for each class and overall, given a
manually-annotated random selection. If a machine learner is used for the classification, crossvalidation can be performed on what is otherwise
training data. Results could also be compared to
gazetteers (at least in the case of unambiguous
entities), to identify losses in coverage and misclassifications.
Overall evaluation Although the evaluation
described in the preceding paragraphs will helpfully describe the coverage and accuracy of our
training corpus, its performance in training
a sys2
tem for NER is of primary significance. NER accuracy is usually measured in terms of per-class
and overall precision, recall and F1 statistics, and
initially the system would be tested in comparison to baselines 3on the CoNLL 2003 shared task
English test data, in order to provide comparison
to known results. (Given a multi-language capable system, other languages of CoNLL data may
also be used.) The primary baseline would be the
CoNLL 2003 training corpus, while an additional
baseline could be constructed by extracting a training corpus from the web using a method similar to
An et al. (2003). Thus, results could be compared
when training the Curran and Clark (2003) NER
system on each corpus and combinations thereof
for the CoNLL baseline, the Web-derived
baseline
5
and the Wikipedia-derived corpora.
Particular focus will be given to relative
strengths regarding:
3. tested in compari- son to
baselines...
Good evaluation method testing against existing
systems and baselines.
[info5993]
4. Some evaluation will be
performed...
Good explanation of how your
evaluation will be broken up.
[info5993]
5. results could be
compared when...
Good overview of your
planned evaluation.
[info5993]
6. Sentence selection
Good subheadings to specify
evaluation tasks. [info5993]
• different classes of entity;
• ambiguous entities;
• entities unknown from training data.
There may also be a genre mismatch between
CoNLL and Wikipedia-derived data, but testing
this claim will depend7 on the availability of other
gold standard corpora.
4
7. There may also be a
genre mismatch...
Good identification of
potential problem. [info5993]
Conclusion
Although there are a number of options for implementation detail, the subtasks of building and
evaluating NER
improvements with a Wikipedia8
derived corpus are clear.
-2-
8. subtasks of building and
evaluating...
Good focus for conclusion.
[info5993]