Research approach: Building a Wikipedia-derived training corpus for named entity recognition 1. Selecting sentences Good subheadings to break down your methodology [info5993] 2. ࡚ll the need for highcoverage... Good, direct justification for project aims [info5993] 3. For a baseline implementation, Good - describe the baseline first. [info5993] Prior to either of these subtasks being possible, though, the corpus needs to be digested into a more useful form than its default wiki-markup. 1 Contribution My honours research aims to fill the need for highcoverage training corpora for named entity recogni2 tion (NER), by deriving one from Wikipedia; the alternative is labour-intensive. This corpus aims 4 to be an up-to-date (and updatable) resource to improve statistical NER, across domains and potentially across languages. Independent of the system’s success at improving NER performance, it will also make contributions to the currently-popular field of Wikipedia mining. The creation of a high-quality annotated data set will illustrate the efficacy of using Wikipedia as a structured corpus of world knowledge, in7 particular in order to derive ML training corpora. On a more detailed level, I will extend the work of others in identifying the features of Wikipedia that are most useful for named entity identification and classification, and show the flexibility of using Wikipedia’s structure and text for such a task. 2 Methodology The process of deriving a corpus of named entityannotated sentences from Wikipedia will consist of two main sub-tasks: (1) selecting sentences to include in the corpus; (2) classifying articles men-8 tioned in those sentences into named entity classes. Together these can be used to produce tagged data which will train the Curran and Clark (2003) NER system, possibly in cojunction with other training corpora. See figure 1 for an overview of the system. Since we are relying on redundancy, sentences that are questionable and articles that are difficult to10 classify with confidence may simply be discarded. 1 Selecting sentences Since all named entities within a sentence need to be annotated in our corpus, only sentences where all named entities have known referents can be used. For a baseline imple3 mentation, this involves using only those sentences where all capitalised terms are linked to articles, with some minor complications due to sentence5 initial capitalisation. This may be avoided by simply ignoring sentences that begin with links, or by using other heuristics to identify whether sentenceinitial links refer to named entities. 6 At least two further problems exist at this stage: (a) most entities that are mentioned multiple times in an article are only linked on their first appearance; (b) capitalisation cues are only applicable for English and a few other European languages, so any attempts to use this method to train named entity recognition in other languages would be limited. The former constraint is not merely one of reduced recall, but may produce a deficient training corpus that does not help to identify abbreviated forms that would not often be used on first mention of an entity in an encyclopedia article. The latter problem may be solved by relying on predetermined article classifications as discussed in the following paragraph. Generally, diverging from a simple baseline may imply a loss of precision. 9 Classifying articles Choosing a NE category (or none) for each linked article will require determining heuristics that mark each category, or selecting features upon which a machine learner could train association to named entity classes (see figure 2). The latter approach is much more flexible and would allow corpora 11 to be generated for different classification schemes, or indeed for single -1- 4. This corpus aims to be Good focusing on aims. [info5993] 5. This may be avoided by Good ideas about possible strategies for complications. [info5993] 6. At least two further problems exist... Good critical analysis of your method. [info5993] 7. The creation of a highquality... Good further statement & justification of project aims [info5993] 8. will consist of two main sub-tasks:... Good simplification of project methodology [info5993] 9. Classifying articles Good subheadings to break down methodology [info5993] 10. Since we are relying on redundancy,... Good analysis of possibile problems, and strategy for how they may be dealt with. [info5993] 11. The latter approach is much more... Good critical analysis of your options. [info5993] 1. Sentence selections 1. Figure 1: Good use of figures to illustrate method [info5993] Annotated NE training corpora 3. Wikipedia-derived 2. Article classifications 4. CoNLL-2003 6. NER tagger 2. its performance in training a sys-... Good explanation of sub-task evaluation Vs overall evaluation. [info5993] 5. Web-derived 1 Figure 1: System overview: 1 and 2 are extracted from Wikipedia to form 3. A selection of training corpora (3, 4, 5) may then train 6, with performance results compared on some test set. topical domains. It would nonetheless require the manual classification of a representative collection of articles from each class, although in some cases this could plausibly be achieved with gazetteers. Article features ML classifier Article classifications Known classifications Figure 2: A flexible article classification approach. 3 Evaluation Some evaluation will be performed for the two subtasks, as well as the overall performance induced 4 by our generated training corpus. 6 Sentence selection Sentence selection can be evaluated for accuracy by manually counting the number of retrieved sentences from a random selection that include named entities that our system would fail to assign tags to, such as a single-word named entity beginning a sentence. An absolute measure of recall could also be given, comparing the number of sentences retrieved to the estimated number of sentences in Wikipedia (or the number with proper noun capitalisation). Otherwise, refinements could only be compared to an initial baseline implementation. Article classification Different methods of article classification could be measured with the usual statistics of precision and recall (combined as F1 score) for each class and overall, given a manually-annotated random selection. If a machine learner is used for the classification, crossvalidation can be performed on what is otherwise training data. Results could also be compared to gazetteers (at least in the case of unambiguous entities), to identify losses in coverage and misclassifications. Overall evaluation Although the evaluation described in the preceding paragraphs will helpfully describe the coverage and accuracy of our training corpus, its performance in training a sys2 tem for NER is of primary significance. NER accuracy is usually measured in terms of per-class and overall precision, recall and F1 statistics, and initially the system would be tested in comparison to baselines 3on the CoNLL 2003 shared task English test data, in order to provide comparison to known results. (Given a multi-language capable system, other languages of CoNLL data may also be used.) The primary baseline would be the CoNLL 2003 training corpus, while an additional baseline could be constructed by extracting a training corpus from the web using a method similar to An et al. (2003). Thus, results could be compared when training the Curran and Clark (2003) NER system on each corpus and combinations thereof for the CoNLL baseline, the Web-derived baseline 5 and the Wikipedia-derived corpora. Particular focus will be given to relative strengths regarding: 3. tested in compari- son to baselines... Good evaluation method testing against existing systems and baselines. [info5993] 4. Some evaluation will be performed... Good explanation of how your evaluation will be broken up. [info5993] 5. results could be compared when... Good overview of your planned evaluation. [info5993] 6. Sentence selection Good subheadings to specify evaluation tasks. [info5993] • different classes of entity; • ambiguous entities; • entities unknown from training data. There may also be a genre mismatch between CoNLL and Wikipedia-derived data, but testing this claim will depend7 on the availability of other gold standard corpora. 4 7. There may also be a genre mismatch... Good identification of potential problem. [info5993] Conclusion Although there are a number of options for implementation detail, the subtasks of building and evaluating NER improvements with a Wikipedia8 derived corpus are clear. -2- 8. subtasks of building and evaluating... Good focus for conclusion. [info5993]
© Copyright 2026 Paperzz