Definition, dictionaries and tagger for Extended Named Entity Hierarchy Satoshi Sekine Chikashi Nobata New York University 715 Broadway 7th floor New York , NY 10003 USA [email protected] Communications Research Laboratory 3-5 Hikaridai, Seika-chou, Soraku-gun Kyoto, 619-0289, Japan [email protected] Abstract The tagging of Named Entities, the names of particular things or classes, is regarded as an important component technology for many NLP applications. The first Named Entity set had 7 types, organization, location, person, date, time, money and percent expressions. Later, in the IREX project artifact was added and ACE added two, GPE and facility, to pursue the generalization of the technology. However, 7 or 8 kinds of NE are not broad enough to cover general applications. We proposed about 150 categories of NE (Sekine et al. 2002) and now we have extended it again to 200 categories. Also we have developed dictionaries and an automatic tagger for NEs in Japanese. 2) Based on existing systems and tasks Introduction The tagging of Named Entities, the names of particular things or classes, is regarded as an important component technology for many NLP applications. These applications include question answering (QA), summarization, information retrieval (IR) and information extraction (IE), from which it was born. The first Named Entity set had 7 types (Grishman and Sundheim 1996), organization, location, person, date, time, money and percent expressions. Later, in the IREX project (Sekine and Isahara 2000) artifact (for example, product name “Windows” or book title “Odyssey”) was added and ACE (ACE homepage) added two, GPE (location with political function, such as Portugal or Lisbon) and facility (such as building name), to pursue the generalization of the technology. However, 7 or 8 kinds of NE are not broad enough to cover general applications. For example, when we encounter a new IE domain “airplane crashes” we need a new NE category “name of airplane”. In QA applications, in order to answer the question “What is the name of an international prize the Korean President Kim Dae received in 2002?”, we need to have a NE category “name of prize”. Aiming to cover such needs, we proposed about 150 categories of Extended NE (Sekine et al. 2002). However, in the process of developing system applications since then, we noticed the need for additional Extended NE categories and now we have extended it again to 200 categories. In this paper we will describe the Extended NE and report on our effort to create the dictionaries and an automatic tagger for ENEs in Japanese. Initial Extended NE We reported, at the previous LREC (Sekine et. al 2002), that we designed the Extended NE hierarchy based on three procedures. Note that our hierarchy is intended as a general hierarchy for newspaper domains, rather than a special hierarchy for a particular domain. This is because our target application is general IE or QA. 1) Based on a newspaper corpus We extracted about 3500 candidate NE expressions from a corpus, using surface clues, namely capitalized words and the context of numerals. We assigned NE categories to each expression. There are many systems and tasks related to NE, e.g. (ISI Webclopedia HP) 3) Based on thesaurus A thesaurus provides data closely related to the NE hierarchy. We consulted two well-known thesauri, WordNet (WordNet HP) and Roget Thesaurus. Our hierarchy was used for several applications, and some people referred to it to design their own hierarchy. We also developed an English tagger based on the hierarchy, and used it in several applications. 200 category Extended NE We used the 150 categories for our applications, such as QA and IE systems in the newspaper domain. In the process of system development, we observed that 150 categories are not yet enough. We extended our hierarchy again, to about 200 categories. We are trying to make the Extended NE hierarchy cover major newspaper domains so that applications in such domains will work well with this hierarchy. One of the major changes is that the non-terminal categories now have no instances. Instead, we created an “other” category for each non-terminal to put the instances which do not fit in any of its child categories. This is to make use of rule/feature inheritance. We used to put such instances to the non-terminal node, but then there is no way to make a rule specifically for such “other” instances. For example, in a QA application, we listed groups of noun phrases for each category. Then, for a question like “What is the name the area where the WTC used to exist?”, the answer category (Ground Zero) is a location which shall be found by the word “area” in the question. But it doesn’t fall into any of the subcategories of location, so it should be “location_other”. In the old scheme, where such entities belonged to the non-terminal node “location”, there was a conflict in that we would have to assign “area” as a noun phrase for all locations, which is not the case, as you don’t usually use the noun phrase “area” to indicate the name of a star, which is one of the location categories. To avoid this problem, we introduced an “other” category for each nonterminal category. Also, we extended several categories, mostly to make the existing categories finer. For example, we introduced 1977 This includes 130 notes addressing difficulties in definitions or tagging. For example, how we will tag imaginary entities, like fictional characters in movies, criteria for deciding if entities belong one category or another, how to treat special entities, e.g. “National central bank”, anaphoric expressions, abbreviations, coordination, initials, nicknames, compound nouns, overlapping of several entities, etc. In the manual, there are links from each note to related entities and other notes, so that the user can easily refer to related issues. "natural disaster", as we found that such things are often reported in newspapers and useful for many applications, such as QA and IE. We extended numeric expressions, too. For example, “school age” (e.g. second grade) was introduced. It is obvious that the more categories we have, the more difficult it is to make clear definitions, although in some particular cases, extending the categories helps in finding the right category for an entity. For example, “The Supreme Court”, which was ambiguous between a location and an organization in the 7 or 8 category NE can be clearly classified as a GOE (Geographical Organizational Entities), which is a facility with an identity. However, there are more cases where it becomes difficult to find the right category, e.g. whether a civil strife is a war or an affair, whether a typhoon with minor casualties is a natural phenomena or a natural disaster, the ambiguity of a religious name between the religion and the group of people who believe it, or the definition of an ethnic group. This is the problem of categorizing the world into semantic categories, and finding the right category for each word (of each occurrence). We believe that there is no ultimate solution, so we made definitions with a lot of examples in addition to the verbal definition of each category, as the verbal definition itself can’t be concrete enough to define most categories. The 200 categories are listed in the appendix, in the form of a tree. The root node is TOP and its three children are NAME, TIME_TOP and NUMEX. The number of “>” signs at the beginning of each line indicates the depth of the node. The parent of a line can be found by searching upwards for the line with one fewer “>”. In order to save space, several categories are listed on a single line if there is more than one category at the same level and these do not have any children. Dictionaries Definition We created a hyper-linked definition document in Japanese (available on the web http://nlp.cs.nyu.edu/ene), which is about 150 pages long. An English version is also available, but currently not as detailed as the Japanese version. The definition includes 3 major parts. 1) Global definition and explanation This includes the goal and background of the hierarchy, and the sentence to test whether an instance belongs to a category. Suppose we are testing if instance I belongs to Category C. We use a sentence “Tell me the name of that C.” and then if the question is a natural question with response I, we regard I as an instance of C. The phrase in the question has to be “the name of that C” rather than “the class of that C” or “the name of a C”. For example, “New York University” can be an answer to “What is the name of that school?”, so it is an instance of the school category, but “elementary school” can’t be. Such a test sentence is also used for numeric expressions; for example “What is its length?” is the test sentence to check if an instance can belong to the category “physical extent”. The noun phrases to be used in the test sentence for each category are listed in the manual. 2) Definition of each category Each category is defined, along with the link to its parent node, its child nodes, the typical noun phrases (used in the test sentence) for the category, examples, and links to the notes of difficult cases, described below. As mentioned before, we are not relying only on the top down description of the category. We listed many examples for each category. For example, "product_other" includes 40 positive and 12 negative examples. 3) Notes for the difficult cases We have accumulated instances of each category from the Web, newspapers and other sources. This was all done by hand and includes about 130,000 instances. Table 1 shows some categories and the number of instances in the dictionary. We also created a common noun phrase dictionary with about 50,000 instances. In the dictionary, common noun phrases which express a particular category are listed. For example, for the "people" category, words like "scientist" and "baseball player" are listed. One of the uses of the dictionary is for QA systems. For a question like "What is the name of the scientist who created the electric light?", you can determine the category of the answer using the dictionary. Person Landform Company Water form Disease City Mineral … 32,606 17,197 7,261 5,244 4,777 3,750 2,561 Theory Conference Crime … Train Reptile GPE … 190 180 152 67 66 47 Table 1. Number of instances for some categories Tagger We developed an NE tagger using the dictionary and pattern-based rules. Rules are used to identify entities which can’t be tagged by dictionary. For example, although popular person names are listed in the dictionary, we can’t include all person names. So rules like "Mr. XXX" are used to tag person names. About 1,400 such rules were developed by hand from a study of newspapers and other sources. Some examples of rules are listed in Table 2; these examples are slightly modified to make them easier to understand. In the table, the left side shows patterns to be matched to the input sentence. Each element of a pattern can have up to four components: literal string, word class, POS and currently tagged NE label. For example, the first element of the first example 1978 in Table 2 matches a token of Katakana which is currently tagged as “person”. The right side indicates the NE tags to be changed for the token at each position of the pattern. These rules make use of word classes to group similar words, such as "Mr.", "Mrs." and "Miss". There are 140 classes with about 2,500 word instances; some examples are shown in Table 3. Pattern NE tag (* ~Katakana * PERSON) (.) (* 1:B-PERSON ~Alphabet * OTHER) 2:I-PERSON 3:I-PERSON (TSUMA|ANI) (* * NNP) (SHI|SAN) 2:B-PERSON 3:B-TITLE (* ~Katakana) (* COM_SUFFIX) 1:B-COMP 2:I-COMP (`) (*) (’) (TOIU) (EIGA) 2: B-MOVIE Translation: “Katakana”: A Japanese character type. It usually represents borrowed words, like names of foreign persons. “TSUMA”: “wife”. “ANI”: “big brother”. “SHI|SAN”: Prefix indicating person, like “Mr.”, but located after a person name. “X TOIU EIGA”: “Movie called X” Table 2. Example of tagging rules Class COM_SUFFIX PERSON_SUFFIX POSITION_PREFIX FACILITY_SUFFIX Examples KABUSHIKIGAISYA (Cooperation), Co., Ltd. SAN (Mr.), SAMA (Mr. with respect) , Reporter, Driver FUKU (Vice), SHIN (New), ZEN (Previous) KYOUGIJOU (Stadium), GEKIJOU (Theater) Table 3. Example of Classes ENE category TOTAL Person Date Country Position title … Money Event … Product Landform … Weight Picture Frequency 6708 803 709 574 433 Precision 72% 65% 88% 92% 72% Recall 80% 77% 86% 90% 69% 118 90 95% 36% 98% 41% 40 36 0% 86% 0% 86% 5 5 60% 0% 100% 0% Table 4. Evaluation Results The accuracy of the tagger is 72% recall and 80% precision. Table 4 shows the frequency in the test data and accuracy for the overall categories and several categories. Although there is a room for improvement, we can't expect it to be on a par with the state-of-the-art 8category NE taggers. As the number of categories is large, it is quite difficult to prepare tagged sentences for supervised training, which is a major technique for creating taggers for a small number of categories. We adopt the strategy of using a dictionary and rules at the moment. Discussion and Future Directions Domain dependent category vs. General NE It is controversial whether a general purpose Extended NE hierarchy really exists or not. It might be better to create a domain dependent NE hierarchy if the application targets only one domain. However, there are applications like open-domain Question Answering or open-domain Information Extraction where the target domain is very broad, like the newspaper domain. Then we can’t afford to create NE hierarchies for each subdomain, but must create a general Extended NE hierarchy. Extended NE hierarchy and thesaurus As the number of categories grows, NE tagging becomes more similar to the problem of sense disambiguation or finding the appropriate node in a thesaurus for the entity, as evaluated in SENSEVAL (Senseval HP). Indeed there are Question Answering systems which use a thesaurus like WordNet (WordNet HP) for finding the type of an entity. However, the current WordNet includes mostly common nouns rather than proper nouns or names. Also as we observed when we designed the hierarchy in the first place, WordNet is not really suitable for use as a hierarchy for names, as it is designed for common nouns. Accuracy of the tagger Obviously the accuracy of the tagger is not satisfactory. Some large categories, including person and organizations, make the overall accuracy worse. It may be a good idea to combine machine learning methods with our system. Also, for some categories, we found that collecting more instances could easily help. We may have to work hard to make the list longer. Finally, we are considering other kinds of machine learning; including weakly supervised learning, bootstrapping, active learning and unsupervised methods using linguistic clues, in order to improve the accuracy of the tagger. Acknowledgements A part of this research is supported by the Defense Advanced Research Projects Agency as part of the Translingual Information Detection, Extraction and Summarization (TIDES) program, under Grant N66001001-1-8917 from the Space and Naval Warfare Systems Center, San Diego, and by the National Science Foundation under Grant ITS-00325657. This paper does not necessarily reflect the position of the U.S. Government. We would like to thank our colleagues at New York University and Communications Research Laboratory, who provided useful suggestions and discussions, including, Prof. Ralph Grishman, Mr. Kiyoshi Sudo, Mr. Yusuke Shinyama at NYU and Mr. Kiyotaka Uchimoto and Hitoshi Isahara at CRL. 1979 Bibliography (Grishman and Sundheim 1996) Ralph Grishman and Beth Sundheim, "Message Understanding Conference 6: A Brief History", COLING-1996. (Sekine and Isahara 2000) Satoshi Sekine and Hitoshi Isahara, "IREX: IR and IE Evaluation project in Japanese", LREC-2000. (Sekine et al. 2002) Satoshi Sekine, Kiyoshi Sudo and Chikashi Nobata. “Extended Named Entity Hierarchy”, LREC-2002 (ACE homepage) http://www.itl.nist.gov/iaui/894.01/ tests/ace (Senseval HP) http://www.senseval.org/ (WordNet HP) http://www.cogsci.princeton.edu/~wn/ (ISI Webclopedia HP) http://www.isi.edu/naturallanguage/projects/webclopedia/ APPENDIX: 200 ENE categories TOP >NAME >>NAME_OTHER >>PERSON >>ORGANIZATION >>>ORGANIZATION_OTHER COMPANY COMPANY_GROUP MILITARY INSTITUTE GOVERNMENT POLITICAL_PARTY GAME_GROUP INTERNATIONAL_ORGANIZATION ETHNIC_GROUP NATIONALITY >>LOCATION >>>LOCATION_OTHER >>>GPE >>>>GPE_OTHER CITY COUNTY PROVINCE COUNTRY >>>CONTINENTAL_REGION DOMESTIC_REGION >>>GEOLOGICAL_REGION >>>>GEOLOGICAL_REGION_OTHER LANDFORM WATER_FORM SEA >>> ASTRAL_BODY >>>>ASTRAL_BODY_OTHER STAR PLANET >>>ADDRESS >>>>ADDRESS_OTHER POSTAL_ADDRESS PHONE_NUMBER EMAIL URL >>FACILITY >>>FACILITY_OTHER >>>GOE >>>>GOE_OTHER SCHOOL PUBLIC_INSTITUTION MARKET MUSEUM AMUSEMENT_PARK WORSHIP_PLACE >>>>STATION_TOP >>>>>STATION_TOP_OTHER AIRPORT STATION PORT CAR_STOP >>>LINE >>>>LINE_OTHER RAILROAD ROAD WATERWAY TUNNEL BRIDGE >>>PARK SPORTS_FACILITY MONUMENT FACILITY_PART >>PRODUCT >>>PRODUCT_OTHER >>>VEHICLE >>>>VEHICLE_OTHER CAR TRAIN AIRCRAFT SPACESHIP SHIP >>>FOOD CLOTHES DRUG WEAPON STOCK AWARD THEORY RULE SERVICE CHARACTER METHOD_SYSTEM DOCTRINE CULTURE RELIGION LANGUAGE PLAN ACADEMIC CLASS SPORTS OFFENCE >>>ART >>>>ART_OTHER PICTURE BROADCAST_PROGRAM MOVIE SHOW MUSIC BOOK >>>PRINTING >>>>PRINTING_OTHER NEWSPAPER MAGAZINE >>EVENT >>>EVENT_OTHER >>>OCCASION >>>>OCCASION_OTHER GAMES CONFERENCE >>>NATURAL_PHENOMENA NATURAL_DISASTER WAR INCIDENT >>NATURAL_OBJECT >>>NATURAL_OBJECT_OTHER >>>LIVING_THING >>>>LIVING_THING_OTHER >>>>ANIMAL >>>>>ANIMAL_OTHER >>>>>INVERTEBRATE >>>>>>INVERTEBRATE_OTHER INSECT >>>>>VERTEBRATE >>>>>>VERTEBRATE_OTHER FISH REPTILE AMPHIBIAN BIRD MAMMAL >>>>FLORA BODY_PARTS FLORA_PARTS >>>MINERAL >>TITLE >>>TITLE_OTHER POSITION_TITLE >>UNIT UNIT_OTHER CURRENCY >>VOCATION >>DISEASE >>GOD >>ID_NUMBER >>COLOR >TIME_TOP >>TIME_TOP_OTHER >>>TIMEX >>>>TIMEX_OTHER TIME DATE DAY_OF_WEEK ERA >>>PERIODX >>>>PERIODX_OTHER TIME_PERIOD DATE_PERIOD WEEK_PERIOD MONTH_PERIOD YEAR_PERIOD >NUMEX >>NUMEX_OTHER MONEY STOCK_INDEX POINT PERCENT MULTIPLICATION FREQUENCY RANK AGE SCHOOL_AGE LATITUDE_LONGITUDE >>MEASUREMENT >>>MEASUREMENT_OTHER PHYSICAL_EXTENT SPACE VOLUME WEIGHT SPEED INTENSITY TEMPERATURE CALORIE SEISMIC_INTENSITY SEISMIC_MAGNITUDE >>COUNTX >>>COUNTX_OTHER N_PERSON N_ORGANIZATION >>>N_LOCATION >>>>N_LOCATION_OTHER N_COUNTRY >>>N_FACILITY N_PRODUCT N_EVENT N_ANIMAL N_FLORA N_MINERAL >>ORDINAL_NUMBER 1980
© Copyright 2025 Paperzz