Advance Access publication on 18 October 2013 © The British Computer Society 2013. All rights reserved. For Permissions, please email: [email protected] doi:10.1093/comjnl/bxt112 Extracting and Displaying Temporal and Geospatial Entities from Articles on Historical Events Rachel Chasin1 , Daryl Woodward2 , Jeremy Witmer3 and Jugal Kalita4,∗ 1 Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA of Electrical and Computer Engineering, University of Colorado, Colorado Springs, CO 80918, USA 3 The MITRE Corporation, 49185 Transmitter Road, San Diego, CA 92152-7335, USA 4 Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA ∗Corresponding author: [email protected] 2 Department This paper discusses a system that extracts and displays temporal and geospatial entities in text. The first task involves identification of all events in a document followed by identification of important events using a classifier. The second task involves identifying named entities associated with the document. In particular, we extract geospatial named entities. We disambiguate the set of geospatial named entities and geocode them to determine the correct coordinates for each place name, often called grounding. We resolve ambiguity based on sentence and article context. Finally, we present a user with the key events and their associated people, places and organizations within a document in terms of a timeline and a map. For purposes of testing, we use Wikipedia articles about historical events, such as those describing wars, battles and invasions. We focus on extracting major events from the articles, although our ideas and tools can be easily used with articles from other sources such as news articles. We use several existing tools such as Evita, Google Maps, publicly available implementations of Support Vector Machines, Hidden Markov Model and Conditional Random Field, and the MIT SIMILE Timeline. Keywords: information extraction; temporal extraction; geospatial entity extraction; natural language processing Received 23 December 2012; revised 11 September 2013 Handling editor: Fionn Murtagh 1. INTRODUCTION The amount of user-generated content on the Internet increases significantly every day. Whether they be short news articles or large-scale collaborations on Wikipedia, textual documents harbor an abundance of information, explicit or implicit. Relationships between people, places and events within a single text document are sometimes difficult for a reader to extract—meanwhile the thought of digesting and integrating information from a whole series of lengthy articles is daunting. Consequently, the need for techniques to automatically extract, disambiguate, organize and visualize information from raw text is increasing in importance. In this paper, we are interested in mining documents to identify important events along with the named entities to which they are related. The objective is to create the basis of a system that combines automatically extracted temporal, geospatial and nominal information into a large database in a form that facilitates further processing and visualization. This paper describes an interesting, state-of-the-art system combining state-of-the-art techniques. We believe that the article will be instructive for anyone trying to set up an operational system with the complex functionalities described. Instead of focusing on a subproblem, this paper’s focus is on an overall system. The paper describes a well-crafted combination of state-of-the-art techniques (many of them embodied in offthe-shelf toolkits) into an interesting working system. The two main foci of this paper are to accurately and automatically acquire various pieces of event-related information and to The Computer Journal, Vol. 57 No. 3, 2014 404 R. Chasin et al. FIGURE 1. A Wikipedia article highlighted for event references and named entities. visualize the extracted data. Visualization is an important post-process of information extraction as it can offer a level of understanding not inherent in reading the text alone. Visualization is ideal for users who need quick, summary information from an article. Possible domains for obtaining the information used in our process are numerous and varied; however, we have identified two primary sources for rich amounts of reliable information. These domains are online news archives and Wikipedia. For the research reported in this paper, we work with Wikipedia articles on historical events because of the availability of high-quality articles in large numbers. Figure 1 shows a portion of a Wikipedia article that has been marked up for event references and named entities. One of the most influential features of the Internet is the ability to easily collaborate in generating information. Wikis are perfect examples. A wiki offers an ideal environment for the sharing of information while allowing for a form of peer review that is not present elsewhere. Wikipedia is only one of many corpora that can be mined for knowledge that can be stored in machinereadable form. A system such as the one we report in this paper may have exciting and practical applications. An example would be the assistance in researching World War II. One would be able to select a time period, perhaps 1939–1940. Then one would be able to select an area on a map—perhaps Europe. One could choose a source of information, such as the World War II article on Wikipedia and all of the articles linked on that page. The result may be an interactive map and timeline that plays between the time period of interest, showing only events on the timeline that occurred in the selected region. Meanwhile markers would fade in and out on the map as events begin and end—each one clickable for a user to retrieve more information. The people and organizations involved in a particular battle could be linked to articles about them. The system described in this paper collects the information necessary and although the visualization is not as sophisticated, we describe the basis for such a system here. It should be noted that although we extract information from Wikipedia articles, all information mining in our system is based on raw text which allows our system to work with historical articles, books and memoirs, and even newspaper articles describing currently unfolding events. Thus, it is possible to analyze text on the same topic from different sources, and find the amount of similarity or dissimilarity or inconsistency among the sources at a gross level. The existing visualization can currently enable a human to make such distinctions, e.g. when the same event may appear at different places on a timeline. 2. RELATED RESEARCH Our overarching goal is to build a system that extracts, stores and displays important events and named entities from Wikipedia articles on historical events. It should be noted that our methods do not limit the system to this source and the event and named entity extraction works directly from plain text. Research on event extraction has often focused on identifying the most important events in a set of news stories [1–3]. The advantage of this domain is that important information is usually repeated in many different stories, all of which are being examined. This allows algorithms like TF-IDF [4] to be used. In the Wikipedia project, there is only one document per specific topic, so these algorithms cannot be used. There has also been research into classifying events into specific categories and determining their attributes and argument roles [5]. Another issue that has been addressed is event coreference, determining which descriptions of events in a document refer to the same actual event [6, 7]. Even setting aside the problem of automating the process, identifying and representing the temporal relations in a document is a daunting task for humans. Several structures have been proposed, from Allen’s interval notation and 13 relations [8] and variations on it using points instead, to a constraint structure for the medical domain in [9]. A recent way to represent the relations is a set of XML tags called TimeML [10]. Software has been developed [11, 12] that identifies events and time expressions, and then generates relations among them. Most events are generally represented by verbs in the document, but also include some nouns. The link or relation generation is done differently depending on the system, but uses a combination of rule-based components and classifiers. The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities We are also interested in named entity recognition (NER), i.e. the extraction of words and strings of text within documents that represent discrete concepts, such as names and locations [13]. Successful approaches include Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), Maximum Entropy models, Neural Networks and Support Vector Machines (SVMs). The extraction of named entities continues to invite new methods, tools and publications. SVMs [14] have shown significant promise for the task of NER. Starting with the CoNLL 2002 shared task on NER, SVMs have been applied to this problem domain with improving performance. Bhole et al. [15] report extracting named entities from Wikipedia and relating them in time using SVMs. In the biomedical domain, Takeuchi and Collier [16] have demonstrated that SVMs outperform HMMs for NER. Lee et al. [17] and Habib [18–20] have used SVMs for the multi-class NER problem. Isozaki and Kazawa [21] show that chunking and part-of-speech tagging also provide performance increases for NER with SVMs, both in speed and in accuracy. Dakka and Cucerzan [22] demonstrate high accuracy using SVMs for locational entities in Wikipedia articles. Mansouri et al. [23] demonstrate excellent performance on NER using what are called fuzzy SVMs. Thus, as attested by many published papers, SVMs provide good performance on NER tasks. HMMs have also shown excellent results in the NER problem. Klein et al. [24] demonstrate that a character-level HMM can identify both English and German named entities fairly well for location entities in testing data. Zhou and Su [25] evaluate HMM and HMM-based chunk tagger on English NE tasks, achieving extremely high detection performance. CRFs have been extremely successful in NER [26, 27]. With publicly available CRF NER tools, it is possible to achieve excellent results. Our particular interest is in extraction of named geospatial entities. Chen et al. [28] perform NER with a ‘regularized maximum entropy classifier with Viterbi decoding’ and achieve over 88% detection with regard to geospatial entities. Our goal is to achieve high recall during initial extraction. NER precision for geospatial entities can be sacrificed since we disambiguate and geocode such entities. Disambiguation is necessary since the same name may refer to several geospatial entities in different parts of the world. Geocoding obtains latitude and longitude information for a geospatial entity on the map of the world. If an identified geospatial named entity cannot be geocoded, it is assumed to be mis-identified and discarded. Scharl and Tochtermann [29] use an interesting disambiguation process in their geospatial resolution process. Nearby named entities are used to disambiguate more specific places. An example is a state being used to determine exactly which city is being referenced. In addition, NEs previously extracted in the document are also used to reinforce the score of a particular instance of a place name. For example, if a particular region in the USA has been previously referenced and a new named entity arises that could be within that region or in Europe, it will 405 be weighted more toward being the instance within that region in the USA. The basic design of our graphical user interface (GUI) is based on the GUI referenced in [28]. Recently, interest in extracting and displaying spatiotemporal information from unstructured documents has been on the rise. For example, Strötgen et al. [30] discusses a prototype system that combines temporal information retrieval and geographic information retrieval. This system also displays the information. Cybulska and Vossen [31, 32] describe a historical information extraction system where they extract event and participant information from Dutch historical archives. They extract information using what they call profiles. For example, they have developed 402 profiles for event extraction although they use only 22 of them in the reported system. For extraction of participants, they use 314 profiles. They also use 43 temporal profiles and 23 location profiles to extract temporal and locational information. Profiles are created using semantic and syntactic information as well as information gleaned from WordNet [33]. Ligozat et al. [34] present a system that takes a simple natural language description of a historical scene, identifies landmarks and performs matching with a set of choremes, situates the choremes in appropriate locations and displays the choremes. The details of the system are not discussed in detail. Choremes [35, 36] are icons that can be used to represent specific spatial configurations and processes in geography. In layman’s terms, they are like an elaborate set of graphical signs used by roads and highways to indicate things like a road turning in a certain direction, an intersection coming up, etc. Ligozat et al. [37] describe another system that monitors SMS messages sent by individuals describing events taking place in a compact geographical area, discover relations among these events and report on the severity of these events. They consider events to be entities with a temporal span and a spatial extent and describe them using a representational scheme that allows for both [38]. They display events of interest in an animated manner using the approach described in [34]. Events are represented schematically, but can have texts, sounds and lights associated with them. The most interesting part of their system is the representation of space and time. Time is represented using Allen’s interval calculus [8]. Space is quantized in rectangles, and the relationships among these rectangles is expressed using a calculus that mimics the relationships used in the representation of time, ensuring that both time and space are treated in a similar manner. Computer Science literature is by no means the only place where aspects of the work being discussed in this paper are pursued. Geographical Information Systems (GISs) is one such area. A GIS is a system designed to capture, store, manipulate, analyze, manage and present geographic information. Traditionally, such information has been used in cartography, resource management and scientific work, and in an increasing manner with mobile computing platforms like cell phones and tablets for mapping and location-aware searches and The Computer Journal, Vol. 57 No. 3, 2014 406 R. Chasin et al. advertisements. Books by Rigaux et al. [39] and Clarke [40] present the main theoretical concepts behind these systems, namely spatial data models, algorithms and indexing methods. As in many fields, with tremendous increase in the amount of geographic information being available in recent years, techniques from statistical analysis, data mining, information retrieval and machine learning are becoming tools in the quest to discover useful patterns and nuggets of geographic knowledge and information. Development and use of geographic ontologies to facilitate geographic information retrieval is an active area of research [41–44]. Fu et al. [45] describe the use of an ontology for deriving the spatial footprints of a query, focusing on queries that involve spatial relations. Martins [46] describes issues in the development of a location-aware Web search engine. He creates a geographic inference graph of the places in the document and uses the PageRank algorithm [47] and the HITS algorithm [48] to associate the main geographical contexts/scopes for a document. Martins [46] explores the idea of returning documents in response to queries that may be geographically interesting to the individual asking the query. This means queries must be assigned geographic scopes as well and the geographic scope of the query should be matched with the geographic scopes of the documents returned. It also discusses ranking search results according to geographical significance. 3. OUTLINE OF OUR WORK The work reported in this paper is carried out in several phases. First, we extract all events from a document. Next, we identify significant events in the text. We follow by identifying associated named entities in the text. We process the geospatial entities further for disambiguation and geolocation. Finally, we create a display with maps and timelines to represent the information extracted from the documents. The activities we carry out in the system reported in this paper are listed below. FIGURE 2. Overall architecture. (location) and ORG (organization) named entities. Temporal information, document structure and NE information is used to identify important events. In parallel, the LOC named entities are geocoded and grounded to a single specific location. If the LOC entities are ambiguous, document information is used to ground the LOC entity to the correct geospatial location. Once the LOC and event processing is complete, all of the information is stored in a database. We have also developed a map and timeline front-end that allows users to query documents and visualize the geospatial and temporal information on a map and timeline. 4. (i) Extract events from texts. (ii) Identify important events from among the events extracted. (iii) Extract temporal relations among identified important events. (iv) Extract named entities in the texts including those associated with important events. (v) Disambiguate locational named entities and geocode them. (vi) Create a display with a map and timeline showing relationships among important events. We discuss each one of these topics in detail in the rest of the paper. Figure 2 shows the overall processing architecture and segments of the system, including the visualization subsystem. The operation of the entire system is based on an initial NER processing of the document, to extract PER (people), LOC EXTRACTING EVENTS FROM TEXTS Automatically identifying events in a textual document enables a higher level of understanding of the document than only identifying named entities. The usual approach to identifying events in a text is by providing a set of relations along with associated patterns that produce their natural language realization, possibly in a domaindependent manner. For example, [49–51] are able to provide answers to queries about individual facts or factoid questions using relation patterns. Others [52, 53] take a deeper approach that converts both the query text and the text of candidates for answers into a logical form and then use an inference engine to choose the answer text from a set of candidates. Still others [54] use linguistic contexts to identify events to answer factoid questions. For non-factoid questions, where there are usually no unique answers and which may require dealing with several events, pattern matching does not work well [55, 56]. The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities 407 FIGURE 3. Portion of an article with events tagged by EVITA (words followed by e_i). We show only the event tags here. 5. FIGURE 4. A portion of the XML representation of Fig. 3. Evita [57] does not use any pre-defined patterns to identify the language of event occurrence. It is also not restricted to any domain. Evita combines linguistic and statistical knowledge for event recognition. It is able to identify grammatical information associated with an event-referring expression such as tense, aspect, polarity and modality. Linguistic knowledge is used to identify local contexts such as verbal phrases and to extract morphological information. The tool has a pre-processing stage and uses finite-state machine-based and Bayesian-based techniques. We run the Evita program1 on the article, which labels all possible events in the TimeML format. EVITA recognizes events by first preprocessing the document to tag parts of speech and chunk it, then examining each verb, noun and adjective for linguistic (using rules) or lexical (using statistical methods) properties that indicate it is an event. Instances of events also have more properties that help establish temporal relations in later processing steps; these properties include tense and modality. Specifically, Evita places XML ‘EVENT’ tags around single words defining events; these may be verbs or nouns, as seen in Figs 1, 3 and 4. Each event has a class; in this case, ‘fought’ is an occurrence. Other classes include states and reporting events. For our purposes, occurrences will be most important because those are the kinds of events generally shown on timelines. For event extraction, there were a few other alternatives although we found that for our task EVITA’s results were comparable. EVITA is easily downloadable and usable as well. The other systems include TRIPS-TRIOS [58, 59] and TIPSem [60]. TRIPS-TRIOS parses the raw text using a parser that produces a deep logical (LF) from text and then extracts events by matching semantic patterns using about 100 handcoded extraction patterns that match against parsed phrases. TIPSem uses CRFs [26, 61]. 1 The entire TARSQI Toolkit is freely downloadable upon email request; see http://timeml.org/site/tarsqi/toolkit/download.html. IDENTIFYING IMPORTANT EVENTS When storing events in our database, we currently associate an event with its containing sentence, saving the entire line. Each event/sentence is classified as important or not using a classifier trained with a set of mostly word-level and sentencelevel features. We experimented with several SVM-based classifiers to identify important events/sentences. We also experimented with choices of features for the classifiers. Several features are concerned with the characteristics of the event word in the sentence (part of speech, grammatical aspect, distance from a named entity, etc.) or the word by itself (length). On the other hand, some features relate to the sentence as a whole (for example, presence of negation and presence of digits). The features are based on practical intuition. For example, low distance from a named entity, especially from a named entity that is mentioned often, probably indicates importance because a reader cares about events that the named entity is involved in. This is opposed to more generic nouns, or abstract nouns which generally means less importance. For example, consider the sentence from the Wikipedia article on the Battle of Gettysburg: ‘Lee led his army through the Shenandoah Valley to begin his second invasion of the North’. This sentence describes an important event and Lee is a named entity that occurs often in this article. Intuitively speaking, negation is usually not associated with importance since we care about what happened, not what did not. Grammatical aspect, in particular, progressive, seems less important because it tends to pertain to a state of being, not a clearly defined event. For example, consider another sentence from the Battle of Gettysburg article: ‘The 26th North Carolina (the largest regiment in the army with 839 men) lost heavily, leaving the first day’s fight with around 212 men’. This has ‘leaving’ with progressive aspect and this sentence does not really depict an event, more a state. Longer words are sometimes more descriptive and thus more important, while nouns and verbs are usually more important than adjectives. Digits are also a key factor of importance as they are often part of a date or statistic. Similar features have been used by other researchers when searching for events or important events or similar tasks [5, 58, 59, 62]. We also experimented with features that relate to the article as a whole such as the position of a sentence in the document although we did not find these as useful. We use the similarity of the event word to article ‘keywords’. These keywords are taken The Computer Journal, Vol. 57 No. 3, 2014 408 R. Chasin et al. as the first noun and verb or first two nouns of the first sentence of the article. These were chosen because the first sentence of these historical narratives often sums up the main idea and will often therefore contain important words. In the articles of our corpus, words such as ‘war’, ‘conflict’ and ‘fought’ are often used to specify salient topics. An event word’s similarity to one of these words may having a bearing on its importance. The decision to use two keywords helps in case one of the words is not a good keyword; only two are used because finding similarity to a keyword is expensive in time. Similarity is measured using the ‘vector pairs’measure from the WordNet::Similarity Perl module from [63]. As explained in the documentation for this Perl module,2 it calculates the similarity of the vectors representing the glosses of the words to be compared. This measure was chosen because it was one of the few that can calculate similarity between words from different parts of speech. Given two word senses, this module looks at their WordNet glosses and computes the amount of similarity. It implements several semantic relatedness measures including the ones described by Leacock and Chodorow [64], Jiang and Conrath [65], Resnik [66, 67], Hirst and St-Onge [68], Wu and Palmer [69], and the extended gloss overlap measure by Banerjee and Pedersen [70]. After initial experimentation, we changed some word-level features to sentence-level features by adding, averaging or taking the maximum of the word-level features for each event in the sentence. A feature added was based on TextRank [71], an algorithm developed by Mihalcea and Tarau that ranks sentences based on importance and is used in text summarization. TextRank works using the PageRank algorithm developed by Page et al. [47]. While PageRank works on web pages and the links among them, TextRank treats each sentence like a page, creating a weighted, undirected graph whose nodes are the document’s sentences. Edge weights between nodes are determined using a function of how similar the sentences are. After the graph is created, PageRank is run on it to rank the nodes in order of essentially how much weight points at them (taking into account incident edges and the ranks of the nodes on the other sides of these edges). Thus, sentences that are in some way most similar to most other sentences get ranked highest. We wrote our own implementation of TextRank with our own function for similarity between two sentences. Our function automatically gives a weight of essentially 0 if either sentence is shorter than a threshold of 10 words, which we chose based on the average sentence length in our corpora. For all others, it calculates the distance between the sentences using a simple approach. Each sentence is treated as a bag of words. The words are stemmed. Then, we simply compute how many words are the same between the two sentences. The similarity is then chosen as the sum of the sentence lengths divided by their distance. Named entities are also considered more important to the process than before. Instead of just asking if the sentence 2 http://search.cpan.org/dist/WordNet-Similarity/doc/intro.pod. TABLE 1. A final list of features for classifying events as important or not. Important event classifier features Presence of an event in the perfective aspect Percent of events in the sentence with class ‘occurrence’ Digit presence Maximum length of any event word in the sentence Sum of the Named Entity ‘weights’ in the sentence (NE weight being the number of times this NE was mentioned in the article divided by the number of all NE mentions in the article) Negation presence in the sentence Number of events in the sentence Percent of events in the sentence that are verbs Position of the sentence in the article normalized by the number of sentences in the article Maximum similarity of any event word in the sentence Percent of events in the sentence that are in some past tense TextRank rank of the sentence in the article, divided by the number of sentences in the article Number of ‘to be’ verbs in the sentence contains one, some measure of the importance of the named entity is calculated and taken into account for the feature. This is done by counting the number of times the named entity is mentioned in the article (people being equal if their last names are equal, and places being equal if their most specific parts are equal). This total is then normalized for the number of named entities in the article. The final set of features used can be seen in Table 1. 6. IDENTIFYING TEMPORAL RELATIONS We experimented with three different ways to identify temporal relations among events in a document. The first approach used the TARSQI toolkit to identify temporal relations among events. The second approach was the development of a large number of regular expressions. We also downloaded Heideltime [72] and integrated it with our system. Heideltime is also a regular expression-based system, written in a modular manner. Since the results of our regular expressions were comparable to Heideltime’s (although our expressions were a little messier) and we had developed and integrated our regular expression module before Heideltime became available for download, we decided to keep our module. However, we can switch to Heideltime if the need arises. 6.1. Using TARSQI to extract occurrence times We use the TARSQI Toolkit (TTK) to create event–event and event–time links (called TLINKs) in addition to identifying The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities FIGURE 5. Example TLINK elements. events in a document. These include attributes like type of expression (DATE, TIME, etc.) and value. Some events are anchored to time expressions, and some to other events. These are represented by the TLINK XML tag, which has attributes including the type of relation. An example of each kind of TLINK is shown in Fig. 5. The first represents that the event with ID 1 is related to the time with ID 3 by ‘before’; the second also represents ‘before’, between events 5 and 6. 6.2. Using regular expressions to extract occurrence times Our second approach applied regular expressions to extract times. The results of using extensive regular expressions versus using the TTK showed that the regular expressions pull out many more complete dates and times. For example, in the text given in Figs 1 and 3, TTK only finds 1862, while the regular expressions would find a range from 11 December 1862 to 15 December 1862. Because of this, we decided to use our own program based on regular expressions rather than TTK. The regular expressions in our program are built from smaller ones representing months, years, seasons, and so on. We use 24 total combinations of smaller expressions, some of which recursively tested groups they matched for subexpressions. One example of an expression is (early |late |mid-|mid |middle |the end of |the middle of |the beginning of |the start of )?(Winter|Spring|Summer|Fall|Autumn) (of )? (((([12][0-9])|([1-9])) [0-9][0-9])|(’?[0-9][0-9])) which finds expressions like ‘[By] early fall 1862 [. . .]’a type of phrase that occurs in the corpus. A flaw present in our program and not in TTK is its ability to only pick out one time expression (point or interval) per sentence. This is consistent with our current view of events as sentences, although it would arguably be better to duplicate the sentence’s presence on a timeline while capturing all time expressions present in it. To anchor time expressions to real times—specifically to a year—we use a naïve algorithm that chooses the previous anchored time’s year. We heuristically choose the most probable year out of the previous anchored time’s year and the two adjacent to it by looking at the difference in the month if available. For example, a month that is more than 5 months after the month of the anchoring time is statistically in the next 409 year rather than the same year (with ‘The first attack occurred in December 1941. In February, the country had been invaded’, it is probably February 1942). In addition to the year, if the time needing to be anchored lacks more fields, we fill them with as many corresponding fields from the anchoring time as possible. Each time expression extracted is considered to have a beginning and an end, at the granularity of days (though we do extract times when they are present). Then the expression ‘7 December 1941’ would have the same start and end point, while the expression ‘December 1941’ would be considered to start on December 1 and end on December 31. Similarly, modifiers like ‘early’ and ‘late’ change this interval according to empirical evidence; for example, ‘early December’ corresponds to December 1 to December 9. While these endpoints are irrelevant to a sentence like, ‘The Japanese planned many invasions in December 1941’, it is necessary to have exact start and end points in order to plot a time on a timeline. Thus, while the exact start and end days are often arbitrary for meaning, they are chosen by the extractor. Some expressions cannot be given a start or end point at all. For example, ‘The battle began at 6:00 AM’ tells us that the start point of the event is 6:00 AM but says nothing about the end point. Any expression like this takes on the start or end point of the interval for the entire event the article describes (for example, the Gulf War). This interval is found by choosing the first beginning time point and first ending time point that are anchored directly from the text, and it is logically probable to find the correct span. While this method is reasonable for some expressions, many of them have an implicit end point somewhere else in the text, rather than stretching until the end of the article’s event. It would likely be better to instead follow the method we use for giving times to sentences with no times at all, described below. After initial extraction of time expressions and finding a tentative article span, we discard times that are far off from either end point, currently using 100 years as a cutoff margin. This helps avoid times that are obviously irrelevant to the event, as well as expressions that are not actually times but look like they could be (for example, ‘The bill was voted down 166–269’ which looks like a year range). We also discard expressions with an earlier end date than start date, which helps avoid the latter problem. Despite the thoroughness of the patterns we look for in the text, the majority of sentences still have no times at all. However, they may still be deemed important by the other part of our work, and so must be able to be displayed on a timeline. Here, we exploit the characteristic of historical descriptions that events are generally mentioned in the order in which they occurred. For a sentence with no explicit time expression, the closest (textwise) sentence on either side that does have a time expression is found, and the start times of those sentences are used as the start and end time of the unknown sentence. There are often many unknown sentences in a row; each one’s position in this sequence is kept track of so that they can be plotted in this order, The Computer Journal, Vol. 57 No. 3, 2014 410 R. Chasin et al. FIGURE 6. Flow of time in a few sentences from the Battle of Gettysburg article. despite having no specific date information outside the interval, which is the same for them all. For example, the first sentence of the passage in Fig. 6 sets a start point, June 9, for the subsequent events. The year for this expression would be calculated using a previously found time expression, possibly the first one in the article. There are various sentences in the passage following this one that do not contain time expressions, some of which may be considered important enough to place on the timeline— for example, ‘Alfred Pleasonton’s combined arms force of two cavalry divisions (8000 troopers) and 3000 infantry, but Stuart eventually repulsed the Union attack’. In context, a human can tell that this also takes place on June 9; the system places it (correctly) between June 9 (the previously found expression) and June 15, the date, which appears next. Sometimes the start and end time we get in this manner are invalid because the end time is earlier than the start time. In this case, we look for the next possible end time (the start time of the next closest sentence with a time expression). If nothing can be found, we use a default end point. This default end point is chosen as the last time in the middle N of the sorted times, where N is some fraction specified in the program (currently chosen as 1/3). We do this because times tend to be sparse around the ends of the list of sorted times, since there are often just a few mentions of causes or effects of the article’s topic. 7. EXTRACTING NAMED GEOSPATIAL AND OTHER NAMED ENTITIES The next goal is to extract all geospatial named entities from the documents. In particular, we are interested in extracting NEs that coincide with the important events or sentences identified in the documents. We experimented with three approaches to extracting geospatial named entities from the texts of documents. The first approach uses an SVM which is trained to extract geospatial named entities. The second approach uses an HMM-based named entity recognizer. The third approach uses a CRF-based (Conditional Random Field) named entity recognizer. A by-product of the second and third approaches is that we obtain non-geospatial named entities as well. The reason for experimenting with the three methods is that we wanted to evaluate if training an SVM on just geospatial entities would work better than off-the-shelf HMM or CRF tools. Precision, recall and F -measure are standard metrics used to evaluate the performance of a classifier, especially in natural language processing and information retrieval. After a trained classifier has been used to classify a set of objects, the fraction of objects correctly classified is termed precision. Recall computes the fraction of objects that belong to a specific class in the test dataset and are correctly classified by the classifier. When there are several classes under consideration, precision and recall can be obtained for each class and averaged. Precision and recall are often inversely proportional to each other and there is normally a trade-off between these two ratios. The F -measure mixes the properties of the previous two measures as the harmonic mean of precision and recall [73, 74]. If we want to use only one accuracy metric as an evaluation criterion, F -measure is the most preferable. 7.1. SVM-based extraction of geospatial entities This effort is discussed in detail in [75–77].An SVM [14, 78, 79] is an example of a type of supervised learning algorithm called classifiers. A classifier finds the boundary function(s) between instances of two or more classes. A binary SVM works with two classes, but a multi-class SVM can learn the functions for the discriminant or separator functions for several classes at the same time. A binary SVM attempts to discover a wide-margin separator between two classes. The SVM can learn this separator, but to do so, it must be presented with examples of known instances from the two classes. Shown a number of such known instances, the SVM formulates the problem of classification as solving a complex optimization problem. To train an SVM, each instance must be described in terms of a number of attributes or features. A multi-class SVM is a generalization of the binary SVM. The desired output of this stage of processing is a set of words and strings that may be place names. To define the features describing the geospatial NEs, we draw primarily from the work done for word-based feature selection by [18–20]. They used lexical, word-shape and other language-independent features to train an SVM for the extraction of NEs. Furthermore, the work done by Cucerzan et al. [80] on extracting NEs from Wikipedia provided further features. For feature set selection, we postulate that each word in the geospatial NE should be The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities independent, and that the training corpus should be split into individual words. A word can be labeled as beginning, middle or end of a named entity for multi-word named entities. After testing various combinations of features, we used the following features for each word to generate the feature vectors for the SVM, with equal weighting on all features: (i) Word length: alphanumeric character count, ignoring punctuation marks. (ii) Vowel count, consonant count, and capital count in the word. (iii) First capital: if the word begins with a capital letter. (iv) If the word contains two identical consonants together. (v) If the word ends with a comma, a period, a hyphen, a semi-colon or a colon. (vi) If the word begins or ends with a quote. (vii) If the word begins or ends with a parenthesis. (viii) If the word is plural, or a single character. (ix) If the word is alphanumeric, purely numeric, purely text, all uppercase or all lowercase. (x) If the word has a suffix or prefix. (xi) If the word is a preposition, an article, a pronoun, a conjunction, an interrogative or an auxiliary verb. 7.3. HMM-based approach using LingPipe HMMs have shown excellent results in the task of NER. An HMM [81] is also a type of classifier that can learn to discriminate between new and unknown instances of two (or more classes), once it has been trained with known and labeled examples of the classes. Unlike an SVM, an HMM is a sequence classifier, in the sense that an HMM classifies individual objects into classes, but in doing so, it takes into account characteristics of objects or entities that occur before and after the current entity that is being classified. An HMM is a graphical model of learning that visualizes a sequence as a set of nodes, with transitions among them. Transitions have probabilities on them. When the system transitions into a state, it is assumed to emit certain symbols with associated probabilities. Learning involves finding a sequence of ‘hidden’ states that best explains a certain emitted sequence. The idea of sequence classification is useful in many natural language processing tasks because the classification of an instance (e.g. a word as describing an event or not) may depend on the nature of words that occur before or after it in a sentence. An SVM can also take into account what comes before and after the object being classified, but the idea of sequence has to be handled in an indirect manner. One must note that sequencing does not always make sense, but in some situations, it is natural and important. Klein et al. [24] and Zhou and Su [25] have demonstrated excellent performance in extracting named entities using HMMs. We chose to use the HMM implemented by the LingPipe library for NER.3 3 http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html. CRF-based approach We also used the named entity extraction tool designed by the Stanford Natural Language Processing Group.4 This package uses a CRF for NER which achieved very high performance for all named entities when tested on the CoNLL 2003 test data [82]. The Stanford Web site states that the CRF was trained on ‘data from CoNLL, MUC6, MUC7 and ACE’.5 These datasets originate from news articles from the USA and the UK, making it a strong model for most English domains. CRFs [26, 61] provide a supervised learning framework to build statistical models using sequence data. A CRF is an undirected graphical model that defines a single log-linear distribution over label sequences, given a particular observation sequence. HMMs make certain strong independence assumptions among neighboring samples, which can be relaxed by CRFs. CRFs allow encoding of known relationships among observations in terms of likelihoods and construct consistent interpretations. The linear chain CRF is popular in natural language processing for predicting as sequence of labels for a sequence of input objects. 8. 7.2. 411 GROUNDING GEOSPATIAL NAMED ENTITIES Each of the three learning-based methods extract a set of candidate geospatial NEs from the article text. For each candidate string, the second objective was to decide whether it was a geospatial NE, and to determine the correct (latitude, longitude), or (φ, λ) coordinate pair for the place name in context of the article, which is often referred to as grounding the named entity. To resolve the candidate NE, a lookup is made using Google Geocoder.6,7 If the entity reference resolves to a single geospatial location, no further action is required. Otherwise, the context of the place name in the article, a novel data structure and a rule-driven algorithm are used to decide the correct spatial location for the place name. Our research targeted the so-called context locations, defined as ‘the geographic location that the content of a web resource describes’ [83]. Our task is close to that of word sense disambiguation, as defined by Cucerzan [80]; only that we consider the geospatial context and domain instead of the lexical context and domain. Sehgal et al. [84] demonstrate good results for geospatial entity resolution using both spatial (coordinate) and non-spatial (lexical) features of the geospatial entities. Zong et al. [85] demonstrate a rule-based method for place name assignment, achieving a precision of 88.6% on disambiguating place names in the United States, from the Digital Library for Earth System Education (DLESE) metadata. 4 http://nlp.stanford.edu/. 5 Stanford CRF – README.txt. 6 http://code.google.com/apis/maps/documentation/geocoding/index.html. 7 Google Geocoder was chosen because of accessibility and accuracy of the data, as this research does not focus on the creation of a geocoder. The Computer Journal, Vol. 57 No. 3, 2014 412 R. Chasin et al. While related to the work done by Martins et al. [86], we apply a learning-based method for the initial NER task, and only use the geocoder/gazetter for lookups after the initial candidate geospatial NEs have been identified. We also draw on our earlier work in this area, covered in [77] The primary challenge to the resolution and disambiguation is that multiple names can refer to the same place, and multiple places can have the same name. The statistics for these place name and reference overlaps are given in Table 2, from [87]. The table demonstrates that the largest area with ambiguous place names is North and Central America, the prime areas for our research. Most of the name overlaps are in city names, and not in state/province names. These statistics are further supported by the location name breakdowns on a few example individual Wikipedia articles in Table 3. The details in this table are broken down into countries, states, foreign cities, foreign features, US cities and US features, where ‘features’ are any non-state or city geospatial features, such as rivers, hills and ridges. The performance of the NER extraction and grounding system is indicated. The Battle of Gettysburg, with over 50% US city names, many of which overlap, resulted in the lowest performance. The Liberty Incident in the Mediterranean Sea, with <10% US city and state names combined, showed significantly better performance. 8.1. Google Geocoder We used Google Geocoder as the gazetteer and geocoder for simplicity, as much research has already been done in this area. TABLE 2. Places with multiple names and names applied to more than one place in the Getty Thesaurus of geographic names. Continent % places with multiple names % names with multiple places North and Central America Oceania South America Asia Africa Europe 11.5 6.9 11.6 32.7 27.0 18.2 57.1 29.2 25.0 20.3 18.2 16.6 Vestavik [88] and D’Roza and Bilchev [89] provide an excellent overview of existing technology in this area. Google Geocoder provides a simple REST-based interface that can be queried over HTTP, returning data in a variety of formats. For each geospatial NE string submitted as a query, Google Geocoder returns 0 or more placemarks as a result. Each placemark corresponds to a single (φ, λ) coordinate pair. If the address string is unknown, or another error occurs, the geocoder returns no placemarks. If the query resolves to a single location, Google Geocoder returns a single placemark. If the query string is ambiguous, it may return more than one placemark. A country code can also be passed as part of the request parameters to bias the geocoder toward a specific country. Google Geocoder returns locations at roughly four different levels, Country, State, City and Street/Feature, depending on the address query string and the area of the world. In our system, the address string returned from Google Geocoder is separated into the four parts, and each part is stored separately with the coordinates. The coordinates are also checked against the existing set of locations in the database for uniqueness. For our system, coordinates had to be more than 1/10th of a mile apart to be considered a unique location. 8.2. Location tree data structure The output from the Google Geocoder is fed through our location tree data structure, which, along with an algorithm driven by a simple set of rules, orders the locations by geospatial division, and aids in the disambiguation of any ambiguous place name. The location tree operates at four levels of hierarchy, Country, State/Province, City and Street/Local Geographic Feature. Each placemark response from the geocoder corresponds to one of these geospatial divisions. A separate set of rules governs the insertion of a location at each level of the tree. For simple additions to the tree, the rules are given below. (1) If the location is a country, add it to the tree. (2) If the location is a state or province, add it to the tree. (3) If the location is a city in the USA, and the parent node in the tree is a state, add it to the tree. (4) If the location is a city outside the USA, add it to the tree. TABLE 3. Details on place name categories for example Wikipedia articles. Article name Foreign features US features Foreign cities USA cities States Countries F -measure Battle of gettysburg War of 1812 World War II Liberty incident 0 3 20 25 31 18 0 1 1 5 17 10 51 30 1 6 15 8 0 1 2 36 62 57 68.2 81.6 87.0 90.5 All numbers are percentages. The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities 413 FIGURE 7. Example location tree showing duplicated node count. (5) If the location is a street or feature inside the USA, and the parent node is a state or city, add it to the tree. (6) If the location is a street or outside the USA, and the parent node is either a country or city, add it to the tree. Rules 3 and 5 are US-specific because Google Geocoder has much more detailed place information for the US than other areas of the world, requiring extra processing. Any location that does not match any rule is placed on a pending list. Each time a new location is added, the tree is resorted to ensure that the correct hierarchy is maintained, and each location on the pending list is rechecked to see if it now matched the rules for insertion. If the location returned from Google Geocoder has a single placemark and matches the rules, it is placed in the tree. If the location returned corresponds to multiple coordinates, another set of calculations is required before running the insertion rule set. First, any ambiguous locations are placed on the pending list until all other locations are processed. This ensures that we have the most complete context possible, and that all non-ambiguous locations are in the tree before the ambiguous locations are processed. Once we finish processing all the non-ambiguous locations in the candidate list, we have a location tree, and a set of pending locations, some of which could fit more than one place on the tree. As each node is inserted into the tree, a running count of nodes and repeated nodes are kept. Using this information, when each ambiguous location is considered for insertion, a weighting is placed on each possible node in the tree that could serve as the parent node for the location. As an example, we consider the location ‘Cambridge’. This could either be Cambridge, MA, USA, or Cambridge, UK. The tree shown in Fig. 7, demonstrates a pair of possible parent leaf nodes to ‘Cambridge’, with the occurrence count for each node in the article text. The location ‘Cambridge’ can be added underneath either leaf node in the tree. The weight calculation shown in Equation (1) determines the correct node. The weight for node n is determined by summing the insertion count for all the parent nodes of n in the tree (up to three parents), and dividing by the total insertion count for the tree. A parent node is set to zero if the tree does not contain that node. The insertion count for each node is the total number of occurrences of the location represented by that node in the original article; so the insertion count for the whole tree FIGURE 8. Complete location tree for the Operation Nickel Grass Wikipedia article. is equal to the number of all location occurrences in the article. Weight n = (Ctcountry + Ctstate + Ctcity ) . Cttotal (1) Calculating weight for the two possible parent nodes of ‘Cambridge’, we use 3 as the insertion count for the UK, and 4 for the insertion count for Massachusetts, 2 from the Massachusetts node and 2 from the USA node. Dividing both by the insertion count of 7 for the entire tree, we find WeightUK = 0.428 and WeightMA = 0.572. As the WeightMA is the higher of the two, ‘Cambridge’ is associated with Cambridge, MA. The equation for the weight calculation is chosen based on empirical testing. Our testing shows that we place 78% of the ambiguous locations correctly in the tree. This does not factor in unambiguous locations, as they are always placed correctly in the tree. Figure 8 shows the complete location tree for the Operation Nickel Grass Wikipedia article. Note that the Operation Orchard node is the only incorrect node placed in the tree. 9. EXPERIMENTS, RESULTS AND ANALYSIS For this research, we downloaded the English language pages and links database from the June 18, 2008 dump of Wikipedia. This download provides the full text of all the pages in the English Wikipedia at the time of creation. This full text includes all wiki and HTML markup. The download contains ∼7.2 million pages.8 A selection of 90 pages about battles and 8 http://download.wikimedia.org/enwiki/20080618/. The Computer Journal, Vol. 57 No. 3, 2014 414 R. Chasin et al. TABLE 4. Partial Wikipedia corpus article metrics. Page title Word Ct. Sentence Ct. NE Ct. Battle of Antietam Battle of Britain Battle of the Bulge Battle of Chancellorsville Battle of Chantilly Battle of Chickamauga American Civil War First Barbary War Battle of Fredericksburg Battle of Gettysburg First Gulf War Korean War Mexican-American War Operation Eagle Claw Operation Nickel Grass Attack on Pearl Harbor Battle of Shiloh Spanish–American War Battle of Vicksburg The Whiskey Rebellion World War II 6167 10 611 7614 2979 1170 2000 7961 1906 2354 5289 11 562 10 570 5822 1869 1092 3414 4899 4229 3369 1475 6775 400 512 339 137 50 99 397 88 111 256 895 462 275 60 52 188 253 194 178 65 331 260 229 265 133 56 108 298 97 77 274 674 495 472 56 47 198 157 296 119 49 661 wars were obtained and hand-tagged. The pages were obtained by automatically crawling a Wikipedia index of battles9 and dismissing extremely short articles. These pages resulted in ∼230 000 words of data, an average of 7% of which are geospatial named entities. The pages also provided a total of 5000 sentences. Table 4 shows a list of selection of articles we work with. It is necessary to pre-process the text of the articles before it can be used for information extraction. We remove all infoboxes. For Wikipedia-style links, the link text is removed, and replaced with either the name of linked page or with the caption text for links with a caption. Links to external sites are removed. Certain special characters and symbols that do not add any information are removed as they tend to cause exceptions and problems in string processing, e.g. HTML tags. The characters we remove do not include those that would appear in names such as accented characters, commas, periods, etc. The cleaned primary text of each page is split into an array of individual words, and the individual words are hand-tagged for various purposes as necessary. 9.1. Extracting important events We use supervised learning for extracting important events from the historical documents. For this purpose, we require a set 9 http://en.wikipedia.org/wiki/List_of_battles. of hand-tagged articles indicating whether each event word identified by EVITA is part of an important event. If an event word is part of an important event, the sentence containing it is considered important because the word contributes to this importance. Because there is no good objective measure of importance, we asked multiple volunteers to tag the articles to avoid bias. Subjectivity was a real problem, however. The most basic guideline was to judge whether one would want the event on a timeline of the article. Other guidelines included to not tag it if it was more of a state than an action, and to tag all event words that referred to the same event. Each article was annotated by more than one person so that disagreements could usually be resolved using a majority vote. Thirteen articles were annotated in total and the breakdown of numbers of annotators was: one article annotated by five people, four articles annotated by four people, six articles annotated by three people and two articles annotated by two people. If we took one annotator’s tagging as ground truth and measure every other annotator’s accuracy against that, the average inter-annotator F1-score came out as 42.6%, indicating great disagreement. Using an alternative approach, we labeled a sentence as important for the classifier if either (1) some event in the sentence was marked important by every annotator, or (2) at least half of the events in the sentence were marked important by at least half of the annotators. Using this criteria, over the 13 articles 696 sentences were labeled important and 1823 unimportant. The average F1-score between annotators for this new method rose to 66.8% (lowest being 47.7% and highest being 78.5%). By using data gathered by multiple annotators and filtering it in such a way, we are able to concentrate on events that multiple people can agree are important. This is our attempt to both avoid overly focusing and overly diluting our training data. Adding more participants can impact our training data in a variety of ways that depends on the like-mindedness of the individuals; the addition could increase our confidence in a particular event being important, add to or drop events from acceptable confidence levels, etc. Depending on our annotators, training data could represent anything from a very focused method of importance identification to a noisy over-saturated list of events. Overall, the human training aspect is necessary and has the potential to be improved, likely by adding a short phase of mass-tagging potentially followed by periodic additions to the training corpus. Although we currently use one generic set of training data which we apply to text from any source, another option is to develop various models trained on different sets of text to achieve better performance across sources, i.e. utilizing metadata along with the training text to allow classification of the text based on time period, structure, source, etc. We can generate multiple models trained on different subsets of our tagged texts, grouped by their characteristics, and when we apply the system to a new source of text, we can use metadata, if it is available for the text, to choose the most relevant model. The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities TABLE 5. SVM parameter searching test results (averaged over cross-validation sets). c g w F -score 1.0 1.0 1 29.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.5 0.25 0.125 0.0625 0.03125 2 2 2 2 2 2 46.1 46.9 47.5 48.0 49.0 48.0 1.0 1.0 1.0 1.0 1.0 0.5 0.25 0.125 3 3 3 3 46.2 48.9 50.1 51.0 (p:41.06066, r:75.0522, f:51.0428) 1.0 0.0625 3 50.9 0.5 0.5 0.5 0.5 0.5 45.2 47.1 47.5 48.2 48.1 1.0 0.5 0.25 0.125 0.0625 2 2 2 2 2 0.5 1.0 3 0.5 0.5 3 0.5 0.25 3 0.5 0.125 3 (p:41.14866, r:74.7939, f:51.03596) 0.5 0.0625 3 (p:41.35864, r:73.9689, f:51.01095) 48.0 49.2 50.55 51.0 51.0 Ultimately, we trained a binary SVM using a radial basis function kernel using the LIBSVM software [90]. The parameters for the kernel and cost constant were found with one of LIBSVM’s built-in tools. SVMs were trained and tested using 10-fold cross-validation. To determine the best SVM model, the parameter space was partially searched for the cost of missing examples (c), the parameter gamma of the radial basis kernel function (g) and the weight ratio of cost for missing positive examples to cost for missing negative examples (w). For any given choice of c, g and w, an SVM was trained on each cross-validation training set and tested on the corresponding test set, using LIBSVM. The resulting F -scores were averaged over the different sets and recorded. The search space was explored starting at c = 1.0, g = 1.0, w = 1; c and g were reduced by powers of 2 and w was incremented. This portion of the space was explored because in the original classification SVM training, the optimal c and g (as found by a LIBSVM tool) were usually 0.5 and 0.5. Fixing c and w, the next g would be chosen until F -scores no longer went up; w was usually set to 2 almost immediately because of low scores for w = 1. This was repeated for w up to 3 (as 4 produced 415 slightly poorer results), and then the next value of c was chosen. The results of these tests are shown in Table 5. However, 0.5 and 0.5 were optimal values for the event-word-level classifiers, and so a broader search may have been beneficial. The best results were at the 51.0% F -score and came from a few different choices of parameters. The set c = 1.0, g = 0.125, w = 3 was chosen and the final model was trained with those parameters on the entire dataset. 9.2. Experiments in extracting occurrence times Testing the times is significantly more difficult, since the intervals generated by this program are certainly different from the ones intended by the article and representing events with times is difficult even for human annotators. We use two evaluation methods for extracted times. First, spot-checking showed strengths of our regularexpression-based time extraction method when time expressions are explicit. Two undergraduates examined a random pick of several dozen times generated automatically. We found that ∼80% of the automatic extractions were correct if there is an explicit time expression. When a time of occurrence is not given explicitly, we use a measure proposed by Ling and Weld [12] that they term ‘Temporal Entropy’ (TE). This indirectly measures how large the intervals generated are, smaller and, therefore, better ones yielding smaller TE. Different Wikipedia articles have different spans and time granularities, and therefore TE varies greatly among them. For example, a war is usually measured in years while a battle is measured in days.An event whose interval spans a few months in a war article should not be penalized the way that span should be in a battle article. The range of temporal entropies obtained is shown in Fig. 9. It is also necessary to normalize the TE. To do this, we divide the length of the event’s interval in seconds by the length in days of the unit that is found for display on the timeline as described in the visualization section. The TE from these results is primarily between 5 and 15, with 1400 sentences that had temporal expressions having TE in this range, and the majority of this close to 10. A TE of 10 represents an interval that was about 0.25 of the unit used to measure the article—for example, for a battle article measured in weeks, an interval with 10 TE would span about 2 days. There is no theoretical lower or upper bound on the TE range, but in practice it is bounded above by the TE of the interval with the longest interval to article unit ratio. Temporal entropy does not give all the information, however. Spot-checking of articles reveals that many events—particularly those whose times were estimated—are not in the correct interval at all. An example of this error can be seen in the USS Liberty article. This article’s span was found to be the 1 day 8 June 1967. Regular expressions on the sentence ‘She was acquired by the United States Navy ... and began her first deployment in 1965, to waters off the west coast of Africa’ find ‘began’ and ‘1965’ and it is concluded that this event starts in The Computer Journal, Vol. 57 No. 3, 2014 416 R. Chasin et al. FIGURE 9. Temporal Entropy graph—the temporal entropies were sorted in increasing order for plotting. Temporal entropy is in log(seconds/days). 1965 and ends at the end of the whole event, in 1967. This is incorrect. Worse, a sentence closely following this one begins ‘On June 5, at the start of the war’, and is guessed to be in the 1965 because the correct year, 1967, was not mentioned again between the sentences. This error then propagates through at least one more sentence. Similar examples succeed, however, as in the Battle of Gettsyburg article: ‘The two armies began to collide at Gettysburg on 1 July 1863, as Lee urgently concentrated his forces there’ is given the date 1 July 1863, and the next sentence with a time expression ‘On the third day of battle, July 3, fighting resumed on Culps Hill,...’ uses 1863 as a year. A useful but impractical additional test, which we have not performed, would be human examination of the results to tell whether each event’s assigned interval includes or is included in its actual interval. Then those that fail this test could be given maximum temporal entropy, to integrate this test’s results into the former results. 9.3. Experiments in geospatial named entity extraction We performed extensive experiments in extracting geospatial named entities. We used three different methods for extracting and geocoding geospatial named entities. Our goal was to see if an especially trained and tuned SVM will perform better than HMM or CRF approaches, particularly for geospatial names. More details can be found in [75–77, 91]. 9.3.1. Using machine learning for geospatial named entity extraction We downloaded a number of previously tagged datasets to provide the training material for the SVM. We also hand-tagged a dataset on our own. Here are the relevant datasets. (i) Spell Checker-Oriented Word List (SCOWL)10 to provide a list of English words. (ii) CoNLL 2003 shared task dataset on multi-language NE tagging11 containing tagged named entities for PER, LOC and ORG, in English. (iii) Reuters Corpus from NIST:12 15 000 selected words from the corpus were hand-tagged for LOC entities. (iv) CoNLL 2004 and 2005 shared task datasets, on Semantic Role Labeling13 tagged for English LOC NEs. (v) Geonames database of 6.2 million names of cities, counties, countries, land features, rivers and lakes.14 To generate the SVM training corpus, we combined our hand-tagged Reuters corpus and the three CoNLL datasets. We extracted those records in the 2.5-million word Geonames database in either the P or A Geonames codes (Administrative division and Population center), which provided the names of cities, villages, countries, states, and regions. Combined with the SCOWL wordlist, which provided negative examples, this resulted in a training set of 1.2 million unique words. We used LibSVM to train an SVM based on the feature vectors and two parameters, C and γ , which are the cost and degree coefficient of the SVM kernel function.After using a grid search, the optimal parameters were determined to be C = 8 and γ = 0.03125. We selected a radial basis function kernel, and applied it to the C-SVC, the standard cost-based SVM classifier. The SVM was trained using feature vectors based on the training corpus, requiring 36 h to complete. Our goal was to obtain high 10 http://downloads.sourceforge.net/wordlist/scowl-6.tar.gz. 11 http://www.cnts.ua.ac.be/conll2003/ner/. 12 http://trec.nist.gov/data/reuters/reuters.html. 13 http://www.lsi.upc.es/s̃rlconll/st04/st04.html. 14 http://www.geonames.org/export/. The Computer Journal, Vol. 57 No. 3, 2014 417 Extracting and Displaying Temporal and Geospatial Entities TABLE 6. Training corpus combination results for the NER SVM. Corpus Precision Recall F R R + C03 + C04 + C05 R + C03 + C04 + C05 + SCW Wiki Wiki + R + C03 + C04 + C05 + SCW Geo + R + C03 + C04 + C05 + SCW FGeo + R + C03 + C04 + C05 + SCW 0.38 0.45 0.49 0.65 0.60 TC 0.51 0.63 0.68 0.75 0.34 0.55 TC 0.99 0.47 0.54 0.59 0.44 0.57 TC 0.67 R, reuters corpus; C03, CoNLL 2003 dataset; C04, CoNLL 2004 dataset; C05, CoNLL 2005 dataset; Wiki, Wikipedia data; SCW, SCOWL wordlist; Geo, full geonames database; FGeo, filtered geonames database; TC, training canceled. Bold content indicates the chosen feature set. TABLE 7. Feature set combination results for the NER SVM SHP: Word shape, including length and alphanumeric content. Feature set F -measure SHP + PUN + HYP + SEP SHP + CAP + PUN + HYP + SEP CAP + VOW + CON SHP + CAP SHP + CAP + VOW + CON SHP + CAP + VOW + CON + AFF + POS SHP + CAP + VOW + CON + PUN + HYP + SEP + AFF + POS 0.44 0.47 0.51 0.54 0.56 0.58 0.67 CAP, capital positions and counts; VOW, vowel positions and counts; CON, consonant positions and counts; PUN, punctuation; HYP, hyphens; SEP, separators, quotes, parentheses, etc.; AFF, affixes; POS, part of speech categorization. Bold content indicates the chosen feature set. recall while maintaining good precision. Table 6 shows the precision, recall and F -measure numbers for training corpuses created with different subsets of the training data. The performance numbers are the results from the Wikipedia testing corpus. After selecting the optimal training corpus, we optimized the feature set for the SVM. Table 7 shows a breakdown of the feature sets we test separately. We found that the list presented earlier provided the best performance by close to 10% in F measure over any other combination of features. With this setup, we were successful in training the SVM to return high recall numbers, recognizing 99.8% of the geospatial named entities. Unfortunately, this was at the cost of precision. The SVM returned noisy results with only 51.0% precision, but the noise was processed out during the NE geocoding process, to obtain our final results. As described earlier, we also performed geospatial NER using an HMM-based tool from LingPipe and a CRF-based tool from Stanford Natural Language Processing Group. TABLE 8. Grounded geospatial NE results. HMM NER results CRF NER results SVM NER results HMM resolved results SVM resolved results CRF resolved results Precision Recall F -measure 0.489 0.579 0.510 0.878 0.869 0.954 0.615 0.575 0.998 0.734 0.789 0.695 0.523 0.572 0.675 0.796 0.820 0.791 9.3.2. Comparison of geospatial NER using machine learning methods Table 8 shows the raw results from the geospatial NE recognition using the three learning methods along with the results after processing through geospatial resolution and disambiguation (discussed in the following section). The NER Results show the accuracy of the NERs before any further processing. The Resolved Results in Table 8 specifically show the performance in correctly identifying location strings and geocoding the locations. A string correctly identified by the NER process is one that exactly matched a hand-tagged ground truth named entity. Figure 10 shows a more detailed breakdown of the precision, recall and F -measure for a subset of the articles processed. A handful of articles with foreign names, such as the article on the Korean War, brought down the average results for all three, but to a larger extent for the SVM and a little for the HMM, compared with the other two methods. This is most likely due to the fact that our training data contained a limited number of foreign names, and the HMM had trouble recognizing these strings as LOC named entities. The SVM produces a lot more traffic to the geocoder compared with the HMM or CRF method. Table 9 shows the decrease in the number of candidate location NEs extracted by the HMM over the SVM for some of the articles in the test corpus. It also shows the number of these NEs that successfully geocoded and were disambiguated. Although the HMM or the CRF often extracted less than half as many potential NEs as the SVM, the final results came out similar. The HMM demonstrates better performance than the SVM in the longer articles, and worse performance in shorter articles. The F -measure for some of these articles are pictured in Fig. 10 in which the three lowest and three highest scoring articles of the HMM’s results are shown side by side. For the HMM, the lowest scoring articles were the articles with the fewest potential NEs in the text, and the highest had the most potential NEs. We see in Table 8 that, as far as F -measure after resolution goes, all three approaches are comparable in performance, although our especially trained SVM has high recall as tuned. However, the traffic to the Geocoder is a lot less for the HMM or CRF method. Our system allows a choice of all three methods as options although CRF is the default due to its fastest training speed. The Computer Journal, Vol. 57 No. 3, 2014 418 R. Chasin et al. FIGURE 10. Comparing SVM and HMM results for a subset of Wikipedia articles. TABLE 9. Hand-tagged articles: potential location NEs. Article HMM extracted potential NEs/ grounded NEs SVM extracted potential NEs/ grounded NEs HMM precision SVM precision HMM recall SVM recall Chancellorsville Gettysburg Korean War War of 1812 World War 2 119/47 327/115 625/328 752/384 668/464 566/75 1209/117 2167/331 2173/408 1641/448 0.7015 0.7718 0.9371 0.9165 0.9915 0.8621 0.7267 0.6910 0.8518 0.9124 0.4947 0.6319 0.8700 0.7370 0.8609 0.7895 0.6429 0.8780 0.7831 0.8312 FIGURE 11. Comparing HMM- and CRF-based named entity recognizers on selected articles. The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities 9.3.3. Disambiguation of geospatial named entities We found that there are problems with the context-based disambiguation in North America in areas where the city and feature names have significant overlap. The ultimate result of this is that places with globally unique city and feature names perform the best, as there is no need to run the context disambiguation algorithms. We also found that the worst areas in North America are the Northeast, and the Southwest USA where there are significant overlap between place names in the USA and place names in Mexico. In general, our context disambiguation algorithm performs at 78%, but in these areas of the country, the performance drops to around 64.5%, due to the significant place name overlap. 9.3.4. Geocoding and caching geocoded results Once a list of geospatial names are identified, they are fed into Google Geocoder, which is a service that attempts to resolve each name as an object with various attributes, including a latitude and longitude. The geocoder is accessed via an online API and can return multiple results if there are multiple places in Google’s catalog that match the provided name. Geocoding success is defined as correctly resolving a string to a single location in the context of the document by the end of our postprocessing of the geocoder’s results. We also perform post-processing on the Geocoder results. We find that the geocoder performs much better on land-based than ocean- and sea-based locations. This is primarily because land tends to be better divided and named than the ocean. This problem is found primarily in articles about naval battles, like the Battle of Midway article. We also find that Google Geocoder has much better data in its gazetteer for the USA and Western Europe than other areas of the world. While we are able to resolve major cities in Asia and Africa, smaller-scale features are often returned as ‘not found’. Russian and Eastern Asian locations also introduce the occasional problem of other character sets in Unicode characters. We find that the articles in our corpus do not refer to very many features at smaller than a city level in general, so these geocoding problems do not significantly impact our performance. Historical location names cause some small issues with the geocoder, but our research demonstrates that, for North America and Europe at least, Google Geocoder has historical names in its gazetteer going back at least 300 years. It should be noted that we cache geocoded results, which decreases the number of online geocoding queries by an average of 70%. Google Geocoder limits the speed at which a public user can make consecutive requests and so we add a delay of 0.1 s between queries. By reducing our online accesses by such a large percentage, we end up saving seconds processing each article. More detailed performance evaluations still need to be made to more clearly identify how well the cache has improved efficiency. Since Google Geocoder outputs have not yet been disambiguated by our system when cached, the cache holds information across the processing of multiple articles. 419 In initial development, it was the single final disambiguated output that was saved into the cache rather than raw geocoder output. This caused problems when two completely different geospatial entities with the same name were referenced in two different articles that would have been distinguished correctly given raw geocoder output. We have since fixed this problem by immediately saving the list of results returned by the geocoder into the cache, rather than the specific location identified after disambiguation. Unfettered access to Google Geocoder costs several tens of thousands of dollars per year. 10. VISUALIZATION The overall objective of the research presented in this paper is to create a structured dataset that represents the events, entities, temporal data and grounded locations contained in a corpus of unstructured text. While this is, of course, very valuable for many things, our follow-up objective is the visualization of this structured data from the text. We create a bifurcated GUI that provides the visualizations for the geospatial and event/temporal information. We generate a combination map and timeline using the Google Maps API15 and the MIT SIMILE Timeline.16 The map interface provides a clickable list of events on the right, with an associated map of the events. Clicking on a location on the map or in the sidebar shows an infobox on the map that provides more details about the location, including the sentence in which it appears. Figure 12 shows this interface, and Fig. 13 shows the clicked map. The second part of the interface is a timeline with short summaries of the important events. Clicking on any of the events brings up an infobox with a more detailed description of the event, shown in Fig. 14. 10.1. Timeline hot zones Key to the timeline visualization is the concept of ‘hot zones’, or clusters of important events on the timeline. An example set of hot zones is shown in Fig. 15. In the figure, the orange bars represent three different hot zones where multiple events have been identified to occur at about the same time. Blue markers, along with the text of the event itself, are aligned vertically in each zone. Events that were not identified to reside in a hot zone are staggered outside of the orange bars. To generate the endpoints for more than one hot zone, we sort the times for an article and examine the time differences (in days) between consecutive times. Long stretches of low differences indicate dense time intervals, indicating a hot zone. It is not necessary for the differences to be zero, just low, and so a measure is needed for how low is acceptable. We choose to remove outliers from the list of time differences and then take the average of the new list. To remove outliers, we proceed in a common manner and calculate the first and third quartiles (Q1 and Q3) and then 15 http://code.google.com/apis/maps/documentation/javascript/. 16 http://www.simile-widgets.org/timeline/. The Computer Journal, Vol. 57 No. 3, 2014 420 R. Chasin et al. FIGURE 12. Example of an article’s visualization. FIGURE 13. Geospatial entity infobox in the visualization. The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities 421 FIGURE 14. Event infobox in the visualization, after clicking ‘Richmond, VA, USA’. FIGURE 15. Timeline for Mexican–American War, containing three hot zones. Key to the timeline visualization is the concept of ‘hot zones’, or clusters of important events on the timeline. An example set of hot zones is shown in this figure. In the figure, the orange bars represent three different hot zones where multiple events have been identified to occur at about the same time. Blue markers, along with the text of the event itself, are aligned vertically in each zone. Events that were not identified to reside in a hot zone are staggered outside of the orange bars. remove values greater than Q3+3∗(Q3−Q1) or less than Q1− 3 ∗ (Q3 − Q1). The latter quantity is usually zero; so this ends up removing high outliers. This average is the threshold below which a difference is called low. We also ensure that too many hot zones are not generated, and so we choose a threshold for the number of consecutive differences that is low for the interval to be a hot zones; we choose a threshold of 10, based on what is appropriate for the dates in the majority of articles we process. 10.2. Article processing flow The website both allows a user to search for and visualize articles, and serves to control the processing of new articles as they are requested. Figure 16 shows the processing flow of articles. While the system constantly queues and processes articles, articles that have not been previously processed that are requested by a user are immediately moved to the top of the queue for processing. A user can also choose to re-process an article that The Computer Journal, Vol. 57 No. 3, 2014 422 R. Chasin et al. FIGURE 16. System workflow. has already been processed, if it has been too long since it was last processed, as the article content may change in Wikipedia. 11. CONCLUSIONS AND FUTURE WORK In this paper, we have presented details of a system that puts together many tools and approaches to produce a working system that is useful and interesting. Of course, the work presented here can be improved in many ways. The performance of the important event classifier could be improved upon in the future. It is possible that more features would help. Ideas for sentence features that were not implemented included calculating the current features for the adjacent sentences. Dependency relations obtained from dependency parsers can be used as well. Classifiers other than SVMs could be explored. Since the task is similar to text summarization, methods used for summarization could be adapted as well. Assuming we perform extractive summarization, the sentences that are selected for the summary for a document can be considered important; these in turn are assumed to contain references to important events. If the number of sentences selected for the summary can be controlled, it will give us control over the number or percentage of sentences selected and hence, events considered important. Calculating the features to train the classifier takes more time than desirable. Using multithreading on processing articles would make some improvement. There are three specific features that take the most time to calculate—similarity, named entity weight and TextRank rank. Experimenting with removing any of these features while preserving accuracy could make a more efficient classifier. The extraction of temporal expressions that are explicit turns out to work fairly well for this domain of historical narratives because the sentences are often ordered in the text similarly to their temporal order. In another domain, even one rich in temporal information like biographies, it might not do as well. Although they work well, most of the algorithms we used here are naïve and do not make use of existing tools. The regular expression extractor works fairly well, but needs expansion to handle relative expressions like ‘Two weeks later, the army invaded’. To implement this with the same accuracy that the extractor currently has should not be hard, as when the expression is recognized, the function that it implies (in this example, adding 2 weeks) can be applied to the anchoring time. It also cannot currently handle times B.C.E. More difficult would be a change to the program to have it extract multiple expressions per sentence (beyond the range-like expressions it already finds). If events are considered smaller than sentences, this is crucial. Having events smaller than sentences would also allow the use of Ling and Weld’s Temporal Information Extraction system [12], which temporally relates events within one sentence. We need to improve performance when time expressions are implicit. Errors tend to build up and carry over from sentence to sentence; some kind of check or reset condition would help accuracy. The extracted information is stored in a MYSQL database at this time. We would like to convert it into an ontology based on OWL or RDF so that it can be easily integrated into programs for effective use. For example, this may enable us to allow searches for text content to be filtered by geographic area. The search could then be performed in two steps. The first would be a standard free text search, for articles about the topic. That list could then be further filtered to those articles with appropriate locations associated with them. Reversing this paradigm, the location data provided by the system could also allow location-centric search. The visualization could also be improved with changes to SIMILE’s code for displaying a timeline, particularly the width of the timeline, which does not automatically adjust for more events. Because the important event classifier positively identifies a lot of sentences that are often close in time, the listing of events on the timeline overflows the space, so more must be allotted even for sparser articles. This could be improved with automatic resizing or zooming. Another improvement would be to create event titles that better identify the events since they are currently just the beginnings of the sentences. It will also be interesting to see if the visualization can be made more expressive using a language of choremes [35, 36]. Finally, more features could be added to the visualization, especially in terms of filtering by time or location range. Applied to other corpora of information, this event and geospatial information could also be very useful in finding geospatial trends and relationships. For instance, consider a The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities database of textual disease outbreak reports. The system could extract all the locations, and then they could be graphically presented on a map, allowing trends to be found much more easily. In our current effort, we perform limited association of events and geo-entities based on their proximity within the sentence, which provides only the most basic association. While the research done on event extraction did heavily involve the time of the event, the further association of the events with the geoentities is a topic for future research. In practical terms, we also envision developing our user interface further by having the system we created have a slider for time (rather than just being able to enter start/end dates), which if used continuously could provide a nice overview of where events happen over time. Some kind of clustering could be applied at each time step of some granularity to see where the action moves. [11] [12] [13] [14] [15] FUNDING The work reported in this paper has been partially supported by NSF grant ARRA: 0851783. R.C. and D.W. are undergraduate researchers supported by this NSF Research Experience for Undergraduates (REU) Grant. This paper is a much extended version of [92]. [16] [17] REFERENCES [1] Allan, J., Gupta, R. and Khandelwal, V. (2001) Temporal Summaries of New Topics. SIGIR’01: Proc. 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, New Orleans, LA, USA, pp. 10–18. [2] Allan, J. (2002) Topic Detection and Tracking: Event-Based Information Organization. Springer, New York, NY, USA. [3] Swan, R. andAllan, J. (1999) Extracting Significant Time Varying Features from Text. CIKM’99: Proc. 8th Int. Conf. on Information and Knowledge Management, Kansas City, MO, USA, pp. 38–45. [4] Manning, C.D., Raghavan, P. and Schütze, H. (2008) Introduction to Information Retrieval. Cambridge University Press, Cambridge, MA USA. [5] Ahn, D. (2006) The Stages of Event Extraction. Proc. Workshop on Annotating and Reasoning about Time and Events, Sydney, Australia, July, pp. 1–8. [6] Humphreys, K., Gaizauskas, R. and Azzam, S. (1997) Event Coreference for Information Extraction. Proc. Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts, Madrid, Spain, pp. 75–81. [7] Bagga, A. and Baldwin, B. (1999) Cross-document Event Coreference: Annotations, Experiments, and Observations. Proc. Workshop on Coreference and its Applications, Madrid, Spain, pp. 1–8. [8] Allen, J.F. (1983) Maintaining knowledge about temporal intervals. Commun. ACM, 26, 832–843. [9] Zhou, L., Melton, G.B.B., Parsons, S. and Hripcsak, G. (2005) A temporal constraint structure for extracting temporal information from clinical narrative. J. Biomed. Inf., 39, 424–439. [10] Pustejovsky, J., Castano, J., Ingria, R., Sauri, R., Gaizauskas, R., Setzer, A., Katz, G. and Radev, D. (2003) TimeML: Robust [18] [19] [20] [21] [22] [23] [24] [25] 423 Specification of Event and Temporal Expressions in Text. New Directions in Question Answering, pp. 28–34. AAAI Press Palo Alto, CA, USA. Verhagen, M. and Pustejovsky, J. (2008) Temporal Processing with the TARSQI Toolkit. 22nd Int. Conf. on Computational Linguistics: Demonstration Papers, Manchester, UK, August, pp. 189–192. Ling, X. and Weld, D. (2010) Temporal Information Extraction. Proc. 24th Conf. on Artificial Intelligence (AAAI-10), Atlanta, GA, July, pp. 1385–1390. Grishman, R. and Sundheim, B. (1996) Message Understanding Conference-6: A Brief History. Proc. COLING, Copenhagen, Denmark, pp. 466–471. International Conference on Computational Linguistics. Vapnik, V. (2000) The Nature of Statistical Learning Theory. Springer, New York, NY, USA. Bhole, A., Fortuna, B., Grobelnik, M. and Mladenić, D. (2007) Extracting named entities and relating them over time based on Wikipedia. Informatica, 22, 20–26. Takeuchi, K. and Collier, N. (2002) Use of Support Vector Machines in Extended Named Entity Recognition. COLING02, Taipei, Taiwan, pp. 1–7. International Conference on Computational Linguistics. Lee, K.-J., Hwang, Y.-S. and Rim, H.-C. (2003) Two-Phase Biomedical NE Recognition Based on SVMs. Proc. ACL 2003 Workshop on Natural Language Processing in Biomedicine, Sapporo, Japan, pp. 33–40. Association for Computational Linguistics. Habib, M. and Kalita, J. (2007) Language and DomainIndependent Named Entity Recognition: Experiment using SVM and High-dimensional Features. Proc. 4th Biotechnology and Bioinformatics Symposium (BIOT-2007), Colorado Springs, CO, USA, pp. 69–73. Habib, M. (2008) Improving scalability of support vector machines for biomedical named entity recognition. PhD Thesis, University of Colorado, Colorado Springs, CO, USA. Habib, M. and Kalita, J. (2010) Scalable biomedical named entity recognition: investigation of a database-supported SVM approach. Int. J. Bioinf. Res. Appl., 6, 191–208. Isozaki, H. and Kazawa, H. (2002) Efficient Support Vector Classifiers for Named Entity Recognition. Proc. 19th Int. Conf. on Computational Linguistics, Taipei, Taiwan, pp. 1–7. COLING. Dakka, W. and Cucerzan, S. (2008) Augmenting Wikipedia with Named Entity Tags. Proc. 3rd Int. Joint Conf. on Natural Language Processing, Hyderabad, India, pp. 545–552. Mansouri, A., Affendey, L. and Mamat, A. (2008) Named entity recognition using a new fuzzy support vector machine. Int. J. Comput. Sci. Netw. Secur., 8, 320. Klein, D., Smarr, J., Nguyen, H. and Manning, C.D. (2003) Named Entity Recognition with Character-Level Models. Proc. 7th Conf. on Natural Language Learning at HLT-NAACL 2003, Edmonton, Canada, pp. 180–183. Human Language Technologies–North American Chapter of Association for Computational Linguistics. Zhou, G. and Su, J. (2002) Named Entity Recognition using an HMM-Based Chunk Tagger. Proc. 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania, pp. 473–480. The Computer Journal, Vol. 57 No. 3, 2014 424 R. Chasin et al. [26] Lafferty, J., McCallum, A. and Pereira, F.C. (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proc. Int. Conf. on Machine Learning, Williamstown, MA, USA. [27] McCallum, A. and Li, W. (2003) Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-enhanced Lexicons. Proc. 7th Conf. on Natural Language Learning at HLT-NAACL 2003, Berkeley, CA, USA, pp. 188–191. Human Language Technologies–North American Chapter of Association for Computational Linguistics. [28] Chen, Y., Di Fabbrizio, G., Gibbon, D., Jora, S., Renger, B. and Wei, B. (2007) Geotracker: Geospatial and Temporal RSS Navigation. WWW’07: Proc. 16th Int. Conf. on World Wide Web, Banff, Alberta, Canada, pp. 41–50. [29] Scharl, A. and Tochtermann, K. (2007) The Geospatial Web: How Geobrowsers, Social Software and the Web 2.0 Are Shaping the Network Society. Springer, New York, NY. [30] Strötgen, J., Gertz, M. and Popov, P. (2010) Extraction and Exploration of Spatio-temporal Information in Documents. Proc. 6th Workshop on Geographic Information Retrieval, Zurich, Switzerland, pp. 16–23. [31] Cybulska, A. and Vossen, P. (2010) Event models for Historical Perspectives: Determining Relations between High and Low Level Events in Text, Based on the Classification of Time, Location and Participants. Proc. Int. Conf. on Language Resources and Evaluation, Malta, pp. 17–23. [32] Cybulska, A. and Vossen, P. (2011) Historical Event Extraction from Text. ACL HLT 2011 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2011), Portland, OR, USA, pp. 39–43. Association for Computational Linguistics–Human Language Technologies. [33] Miller, G. (1995) WordNet: a lexical database for English. Commun. ACM, 38, 39–41. [34] Ligozat, G., Nowak, J. and Schmitt, D. (2007) From Language to Pictorial Representations. Proc. Language and Technology Conf., Poznan, Poland. [35] Brunet, R. (1980) La Composition des Modèles dans L’analyse Spatiale. Espace Géographique, 9, 253–265. [36] Klippel, A., Tappe, H., Kulik, L. and Lee, P.U. (2005) Wayfinding choremes: a language for modeling conceptual route knowledge. J. Vis. Lang. Comput., 16, 311–329. [37] Ligozat, G., Vetulani, Z. and J., O. (2011) Spatiotemporal aspects of the monitoring of complex events for public security purposes. Spat. Cogn. Comput., 11, 103–128. [38] Goyal, R.K. and Egenhofer, M.J. (2001) Similarity of Cardinal Directions. Advances in Spatial and Temporal Databases, pp. 36–55. Springer. [39] Rigaux, P., Scholl, M. and Voisard, A. (2001) Spatial Databases: With Application to GIS. Morgan Kaufmann, CA, USA. [40] Clarke, K.C. (2001) Getting Started with Geographic Information Systems. Prentice Hall. [41] Hill, L.L. (2000) Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints. Research and Advanced Technology for Digital Libraries, pp. 280–290. Springer. [42] Fonseca, F.T., Egenhofer, M.J.,Agouris, P. and Câmara, G. (2002) Using ontologies for integrated geographic information systems. Trans. GIS, 6, 231–257. [43] Buscaldi, D., Rosso, P. and Peris, P. (2006) Inferring Geographical Ontologies from Multiple Resources for Geographical Information Retrieval. Proc. 3rd SIGIR Workshop on Geographical Information Retrieval, Seattle, WA, USA. Association for Computing Machinery Special Interest Group on Information Retrieval. [44] Jones, C.B., Purves, R., Ruas, A., Sanderson, M., Sester, M., Van Kreveld, M. and Weibel, R. (2002) Spatial Information Retrieval and Geographical Ontologies an Overview of the SPIRIT Project. Proc. 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Tampere, Finland, pp. 387–388. Association for Computing Machinery Special Interest Group on Information Retrieval. [45] Fu, G., Jones, C.B. and Abdelmoty, A.I. (2005) OntologyBased Spatial Query Expansion in Information Retrieval. On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE, pp. 1466–1482. Springer. [46] Martins, B.E.d.G. (2009) Geographically aware Web text mining. PhD Thesis, Universidade de Lisboa, Lisbon, Portugal. [47] Page, L., Brin, S., Motwani, R. and Winograd, T. (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical report., Stanford InfoLab, available at http://ilpubs. stanford.edu:8090/422/1/1999-66.pdf. [48] Kleinberg, J.M. (1999) Authoritative sources in a hyperlinked environment. J. ACM, 46, 604–632. [49] Brill, E., Dumais, S. and Banko, M. (2002) An Analysis of the AskMSR Question-Answering System. Proc. ACL02 Conf. on Empirical Methods in Natural Language Processing, Philadelphia, PA, USA, pp. 257–264. Association for Computational Linguistics. [50] Ravichandran, D. and Hovy, E. (2002) Learning Surface Text patterns for a Question Answering System. Proc. 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, pp. 41–47. [51] Hovy, E., Hermjakob, U. and Ravichandran, D. (2002) A Question/Answer Typology with Surface Text Patterns. Proc. 2nd Int. Conf. on Human Language Technology Research, San Diego, CA, USA, pp. 247–251. [52] Moldovan, D., Harabagiu, S., Girju, R., Morarescu, P., Lacatusu, F., Novischi, A., Badulescu, A. and Bolohan, O. (2003) LCC tools for Question Answering. Proc. Text Retrieval Conference (TREC), Gaithersburg, MD, USA, pp. 388–397. [53] Harabagiu, S., Moldovan, D., Clark, C., Bowden, M., Williams, J. and Bensley, J. (2003) Answer Mining by Combining Extraction Techniques with Abductive Reasoning. Proce. 12th Text Retrieval Conference (TREC 2003), Gaithersburg, MD, USA, pp. 375–382. [54] Voorhees, E. (2003) Overview of TREC 2003. Proc. Text Retrieval Conf. (TREC), Gaithersburg, MD, USA, pp. 1–16. [55] Small, S., Liu, T., Shimizu, N. and Strzalkowski, T. (2003) HITIQA: An Interactive Question Answering System a Preliminary Report. Proc. ACL 2003 Workshop on Multilingual Summarization and Question Answering, Sapporo, Japan, pp. 46–53. Association for Computational Linguistics. [56] Soricut, R. and Brill, E. (2004) Automatic Question Answering: Beyond the Factoid. Proc. HLT-NAACL, Boston, MA, USA. pp. 57–64. Human Language Technologies–North American Chapter of Association for Comoputational Linguistics. Boston, MA, USA. The Computer Journal, Vol. 57 No. 3, 2014 Extracting and Displaying Temporal and Geospatial Entities [57] Sauri, R., Knippen, R., Verhagen, M. and Pustejovsky, J. (2005) EVITA: A Robust Event Recognizer for QA Systems. Proc. Conf. on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, October, pp. 700–707. [58] UzZaman, N. and Allen, J. (2010) TRIPS and TRIOS System for TempEval-2: Extracting Temporal Information from Text. Proc. 5th Int. Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 276–283. [59] UzZaman, N. andAllen, J. (2010) Event and Temporal Expression Extraction from Raw Text: First Step towards a Temporally Aware System. Int. J. Semant. Comput., 4, 487–508. [60] Llorens, H., Saquete, E. and Navarro, B. (2010) TIPSem (English and Spanish): Evaluating CRFs and Semantic Roles in TempEval2. Proc. 5th Int. Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 284–291. [61] Wallach, H.M. (2004) Conditional Random Fields: An introduction. Technical Report, Department of Computer Science, University of Pennsylvania, Philadelphia, PA, USA. [62] Naughton, M., Stokes, N. and Carthy, J. (2008) Investigating Statistical Techniques for Sentence-level Event Classification. Proc. 22nd Int. Conf. on Computational Linguistics, Manchester, UK, pp. 617–624. [63] Pedersen, T., Patwardhan, S. and Michelizzi, J. (2004) WordNet::Similarity: Measuring the Relatedness of Concepts. HLT-NAACL’04: Demonstration Papers, Boston, MA, USA, pp. 38–41. [64] Leacock, C. and Chodorow, M. (1998). WordNet: An Electronic Lexical Database, Chapter Combining Local Context and WordNet Similarity for Word Sense Identification, pp. 265–283. MIT Press, Cambridge, MA. [65] Jiang, J. and Conrath, D. (1997) Semantic Similarity based on Corpus Statistics and Lexical Taxonomy. Proc. Int. Conf. on Research in Computational Linguistics (ROCLING), Taipei, Taiwan. [66] Resnik, P. (1995) Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proc. 14th Int. Joint Conf. on Artificial Intelligence, Montreal, QC, Canada, pp. 448–453. [67] Lin, D. (1998) An Information-theoretic Definition of Similarity. Proc. 15th Int. Conf. on Machine Learning, Madison, WI, USA, pp. 296–304. [68] Hirst, G. and St-Onge, D. (1998). WordNet: An Electronic Lexical Database, Chapter Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms, pp. 305–332. MIT Press, Cambridge, MA, USA. [69] Wu, Z. and Palmer, M. (1994) Verbs Semantics and Lexical Selection. Proc. 32nd Annual Meeting on Association for Computational Linguistics, Las Cruces, NM, USA, pp. 133–138. [70] Banerjee, S. and Pedersen, T. (2002) An Adapted Lesk Algorithm for Word Sense Disambiguation using WordNet. Conf. on Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico, pp. 117–171. [71] Mihalcea, R. and Tarau, P. (2004) TextRank: Bringing Order into Texts. Proc. EMNLP 2004, Barcelona, Spain, July, pp. 404– 411. Conference on Empirical Methods in Natural Language Processing. [72] Strötgen, J. and Gertz, M. (2010) HeidelTime: High Quality Rulebased Extraction and Normalization of Temporal Expressions. [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] 425 Proc. 5th Int. Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 321–324. van Rijsbergen, C. (1979) Information Retrieval. Butterworths, London, UK. Weiss, S.M. and Zhang, T. (2003) Performance Analysis and Evaluation. The Handbook of Data Mining, pp. 426–440. Lawrence Erlbaum Associates, Mahwah, NJ USA. Witmer, J. (2009) Mining WIkipedia for Geospatial Entities and Relationships. Master’s Thesis, Department of Computer Science, University of Colorado Colorado Springs. Witmer, J. and Kalita, J. Extracting Geospatial Entities from Wikipedia. IEEE Int. Conf. on Semantic Computing, Berkeley, CA, USA, pp. 450–457. Institute of Electrical and Electronics Engineers. Witmer, J. and Kalita, J. (2009) Mining Wikipedia Article Clusters for Geospatial Entities and Relationships. AAAI Spring Symposium, Stanford, CA, USA, pp. 80–81. Association for the Advancement of Artificial Intelligence. Cortes, C. and Vapnik, V. (1995) Support-vector networks. Mach. Learn., 20, 273–297. Vapnik, V. (1998) Statistical Learning Theory: Adaptive and Learning Systems for Signal Processing, Communications, and Control. Wiley, New York, NY, USA. Cucerzan, S. (2007) Large-Scale Named Entity Disambiguation Based on Wikipedia Data. Proc. EMNLP-CoNLL, pp. 708– 716. Conference on Empirical Methods in Natural Language Processing–Computational Natural Language Learning. Rabiner, L.R. (1989) A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE, 77, 257–286. Finkel, J.R., Grenager, T. and Manning, C. (2005) Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL’05: Proc. 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, MI, USA, pp. 363–370. Wang, C., Xie, X., Wang, L., Lu, Y. and Ma, W. (2005) Detecting Geographic Locations from Web Resources. 2005 Workshop on Geographical Information Retrieval, Bremen, Germany, pp. 17–24. Sehgal, V., Getoor, L. and Viechnicki, P. (2006) Entity Resolution in Geospatial Data Integration. ACM Int. Symp. on Advances in Geographical Information Systems, Irvine, CA, USA, pp. 83–90. Association for Computing Machinery. Zong, W., Wu, D., Sun, A., Lim, E. and Goh, D. (2005) On Assigning Place Names to Geography Related Web Pages. ACM/IEEE-CS Joint Conf. on Digital Libraries, Denver, CO, USA, pp. 354–362. Association for Computing Machinery– Institute for Electrical and Electronics Engineers, Computer Science section. Martins, B., Manguinhas, H. and Borbinha, J. (2008) Extracting and Exploring the Geo-Temporal Semantics of Textual Resources. IEEE Int. Conf. on Semantic Computing, pp. 1–9. Institute for Electrical and Electronics Engineers, Santa Clara, CA, USA. Smith, D. and Crane, G. (2001) Disambiguating Geographic Names in a Historical Digital Library. Research and Advanced Technology for Digital Libraries, pp. 127–136. Springer, New York, NY, USA. The Computer Journal, Vol. 57 No. 3, 2014 426 R. Chasin et al. [88] Vestavik, Ø. (2003) Geographic Information Retrieval: An Overview. Department of Computer and Information Science, Norwegian University of Technology and Science, Oslo, Norway, available http://www.idi.ntnu.no/ oyvindve/article.pdf. [89] D’Roza, T. and Bilchev, G. (2003) An Overview of LocationBased Services. BT Technol. J., 21, 20–27. [90] Chang, C.-C. and Lin, C.-J. (2001) LIBSVM: a Library for Support Vector Machines. Software available at http://www.csie. ntu.edu.tw/ cjlin/libsvm. [91] Woodward, D., Witmer, J. and Kalita, J. (2010) A Comparison of Approaches for Geospatial Entity Extraction from Wikipedia. IEEE Conf. on Semantic Computing, Pittsburg, PA, USA, pp. 402–407. Institute for Electrical and Electronic Engineers. [92] Chasin, R., Woodward, D. and Kalita, J. (2011) Extracting and Displaying Temporal Entities from Historical Articles. National Conf. on Machine Learning, Tezpur University, Assam, India, pp. 1–13. Narosa Publishing House, New Delhi, India. The Computer Journal, Vol. 57 No. 3, 2014
© Copyright 2024 Paperzz