Extracting and Displaying Temporal and Geospatial Entities from

Advance Access publication on 18 October 2013
© The British Computer Society 2013. All rights reserved.
For Permissions, please email: [email protected]
doi:10.1093/comjnl/bxt112
Extracting and Displaying Temporal
and Geospatial Entities from Articles
on Historical Events
Rachel Chasin1 , Daryl Woodward2 , Jeremy Witmer3 and Jugal Kalita4,∗
1 Department
of Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
of Electrical and Computer Engineering, University of Colorado, Colorado Springs,
CO 80918, USA
3 The MITRE Corporation, 49185 Transmitter Road, San Diego, CA 92152-7335, USA
4 Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA
∗Corresponding author: [email protected]
2 Department
This paper discusses a system that extracts and displays temporal and geospatial entities in text. The
first task involves identification of all events in a document followed by identification of important
events using a classifier. The second task involves identifying named entities associated with the
document. In particular, we extract geospatial named entities. We disambiguate the set of geospatial
named entities and geocode them to determine the correct coordinates for each place name, often
called grounding. We resolve ambiguity based on sentence and article context. Finally, we present a
user with the key events and their associated people, places and organizations within a document in
terms of a timeline and a map. For purposes of testing, we use Wikipedia articles about historical
events, such as those describing wars, battles and invasions. We focus on extracting major events
from the articles, although our ideas and tools can be easily used with articles from other sources
such as news articles. We use several existing tools such as Evita, Google Maps, publicly available
implementations of Support Vector Machines, Hidden Markov Model and Conditional Random Field,
and the MIT SIMILE Timeline.
Keywords: information extraction; temporal extraction; geospatial entity extraction; natural language
processing
Received 23 December 2012; revised 11 September 2013
Handling editor: Fionn Murtagh
1.
INTRODUCTION
The amount of user-generated content on the Internet increases
significantly every day. Whether they be short news articles
or large-scale collaborations on Wikipedia, textual documents
harbor an abundance of information, explicit or implicit.
Relationships between people, places and events within a
single text document are sometimes difficult for a reader to
extract—meanwhile the thought of digesting and integrating
information from a whole series of lengthy articles is daunting.
Consequently, the need for techniques to automatically extract,
disambiguate, organize and visualize information from raw text
is increasing in importance. In this paper, we are interested
in mining documents to identify important events along with
the named entities to which they are related. The objective is
to create the basis of a system that combines automatically
extracted temporal, geospatial and nominal information into a
large database in a form that facilitates further processing and
visualization.
This paper describes an interesting, state-of-the-art system
combining state-of-the-art techniques. We believe that the
article will be instructive for anyone trying to set up an
operational system with the complex functionalities described.
Instead of focusing on a subproblem, this paper’s focus is on an
overall system. The paper describes a well-crafted combination
of state-of-the-art techniques (many of them embodied in offthe-shelf toolkits) into an interesting working system. The two
main foci of this paper are to accurately and automatically
acquire various pieces of event-related information and to
The Computer Journal, Vol. 57 No. 3, 2014
404
R. Chasin et al.
FIGURE 1. A Wikipedia article highlighted for event references and named entities.
visualize the extracted data. Visualization is an important
post-process of information extraction as it can offer a level
of understanding not inherent in reading the text alone.
Visualization is ideal for users who need quick, summary
information from an article. Possible domains for obtaining
the information used in our process are numerous and varied;
however, we have identified two primary sources for rich
amounts of reliable information. These domains are online news
archives and Wikipedia.
For the research reported in this paper, we work with
Wikipedia articles on historical events because of the
availability of high-quality articles in large numbers. Figure 1
shows a portion of a Wikipedia article that has been marked
up for event references and named entities. One of the
most influential features of the Internet is the ability to
easily collaborate in generating information. Wikis are perfect
examples. A wiki offers an ideal environment for the sharing
of information while allowing for a form of peer review that is
not present elsewhere. Wikipedia is only one of many corpora
that can be mined for knowledge that can be stored in machinereadable form.
A system such as the one we report in this paper may have
exciting and practical applications. An example would be the
assistance in researching World War II. One would be able to
select a time period, perhaps 1939–1940. Then one would be
able to select an area on a map—perhaps Europe. One could
choose a source of information, such as the World War II article
on Wikipedia and all of the articles linked on that page. The
result may be an interactive map and timeline that plays between
the time period of interest, showing only events on the timeline
that occurred in the selected region. Meanwhile markers would
fade in and out on the map as events begin and end—each one
clickable for a user to retrieve more information. The people and
organizations involved in a particular battle could be linked to
articles about them. The system described in this paper collects
the information necessary and although the visualization is not
as sophisticated, we describe the basis for such a system here.
It should be noted that although we extract information from
Wikipedia articles, all information mining in our system is based
on raw text which allows our system to work with historical
articles, books and memoirs, and even newspaper articles
describing currently unfolding events. Thus, it is possible to
analyze text on the same topic from different sources, and
find the amount of similarity or dissimilarity or inconsistency
among the sources at a gross level. The existing visualization can
currently enable a human to make such distinctions, e.g. when
the same event may appear at different places on a timeline.
2.
RELATED RESEARCH
Our overarching goal is to build a system that extracts, stores and
displays important events and named entities from Wikipedia
articles on historical events. It should be noted that our methods
do not limit the system to this source and the event and named
entity extraction works directly from plain text.
Research on event extraction has often focused on identifying
the most important events in a set of news stories [1–3].
The advantage of this domain is that important information
is usually repeated in many different stories, all of which are
being examined. This allows algorithms like TF-IDF [4] to be
used. In the Wikipedia project, there is only one document per
specific topic, so these algorithms cannot be used. There has also
been research into classifying events into specific categories and
determining their attributes and argument roles [5]. Another
issue that has been addressed is event coreference, determining
which descriptions of events in a document refer to the same
actual event [6, 7].
Even setting aside the problem of automating the process,
identifying and representing the temporal relations in a
document is a daunting task for humans. Several structures
have been proposed, from Allen’s interval notation and 13
relations [8] and variations on it using points instead, to a
constraint structure for the medical domain in [9]. A recent
way to represent the relations is a set of XML tags called
TimeML [10]. Software has been developed [11, 12] that
identifies events and time expressions, and then generates
relations among them. Most events are generally represented
by verbs in the document, but also include some nouns. The
link or relation generation is done differently depending on the
system, but uses a combination of rule-based components and
classifiers.
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
We are also interested in named entity recognition (NER),
i.e. the extraction of words and strings of text within
documents that represent discrete concepts, such as names
and locations [13]. Successful approaches include Hidden
Markov Models (HMMs), Conditional Random Fields (CRFs),
Maximum Entropy models, Neural Networks and Support
Vector Machines (SVMs). The extraction of named entities
continues to invite new methods, tools and publications.
SVMs [14] have shown significant promise for the task
of NER. Starting with the CoNLL 2002 shared task on
NER, SVMs have been applied to this problem domain with
improving performance. Bhole et al. [15] report extracting
named entities from Wikipedia and relating them in time using
SVMs. In the biomedical domain, Takeuchi and Collier [16]
have demonstrated that SVMs outperform HMMs for NER.
Lee et al. [17] and Habib [18–20] have used SVMs for the
multi-class NER problem. Isozaki and Kazawa [21] show that
chunking and part-of-speech tagging also provide performance
increases for NER with SVMs, both in speed and in accuracy.
Dakka and Cucerzan [22] demonstrate high accuracy using
SVMs for locational entities in Wikipedia articles. Mansouri
et al. [23] demonstrate excellent performance on NER using
what are called fuzzy SVMs. Thus, as attested by many
published papers, SVMs provide good performance on NER
tasks. HMMs have also shown excellent results in the NER
problem. Klein et al. [24] demonstrate that a character-level
HMM can identify both English and German named entities
fairly well for location entities in testing data. Zhou and Su [25]
evaluate HMM and HMM-based chunk tagger on English NE
tasks, achieving extremely high detection performance. CRFs
have been extremely successful in NER [26, 27]. With publicly
available CRF NER tools, it is possible to achieve excellent
results.
Our particular interest is in extraction of named geospatial
entities. Chen et al. [28] perform NER with a ‘regularized
maximum entropy classifier with Viterbi decoding’ and achieve
over 88% detection with regard to geospatial entities. Our goal
is to achieve high recall during initial extraction. NER precision
for geospatial entities can be sacrificed since we disambiguate
and geocode such entities. Disambiguation is necessary since
the same name may refer to several geospatial entities in
different parts of the world. Geocoding obtains latitude and
longitude information for a geospatial entity on the map of
the world. If an identified geospatial named entity cannot be
geocoded, it is assumed to be mis-identified and discarded.
Scharl and Tochtermann [29] use an interesting disambiguation
process in their geospatial resolution process. Nearby named
entities are used to disambiguate more specific places. An
example is a state being used to determine exactly which city is
being referenced. In addition, NEs previously extracted in the
document are also used to reinforce the score of a particular
instance of a place name. For example, if a particular region
in the USA has been previously referenced and a new named
entity arises that could be within that region or in Europe, it will
405
be weighted more toward being the instance within that region
in the USA. The basic design of our graphical user interface
(GUI) is based on the GUI referenced in [28].
Recently, interest in extracting and displaying spatiotemporal information from unstructured documents has been on
the rise. For example, Strötgen et al. [30] discusses a prototype
system that combines temporal information retrieval and
geographic information retrieval. This system also displays the
information. Cybulska and Vossen [31, 32] describe a historical
information extraction system where they extract event and
participant information from Dutch historical archives. They
extract information using what they call profiles. For example,
they have developed 402 profiles for event extraction although
they use only 22 of them in the reported system. For extraction
of participants, they use 314 profiles. They also use 43
temporal profiles and 23 location profiles to extract temporal
and locational information. Profiles are created using semantic
and syntactic information as well as information gleaned from
WordNet [33].
Ligozat et al. [34] present a system that takes a simple
natural language description of a historical scene, identifies
landmarks and performs matching with a set of choremes,
situates the choremes in appropriate locations and displays the
choremes. The details of the system are not discussed in detail.
Choremes [35, 36] are icons that can be used to represent
specific spatial configurations and processes in geography. In
layman’s terms, they are like an elaborate set of graphical signs
used by roads and highways to indicate things like a road turning
in a certain direction, an intersection coming up, etc. Ligozat
et al. [37] describe another system that monitors SMS messages
sent by individuals describing events taking place in a compact
geographical area, discover relations among these events and
report on the severity of these events. They consider events
to be entities with a temporal span and a spatial extent and
describe them using a representational scheme that allows for
both [38]. They display events of interest in an animated manner
using the approach described in [34]. Events are represented
schematically, but can have texts, sounds and lights associated
with them. The most interesting part of their system is the
representation of space and time. Time is represented using
Allen’s interval calculus [8]. Space is quantized in rectangles,
and the relationships among these rectangles is expressed
using a calculus that mimics the relationships used in the
representation of time, ensuring that both time and space are
treated in a similar manner.
Computer Science literature is by no means the only
place where aspects of the work being discussed in this
paper are pursued. Geographical Information Systems (GISs)
is one such area. A GIS is a system designed to capture,
store, manipulate, analyze, manage and present geographic
information. Traditionally, such information has been used in
cartography, resource management and scientific work, and in
an increasing manner with mobile computing platforms like cell
phones and tablets for mapping and location-aware searches and
The Computer Journal, Vol. 57 No. 3, 2014
406
R. Chasin et al.
advertisements. Books by Rigaux et al. [39] and Clarke [40]
present the main theoretical concepts behind these systems,
namely spatial data models, algorithms and indexing methods.
As in many fields, with tremendous increase in the amount
of geographic information being available in recent years,
techniques from statistical analysis, data mining, information
retrieval and machine learning are becoming tools in the quest to
discover useful patterns and nuggets of geographic knowledge
and information. Development and use of geographic ontologies
to facilitate geographic information retrieval is an active area
of research [41–44]. Fu et al. [45] describe the use of an
ontology for deriving the spatial footprints of a query, focusing
on queries that involve spatial relations. Martins [46] describes
issues in the development of a location-aware Web search
engine. He creates a geographic inference graph of the places
in the document and uses the PageRank algorithm [47] and
the HITS algorithm [48] to associate the main geographical
contexts/scopes for a document. Martins [46] explores the idea
of returning documents in response to queries that may be
geographically interesting to the individual asking the query.
This means queries must be assigned geographic scopes as
well and the geographic scope of the query should be matched
with the geographic scopes of the documents returned. It also
discusses ranking search results according to geographical
significance.
3.
OUTLINE OF OUR WORK
The work reported in this paper is carried out in several
phases. First, we extract all events from a document. Next, we
identify significant events in the text. We follow by identifying
associated named entities in the text. We process the geospatial
entities further for disambiguation and geolocation. Finally,
we create a display with maps and timelines to represent the
information extracted from the documents. The activities we
carry out in the system reported in this paper are listed below.
FIGURE 2. Overall architecture.
(location) and ORG (organization) named entities. Temporal
information, document structure and NE information is used to
identify important events. In parallel, the LOC named entities
are geocoded and grounded to a single specific location. If the
LOC entities are ambiguous, document information is used to
ground the LOC entity to the correct geospatial location. Once
the LOC and event processing is complete, all of the information
is stored in a database. We have also developed a map and
timeline front-end that allows users to query documents and
visualize the geospatial and temporal information on a map
and timeline.
4.
(i) Extract events from texts.
(ii) Identify important events from among the events
extracted.
(iii) Extract temporal relations among identified important
events.
(iv) Extract named entities in the texts including those
associated with important events.
(v) Disambiguate locational named entities and geocode
them.
(vi) Create a display with a map and timeline showing
relationships among important events.
We discuss each one of these topics in detail in the rest of the
paper. Figure 2 shows the overall processing architecture and
segments of the system, including the visualization subsystem.
The operation of the entire system is based on an initial NER
processing of the document, to extract PER (people), LOC
EXTRACTING EVENTS FROM TEXTS
Automatically identifying events in a textual document enables
a higher level of understanding of the document than only
identifying named entities.
The usual approach to identifying events in a text is by
providing a set of relations along with associated patterns that
produce their natural language realization, possibly in a domaindependent manner. For example, [49–51] are able to provide
answers to queries about individual facts or factoid questions
using relation patterns. Others [52, 53] take a deeper approach
that converts both the query text and the text of candidates for
answers into a logical form and then use an inference engine to
choose the answer text from a set of candidates. Still others [54]
use linguistic contexts to identify events to answer factoid
questions. For non-factoid questions, where there are usually
no unique answers and which may require dealing with several
events, pattern matching does not work well [55, 56].
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
407
FIGURE 3. Portion of an article with events tagged by EVITA (words followed by e_i). We show only the event tags here.
5.
FIGURE 4. A portion of the XML representation of Fig. 3.
Evita [57] does not use any pre-defined patterns to identify
the language of event occurrence. It is also not restricted to any
domain. Evita combines linguistic and statistical knowledge for
event recognition. It is able to identify grammatical information
associated with an event-referring expression such as tense,
aspect, polarity and modality. Linguistic knowledge is used to
identify local contexts such as verbal phrases and to extract
morphological information. The tool has a pre-processing
stage and uses finite-state machine-based and Bayesian-based
techniques.
We run the Evita program1 on the article, which labels
all possible events in the TimeML format. EVITA recognizes
events by first preprocessing the document to tag parts of speech
and chunk it, then examining each verb, noun and adjective
for linguistic (using rules) or lexical (using statistical methods)
properties that indicate it is an event. Instances of events also
have more properties that help establish temporal relations
in later processing steps; these properties include tense and
modality. Specifically, Evita places XML ‘EVENT’ tags around
single words defining events; these may be verbs or nouns,
as seen in Figs 1, 3 and 4. Each event has a class; in this
case, ‘fought’ is an occurrence. Other classes include states and
reporting events. For our purposes, occurrences will be most
important because those are the kinds of events generally shown
on timelines.
For event extraction, there were a few other alternatives
although we found that for our task EVITA’s results were
comparable. EVITA is easily downloadable and usable as
well. The other systems include TRIPS-TRIOS [58, 59] and
TIPSem [60]. TRIPS-TRIOS parses the raw text using a parser
that produces a deep logical (LF) from text and then extracts
events by matching semantic patterns using about 100 handcoded extraction patterns that match against parsed phrases.
TIPSem uses CRFs [26, 61].
1 The entire TARSQI Toolkit is freely downloadable upon email request; see
http://timeml.org/site/tarsqi/toolkit/download.html.
IDENTIFYING IMPORTANT EVENTS
When storing events in our database, we currently associate
an event with its containing sentence, saving the entire line.
Each event/sentence is classified as important or not using a
classifier trained with a set of mostly word-level and sentencelevel features.
We experimented with several SVM-based classifiers to
identify important events/sentences. We also experimented with
choices of features for the classifiers.
Several features are concerned with the characteristics of
the event word in the sentence (part of speech, grammatical
aspect, distance from a named entity, etc.) or the word by itself
(length). On the other hand, some features relate to the sentence
as a whole (for example, presence of negation and presence
of digits). The features are based on practical intuition. For
example, low distance from a named entity, especially from
a named entity that is mentioned often, probably indicates
importance because a reader cares about events that the named
entity is involved in. This is opposed to more generic nouns,
or abstract nouns which generally means less importance. For
example, consider the sentence from the Wikipedia article on the
Battle of Gettysburg: ‘Lee led his army through the Shenandoah
Valley to begin his second invasion of the North’. This sentence
describes an important event and Lee is a named entity that
occurs often in this article. Intuitively speaking, negation is
usually not associated with importance since we care about what
happened, not what did not. Grammatical aspect, in particular,
progressive, seems less important because it tends to pertain to a
state of being, not a clearly defined event. For example, consider
another sentence from the Battle of Gettysburg article: ‘The 26th
North Carolina (the largest regiment in the army with 839 men)
lost heavily, leaving the first day’s fight with around 212 men’.
This has ‘leaving’ with progressive aspect and this sentence
does not really depict an event, more a state. Longer words
are sometimes more descriptive and thus more important, while
nouns and verbs are usually more important than adjectives.
Digits are also a key factor of importance as they are often part
of a date or statistic. Similar features have been used by other
researchers when searching for events or important events or
similar tasks [5, 58, 59, 62].
We also experimented with features that relate to the article
as a whole such as the position of a sentence in the document
although we did not find these as useful. We use the similarity of
the event word to article ‘keywords’. These keywords are taken
The Computer Journal, Vol. 57 No. 3, 2014
408
R. Chasin et al.
as the first noun and verb or first two nouns of the first sentence of
the article. These were chosen because the first sentence of these
historical narratives often sums up the main idea and will often
therefore contain important words. In the articles of our corpus,
words such as ‘war’, ‘conflict’ and ‘fought’ are often used to
specify salient topics. An event word’s similarity to one of these
words may having a bearing on its importance. The decision
to use two keywords helps in case one of the words is not a
good keyword; only two are used because finding similarity
to a keyword is expensive in time. Similarity is measured using
the ‘vector pairs’measure from the WordNet::Similarity
Perl module from [63]. As explained in the documentation for
this Perl module,2 it calculates the similarity of the vectors
representing the glosses of the words to be compared. This
measure was chosen because it was one of the few that can
calculate similarity between words from different parts of
speech. Given two word senses, this module looks at their
WordNet glosses and computes the amount of similarity. It
implements several semantic relatedness measures including
the ones described by Leacock and Chodorow [64], Jiang and
Conrath [65], Resnik [66, 67], Hirst and St-Onge [68], Wu
and Palmer [69], and the extended gloss overlap measure by
Banerjee and Pedersen [70].
After initial experimentation, we changed some word-level
features to sentence-level features by adding, averaging or
taking the maximum of the word-level features for each event in
the sentence. A feature added was based on TextRank [71], an
algorithm developed by Mihalcea and Tarau that ranks sentences
based on importance and is used in text summarization.
TextRank works using the PageRank algorithm developed by
Page et al. [47]. While PageRank works on web pages and
the links among them, TextRank treats each sentence like a
page, creating a weighted, undirected graph whose nodes are
the document’s sentences. Edge weights between nodes are
determined using a function of how similar the sentences are.
After the graph is created, PageRank is run on it to rank the
nodes in order of essentially how much weight points at them
(taking into account incident edges and the ranks of the nodes
on the other sides of these edges). Thus, sentences that are
in some way most similar to most other sentences get ranked
highest. We wrote our own implementation of TextRank with
our own function for similarity between two sentences. Our
function automatically gives a weight of essentially 0 if either
sentence is shorter than a threshold of 10 words, which we
chose based on the average sentence length in our corpora. For
all others, it calculates the distance between the sentences using
a simple approach. Each sentence is treated as a bag of words.
The words are stemmed. Then, we simply compute how many
words are the same between the two sentences. The similarity is
then chosen as the sum of the sentence lengths divided by their
distance. Named entities are also considered more important to
the process than before. Instead of just asking if the sentence
2 http://search.cpan.org/dist/WordNet-Similarity/doc/intro.pod.
TABLE 1. A final list of features for classifying events as important
or not.
Important event classifier features
Presence of an event in the perfective aspect
Percent of events in the sentence with class ‘occurrence’
Digit presence
Maximum length of any event word in the sentence
Sum of the Named Entity ‘weights’ in the sentence
(NE weight being the number of times this NE was mentioned
in the article divided by the number of all NE mentions in the
article)
Negation presence in the sentence
Number of events in the sentence
Percent of events in the sentence that are verbs
Position of the sentence in the article normalized by the number of
sentences in the article
Maximum similarity of any event word in the sentence
Percent of events in the sentence that are in some past tense
TextRank rank of the sentence in the article, divided by the number
of sentences in the article
Number of ‘to be’ verbs in the sentence
contains one, some measure of the importance of the named
entity is calculated and taken into account for the feature. This
is done by counting the number of times the named entity is
mentioned in the article (people being equal if their last names
are equal, and places being equal if their most specific parts are
equal). This total is then normalized for the number of named
entities in the article.
The final set of features used can be seen in Table 1.
6.
IDENTIFYING TEMPORAL RELATIONS
We experimented with three different ways to identify temporal
relations among events in a document. The first approach used
the TARSQI toolkit to identify temporal relations among events.
The second approach was the development of a large number
of regular expressions. We also downloaded Heideltime [72]
and integrated it with our system. Heideltime is also a regular
expression-based system, written in a modular manner. Since
the results of our regular expressions were comparable to
Heideltime’s (although our expressions were a little messier)
and we had developed and integrated our regular expression
module before Heideltime became available for download,
we decided to keep our module. However, we can switch to
Heideltime if the need arises.
6.1.
Using TARSQI to extract occurrence times
We use the TARSQI Toolkit (TTK) to create event–event and
event–time links (called TLINKs) in addition to identifying
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
FIGURE 5. Example TLINK elements.
events in a document. These include attributes like type of
expression (DATE, TIME, etc.) and value. Some events are
anchored to time expressions, and some to other events. These
are represented by the TLINK XML tag, which has attributes
including the type of relation. An example of each kind of
TLINK is shown in Fig. 5. The first represents that the event
with ID 1 is related to the time with ID 3 by ‘before’; the second
also represents ‘before’, between events 5 and 6.
6.2.
Using regular expressions to extract
occurrence times
Our second approach applied regular expressions to extract
times. The results of using extensive regular expressions versus
using the TTK showed that the regular expressions pull out
many more complete dates and times. For example, in the text
given in Figs 1 and 3, TTK only finds 1862, while the regular
expressions would find a range from 11 December 1862 to 15
December 1862. Because of this, we decided to use our own
program based on regular expressions rather than TTK. The
regular expressions in our program are built from smaller ones
representing months, years, seasons, and so on. We use 24 total
combinations of smaller expressions, some of which recursively
tested groups they matched for subexpressions. One example of
an expression is
(early |late |mid-|mid |middle |the end of
|the middle of |the beginning of |the start
of )?(Winter|Spring|Summer|Fall|Autumn)
(of )? (((([12][0-9])|([1-9]))
[0-9][0-9])|(’?[0-9][0-9]))
which finds expressions like ‘[By] early fall 1862 [. . .]’a type of
phrase that occurs in the corpus. A flaw present in our program
and not in TTK is its ability to only pick out one time expression
(point or interval) per sentence. This is consistent with our
current view of events as sentences, although it would arguably
be better to duplicate the sentence’s presence on a timeline while
capturing all time expressions present in it.
To anchor time expressions to real times—specifically to
a year—we use a naïve algorithm that chooses the previous
anchored time’s year. We heuristically choose the most probable
year out of the previous anchored time’s year and the two
adjacent to it by looking at the difference in the month
if available. For example, a month that is more than 5 months
after the month of the anchoring time is statistically in the next
409
year rather than the same year (with ‘The first attack occurred
in December 1941. In February, the country had been invaded’,
it is probably February 1942). In addition to the year, if the time
needing to be anchored lacks more fields, we fill them with as
many corresponding fields from the anchoring time as possible.
Each time expression extracted is considered to have a
beginning and an end, at the granularity of days (though we
do extract times when they are present). Then the expression
‘7 December 1941’ would have the same start and end point,
while the expression ‘December 1941’ would be considered
to start on December 1 and end on December 31. Similarly,
modifiers like ‘early’ and ‘late’ change this interval according to
empirical evidence; for example, ‘early December’ corresponds
to December 1 to December 9. While these endpoints are
irrelevant to a sentence like, ‘The Japanese planned many
invasions in December 1941’, it is necessary to have exact start
and end points in order to plot a time on a timeline. Thus, while
the exact start and end days are often arbitrary for meaning, they
are chosen by the extractor.
Some expressions cannot be given a start or end point at all.
For example, ‘The battle began at 6:00 AM’ tells us that the
start point of the event is 6:00 AM but says nothing about the
end point. Any expression like this takes on the start or end
point of the interval for the entire event the article describes
(for example, the Gulf War). This interval is found by choosing
the first beginning time point and first ending time point that
are anchored directly from the text, and it is logically probable
to find the correct span. While this method is reasonable for
some expressions, many of them have an implicit end point
somewhere else in the text, rather than stretching until the end
of the article’s event. It would likely be better to instead follow
the method we use for giving times to sentences with no times
at all, described below.
After initial extraction of time expressions and finding a
tentative article span, we discard times that are far off from
either end point, currently using 100 years as a cutoff margin.
This helps avoid times that are obviously irrelevant to the event,
as well as expressions that are not actually times but look like
they could be (for example, ‘The bill was voted down 166–269’
which looks like a year range). We also discard expressions with
an earlier end date than start date, which helps avoid the latter
problem.
Despite the thoroughness of the patterns we look for in the
text, the majority of sentences still have no times at all. However,
they may still be deemed important by the other part of our work,
and so must be able to be displayed on a timeline. Here, we
exploit the characteristic of historical descriptions that events
are generally mentioned in the order in which they occurred. For
a sentence with no explicit time expression, the closest (textwise) sentence on either side that does have a time expression
is found, and the start times of those sentences are used as the
start and end time of the unknown sentence. There are often
many unknown sentences in a row; each one’s position in this
sequence is kept track of so that they can be plotted in this order,
The Computer Journal, Vol. 57 No. 3, 2014
410
R. Chasin et al.
FIGURE 6. Flow of time in a few sentences from the Battle of Gettysburg article.
despite having no specific date information outside the interval,
which is the same for them all. For example, the first sentence of
the passage in Fig. 6 sets a start point, June 9, for the subsequent
events. The year for this expression would be calculated using
a previously found time expression, possibly the first one in the
article. There are various sentences in the passage following
this one that do not contain time expressions, some of which
may be considered important enough to place on the timeline—
for example, ‘Alfred Pleasonton’s combined arms force of two
cavalry divisions (8000 troopers) and 3000 infantry, but Stuart
eventually repulsed the Union attack’. In context, a human can
tell that this also takes place on June 9; the system places it
(correctly) between June 9 (the previously found expression)
and June 15, the date, which appears next.
Sometimes the start and end time we get in this manner are
invalid because the end time is earlier than the start time. In this
case, we look for the next possible end time (the start time of
the next closest sentence with a time expression). If nothing can
be found, we use a default end point. This default end point is
chosen as the last time in the middle N of the sorted times, where
N is some fraction specified in the program (currently chosen
as 1/3). We do this because times tend to be sparse around the
ends of the list of sorted times, since there are often just a few
mentions of causes or effects of the article’s topic.
7.
EXTRACTING NAMED GEOSPATIAL
AND OTHER NAMED ENTITIES
The next goal is to extract all geospatial named entities from the
documents. In particular, we are interested in extracting NEs
that coincide with the important events or sentences identified
in the documents. We experimented with three approaches
to extracting geospatial named entities from the texts of
documents. The first approach uses an SVM which is trained
to extract geospatial named entities. The second approach uses
an HMM-based named entity recognizer. The third approach
uses a CRF-based (Conditional Random Field) named entity
recognizer. A by-product of the second and third approaches is
that we obtain non-geospatial named entities as well.
The reason for experimenting with the three methods is that
we wanted to evaluate if training an SVM on just geospatial
entities would work better than off-the-shelf HMM or CRF
tools. Precision, recall and F -measure are standard metrics used
to evaluate the performance of a classifier, especially in natural
language processing and information retrieval. After a trained
classifier has been used to classify a set of objects, the fraction of
objects correctly classified is termed precision. Recall computes
the fraction of objects that belong to a specific class in the test
dataset and are correctly classified by the classifier. When there
are several classes under consideration, precision and recall can
be obtained for each class and averaged. Precision and recall are
often inversely proportional to each other and there is normally
a trade-off between these two ratios. The F -measure mixes the
properties of the previous two measures as the harmonic mean
of precision and recall [73, 74]. If we want to use only one
accuracy metric as an evaluation criterion, F -measure is the
most preferable.
7.1.
SVM-based extraction of geospatial entities
This effort is discussed in detail in [75–77].An SVM [14, 78, 79]
is an example of a type of supervised learning algorithm
called classifiers. A classifier finds the boundary function(s)
between instances of two or more classes. A binary SVM
works with two classes, but a multi-class SVM can learn the
functions for the discriminant or separator functions for several
classes at the same time. A binary SVM attempts to discover
a wide-margin separator between two classes. The SVM can
learn this separator, but to do so, it must be presented with
examples of known instances from the two classes. Shown
a number of such known instances, the SVM formulates the
problem of classification as solving a complex optimization
problem. To train an SVM, each instance must be described
in terms of a number of attributes or features. A multi-class
SVM is a generalization of the binary SVM.
The desired output of this stage of processing is a set of words
and strings that may be place names. To define the features
describing the geospatial NEs, we draw primarily from the work
done for word-based feature selection by [18–20]. They used
lexical, word-shape and other language-independent features
to train an SVM for the extraction of NEs. Furthermore, the
work done by Cucerzan et al. [80] on extracting NEs from
Wikipedia provided further features. For feature set selection,
we postulate that each word in the geospatial NE should be
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
independent, and that the training corpus should be split into
individual words. A word can be labeled as beginning, middle
or end of a named entity for multi-word named entities. After
testing various combinations of features, we used the following
features for each word to generate the feature vectors for the
SVM, with equal weighting on all features:
(i) Word length: alphanumeric character count, ignoring
punctuation marks.
(ii) Vowel count, consonant count, and capital count in the
word.
(iii) First capital: if the word begins with a capital letter.
(iv) If the word contains two identical consonants together.
(v) If the word ends with a comma, a period, a hyphen, a
semi-colon or a colon.
(vi) If the word begins or ends with a quote.
(vii) If the word begins or ends with a parenthesis.
(viii) If the word is plural, or a single character.
(ix) If the word is alphanumeric, purely numeric, purely text,
all uppercase or all lowercase.
(x) If the word has a suffix or prefix.
(xi) If the word is a preposition, an article, a pronoun, a
conjunction, an interrogative or an auxiliary verb.
7.3.
HMM-based approach using LingPipe
HMMs have shown excellent results in the task of NER.
An HMM [81] is also a type of classifier that can learn to
discriminate between new and unknown instances of two (or
more classes), once it has been trained with known and labeled
examples of the classes. Unlike an SVM, an HMM is a sequence
classifier, in the sense that an HMM classifies individual objects
into classes, but in doing so, it takes into account characteristics
of objects or entities that occur before and after the current
entity that is being classified. An HMM is a graphical model
of learning that visualizes a sequence as a set of nodes, with
transitions among them. Transitions have probabilities on them.
When the system transitions into a state, it is assumed to emit
certain symbols with associated probabilities. Learning involves
finding a sequence of ‘hidden’ states that best explains a certain
emitted sequence.
The idea of sequence classification is useful in many natural
language processing tasks because the classification of an
instance (e.g. a word as describing an event or not) may depend
on the nature of words that occur before or after it in a sentence.
An SVM can also take into account what comes before and after
the object being classified, but the idea of sequence has to be
handled in an indirect manner. One must note that sequencing
does not always make sense, but in some situations, it is
natural and important. Klein et al. [24] and Zhou and Su [25]
have demonstrated excellent performance in extracting named
entities using HMMs. We chose to use the HMM implemented
by the LingPipe library for NER.3
3 http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html.
CRF-based approach
We also used the named entity extraction tool designed by the
Stanford Natural Language Processing Group.4 This package
uses a CRF for NER which achieved very high performance for
all named entities when tested on the CoNLL 2003 test data [82].
The Stanford Web site states that the CRF was trained on
‘data from CoNLL, MUC6, MUC7 and ACE’.5 These datasets
originate from news articles from the USA and the UK, making
it a strong model for most English domains.
CRFs [26, 61] provide a supervised learning framework to
build statistical models using sequence data. A CRF is an undirected graphical model that defines a single log-linear distribution over label sequences, given a particular observation
sequence. HMMs make certain strong independence assumptions among neighboring samples, which can be relaxed by
CRFs. CRFs allow encoding of known relationships among
observations in terms of likelihoods and construct consistent
interpretations. The linear chain CRF is popular in natural language processing for predicting as sequence of labels for a
sequence of input objects.
8.
7.2.
411
GROUNDING GEOSPATIAL NAMED ENTITIES
Each of the three learning-based methods extract a set of
candidate geospatial NEs from the article text. For each
candidate string, the second objective was to decide whether
it was a geospatial NE, and to determine the correct (latitude,
longitude), or (φ, λ) coordinate pair for the place name in
context of the article, which is often referred to as grounding
the named entity.
To resolve the candidate NE, a lookup is made using
Google Geocoder.6,7 If the entity reference resolves to a single
geospatial location, no further action is required. Otherwise, the
context of the place name in the article, a novel data structure
and a rule-driven algorithm are used to decide the correct spatial
location for the place name.
Our research targeted the so-called context locations, defined
as ‘the geographic location that the content of a web resource
describes’ [83]. Our task is close to that of word sense
disambiguation, as defined by Cucerzan [80]; only that we
consider the geospatial context and domain instead of the lexical
context and domain. Sehgal et al. [84] demonstrate good results
for geospatial entity resolution using both spatial (coordinate)
and non-spatial (lexical) features of the geospatial entities. Zong
et al. [85] demonstrate a rule-based method for place name
assignment, achieving a precision of 88.6% on disambiguating
place names in the United States, from the Digital Library for
Earth System Education (DLESE) metadata.
4 http://nlp.stanford.edu/.
5 Stanford
CRF – README.txt.
6 http://code.google.com/apis/maps/documentation/geocoding/index.html.
7 Google Geocoder was chosen because of accessibility and accuracy of the
data, as this research does not focus on the creation of a geocoder.
The Computer Journal, Vol. 57 No. 3, 2014
412
R. Chasin et al.
While related to the work done by Martins et al. [86], we
apply a learning-based method for the initial NER task, and only
use the geocoder/gazetter for lookups after the initial candidate
geospatial NEs have been identified. We also draw on our earlier
work in this area, covered in [77]
The primary challenge to the resolution and disambiguation
is that multiple names can refer to the same place, and multiple
places can have the same name. The statistics for these place
name and reference overlaps are given in Table 2, from [87].
The table demonstrates that the largest area with ambiguous
place names is North and Central America, the prime areas for
our research. Most of the name overlaps are in city names,
and not in state/province names. These statistics are further
supported by the location name breakdowns on a few example
individual Wikipedia articles in Table 3. The details in this
table are broken down into countries, states, foreign cities,
foreign features, US cities and US features, where ‘features’
are any non-state or city geospatial features, such as rivers,
hills and ridges. The performance of the NER extraction and
grounding system is indicated. The Battle of Gettysburg, with
over 50% US city names, many of which overlap, resulted in the
lowest performance. The Liberty Incident in the Mediterranean
Sea, with <10% US city and state names combined, showed
significantly better performance.
8.1.
Google Geocoder
We used Google Geocoder as the gazetteer and geocoder for
simplicity, as much research has already been done in this area.
TABLE 2. Places with multiple names and names applied to
more than one place in the Getty Thesaurus of geographic names.
Continent
% places with
multiple names
% names with
multiple places
North and Central America
Oceania
South America
Asia
Africa
Europe
11.5
6.9
11.6
32.7
27.0
18.2
57.1
29.2
25.0
20.3
18.2
16.6
Vestavik [88] and D’Roza and Bilchev [89] provide an excellent
overview of existing technology in this area. Google Geocoder
provides a simple REST-based interface that can be queried over
HTTP, returning data in a variety of formats.
For each geospatial NE string submitted as a query, Google
Geocoder returns 0 or more placemarks as a result. Each
placemark corresponds to a single (φ, λ) coordinate pair. If the
address string is unknown, or another error occurs, the geocoder
returns no placemarks. If the query resolves to a single location,
Google Geocoder returns a single placemark. If the query string
is ambiguous, it may return more than one placemark. A country
code can also be passed as part of the request parameters to
bias the geocoder toward a specific country. Google Geocoder
returns locations at roughly four different levels, Country, State,
City and Street/Feature, depending on the address query string
and the area of the world.
In our system, the address string returned from Google
Geocoder is separated into the four parts, and each part is
stored separately with the coordinates. The coordinates are also
checked against the existing set of locations in the database for
uniqueness. For our system, coordinates had to be more than
1/10th of a mile apart to be considered a unique location.
8.2.
Location tree data structure
The output from the Google Geocoder is fed through our
location tree data structure, which, along with an algorithm
driven by a simple set of rules, orders the locations by geospatial
division, and aids in the disambiguation of any ambiguous place
name.
The location tree operates at four levels of hierarchy, Country,
State/Province, City and Street/Local Geographic Feature. Each
placemark response from the geocoder corresponds to one of
these geospatial divisions. A separate set of rules governs the
insertion of a location at each level of the tree. For simple
additions to the tree, the rules are given below.
(1) If the location is a country, add it to the tree.
(2) If the location is a state or province, add it to the tree.
(3) If the location is a city in the USA, and the parent node
in the tree is a state, add it to the tree.
(4) If the location is a city outside the USA, add it to the
tree.
TABLE 3. Details on place name categories for example Wikipedia articles.
Article name
Foreign
features
US
features
Foreign
cities
USA
cities
States
Countries
F -measure
Battle of gettysburg
War of 1812
World War II
Liberty incident
0
3
20
25
31
18
0
1
1
5
17
10
51
30
1
6
15
8
0
1
2
36
62
57
68.2
81.6
87.0
90.5
All numbers are percentages.
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
413
FIGURE 7. Example location tree showing duplicated node count.
(5) If the location is a street or feature inside the USA, and
the parent node is a state or city, add it to the tree.
(6) If the location is a street or outside the USA, and the
parent node is either a country or city, add it to the tree.
Rules 3 and 5 are US-specific because Google Geocoder has
much more detailed place information for the US than other
areas of the world, requiring extra processing.
Any location that does not match any rule is placed on a
pending list. Each time a new location is added, the tree is resorted to ensure that the correct hierarchy is maintained, and
each location on the pending list is rechecked to see if it now
matched the rules for insertion.
If the location returned from Google Geocoder has a single
placemark and matches the rules, it is placed in the tree. If the
location returned corresponds to multiple coordinates, another
set of calculations is required before running the insertion rule
set. First, any ambiguous locations are placed on the pending list
until all other locations are processed. This ensures that we have
the most complete context possible, and that all non-ambiguous
locations are in the tree before the ambiguous locations are
processed.
Once we finish processing all the non-ambiguous locations in
the candidate list, we have a location tree, and a set of pending
locations, some of which could fit more than one place on the
tree. As each node is inserted into the tree, a running count of
nodes and repeated nodes are kept. Using this information, when
each ambiguous location is considered for insertion, a weighting
is placed on each possible node in the tree that could serve as
the parent node for the location. As an example, we consider
the location ‘Cambridge’. This could either be Cambridge, MA,
USA, or Cambridge, UK. The tree shown in Fig. 7, demonstrates
a pair of possible parent leaf nodes to ‘Cambridge’, with the
occurrence count for each node in the article text. The location
‘Cambridge’ can be added underneath either leaf node in the
tree. The weight calculation shown in Equation (1) determines
the correct node. The weight for node n is determined by
summing the insertion count for all the parent nodes of n in
the tree (up to three parents), and dividing by the total insertion
count for the tree. A parent node is set to zero if the tree does not
contain that node. The insertion count for each node is the total
number of occurrences of the location represented by that node
in the original article; so the insertion count for the whole tree
FIGURE 8. Complete location tree for the Operation Nickel Grass
Wikipedia article.
is equal to the number of all location occurrences in the article.
Weight n =
(Ctcountry + Ctstate + Ctcity )
.
Cttotal
(1)
Calculating weight for the two possible parent nodes of
‘Cambridge’, we use 3 as the insertion count for the UK,
and 4 for the insertion count for Massachusetts, 2 from the
Massachusetts node and 2 from the USA node. Dividing both by
the insertion count of 7 for the entire tree, we find WeightUK =
0.428 and WeightMA = 0.572. As the WeightMA is the higher
of the two, ‘Cambridge’ is associated with Cambridge, MA.
The equation for the weight calculation is chosen based on
empirical testing. Our testing shows that we place 78% of the
ambiguous locations correctly in the tree. This does not factor
in unambiguous locations, as they are always placed correctly
in the tree.
Figure 8 shows the complete location tree for the Operation
Nickel Grass Wikipedia article. Note that the Operation Orchard
node is the only incorrect node placed in the tree.
9.
EXPERIMENTS, RESULTS AND ANALYSIS
For this research, we downloaded the English language pages
and links database from the June 18, 2008 dump of Wikipedia.
This download provides the full text of all the pages in the
English Wikipedia at the time of creation. This full text includes
all wiki and HTML markup. The download contains ∼7.2
million pages.8 A selection of 90 pages about battles and
8 http://download.wikimedia.org/enwiki/20080618/.
The Computer Journal, Vol. 57 No. 3, 2014
414
R. Chasin et al.
TABLE 4. Partial Wikipedia corpus article metrics.
Page title
Word Ct.
Sentence Ct.
NE Ct.
Battle of Antietam
Battle of Britain
Battle of the Bulge
Battle of Chancellorsville
Battle of Chantilly
Battle of Chickamauga
American Civil War
First Barbary War
Battle of Fredericksburg
Battle of Gettysburg
First Gulf War
Korean War
Mexican-American War
Operation Eagle Claw
Operation Nickel Grass
Attack on Pearl Harbor
Battle of Shiloh
Spanish–American War
Battle of Vicksburg
The Whiskey Rebellion
World War II
6167
10 611
7614
2979
1170
2000
7961
1906
2354
5289
11 562
10 570
5822
1869
1092
3414
4899
4229
3369
1475
6775
400
512
339
137
50
99
397
88
111
256
895
462
275
60
52
188
253
194
178
65
331
260
229
265
133
56
108
298
97
77
274
674
495
472
56
47
198
157
296
119
49
661
wars were obtained and hand-tagged. The pages were obtained
by automatically crawling a Wikipedia index of battles9 and
dismissing extremely short articles. These pages resulted in
∼230 000 words of data, an average of 7% of which are
geospatial named entities. The pages also provided a total of
5000 sentences. Table 4 shows a list of selection of articles we
work with.
It is necessary to pre-process the text of the articles before it
can be used for information extraction. We remove all infoboxes.
For Wikipedia-style links, the link text is removed, and replaced
with either the name of linked page or with the caption text
for links with a caption. Links to external sites are removed.
Certain special characters and symbols that do not add any
information are removed as they tend to cause exceptions and
problems in string processing, e.g. HTML tags. The characters
we remove do not include those that would appear in names
such as accented characters, commas, periods, etc. The cleaned
primary text of each page is split into an array of individual
words, and the individual words are hand-tagged for various
purposes as necessary.
9.1.
Extracting important events
We use supervised learning for extracting important events from
the historical documents. For this purpose, we require a set
9 http://en.wikipedia.org/wiki/List_of_battles.
of hand-tagged articles indicating whether each event word
identified by EVITA is part of an important event. If an event
word is part of an important event, the sentence containing it
is considered important because the word contributes to this
importance. Because there is no good objective measure of
importance, we asked multiple volunteers to tag the articles
to avoid bias. Subjectivity was a real problem, however. The
most basic guideline was to judge whether one would want the
event on a timeline of the article. Other guidelines included to
not tag it if it was more of a state than an action, and to tag all
event words that referred to the same event. Each article was
annotated by more than one person so that disagreements could
usually be resolved using a majority vote. Thirteen articles were
annotated in total and the breakdown of numbers of annotators
was: one article annotated by five people, four articles annotated
by four people, six articles annotated by three people and two
articles annotated by two people.
If we took one annotator’s tagging as ground truth and
measure every other annotator’s accuracy against that, the
average inter-annotator F1-score came out as 42.6%, indicating
great disagreement. Using an alternative approach, we labeled
a sentence as important for the classifier if either (1) some event
in the sentence was marked important by every annotator, or
(2) at least half of the events in the sentence were marked
important by at least half of the annotators. Using this criteria,
over the 13 articles 696 sentences were labeled important and
1823 unimportant. The average F1-score between annotators
for this new method rose to 66.8% (lowest being 47.7% and
highest being 78.5%).
By using data gathered by multiple annotators and filtering it
in such a way, we are able to concentrate on events that multiple
people can agree are important. This is our attempt to both avoid
overly focusing and overly diluting our training data. Adding
more participants can impact our training data in a variety of
ways that depends on the like-mindedness of the individuals;
the addition could increase our confidence in a particular
event being important, add to or drop events from acceptable
confidence levels, etc. Depending on our annotators, training
data could represent anything from a very focused method of
importance identification to a noisy over-saturated list of events.
Overall, the human training aspect is necessary and has the
potential to be improved, likely by adding a short phase of
mass-tagging potentially followed by periodic additions to the
training corpus. Although we currently use one generic set of
training data which we apply to text from any source, another
option is to develop various models trained on different sets of
text to achieve better performance across sources, i.e. utilizing
metadata along with the training text to allow classification
of the text based on time period, structure, source, etc. We
can generate multiple models trained on different subsets of
our tagged texts, grouped by their characteristics, and when
we apply the system to a new source of text, we can use
metadata, if it is available for the text, to choose the most relevant
model.
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
TABLE 5. SVM parameter searching test results
(averaged over cross-validation sets).
c
g
w
F -score
1.0
1.0
1
29.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.5
0.25
0.125
0.0625
0.03125
2
2
2
2
2
2
46.1
46.9
47.5
48.0
49.0
48.0
1.0
1.0
1.0
1.0
1.0
0.5
0.25
0.125
3
3
3
3
46.2
48.9
50.1
51.0
(p:41.06066, r:75.0522, f:51.0428)
1.0
0.0625
3
50.9
0.5
0.5
0.5
0.5
0.5
45.2
47.1
47.5
48.2
48.1
1.0
0.5
0.25
0.125
0.0625
2
2
2
2
2
0.5
1.0
3
0.5
0.5
3
0.5
0.25
3
0.5
0.125
3
(p:41.14866, r:74.7939, f:51.03596)
0.5
0.0625
3
(p:41.35864, r:73.9689, f:51.01095)
48.0
49.2
50.55
51.0
51.0
Ultimately, we trained a binary SVM using a radial
basis function kernel using the LIBSVM software [90]. The
parameters for the kernel and cost constant were found with
one of LIBSVM’s built-in tools. SVMs were trained and tested
using 10-fold cross-validation.
To determine the best SVM model, the parameter space was
partially searched for the cost of missing examples (c), the
parameter gamma of the radial basis kernel function (g) and
the weight ratio of cost for missing positive examples to cost
for missing negative examples (w). For any given choice of c, g
and w, an SVM was trained on each cross-validation training set
and tested on the corresponding test set, using LIBSVM. The
resulting F -scores were averaged over the different sets and
recorded. The search space was explored starting at c = 1.0,
g = 1.0, w = 1; c and g were reduced by powers of 2 and w was
incremented. This portion of the space was explored because in
the original classification SVM training, the optimal c and g (as
found by a LIBSVM tool) were usually 0.5 and 0.5. Fixing c and
w, the next g would be chosen until F -scores no longer went
up; w was usually set to 2 almost immediately because of low
scores for w = 1. This was repeated for w up to 3 (as 4 produced
415
slightly poorer results), and then the next value of c was chosen.
The results of these tests are shown in Table 5. However, 0.5
and 0.5 were optimal values for the event-word-level classifiers,
and so a broader search may have been beneficial.
The best results were at the 51.0% F -score and came from a
few different choices of parameters. The set c = 1.0, g = 0.125,
w = 3 was chosen and the final model was trained with those
parameters on the entire dataset.
9.2.
Experiments in extracting occurrence times
Testing the times is significantly more difficult, since the
intervals generated by this program are certainly different from
the ones intended by the article and representing events with
times is difficult even for human annotators. We use two
evaluation methods for extracted times.
First, spot-checking showed strengths of our regularexpression-based time extraction method when time expressions are explicit. Two undergraduates examined a random pick
of several dozen times generated automatically. We found that
∼80% of the automatic extractions were correct if there is an
explicit time expression.
When a time of occurrence is not given explicitly, we use
a measure proposed by Ling and Weld [12] that they term
‘Temporal Entropy’ (TE). This indirectly measures how large
the intervals generated are, smaller and, therefore, better ones
yielding smaller TE. Different Wikipedia articles have different
spans and time granularities, and therefore TE varies greatly
among them. For example, a war is usually measured in years
while a battle is measured in days.An event whose interval spans
a few months in a war article should not be penalized the way
that span should be in a battle article. The range of temporal
entropies obtained is shown in Fig. 9. It is also necessary to
normalize the TE. To do this, we divide the length of the event’s
interval in seconds by the length in days of the unit that is found
for display on the timeline as described in the visualization
section. The TE from these results is primarily between 5 and
15, with 1400 sentences that had temporal expressions having
TE in this range, and the majority of this close to 10. A TE of
10 represents an interval that was about 0.25 of the unit used to
measure the article—for example, for a battle article measured
in weeks, an interval with 10 TE would span about 2 days. There
is no theoretical lower or upper bound on the TE range, but in
practice it is bounded above by the TE of the interval with the
longest interval to article unit ratio.
Temporal entropy does not give all the information, however.
Spot-checking of articles reveals that many events—particularly
those whose times were estimated—are not in the correct
interval at all. An example of this error can be seen in the
USS Liberty article. This article’s span was found to be the
1 day 8 June 1967. Regular expressions on the sentence ‘She
was acquired by the United States Navy ... and began her first
deployment in 1965, to waters off the west coast of Africa’ find
‘began’ and ‘1965’ and it is concluded that this event starts in
The Computer Journal, Vol. 57 No. 3, 2014
416
R. Chasin et al.
FIGURE 9. Temporal Entropy graph—the temporal entropies were sorted in increasing order for plotting. Temporal entropy is in log(seconds/days).
1965 and ends at the end of the whole event, in 1967. This is
incorrect. Worse, a sentence closely following this one begins
‘On June 5, at the start of the war’, and is guessed to be in the
1965 because the correct year, 1967, was not mentioned again
between the sentences. This error then propagates through at
least one more sentence. Similar examples succeed, however,
as in the Battle of Gettsyburg article: ‘The two armies began
to collide at Gettysburg on 1 July 1863, as Lee urgently
concentrated his forces there’ is given the date 1 July 1863, and
the next sentence with a time expression ‘On the third day of
battle, July 3, fighting resumed on Culps Hill,...’ uses 1863 as a
year. A useful but impractical additional test, which we have not
performed, would be human examination of the results to tell
whether each event’s assigned interval includes or is included
in its actual interval. Then those that fail this test could be given
maximum temporal entropy, to integrate this test’s results into
the former results.
9.3.
Experiments in geospatial named entity extraction
We performed extensive experiments in extracting geospatial
named entities. We used three different methods for extracting
and geocoding geospatial named entities. Our goal was to see
if an especially trained and tuned SVM will perform better than
HMM or CRF approaches, particularly for geospatial names.
More details can be found in [75–77, 91].
9.3.1. Using machine learning for geospatial named entity
extraction
We downloaded a number of previously tagged datasets to
provide the training material for the SVM. We also hand-tagged
a dataset on our own. Here are the relevant datasets.
(i) Spell Checker-Oriented Word List (SCOWL)10 to
provide a list of English words.
(ii) CoNLL 2003 shared task dataset on multi-language NE
tagging11 containing tagged named entities for PER,
LOC and ORG, in English.
(iii) Reuters Corpus from NIST:12 15 000 selected words
from the corpus were hand-tagged for LOC entities.
(iv) CoNLL 2004 and 2005 shared task datasets, on
Semantic Role Labeling13 tagged for English LOC NEs.
(v) Geonames database of 6.2 million names of cities,
counties, countries, land features, rivers and lakes.14
To generate the SVM training corpus, we combined our
hand-tagged Reuters corpus and the three CoNLL datasets.
We extracted those records in the 2.5-million word Geonames
database in either the P or A Geonames codes (Administrative
division and Population center), which provided the names of
cities, villages, countries, states, and regions. Combined with
the SCOWL wordlist, which provided negative examples, this
resulted in a training set of 1.2 million unique words.
We used LibSVM to train an SVM based on the feature vectors and two parameters, C and γ , which are the cost and degree
coefficient of the SVM kernel function.After using a grid search,
the optimal parameters were determined to be C = 8 and
γ = 0.03125. We selected a radial basis function kernel, and
applied it to the C-SVC, the standard cost-based SVM classifier.
The SVM was trained using feature vectors based on the training
corpus, requiring 36 h to complete. Our goal was to obtain high
10 http://downloads.sourceforge.net/wordlist/scowl-6.tar.gz.
11 http://www.cnts.ua.ac.be/conll2003/ner/.
12 http://trec.nist.gov/data/reuters/reuters.html.
13 http://www.lsi.upc.es/s̃rlconll/st04/st04.html.
14 http://www.geonames.org/export/.
The Computer Journal, Vol. 57 No. 3, 2014
417
Extracting and Displaying Temporal and Geospatial Entities
TABLE 6. Training corpus combination results for the NER SVM.
Corpus
Precision Recall F
R
R + C03 + C04 + C05
R + C03 + C04 + C05 + SCW
Wiki
Wiki + R + C03 + C04 + C05 + SCW
Geo + R + C03 + C04 + C05 + SCW
FGeo + R + C03 + C04 + C05 + SCW
0.38
0.45
0.49
0.65
0.60
TC
0.51
0.63
0.68
0.75
0.34
0.55
TC
0.99
0.47
0.54
0.59
0.44
0.57
TC
0.67
R, reuters corpus; C03, CoNLL 2003 dataset; C04, CoNLL 2004
dataset; C05, CoNLL 2005 dataset; Wiki, Wikipedia data; SCW,
SCOWL wordlist; Geo, full geonames database; FGeo, filtered
geonames database; TC, training canceled. Bold content indicates
the chosen feature set.
TABLE 7. Feature set combination results for the NER SVM
SHP: Word shape, including length and alphanumeric content.
Feature set
F -measure
SHP + PUN + HYP + SEP
SHP + CAP + PUN + HYP + SEP
CAP + VOW + CON
SHP + CAP
SHP + CAP + VOW + CON
SHP + CAP + VOW + CON + AFF + POS
SHP + CAP + VOW + CON + PUN + HYP +
SEP + AFF + POS
0.44
0.47
0.51
0.54
0.56
0.58
0.67
CAP, capital positions and counts; VOW, vowel positions and
counts; CON, consonant positions and counts; PUN, punctuation;
HYP, hyphens; SEP, separators, quotes, parentheses, etc.; AFF,
affixes; POS, part of speech categorization. Bold content indicates
the chosen feature set.
recall while maintaining good precision. Table 6 shows the precision, recall and F -measure numbers for training corpuses created with different subsets of the training data. The performance
numbers are the results from the Wikipedia testing corpus.
After selecting the optimal training corpus, we optimized the
feature set for the SVM. Table 7 shows a breakdown of the
feature sets we test separately. We found that the list presented
earlier provided the best performance by close to 10% in F measure over any other combination of features. With this setup,
we were successful in training the SVM to return high recall
numbers, recognizing 99.8% of the geospatial named entities.
Unfortunately, this was at the cost of precision. The SVM
returned noisy results with only 51.0% precision, but the noise
was processed out during the NE geocoding process, to obtain
our final results.
As described earlier, we also performed geospatial NER using
an HMM-based tool from LingPipe and a CRF-based tool from
Stanford Natural Language Processing Group.
TABLE 8. Grounded geospatial NE results.
HMM NER results
CRF NER results
SVM NER results
HMM resolved results
SVM resolved results
CRF resolved results
Precision
Recall
F -measure
0.489
0.579
0.510
0.878
0.869
0.954
0.615
0.575
0.998
0.734
0.789
0.695
0.523
0.572
0.675
0.796
0.820
0.791
9.3.2. Comparison of geospatial NER using machine
learning methods
Table 8 shows the raw results from the geospatial NE
recognition using the three learning methods along with the
results after processing through geospatial resolution and
disambiguation (discussed in the following section). The NER
Results show the accuracy of the NERs before any further
processing. The Resolved Results in Table 8 specifically show
the performance in correctly identifying location strings and
geocoding the locations. A string correctly identified by the
NER process is one that exactly matched a hand-tagged ground
truth named entity.
Figure 10 shows a more detailed breakdown of the precision,
recall and F -measure for a subset of the articles processed. A
handful of articles with foreign names, such as the article on
the Korean War, brought down the average results for all three,
but to a larger extent for the SVM and a little for the HMM,
compared with the other two methods. This is most likely due
to the fact that our training data contained a limited number
of foreign names, and the HMM had trouble recognizing these
strings as LOC named entities.
The SVM produces a lot more traffic to the geocoder
compared with the HMM or CRF method. Table 9 shows the
decrease in the number of candidate location NEs extracted by
the HMM over the SVM for some of the articles in the test
corpus. It also shows the number of these NEs that successfully
geocoded and were disambiguated. Although the HMM or the
CRF often extracted less than half as many potential NEs as the
SVM, the final results came out similar. The HMM demonstrates
better performance than the SVM in the longer articles, and
worse performance in shorter articles. The F -measure for some
of these articles are pictured in Fig. 10 in which the three lowest
and three highest scoring articles of the HMM’s results are
shown side by side. For the HMM, the lowest scoring articles
were the articles with the fewest potential NEs in the text, and
the highest had the most potential NEs.
We see in Table 8 that, as far as F -measure after resolution
goes, all three approaches are comparable in performance,
although our especially trained SVM has high recall as tuned.
However, the traffic to the Geocoder is a lot less for the HMM or
CRF method. Our system allows a choice of all three methods
as options although CRF is the default due to its fastest training
speed.
The Computer Journal, Vol. 57 No. 3, 2014
418
R. Chasin et al.
FIGURE 10. Comparing SVM and HMM results for a subset of Wikipedia articles.
TABLE 9. Hand-tagged articles: potential location NEs.
Article
HMM extracted
potential NEs/
grounded NEs
SVM extracted
potential NEs/
grounded NEs
HMM
precision
SVM
precision
HMM
recall
SVM
recall
Chancellorsville
Gettysburg
Korean War
War of 1812
World War 2
119/47
327/115
625/328
752/384
668/464
566/75
1209/117
2167/331
2173/408
1641/448
0.7015
0.7718
0.9371
0.9165
0.9915
0.8621
0.7267
0.6910
0.8518
0.9124
0.4947
0.6319
0.8700
0.7370
0.8609
0.7895
0.6429
0.8780
0.7831
0.8312
FIGURE 11. Comparing HMM- and CRF-based named entity recognizers on selected articles.
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
9.3.3. Disambiguation of geospatial named entities
We found that there are problems with the context-based
disambiguation in North America in areas where the city and
feature names have significant overlap. The ultimate result
of this is that places with globally unique city and feature
names perform the best, as there is no need to run the context
disambiguation algorithms. We also found that the worst areas
in North America are the Northeast, and the Southwest USA
where there are significant overlap between place names in
the USA and place names in Mexico. In general, our context
disambiguation algorithm performs at 78%, but in these areas
of the country, the performance drops to around 64.5%, due to
the significant place name overlap.
9.3.4. Geocoding and caching geocoded results
Once a list of geospatial names are identified, they are fed into
Google Geocoder, which is a service that attempts to resolve
each name as an object with various attributes, including a
latitude and longitude. The geocoder is accessed via an online
API and can return multiple results if there are multiple places
in Google’s catalog that match the provided name. Geocoding
success is defined as correctly resolving a string to a single
location in the context of the document by the end of our postprocessing of the geocoder’s results.
We also perform post-processing on the Geocoder results.
We find that the geocoder performs much better on land-based
than ocean- and sea-based locations. This is primarily because
land tends to be better divided and named than the ocean.
This problem is found primarily in articles about naval battles,
like the Battle of Midway article. We also find that Google
Geocoder has much better data in its gazetteer for the USA
and Western Europe than other areas of the world. While we
are able to resolve major cities in Asia and Africa, smaller-scale
features are often returned as ‘not found’. Russian and Eastern
Asian locations also introduce the occasional problem of other
character sets in Unicode characters. We find that the articles
in our corpus do not refer to very many features at smaller
than a city level in general, so these geocoding problems do
not significantly impact our performance. Historical location
names cause some small issues with the geocoder, but our
research demonstrates that, for North America and Europe at
least, Google Geocoder has historical names in its gazetteer
going back at least 300 years.
It should be noted that we cache geocoded results, which
decreases the number of online geocoding queries by an average
of 70%. Google Geocoder limits the speed at which a public
user can make consecutive requests and so we add a delay
of 0.1 s between queries. By reducing our online accesses by
such a large percentage, we end up saving seconds processing
each article. More detailed performance evaluations still need
to be made to more clearly identify how well the cache has
improved efficiency. Since Google Geocoder outputs have not
yet been disambiguated by our system when cached, the cache
holds information across the processing of multiple articles.
419
In initial development, it was the single final disambiguated
output that was saved into the cache rather than raw geocoder
output. This caused problems when two completely different
geospatial entities with the same name were referenced in two
different articles that would have been distinguished correctly
given raw geocoder output. We have since fixed this problem by
immediately saving the list of results returned by the geocoder
into the cache, rather than the specific location identified after
disambiguation. Unfettered access to Google Geocoder costs
several tens of thousands of dollars per year.
10. VISUALIZATION
The overall objective of the research presented in this paper is
to create a structured dataset that represents the events, entities,
temporal data and grounded locations contained in a corpus of
unstructured text. While this is, of course, very valuable for
many things, our follow-up objective is the visualization of this
structured data from the text. We create a bifurcated GUI that
provides the visualizations for the geospatial and event/temporal
information. We generate a combination map and timeline using
the Google Maps API15 and the MIT SIMILE Timeline.16
The map interface provides a clickable list of events on the
right, with an associated map of the events. Clicking on a
location on the map or in the sidebar shows an infobox on the
map that provides more details about the location, including the
sentence in which it appears. Figure 12 shows this interface,
and Fig. 13 shows the clicked map. The second part of the
interface is a timeline with short summaries of the important
events. Clicking on any of the events brings up an infobox with
a more detailed description of the event, shown in Fig. 14.
10.1. Timeline hot zones
Key to the timeline visualization is the concept of ‘hot zones’,
or clusters of important events on the timeline. An example set
of hot zones is shown in Fig. 15. In the figure, the orange bars
represent three different hot zones where multiple events have
been identified to occur at about the same time. Blue markers,
along with the text of the event itself, are aligned vertically in
each zone. Events that were not identified to reside in a hot
zone are staggered outside of the orange bars. To generate the
endpoints for more than one hot zone, we sort the times for
an article and examine the time differences (in days) between
consecutive times. Long stretches of low differences indicate
dense time intervals, indicating a hot zone. It is not necessary
for the differences to be zero, just low, and so a measure is
needed for how low is acceptable. We choose to remove outliers
from the list of time differences and then take the average of the
new list. To remove outliers, we proceed in a common manner
and calculate the first and third quartiles (Q1 and Q3) and then
15 http://code.google.com/apis/maps/documentation/javascript/.
16 http://www.simile-widgets.org/timeline/.
The Computer Journal, Vol. 57 No. 3, 2014
420
R. Chasin et al.
FIGURE 12. Example of an article’s visualization.
FIGURE 13. Geospatial entity infobox in the visualization.
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
421
FIGURE 14. Event infobox in the visualization, after clicking ‘Richmond, VA, USA’.
FIGURE 15. Timeline for Mexican–American War, containing three hot zones. Key to the timeline visualization is the concept of ‘hot zones’,
or clusters of important events on the timeline. An example set of hot zones is shown in this figure. In the figure, the orange bars represent three
different hot zones where multiple events have been identified to occur at about the same time. Blue markers, along with the text of the event itself,
are aligned vertically in each zone. Events that were not identified to reside in a hot zone are staggered outside of the orange bars.
remove values greater than Q3+3∗(Q3−Q1) or less than Q1−
3 ∗ (Q3 − Q1). The latter quantity is usually zero; so this ends
up removing high outliers. This average is the threshold below
which a difference is called low. We also ensure that too many
hot zones are not generated, and so we choose a threshold for
the number of consecutive differences that is low for the interval
to be a hot zones; we choose a threshold of 10, based on what is
appropriate for the dates in the majority of articles we process.
10.2. Article processing flow
The website both allows a user to search for and visualize articles, and serves to control the processing of new articles as they
are requested. Figure 16 shows the processing flow of articles.
While the system constantly queues and processes articles, articles that have not been previously processed that are requested
by a user are immediately moved to the top of the queue for
processing. A user can also choose to re-process an article that
The Computer Journal, Vol. 57 No. 3, 2014
422
R. Chasin et al.
FIGURE 16. System workflow.
has already been processed, if it has been too long since it was
last processed, as the article content may change in Wikipedia.
11.
CONCLUSIONS AND FUTURE WORK
In this paper, we have presented details of a system that puts
together many tools and approaches to produce a working
system that is useful and interesting. Of course, the work
presented here can be improved in many ways.
The performance of the important event classifier could
be improved upon in the future. It is possible that more features would help. Ideas for sentence features that were not
implemented included calculating the current features for the
adjacent sentences. Dependency relations obtained from dependency parsers can be used as well. Classifiers other than SVMs
could be explored. Since the task is similar to text summarization, methods used for summarization could be adapted as well.
Assuming we perform extractive summarization, the sentences
that are selected for the summary for a document can be considered important; these in turn are assumed to contain references
to important events. If the number of sentences selected for
the summary can be controlled, it will give us control over the
number or percentage of sentences selected and hence, events
considered important. Calculating the features to train the
classifier takes more time than desirable. Using multithreading on processing articles would make some improvement.
There are three specific features that take the most time to
calculate—similarity, named entity weight and TextRank rank.
Experimenting with removing any of these features while
preserving accuracy could make a more efficient classifier.
The extraction of temporal expressions that are explicit turns
out to work fairly well for this domain of historical narratives
because the sentences are often ordered in the text similarly
to their temporal order. In another domain, even one rich in
temporal information like biographies, it might not do as well.
Although they work well, most of the algorithms we used here
are naïve and do not make use of existing tools. The regular
expression extractor works fairly well, but needs expansion
to handle relative expressions like ‘Two weeks later, the army
invaded’. To implement this with the same accuracy that the
extractor currently has should not be hard, as when the expression is recognized, the function that it implies (in this example,
adding 2 weeks) can be applied to the anchoring time. It also
cannot currently handle times B.C.E. More difficult would be
a change to the program to have it extract multiple expressions
per sentence (beyond the range-like expressions it already
finds). If events are considered smaller than sentences, this is
crucial. Having events smaller than sentences would also allow
the use of Ling and Weld’s Temporal Information Extraction
system [12], which temporally relates events within one sentence. We need to improve performance when time expressions
are implicit. Errors tend to build up and carry over from sentence to sentence; some kind of check or reset condition would
help accuracy.
The extracted information is stored in a MYSQL database
at this time. We would like to convert it into an ontology
based on OWL or RDF so that it can be easily integrated into
programs for effective use. For example, this may enable us
to allow searches for text content to be filtered by geographic
area. The search could then be performed in two steps. The
first would be a standard free text search, for articles about the
topic. That list could then be further filtered to those articles
with appropriate locations associated with them. Reversing
this paradigm, the location data provided by the system could
also allow location-centric search.
The visualization could also be improved with changes
to SIMILE’s code for displaying a timeline, particularly the
width of the timeline, which does not automatically adjust for
more events. Because the important event classifier positively
identifies a lot of sentences that are often close in time, the listing
of events on the timeline overflows the space, so more must be
allotted even for sparser articles. This could be improved with
automatic resizing or zooming. Another improvement would
be to create event titles that better identify the events since
they are currently just the beginnings of the sentences. It will
also be interesting to see if the visualization can be made more
expressive using a language of choremes [35, 36]. Finally, more
features could be added to the visualization, especially in terms
of filtering by time or location range.
Applied to other corpora of information, this event and
geospatial information could also be very useful in finding
geospatial trends and relationships. For instance, consider a
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
database of textual disease outbreak reports. The system could
extract all the locations, and then they could be graphically presented on a map, allowing trends to be found much more easily.
In our current effort, we perform limited association of events
and geo-entities based on their proximity within the sentence,
which provides only the most basic association. While the
research done on event extraction did heavily involve the time
of the event, the further association of the events with the geoentities is a topic for future research. In practical terms, we
also envision developing our user interface further by having
the system we created have a slider for time (rather than just
being able to enter start/end dates), which if used continuously
could provide a nice overview of where events happen over time.
Some kind of clustering could be applied at each time step of
some granularity to see where the action moves.
[11]
[12]
[13]
[14]
[15]
FUNDING
The work reported in this paper has been partially supported by
NSF grant ARRA: 0851783. R.C. and D.W. are undergraduate
researchers supported by this NSF Research Experience for
Undergraduates (REU) Grant. This paper is a much extended
version of [92].
[16]
[17]
REFERENCES
[1] Allan, J., Gupta, R. and Khandelwal, V. (2001) Temporal
Summaries of New Topics. SIGIR’01: Proc. 24th Annual Int.
ACM SIGIR Conf. on Research and Development in Information
Retrieval, New Orleans, LA, USA, pp. 10–18.
[2] Allan, J. (2002) Topic Detection and Tracking: Event-Based
Information Organization. Springer, New York, NY, USA.
[3] Swan, R. andAllan, J. (1999) Extracting Significant Time Varying
Features from Text. CIKM’99: Proc. 8th Int. Conf. on Information
and Knowledge Management, Kansas City, MO, USA, pp. 38–45.
[4] Manning, C.D., Raghavan, P. and Schütze, H. (2008) Introduction to Information Retrieval. Cambridge University Press, Cambridge, MA USA.
[5] Ahn, D. (2006) The Stages of Event Extraction. Proc. Workshop
on Annotating and Reasoning about Time and Events, Sydney,
Australia, July, pp. 1–8.
[6] Humphreys, K., Gaizauskas, R. and Azzam, S. (1997) Event
Coreference for Information Extraction. Proc. Workshop on
Operational Factors in Practical, Robust Anaphora Resolution
for Unrestricted Texts, Madrid, Spain, pp. 75–81.
[7] Bagga, A. and Baldwin, B. (1999) Cross-document Event
Coreference: Annotations, Experiments, and Observations. Proc.
Workshop on Coreference and its Applications, Madrid, Spain,
pp. 1–8.
[8] Allen, J.F. (1983) Maintaining knowledge about temporal
intervals. Commun. ACM, 26, 832–843.
[9] Zhou, L., Melton, G.B.B., Parsons, S. and Hripcsak, G. (2005) A
temporal constraint structure for extracting temporal information
from clinical narrative. J. Biomed. Inf., 39, 424–439.
[10] Pustejovsky, J., Castano, J., Ingria, R., Sauri, R., Gaizauskas,
R., Setzer, A., Katz, G. and Radev, D. (2003) TimeML: Robust
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
423
Specification of Event and Temporal Expressions in Text. New
Directions in Question Answering, pp. 28–34. AAAI Press Palo
Alto, CA, USA.
Verhagen, M. and Pustejovsky, J. (2008) Temporal Processing
with the TARSQI Toolkit. 22nd Int. Conf. on Computational
Linguistics: Demonstration Papers, Manchester, UK, August, pp.
189–192.
Ling, X. and Weld, D. (2010) Temporal Information Extraction.
Proc. 24th Conf. on Artificial Intelligence (AAAI-10), Atlanta,
GA, July, pp. 1385–1390.
Grishman, R. and Sundheim, B. (1996) Message Understanding Conference-6: A Brief History. Proc. COLING, Copenhagen,
Denmark, pp. 466–471. International Conference on Computational Linguistics.
Vapnik, V. (2000) The Nature of Statistical Learning Theory.
Springer, New York, NY, USA.
Bhole, A., Fortuna, B., Grobelnik, M. and Mladenić, D. (2007)
Extracting named entities and relating them over time based on
Wikipedia. Informatica, 22, 20–26.
Takeuchi, K. and Collier, N. (2002) Use of Support Vector
Machines in Extended Named Entity Recognition. COLING02, Taipei, Taiwan, pp. 1–7. International Conference on
Computational Linguistics.
Lee, K.-J., Hwang, Y.-S. and Rim, H.-C. (2003) Two-Phase
Biomedical NE Recognition Based on SVMs. Proc. ACL 2003
Workshop on Natural Language Processing in Biomedicine,
Sapporo, Japan, pp. 33–40. Association for Computational
Linguistics.
Habib, M. and Kalita, J. (2007) Language and DomainIndependent Named Entity Recognition: Experiment using SVM
and High-dimensional Features. Proc. 4th Biotechnology and
Bioinformatics Symposium (BIOT-2007), Colorado Springs, CO,
USA, pp. 69–73.
Habib, M. (2008) Improving scalability of support vector
machines for biomedical named entity recognition. PhD Thesis,
University of Colorado, Colorado Springs, CO, USA.
Habib, M. and Kalita, J. (2010) Scalable biomedical named
entity recognition: investigation of a database-supported SVM
approach. Int. J. Bioinf. Res. Appl., 6, 191–208.
Isozaki, H. and Kazawa, H. (2002) Efficient Support Vector
Classifiers for Named Entity Recognition. Proc. 19th Int. Conf.
on Computational Linguistics, Taipei, Taiwan, pp. 1–7. COLING.
Dakka, W. and Cucerzan, S. (2008) Augmenting Wikipedia
with Named Entity Tags. Proc. 3rd Int. Joint Conf. on Natural
Language Processing, Hyderabad, India, pp. 545–552.
Mansouri, A., Affendey, L. and Mamat, A. (2008) Named entity
recognition using a new fuzzy support vector machine. Int. J.
Comput. Sci. Netw. Secur., 8, 320.
Klein, D., Smarr, J., Nguyen, H. and Manning, C.D. (2003)
Named Entity Recognition with Character-Level Models. Proc.
7th Conf. on Natural Language Learning at HLT-NAACL
2003, Edmonton, Canada, pp. 180–183. Human Language
Technologies–North American Chapter of Association for
Computational Linguistics.
Zhou, G. and Su, J. (2002) Named Entity Recognition using
an HMM-Based Chunk Tagger. Proc. 40th Annual Meeting
on Association for Computational Linguistics, Philadelphia,
Pennsylvania, pp. 473–480.
The Computer Journal, Vol. 57 No. 3, 2014
424
R. Chasin et al.
[26] Lafferty, J., McCallum, A. and Pereira, F.C. (2001) Conditional
Random Fields: Probabilistic Models for Segmenting and
Labeling Sequence Data. Proc. Int. Conf. on Machine Learning,
Williamstown, MA, USA.
[27] McCallum, A. and Li, W. (2003) Early Results for Named
Entity Recognition with Conditional Random Fields, Feature
Induction and Web-enhanced Lexicons. Proc. 7th Conf. on
Natural Language Learning at HLT-NAACL 2003, Berkeley,
CA, USA, pp. 188–191. Human Language Technologies–North
American Chapter of Association for Computational Linguistics.
[28] Chen, Y., Di Fabbrizio, G., Gibbon, D., Jora, S., Renger, B.
and Wei, B. (2007) Geotracker: Geospatial and Temporal RSS
Navigation. WWW’07: Proc. 16th Int. Conf. on World Wide Web,
Banff, Alberta, Canada, pp. 41–50.
[29] Scharl, A. and Tochtermann, K. (2007) The Geospatial Web: How
Geobrowsers, Social Software and the Web 2.0 Are Shaping the
Network Society. Springer, New York, NY.
[30] Strötgen, J., Gertz, M. and Popov, P. (2010) Extraction and
Exploration of Spatio-temporal Information in Documents. Proc.
6th Workshop on Geographic Information Retrieval, Zurich,
Switzerland, pp. 16–23.
[31] Cybulska, A. and Vossen, P. (2010) Event models for Historical
Perspectives: Determining Relations between High and Low
Level Events in Text, Based on the Classification of Time,
Location and Participants. Proc. Int. Conf. on Language
Resources and Evaluation, Malta, pp. 17–23.
[32] Cybulska, A. and Vossen, P. (2011) Historical Event Extraction
from Text. ACL HLT 2011 Workshop on Language Technology
for Cultural Heritage, Social Sciences, and Humanities (LaTeCH
2011), Portland, OR, USA, pp. 39–43. Association for
Computational Linguistics–Human Language Technologies.
[33] Miller, G. (1995) WordNet: a lexical database for English.
Commun. ACM, 38, 39–41.
[34] Ligozat, G., Nowak, J. and Schmitt, D. (2007) From Language to
Pictorial Representations. Proc. Language and Technology Conf.,
Poznan, Poland.
[35] Brunet, R. (1980) La Composition des Modèles dans L’analyse
Spatiale. Espace Géographique, 9, 253–265.
[36] Klippel, A., Tappe, H., Kulik, L. and Lee, P.U. (2005) Wayfinding
choremes: a language for modeling conceptual route knowledge.
J. Vis. Lang. Comput., 16, 311–329.
[37] Ligozat, G., Vetulani, Z. and J., O. (2011) Spatiotemporal aspects
of the monitoring of complex events for public security purposes.
Spat. Cogn. Comput., 11, 103–128.
[38] Goyal, R.K. and Egenhofer, M.J. (2001) Similarity of Cardinal
Directions. Advances in Spatial and Temporal Databases, pp.
36–55. Springer.
[39] Rigaux, P., Scholl, M. and Voisard, A. (2001) Spatial Databases:
With Application to GIS. Morgan Kaufmann, CA, USA.
[40] Clarke, K.C. (2001) Getting Started with Geographic Information
Systems. Prentice Hall.
[41] Hill, L.L. (2000) Core Elements of Digital Gazetteers:
Placenames, Categories, and Footprints. Research and Advanced
Technology for Digital Libraries, pp. 280–290. Springer.
[42] Fonseca, F.T., Egenhofer, M.J.,Agouris, P. and Câmara, G. (2002)
Using ontologies for integrated geographic information systems.
Trans. GIS, 6, 231–257.
[43] Buscaldi, D., Rosso, P. and Peris, P. (2006) Inferring Geographical Ontologies from Multiple Resources for Geographical Information Retrieval. Proc. 3rd SIGIR Workshop on Geographical Information Retrieval, Seattle, WA, USA. Association for
Computing Machinery Special Interest Group on Information
Retrieval.
[44] Jones, C.B., Purves, R., Ruas, A., Sanderson, M., Sester, M.,
Van Kreveld, M. and Weibel, R. (2002) Spatial Information
Retrieval and Geographical Ontologies an Overview of the
SPIRIT Project. Proc. 25th Annual Int. ACM SIGIR Conf. on
Research and Development in Information Retrieval, Tampere,
Finland, pp. 387–388. Association for Computing Machinery
Special Interest Group on Information Retrieval.
[45] Fu, G., Jones, C.B. and Abdelmoty, A.I. (2005) OntologyBased Spatial Query Expansion in Information Retrieval. On the
Move to Meaningful Internet Systems 2005: CoopIS, DOA, and
ODBASE, pp. 1466–1482. Springer.
[46] Martins, B.E.d.G. (2009) Geographically aware Web text mining.
PhD Thesis, Universidade de Lisboa, Lisbon, Portugal.
[47] Page, L., Brin, S., Motwani, R. and Winograd, T. (1999)
The PageRank Citation Ranking: Bringing Order to the Web.
Technical report., Stanford InfoLab, available at http://ilpubs.
stanford.edu:8090/422/1/1999-66.pdf.
[48] Kleinberg, J.M. (1999) Authoritative sources in a hyperlinked
environment. J. ACM, 46, 604–632.
[49] Brill, E., Dumais, S. and Banko, M. (2002) An Analysis
of the AskMSR Question-Answering System. Proc. ACL02 Conf. on Empirical Methods in Natural Language
Processing, Philadelphia, PA, USA, pp. 257–264. Association
for Computational Linguistics.
[50] Ravichandran, D. and Hovy, E. (2002) Learning Surface
Text patterns for a Question Answering System. Proc. 40th
Annual Meeting on Association for Computational Linguistics,
Philadelphia, PA, USA, pp. 41–47.
[51] Hovy, E., Hermjakob, U. and Ravichandran, D. (2002) A
Question/Answer Typology with Surface Text Patterns. Proc. 2nd
Int. Conf. on Human Language Technology Research, San Diego,
CA, USA, pp. 247–251.
[52] Moldovan, D., Harabagiu, S., Girju, R., Morarescu, P., Lacatusu,
F., Novischi, A., Badulescu, A. and Bolohan, O. (2003) LCC
tools for Question Answering. Proc. Text Retrieval Conference
(TREC), Gaithersburg, MD, USA, pp. 388–397.
[53] Harabagiu, S., Moldovan, D., Clark, C., Bowden, M., Williams, J.
and Bensley, J. (2003) Answer Mining by Combining Extraction
Techniques with Abductive Reasoning. Proce. 12th Text Retrieval
Conference (TREC 2003), Gaithersburg, MD, USA, pp. 375–382.
[54] Voorhees, E. (2003) Overview of TREC 2003. Proc. Text Retrieval
Conf. (TREC), Gaithersburg, MD, USA, pp. 1–16.
[55] Small, S., Liu, T., Shimizu, N. and Strzalkowski, T. (2003)
HITIQA: An Interactive Question Answering System a
Preliminary Report. Proc. ACL 2003 Workshop on Multilingual
Summarization and Question Answering, Sapporo, Japan,
pp. 46–53. Association for Computational Linguistics.
[56] Soricut, R. and Brill, E. (2004) Automatic Question Answering:
Beyond the Factoid. Proc. HLT-NAACL, Boston, MA, USA.
pp. 57–64. Human Language Technologies–North American
Chapter of Association for Comoputational Linguistics. Boston,
MA, USA.
The Computer Journal, Vol. 57 No. 3, 2014
Extracting and Displaying Temporal and Geospatial Entities
[57] Sauri, R., Knippen, R., Verhagen, M. and Pustejovsky, J. (2005)
EVITA: A Robust Event Recognizer for QA Systems. Proc.
Conf. on Human Language Technology and Empirical Methods
in Natural Language Processing, Vancouver, British Columbia,
Canada, October, pp. 700–707.
[58] UzZaman, N. and Allen, J. (2010) TRIPS and TRIOS System for
TempEval-2: Extracting Temporal Information from Text. Proc.
5th Int. Workshop on Semantic Evaluation, Uppsala, Sweden, pp.
276–283.
[59] UzZaman, N. andAllen, J. (2010) Event and Temporal Expression
Extraction from Raw Text: First Step towards a Temporally Aware
System. Int. J. Semant. Comput., 4, 487–508.
[60] Llorens, H., Saquete, E. and Navarro, B. (2010) TIPSem (English
and Spanish): Evaluating CRFs and Semantic Roles in TempEval2. Proc. 5th Int. Workshop on Semantic Evaluation, Uppsala,
Sweden, pp. 284–291.
[61] Wallach, H.M. (2004) Conditional Random Fields: An introduction. Technical Report, Department of Computer Science, University of Pennsylvania, Philadelphia, PA, USA.
[62] Naughton, M., Stokes, N. and Carthy, J. (2008) Investigating
Statistical Techniques for Sentence-level Event Classification.
Proc. 22nd Int. Conf. on Computational Linguistics, Manchester,
UK, pp. 617–624.
[63] Pedersen, T., Patwardhan, S. and Michelizzi, J. (2004)
WordNet::Similarity: Measuring the Relatedness of Concepts.
HLT-NAACL’04: Demonstration Papers, Boston, MA, USA, pp.
38–41.
[64] Leacock, C. and Chodorow, M. (1998). WordNet: An Electronic
Lexical Database, Chapter Combining Local Context and
WordNet Similarity for Word Sense Identification, pp. 265–283.
MIT Press, Cambridge, MA.
[65] Jiang, J. and Conrath, D. (1997) Semantic Similarity based
on Corpus Statistics and Lexical Taxonomy. Proc. Int. Conf.
on Research in Computational Linguistics (ROCLING), Taipei,
Taiwan.
[66] Resnik, P. (1995) Using Information Content to Evaluate
Semantic Similarity in a Taxonomy. Proc. 14th Int. Joint Conf.
on Artificial Intelligence, Montreal, QC, Canada, pp. 448–453.
[67] Lin, D. (1998) An Information-theoretic Definition of Similarity.
Proc. 15th Int. Conf. on Machine Learning, Madison, WI, USA,
pp. 296–304.
[68] Hirst, G. and St-Onge, D. (1998). WordNet: An Electronic Lexical
Database, Chapter Lexical Chains as Representations of Context
for the Detection and Correction of Malapropisms, pp. 305–332.
MIT Press, Cambridge, MA, USA.
[69] Wu, Z. and Palmer, M. (1994) Verbs Semantics and Lexical
Selection. Proc. 32nd Annual Meeting on Association for
Computational Linguistics, Las Cruces, NM, USA, pp. 133–138.
[70] Banerjee, S. and Pedersen, T. (2002) An Adapted Lesk Algorithm
for Word Sense Disambiguation using WordNet. Conf. on
Computational Linguistics and Intelligent Text Processing,
Mexico City, Mexico, pp. 117–171.
[71] Mihalcea, R. and Tarau, P. (2004) TextRank: Bringing Order
into Texts. Proc. EMNLP 2004, Barcelona, Spain, July, pp. 404–
411. Conference on Empirical Methods in Natural Language
Processing.
[72] Strötgen, J. and Gertz, M. (2010) HeidelTime: High Quality Rulebased Extraction and Normalization of Temporal Expressions.
[73]
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
425
Proc. 5th Int. Workshop on Semantic Evaluation, Uppsala,
Sweden, pp. 321–324.
van Rijsbergen, C. (1979) Information Retrieval. Butterworths,
London, UK.
Weiss, S.M. and Zhang, T. (2003) Performance Analysis and
Evaluation. The Handbook of Data Mining, pp. 426–440.
Lawrence Erlbaum Associates, Mahwah, NJ USA.
Witmer, J. (2009) Mining WIkipedia for Geospatial Entities
and Relationships. Master’s Thesis, Department of Computer
Science, University of Colorado Colorado Springs.
Witmer, J. and Kalita, J. Extracting Geospatial Entities from
Wikipedia. IEEE Int. Conf. on Semantic Computing, Berkeley,
CA, USA, pp. 450–457. Institute of Electrical and Electronics
Engineers.
Witmer, J. and Kalita, J. (2009) Mining Wikipedia Article
Clusters for Geospatial Entities and Relationships. AAAI Spring
Symposium, Stanford, CA, USA, pp. 80–81. Association for the
Advancement of Artificial Intelligence.
Cortes, C. and Vapnik, V. (1995) Support-vector networks. Mach.
Learn., 20, 273–297.
Vapnik, V. (1998) Statistical Learning Theory: Adaptive and
Learning Systems for Signal Processing, Communications, and
Control. Wiley, New York, NY, USA.
Cucerzan, S. (2007) Large-Scale Named Entity Disambiguation
Based on Wikipedia Data. Proc. EMNLP-CoNLL, pp. 708–
716. Conference on Empirical Methods in Natural Language
Processing–Computational Natural Language Learning.
Rabiner, L.R. (1989) A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition. Proc. IEEE, 77,
257–286.
Finkel, J.R., Grenager, T. and Manning, C. (2005) Incorporating
Non-local Information into Information Extraction Systems
by Gibbs Sampling. ACL’05: Proc. 43rd Annual Meeting on
Association for Computational Linguistics, Ann Arbor, MI, USA,
pp. 363–370.
Wang, C., Xie, X., Wang, L., Lu, Y. and Ma, W. (2005) Detecting
Geographic Locations from Web Resources. 2005 Workshop
on Geographical Information Retrieval, Bremen, Germany,
pp. 17–24.
Sehgal, V., Getoor, L. and Viechnicki, P. (2006) Entity Resolution
in Geospatial Data Integration. ACM Int. Symp. on Advances in
Geographical Information Systems, Irvine, CA, USA, pp. 83–90.
Association for Computing Machinery.
Zong, W., Wu, D., Sun, A., Lim, E. and Goh, D. (2005)
On Assigning Place Names to Geography Related Web Pages.
ACM/IEEE-CS Joint Conf. on Digital Libraries, Denver, CO,
USA, pp. 354–362. Association for Computing Machinery–
Institute for Electrical and Electronics Engineers, Computer
Science section.
Martins, B., Manguinhas, H. and Borbinha, J. (2008) Extracting
and Exploring the Geo-Temporal Semantics of Textual
Resources. IEEE Int. Conf. on Semantic Computing, pp. 1–9.
Institute for Electrical and Electronics Engineers, Santa Clara,
CA, USA.
Smith, D. and Crane, G. (2001) Disambiguating Geographic
Names in a Historical Digital Library. Research and Advanced
Technology for Digital Libraries, pp. 127–136. Springer, New
York, NY, USA.
The Computer Journal, Vol. 57 No. 3, 2014
426
R. Chasin et al.
[88] Vestavik, Ø. (2003) Geographic Information Retrieval: An
Overview. Department of Computer and Information Science,
Norwegian University of Technology and Science, Oslo, Norway,
available http://www.idi.ntnu.no/ oyvindve/article.pdf.
[89] D’Roza, T. and Bilchev, G. (2003) An Overview of LocationBased Services. BT Technol. J., 21, 20–27.
[90] Chang, C.-C. and Lin, C.-J. (2001) LIBSVM: a Library for
Support Vector Machines. Software available at http://www.csie.
ntu.edu.tw/ cjlin/libsvm.
[91] Woodward, D., Witmer, J. and Kalita, J. (2010) A Comparison
of Approaches for Geospatial Entity Extraction from Wikipedia.
IEEE Conf. on Semantic Computing, Pittsburg, PA, USA,
pp. 402–407. Institute for Electrical and Electronic Engineers.
[92] Chasin, R., Woodward, D. and Kalita, J. (2011) Extracting
and Displaying Temporal Entities from Historical Articles.
National Conf. on Machine Learning, Tezpur University,
Assam, India, pp. 1–13. Narosa Publishing House, New Delhi,
India.
The Computer Journal, Vol. 57 No. 3, 2014