Countering Security Threats Using Natural Language Processing

Countering Security Threats Using Natural
Language Processing
Pierre Isabelle
C. Cherry
R. Kuhn
S. Mohammad
National Research Council
Prepared By:
National Research Council
1200 Montreal Rd., Building M-50
Ottawa, ON K1A 0M5
Contract Reference Number: CSSP-2013-CP-1031
Technical Authority: Rodney Howes, DRDC – Centre for Security Science
Disclaimer: The scientific or technical validity of this Contract Report is entirely the responsibility of the Contractor and the
contents do not necessarily have the approval or endorsement of the Department of National Defence of Canada.
Contract Report
DRDC-RDDC-2016-C344
December 2016
© Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2016
© Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale,
2016
Countering Security Threats
Using Natural Language Technology
Prepared by:
P. Isabelle, C. Cherry, R. Kuhn and S. Mohammad
National Research Council
1200 Montreal Rd., building M-50
Ottawa, K1A 0M5
Contract Reference Number: CSSP-2013-CP-1031
Scientific Authority:
Rodney Howes
DRDC Centre for Security Science
613-943-2474
The scientific or technical validity of this Contract report is entirely the
responsibility of the Contractor and the contents do not necessarily have the
approval or endorsement of the Department of National Defence of Canada.
1
Abstract
The quantity of data that is available to information analysts has
been growing at an exponential rate in the last two decades and will
continue to do so in the foreseeable future. At the forefront of that
growth are the new social media such as Twitter, Instagram and Facebook. Those vehicles do carry a wealth of information that could be
of great value to security analysts, but the challenge is to uncover the
small information gems in a huge mass of worthless material. In recent
years, Canada has invested substantial amounts of money in research
efforts on natural language technologies. The NRC has been highly
successful on that front, developing world-class technologies for machine trabnslation, text summarization, information extraction and
sentiment emotion analysis. While these technologies are already
being used in various application areas, their potential in security
analysis remains to be firmly established.
This is exactly what this technology demonstration project set out
to accomplish. Together with our industrial partners Thales TRT
(Quebec city) and MediaMiser (Ottawa), and with the assistance of a
professional intelligence service, we have developed a prototype system
that can: 1) monitor social media on an ongoing basis, extracting from
them huge multilingual collections of documents about interesting topics; 2) translate foreign language documents within such collections;
3) enrich the content of all documents using advanced linguistic analysis such as information extraction and sentiment analysis; 4) store
the results in a special-purpose database; 5) provide users with unmatched flexibility in tailoring multi-faceted search queries based on
criteria as diverse as source language, document genre, posting location, posting date, keywords, linguistic entities, author sentiment and
emotions; and 6) present users with rich visualizations of the results
of their search queries.
The core part of this report contains the following: a) a mostly non
technical presentation of the architecture of the prototype system that
constitutes our main project result; b) An extensive video tour of that
prototype which makes it easy to understand its value for information
analysts; and c) a description of the user contributions, feedback and
conclusions about this prototype system which is found to be “on its
way to being a high-quality analytic tool [...]”.
2
Contents
1 Introduction
7
2 Overall Architecture
9
2.1 MediaMiser: data collection . . . . . . . . . . . . . . . . . . . 9
2.2 NRC: text enrichment . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Thales: data storage, retrieval and visualisation . . . . . . . . 18
3 A tour of the technology demonstrator
23
4 User Contributions and Feedback
24
5 Conclusions
26
A Machine Translation
A.1 Improving the throughput of the MT module .
A.2 Improving Translation Quality . . . . . . . . .
A.2.1 Rules for Handling Tweets . . . . . . .
A.2.2 Additional Training Data . . . . . . .
A.3 Discussion and Recommendations . . . . . . .
29
31
35
36
38
42
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B Summarization
44
B.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
B.2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
B.3 Summarization Algorithm . . . . . . . . . . . . . . . . . . . . 46
C Information Extraction
C.1 Named Entity Recognition
C.1.1 Data . . . . . . . .
C.1.2 Methods . . . . . .
C.1.3 Experiments . . . .
C.1.4 Discussion . . . . .
C.2 Entity Linking . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
50
52
54
56
57
D Sentiment & Emotion Analysis
59
D.1 Sentiment Lexicons . . . . . . . . . . . . . . . . . . . . . . . . 59
D.1.1 Existing, Automatically Created Sentiment Lexicons . 59
3
D.1.2 New, Tweet-Specific, Automatically Generated Sentiment Lexicons . . . . . . . . . . . . . . . . . . . . . . . 60
D.2 Task: Automatically Detecting the Sentiment of a Message . . 61
D.2.1 Classifier and features . . . . . . . . . . . . . . . . . . 61
4
List of Figures
1
2
3
4
5
6
7
8
9
10
Overall architecture of the CST system . . . . . . . . . . . . .
Initial MT module . . . . . . . . . . . . . . . . . . . . . . . .
Improved version of MT module, December 2014 . . . . . . . .
Speed improvement in MT module . . . . . . . . . . . . . . .
Handling of Twitter #hashtags . . . . . . . . . . . . . . . . .
Handling non-Arabic scripts and multiple hash tags . . . . . .
MT module, verion 2 . . . . . . . . . . . . . . . . . . . . . . .
A user can drag a selection box to select tweets under his or
her concern, to generate a summary. In the figure, two spikes
of tweets between September 9th and 11th were selected. . . .
Once the summary is ready, an “Open Summarization” button is shown (upper figure). Users can then click the “Open
Summarization” button to read the summaries (lower figure).
An example of semi-Markov tagging. . . . . . . . . . . . . . .
5
10
32
34
35
36
37
41
45
47
50
List of Tables
1
2
3
4
Details of our NER-annotated corpora. A line is a tweet in
Twitter and a sentence in newswire. . . . . . . . . . . . . . . .
A system trained only on newswire data, tested on newswire
(CoNLL) and social media data (Rit11, Fro14). Reporting F1.
The progression of the NRC named entity recognizer throughout the CST project. Reporting F1. . . . . . . . . . . . . . . .
F1 for our final system, organized by entity class. . . . . . . .
6
51
55
56
57
1
Introduction
The quantity of data that is available to information analysts has been growing at an exponential rate in the last two decades and will continue to do
so in the foreseeable future. At the forefront of that growth are the new
social media such as Twitter, Instagram and Facebook. It is clear that those
vehicles carry a wealth of information that can be of great value to security
analysts. An international example might be blogs or tweets in Arabic or
Chinese that indicate developing threats to Canadian embassies. A Canadian example might be blogs or tweets from Canadians that suggest a social
disturbance may be developing. Recall that during the June 2011 Vancouver hockey riot, many rioters and onlookers used Twitter, Facebook, etc. to
describe what was going on in real time.
Unfortunately, the exponential growth in data size on social media also
means that any valuable information nugget tends to be buried underneath
massive amounts of irrelevant material. For security analysts, time is of
essence: they cannot afford to waste much time tossing away large amounts
of worthless chaff. A large proportion of the potentially useful data is in the
form of natural language texts of many different languages. Consequently,
information analysts could greatly benefit from tools that could make them
more efficient at finding useful pieces of information hidden within massive
quantities of multilingual text.
In recent years, most developed countries, including Canada, have invested substantial amounts of research money into natural language processing (NLP) technology. Lately, researchers in that area have been moving
away from the traditional paradigm of manually-encoded rule systems to
embrace a radically different paradigm: that of machines that can automatically learn from examples. This has resulted in very significant progress on a
broad range of applications including machine translation, text classification,
summarization, information extraction and sentiment & emotion analysis.
Canadian researchers have been highly active on that R&D front. For example, the NRC has succeeeded in developing leading-edge technology on all
the applications just mentioned. Our “leading-edge” qualification is backed
up by the fact that over the last ten years, NRC has repeatedly and consistently obtained some of the very best marks in international technology
benchmarking exercises including the following ones:
• In 2012 NRC tied with Raytheon BBN for the first place in Chinese-to7
English and Arabic-to-English machine translation at NIST Open MT
2012.1
• In 2010, 2011 and 2012, NRC participated in the i2b2 technology
benchmarking exercise for information extraction in the medical domain.2 Each time NRC’s results placed at the top. See for example
[Zhu et al., 2013].
• Between 2013 and 2015, NRC’s sentiment analysis technology was a top
performer on six different tasks of the SemEval annual benchmarking
exercises.3 See for example [Wilson et al., 2013].
• In 2014 and 2015, NRC’s text categorization technology ranked first in
the Discriminating Similar Languages Shared Task.4 [Goutte et al., 2014].
Moreover, NRC’s NLP technologies have already been deployed in many
practical applications. Here are some examples:
• The Extractor multilingual text summarization technology has been on
the market for about 15 years.5
• The PORTAGE machine translation system is being commercialized
since 2009 and currently in use by several private linguistic service
providers as well as by the Canadian Translation Bureau.6
However, to the best of our knowledge, the value of state-of-the-art NLP
technology has yet to be firmly established in the context of security analysis.
The goal of the present CSSP project was precisely to demonstrate that
there is indeed substantial value there for security analysts. Most of the
relevant technologies were already available individually at the start of the
project. The core of our effort was devoted to: 1) adapting each technology
to the specificities of social media and security analysis; 2) assembling these
components into a coherent demonstratable system that can be tested by
1
See
See
3
See
4
See
5
See
6
See
2
http://www.itl.nist.gov/iad/mig/tests/mt/
https://www.i2b2.org/.
https://en.wikipedia.org/wiki/SemEval
http://ttg.uni-saarland.de/lt4vardial2015/papers/goutte2015.pdf.
http://www.extractor.com/.
http://www.terminotix.com/.
8
professional security analysts; 3) collecting feedback from analysts and using
it to produce successively improved versions of the prototype.
Such a system has sucessfully been assembled and extensively demonstrated not only to the user-partner of this project but also to many other
organizations, both public and private.
In section 2 we will examine the architecture of the project and of the
resulting technology. In section 3 we will take the reader on an audio-visual
tour of the prototype system that was built. In section 4, we will synthesize the feedback that we received from our user-partner after testing the
technology at different stages of its development.
Then, more technical aspects of our work are presented in a set of appendices that the less technically-minded reader can safely ignore.
2
Overall Architecture
Our project involved three technical contributors: NRC, Thales TRT Canada
Inc. and MediaMiser Inc. Each contributor developed or adapted software
of its own and made it available to the project partners, typically through
web services. Figure 1 shows the overall architecture.
2.1
MediaMiser: data collection
Each social medium produces its own stream of documents. Generally speaking, those streams are much too large to be integrally captured. In practice,
what this means is that interesting “topics” have to be monitored on a continuing basis so that potentially interesting documents about each such topic
can be captured on-the-fly, ahead of examination time by the users, and
stored for further processing and examination.
For the purposes of our project, the partners agreed a small number of
topics (often referred to as “scenarios” in our project documentation) to be
used as a testbed for technology. Each selected topic was then encapsulated
as a boolean combination of keywords which was used by MediaMiser to
extract matching documents on a continuing basis from the set of media
that we were monitoring.
Here are some of the topics that were in focus at some time during the
project:
9
Overall System Architecture
Twitter
Text
Blogs
Annotated Text
News
Stream
Search
Visualize
1
Figure 1: Overall architecture of the CST system
10
• Sochi Olympics. During winter 2014, we collected a sizable amount of
data about Sochi Olympics, using keywords such as “Sochi Olympics”,
“winter games”, etc. The scope of the collection process was limited to
English documents.
• The Ottawa War Memorial shootings. In October 2014, we started
collecting data about the October 22 shooting events and their impact
soon after the shooting happened, using keywords such as “Ottawa
shooting”. Fortunately, using Twitter’s historical search mechanism,
we were also able to go back in time so that our War memorial collection
does cover the whole event. Here again, the collection process was
limited to English documents.
• The Syria crisis. Almost 100 million documents about the current civil
war in Syria were extracted from social media and processed between
June 2014 and the end of the project. In this case, both English and
Arabic keywords were used so as to extract documents in those two
languages. In the second year of the project, most experiments and
demonstrations concentrated on that particular dataset.
MediaMiser’s primary role was to monitor social media on a continuing
basis for documents matching the keywords associated with any of our active
topics and to extract all such documents from each relevant stream, no matter how numerous the matches might be. The volume of the Syria dataset
reached a peak of about 1.5 million documents per day.
In practice it was found necessary to limit the media sources to the following list: Twitter, plus various English newswire and English blog wires
that were already being monitored by MediaMiser for other purposes. Later
on, we added Arabic blogs for the benefit of our Syria dataset.
The data extracted by MediaMiser included not only the documents as
seen on the social media platforms but also some metadata that the different
media provide about each document. For example, documents extracted
from Twitter are each in the form of a JSON record which in addition to the
document text includes metadata elements such as the following ones:
• Author’s (pen) name.
• Author’s (declared) place of residence.
11
• When available, geographical coordinates of the origin of the posting.
This is only available in cases where the document was sent from a
mobile device that then had geo-tracking turned on (about 2 percent
in the case of our Syria dataset).
• Language of the document. Twitter provides that information using
their in-house language guesser. Our other sources only contained
English-only or Arabic-only documents.
• Social network information (Twitter only). The author field of each
document is enriched with the lists of following and followed authors
as well as a list of favorites.
As the details of the available metadata vary greatly between different
sources, MediaMiser carried out a process of metadata normalization so as
to simplify downstream processing.
The initial plan was that, once normalized, the data extracted by MediaMiser would be streamed to NRC for linguistic processing, and the result
would thereafter be streamed to Thales TRT in Quebec City where it would
be stored and made searchable. However, as it turned out, it proved impractical for NRC to deploy the high-bandwidth web service that would have
been required for that purpose. For that reason, it was decided to install
NRC’s linguistic technology on MediaMiser’s premises.
The heavy computational burden involved in language processing (especially for machine translation) was thereby transferred to MediaMiser, forcing
them at times to struggle to keep up with the incoming flow of documents.
The available computing resources turned out to be somewhat underpowered
given the high demands placed by NRC linguistic technology (especially those
of machine translation). As a result, during some of the peaks in data volume, MediaMiser was unable to translate all of the extracted documents. In
such cases, the original version of the document was extracted and streamed
to Thales without its translation. In order to minimize the impact of this
problem, MediaMiser provided Thales with an access to the NRC translation
server running on their premises. Thales was then able to add an on-demand
translation service to the demonstration system, so that the user would still
be able, if needed, to see translations of the foreign language documents that
had not been translated at capture time.
12
2.2
NRC: text enrichment
NRC’s primary role was to enrich the texts from the documents collected
on social media using its leading-edge natural language technology. This
included the following components:
• Machine translation. The initial plan was to deploy the NRC’s
PORTAGE statistical machine translation for both Chinese-to-English
and Arabic-to-English. However, as the project unfolded, the partners
decided to concentrate on Arabic-to-English only, thereby reflecting
the growing focus of the project on the Syria crisis dataset. The first
step was to integrate a general-purpose Arabic-to-English PORTAGE
translator to the CST prototype. Then, in a second step, the translation component was customized for the peculiarities of social media
texts (in particular, Twitter), thereby obtaining significant gains in the
quality of the translations.
• Document summarization. NRC proceeded to deploy its ExtractorTM
technology to automatically pull out from each document:
– A set of words or phrases that can be considered as good keywords
for that document, in that they reflect the core contents of the
document. In the CST prototype, Extractor keywords are called
“topics”.
– A set of sentences that constitute a good summary for the document. The connection with the above keywords is that the extracted sentences are chosen so as to maximize the exposition
of those keywords, without being overly redundant among themselves.
Note that Extractor summarization operates on single documents only.
In the case of micro-blogs such as Twitter, sentence extraction is irrelevant since documents typically contain only one sentence. However,
during the project we realized that a technology capable of summarizing related groups of documents would be useful to security analysts.
This is why halfway into the project we decided to develop such a
capability (see below).
13
• Information extraction. We designate under that name any technology capable of automatically extracting structured information from
semi-structured or unstructured text. The kinds of information users
are typically interested in include the following ones:
– What entities (e.g. persons, places, dates, amounts, etc) are being
referred to in some particular text or collection of texts?
– What relationships are being expressed between those entities (e.g.
person X was born on date Z).
– What events are being described involving those entities (e.g. an
earthquake happened at place Y on date Z).
For the purposes of this project we chose to concentrate on extracting
the following entity types: persons, places and organizations. Early in
the project, a first entity extractor was integrated which was based on a
pre-existing generic open-source implementation. As the performance
of that initial version yielded unsatisfactory results, we successively
developed two improved versions. The final one yielded some of the
best results ever reported on entity extraction for social media (see
details in Appendix C below).
The more advanced entity extractors include a capability to merge together references to one and the same entity through different expressions. For example, both “Gaddafi” and “Qadaffi”are sometimes used
to refer to Muammar Gaddafi, the former Lybyan leader. The final
version of our CST entity extractor implements such a functionality. It
does so by linking each entity mention with its Wikipedia page (assuming it has one). “Gaffadi” and “Qadaffi” would then lead the user to the
same Wikipedia page, namely https://en.wikipedia.org/wiki/Muammar Gaddafi.
Extracting meaningful relationships and events from the text itself was
beyond the scope of our resource-limited project. However, as we will
see below, the CST protype still contains mechanisms for extracting
simple co-occurrence relationships such as: entities X and Y tend to
occur in the same documents or in documents by the same author or
in documents that share the same hashtags, etc.
• Sentiment and emotion analysis. We use the term “sentiment analysis” to designate the operation of assigning a given piece of text to
14
one of the categories “positive”, “negative” or “neutral” according to
whether the author of the text is expressing a positive, negative or neutral attitude towards the content of that piece of text. In this project,
our sentiment analyzer is applied independently on each sentence of an
input document and the result is a sentiment score ranging between -1
(perfectly negative) and +1 (perfectly positive) with 0 meaning perfectly neutral.
We designate under the term “emotion analysis” the operation of assigning to some piece of text one or more labels denoting the emotions,
if any, that the author of that piece of text is conveying through it. In
this project, we use the following list of six emotions that are drawn
from Plutchik’s set of basic emotions [Plutchik, 1962]: joy, surprise,
sadness, dislike, anger and fear. Our emotion analyzer is also applied
independently to each sentence of an input document. Each sentence
receives a number between 0 and 1 for each of the six emotions, according to the measured strength of the relevant emotion in that sentence.
• Multi-document Summarizer As mentioned above, halfway into the
project we realized that there was an acute need for a capability to summarize groups of documents rather than just single documents. This
particular task turns out to be quite different from that of summarizing
individual documents. For example, a multi-document summarizer is
likely to have to deal with much more redundancy in its input. Think
for example about similar accounts of the same event being published
by different newspapers. In the case of micro blogs such as Twitter, single document summarization is irrelevant but a capability to summarize
groups of tweets is potentially very useful. For example, sudden peaks
in volume on a given topics are most often caused by specific events.
Applying a multi-document summarizer to the set of documents in a
given peak would instantly bring such events to light.
NRC decided to produce its own novel technology for multi-document
summarization and to apply it to the needs of the CST project. Over
the last 6 months of the project, two successive versions of NRC’s
multi-document summarizer were incorporated in the CST prototype
system.
An interesting aspect of this evolution is that it brought a significant
change in the overall project architecture. Up to then, NRC’s language
15
technology was only working at the single-document level: documents
extracted by MediaMiaser were individually subjected to summarization, information extraction and sentiment and emotion analysis by a
process devoid of any awareness of the broader collection to which the
document belongs.
However, the scenario in which multi-document summarization is needed
is one in which some user arbitrarily targets some specific group of documents (e.g. those contained in a particular peak). As a result, multidocument summarization cannot be performed at collection time: it
needs to be user-triggerable at any time. We were thus led to amend
the overall system architecture so as to give Thales, our system integrator, direct access to the NRC’s multi-document summarizer. Thales
was then able to implement a user-triggered multi-document summarization capability in our prototype system.
• Implementation of NRC’s linguistic technology
NRC’s linguistic processing components are implemented through the
following three schemes:
1. A machine translation service embodied as a batching and queuing
system running on a machine located in MediaMiser’s data center.
MediaMiser calls this service upon extracting any document that
is marked as non-English. The translated text is then added as
an additional metadata field in the JSON record associated with
the document. During the project, this was used to deal with the
50 million Arabic documents included in the Syria crisis dataset.
All the other datasets used in the project were English-only.
Machine translation is a computation-intensive technology. As
mentioned above, during some peaks in the volume of extracted
foreign language data, the machine translation server was occasionally unable to cope in real time with the full incoming flow. On
such occasions, some of the foreign language data was streamed to
Thales without any translation or with only a partial translation.
However, Thales was provided access to the translation server so
that they were able to build an on-demand machine translation
service that the users could resort to in case they were interested
in reading some of the untranslated foreign language documents.
16
2. A linguistic annotation service implemented as a Representational
State Transfer (REST) service working on JSON records. The
Web server would then call the following sub-annotators: a tokenizer, the named-entity extractor and the sentiment and emotion analyzer, each implemented as one or more TCP/IP servers,
as well as the Extractor summarizer implemented as a directly
callable library. The effect is as follows:
– The NRC tokenization sub-service segments the text from
each input document into separate word tokens and separate
sentences. The segmented version of the text is stored in additional metadata elements in the JSON record of each document. This prepares the ground for the application of the
remaining sub-services.
– The NRC Extractor sub-service adds to each input document:
a) a metadata field containing a set of “key words” or “key
phrases” that capture the topic of the document; and b) another metadata field which, in the case of multi-sentence documents, contains a few key sentences that constitute an extractive summary of the document.
– The NRC entity extraction sub-service adds to each document
a metadata field containing a list of entities of each of the
following types: person, place or organization. An additional
metadata field is also added to associate each such entity with
a link to its Wikipedia entry (if it has one).
– The NRC sentiment and emotion analysis sub-service adds
the following metada fields to each document: a) a field containing a document-level aggregation of sentence-level sentiment, namely the proportion of positive, negative and neutral sentences in the document; b) a field that contains the
list of sentence-level sentiment and emotions. In the latter
field, each sentence of the document is assigned its most likely
sentiment (positive, negative or neutral) together with probabilities for each, plus a probability value for each one of the
following emotions: anger, dislike, fear, joy, sadness and surprise.
– And finally, NRC’s multi-document summarizer service has
been implemented as a Java library running on Thales’ ma17
chines in Quebec City. Recall that each set of documents
to be summarized by that module is selected by the user at
runtime.
2.3 Thales: data storage, retrieval and visualisation
For each active dataset (or “scenario”) Thales is receiving an uninterrupted real-time stream of social media documents in JSON format.
As discussed above, the metadata associated with each document has
been normalized by MediaMiser and enriched using NRC technology
with translations, document summaries, list of included entities and
sentiment and emotion markings. All received documents are then
stored and made searchable.
For that purpose, Thales developed their own document-oriented database
which supports real-time indexing of huge quantities of documents, as
well as real-time search based on multi-faceted queries on the raw document content and the associated metadata.
Starting from the whole collection of datasets available in the system,
the user is given a wide range of filtering mechanisms will allow narrowing down on specific subsets of interest:
– Dataset. Each document fed into the database belongs to one of
the datasets being actively monitored in the project and identified
as such in its metadata. The user starts by choosing one among
the available datasets, such as “Syria crisis” (which currently contains some 100 million documents), “War memorial shootings” or
“Ebola Canada”.
– Language of the documents to be retrieved. This relevant
piece of metadata is inserted in each tweet by Twitter. Our other
sources are all monolingual and MediaMiser adds the relevant language metadata at collection time. The system user is then provided with the means to narrow down its focus on one or more
of the languages for which documents are available. With the
current datasets this is only used for the “Syria crisis” dataset,
18
which is made up of roughly equal numbers of English and Arabic
documents.
– Genre(s) of documents to be retrieved. MediaMiser adds
to each extracted document a metadata field describing its genres among the following three ones: “tweet”, “News” or “blog”.
The user can then restrict the scope of any given search to any
combination of those three genres.
– Documents matching some user-specified boolean combination of words (e.g. “chlorine AND attacks”). This is of course
a very basic and standard mechanism for filtering down any document collection. A user-selected option is also provided to allow
for matching the query not only against the document text but
also against its metadata elements.
– Documents posted at a specific time. The metadata provided
to Thales includes a posting timestamp on each document.Thales
was able to use this to provide the user with the means to restrict
the search to any time interval between the moment the dataset
under inspection started being collected and the present time.
– Location of posting. The user can select on a map any specific rectangular geographical area to which the search will be restricted. This interesting functionality is unfortunately not available for all documents, since the relevant metadata is only available for a subset of them. For example, only about 2% of Twitter
posts come with metadata indicating the precise geographical coordinates of the posting site, namely those posts that were sent
with a device in which this kind of tracking is both available and
enabled. However, Thales has contributed a mechanism that attempts to infer the posting location from the Twitter user profiles,
in which users are allowed to include a free text description of their
place of residence. Such descriptions often turn out to be difficult to interpret because they are not standardized (e.g. variable
granularity of the location: country, region, town, etc). Moreover,
the residence location and the posting location are not necessarily
identical: the author may be traveling or cheating about his true
place of residence. But this approximation allows us to increase
the geo-location coverage to about 30% of the input data.
19
– Hashtags (Twitter only). The user can restrict the document
collection to those bearing some particular hashtags(s).
– @authors (twitter only). The user can restrict the document
collection to those posted by a specific author.
– Topics. NRC’s Extractor system has been used to annotate the
English text (original or obtained by machine translation) with a
set of words and phrases that capture the topic of the document.
The user is enabled to filter down the current collection to those
documents marked with any of those topics.
– Entities. NRC has added to each document some metadata showing the entities that have been identified in the text of that document, among the following entity types: persons, places and organizations. The user can take advantage of this marking and filter
down the current collection to those documents containing some
particular entity or set of entities.
– Sentiment. NRC has added to each document metadata that
indicates a sentiment score between -1 (completely negative) and
+1 (completely positive). The user is given the means to filter
down the current collection to show only those within some subrange of sentiment score. For example, a user might use this to
filter the set of posts that contain the word “ISIS” to the subset
of them that is very positive (say, sentiment score > 0.75).
– Emotions. NRC has added to each document a score ranging between [0-1] for each of six emotions: joy, surprise, sadness, dislike,
anger and fear. Thales has used a threshold of 0.5 to binarize the
presence or absence of each sentiment. The user is then enabled
to filter down the current selection so as to only show those that
express some particular emotion. For example, the user could ask
to see all posts that refer to some particular entity (say, a given
person) while expressing anger.
Given any collection of documents that represents either a complete
dataset (such as “Syria crisis”) or a subset of it that has been obtained
using any combination of the filtering devices enumerated above, the
user will be presented with viewing mechanisms both at the collection
level and at the document level.
20
At the collection level, the user can see all of the following:
– A timeline that displays the density level of postings over
time bins that span the whole period during which the sub-collection
was extracted (up to now if the extraction process is still active).
This is useful, for example, to spot significant peaks in volume,
which are often associated with important new events. This is
implemented as a bar chart. Moreover, each bar is segmented
in such a way as to display the sentiment distribution between
positive (green), neutral (blue) or negative (red). This makes it
possible to observe the evolution of relative sentiment over time.
Note also that the timeline display is interactive in that it allows
the user to apply the time-based filtering mentioned above.
– A map that shows the geographical distribution of postings in the current sub-collection. That distribution can be observed at various levels of granularity from a whole-earth view
down to city-block level using a zooming function. The display
can be switched between a small widget on the main interface Web
page and a fullscreen view. Here again, the display can be used
not only as a viewing device over the current collection but also
as a triggering device for the location-based filtering mentioned
above.
– A word cloud that shows the relative salience of userselected classes of objects within the current sub-collection.
The classes that can be selected include the following: hashtags,
topics, twitter authors, persons, places and organizations. The
latter three classes correspond to the classes of entities extracted
using NRC technology. Without such a technology it would not
be possible to observe the relative salience of persons (or places,
organisations) in a given dataset. Once again, the display is used
not only as an output device but also a trigger mechanism for
document filtering. When the user clicks on any element of the
word cloud, the current collection gets filtered down to the subcollection containing that element. Note that the word cloud is
always strongly interacting with all the filtering mechanisms. For
example, if the collection is reduced to the production of one particular author, a word cloud on topics makes it possible to in21
stantly survey the range of interests of that particular author.
– A sentiment graph that shows the distribution of posts
on the negative/positive axis. One can use this to observe
differences between datasets or subsets. For example, one can
easily see that the recently introduced “airline” dataset is neatly
centered on neutrality while the “Syria crisis” dataset is skewed
towards the negative side. Like the other widgets described above,
the sentiment graph is not only a display mechanism. It also serves
as a triggering device for sentiment-based filtering: selecting any
range on the sentiment axis will have the effect of filtering the
current collection down to documents that are within that range.
– An emotion graph that displays the relative salience of
the six annotated emotions in the current sub-collection. The
user can toogle the display between a bar chart and a radar chart.
Like the other widgets described before this one can also be used a
a filtering trigger: clicking on the zone associated with a particular
emotion will have the effect of filtering the current collection down
to documents that have been marked as expressing that emotion.
– A co-occurence network that allows the user to examine
co-occurrence relations between various kinds of objects
within the current sub-collection. In doing so, we can distinguish
between: a) the objects between which co-occurrences are to be
observed; b) the domain in which the co-occurrence is taking place;
and c) the strength of a given co-occurrence between a pair of objects in a given domain. In a graphical representation, the objects
are the nodes, the domains can be represented by types of edges
between the nodes and the association strength can be represented
as the relative thickness of the edges. The network available in our
final prototype should be viewed as an incomplete attempt to give
users a very general tool for exploring a large variety of possible
associations. The generality comes from offering a large choice of
objects (the nodes can be hashtags, topics, authors, posting locations and entities including persons, places and organisations) and
a large choice of co-occurrence domains (single documents, documents from the same author, documents about the same topic,
document referring to the same person, etc.). Unfortunately, we
ran out of time before we could implement any display of associ22
ation strength. In the current state, the user can toggle between
the small widget view and a fullscreen view. In the widget view,
one can choose among preset types of co-occurrence relations. For
example, the selection “Co-mentioned hashtags” displays pairs of
hashtags that appear at least once in the same document. Initially, the view is centered on one arbitrary hashtag, but the user
can change that center at will. Two other presets include the same
kinds of document-internal co-occurrence relation but for topics
and entities. The remaining presets tackle more complex kinds
of co-occurrence that will not be discussed here. When the user
switches to the fullscreen view, the same presets are available, but
the user can also select a custom type of co-occurence among an
almost endless variety. For example, a user choosing “topics” as
nodes and “same document” would get the same as one of the
presets mentioned above. However, when the domain is moved
to “same author”, the target is shifted from the number of documents to the number of authors mentioning those two topics. We
believe that the approach we have sketched opens up a large array
of interesting possibilities which will be easier to investigate once
the quantitative aspect (relative strength of co-occurrences) has
been implemented.
3
A tour of the technology demonstrator
We are pleased to offer the reader a detailed tour of the demonstrator
system that constitutes the main deliverable of our CSSP project. The
present report should normally be accompanied with an HTML file
named ”Tour of the CST prototype system.html” which contains a user
interface to the video and a rather large file named “CST Tour.webm”
which contains the video itself. All you needed to embark on the tour is
to load ”Tour of the CST prototype system.html” in your Web browser.
The whole tour takes about 28 minutes but it is possible to cherry pick
your preferred topics using a set of cue points provided in the user
interface.
Enjoy the tour!
23
4
User Contributions and Feedback
The main roles of the user partner as defined in the initial project
plan were: 1) to help us the define the requirements and functional
specifications that our technology should meet; 2) to manually annotate
some raw data extracted from social media in order to help us train
supervised machine learning algorithms; and 3) to provide feedback on
the successive versions of the system prototype developed in the course
of the project so as help prioritize areas for improvement.
Concerning the first point, a noteworthy contribution from our users
was their strong involvement in the so-called “scenario day” meeting
that took place on 2 October 2013. While all partners were represented,
the user partner sent a 10-strong delegation to discuss the technology
requirements for our project. Given the need for the project to work in
an unclassified setting, it was agreed on that occasion that we should
focus on requirements that would allow us to deal well with real-life
“proxy scenarios” (i.e. datasets). It was then agreed that the following
datasets presented all the characteristics and complexity of the genuine
datasets that are of interest for information analysts:
– The Sochi Olympic Winter Games (February 7-23, 2014);
– The League of Legends online game
– The civil war in Syria
Data on those three topics soon started to be extracted from social
media and was used for our experiments and demonstrations. In particular, the Syria dataset soon became the core focus for the remainder
of the project.
User comments frequently led to revisions in our priority scheme. One
of the most important such revisions was to abandon our original plan
of working on the Chinese language as well. Given the developing emphasis of the project on the tesbed provided by the Syria crisis dataset,
the users pointed out their preference for the project to concentrate on
doing the best possible job on the Arabic and English languages.
During the project, our user partner also manually annotated some
data from the Syria crisis dataset to help us evaluate the performance
of our technology on that kind of material:
24
– Evaluate the level of noise in the raw data extracted by MediaMiser (Syria dataset).
– Evaluate the precision of our sentiment and emotion analysis on
the Syria dataset.
– Provide English translations for some Arabic documents from the
Syria dataset.
Given available human resources, it was obviously not possible to annotate enough data for training machine learning algorithms. Rather,
the data annotated by our users was used for technology evaluation
purposes. The three samples described allowed us to confirm that our
technologies were working reasonably well on the Syria dataset.
In the initial project plan, we had assumed that three different versions
of the prototype would successively be tested in house by our user
partner. Unfortunately, we soon found out that this was impractical.
Since our system was being developed in an unclassified setting, it was
not possible to address our users’ security requirements for in-house
technologies in a satisfactory manner.
The on-site testing was thus replaced with two different test settings.
The first one was a series of demonstrations and hands-on sessions that
were organized for representatives of our user partner:
– Task 2 “go-nogo” meeting (11 March 2014);
– Task 3 meeting (28 August 2014);
– Task 4 meeting (12 February 2015);
– Task 5 meeting (4 May 2015);
– Task 6 “final” meeting (28 July 2015).
The second test setting was a permanent one: even though this was
not part of the original plan, Thales decided to provide project partners
with uninterrupted access to the evolving version of the CST prototype
through the Web. This way it was also possible for project partners
to give spontaneous demos of the prototype to interested parties who
were not official project participants and to collect their feedback. This
ongoing availability proved to be a huge asset for participants.
25
We now turn to synthesizing the feedback our prototype system received from our user partner. Looking at the final version of the prototype, we can say that multiple components were implemented as a
result of user feedback:
– Various data filtering, sorting and grouping mechanisms such as
filtering by authors or languages, sorting by chronological order
and grouping tweets and their retweets.
– The capability to exclude any geographical area from the search.
– The capability to export sets of results so that they could then be
processed using different systems.
– The ability to define persistent search queries, over and above the
current working session.
The overall feedback regarding our system was very positive. Basically,
we received a strong confirmation of the core hypothesis underlying
the whole project, namely that advanced linguistic technology could
be harnessed for the benefit of information analysts. In particular, the
feedback made it clear that entity extraction, sentiment analysis and
machine translation were each extremely valuable. The integration of
these linguistic technologies with other technologies also proved highly
valuable. For example, users greatly appreciated the unique capabilities
of the map widget that Thales incorporated in our prototype. In that
respect it was also noted on occasion that some of those non-linguistic
technologies could have been pushed further. For example, some users
remarked that the timeline widget incorporated by Thales could have
been extended in a way to better cope with short time intervals.
The overall conclusion from the user partner was that our prototype
system was on its way to being a high-quality analytic tool and just
needs a little more development to reach its full potential.
5
Conclusions
Our CSSP project set out to provide a concrete demonstration of the
claim that leading-edge linguistic technologies such as machine translation, summarization, information extraction and sentiment emotion
26
analysis can be extremely useful to security analysts interested in “big
data” monitoring on social medias. The goal was not only to integrate
those linguistic technologies together, but also to integrate them with
several other basic capabilities that were needed in order to provide a
realistic tested: information retrieval from social media, social network
analysis and advanced visualization and user interaction facilities.
In accordance with our plan, a first version of the prototype we were
building was demonstrated six months after the beginning of this twoyear project. Moreover, even though this was not part of the original
plan, Thales (our system integrator) provided the partners with an
ongoing access to the evolving prototype over the world Wide Web.
This greatly facilitated ongoing interaction between system developers
and the user partner, which helped steer the project towards the most
successful outcome possible. This interaction led to some changes with
respect to the original plan. For example, the idea of covering the
Chinese language alongside English and Arabic was abandoned in favor
of more sustained work on the two remaining languages. While this and
a few other planned capabilities were dropped, many new ones were
added. This included the permanently online demo mentioned above
but also a myriad of specific system features such as entity linking,
various filtering and sorting devices, persistent search queries, a data
export capability, etc.
The partnership worked in a very smooth way: all partners repeatedly
expressed their satisfaction with the way the project was unfolding.
The most tangible result was a technology demonstrator that has been
extensively tested by our user partner and demonstrated to a wider
public on many occasions, the last one of which was a public event
hosted by Borden, Ladner Gervais in their downtown Ottawa office on
24 November 2015.
The interested reader is invited to embark on our audio-visual tour
of the resulting prototype system by following the indications given in
section 3 above. Hopefully, he/she will then come to agree with the final
verdict of our user-partner who declared that our prototype was well
“on its way to being a high-quality analytic tool [...]”. In this respect,
our industrial partners each have their own plans to make good use of
the results of our project for improving or augmenting their respective
commercial offerings.
27
Appendices
28
A
Machine Translation
As mentioned above, NRC has state-of-the-art machine translation
(MT) technology. The NRC’s phrase-based MT system is called PORTAGE,
and it regularly places at or near the top in international evaluations of
the quality of MT system outputs. Another measure of PORTAGE’s
standing is that the NRC has twice received substantial funding from
DARPA in exchange for its participation in R&D projects (under the
DARPA GALE program from 2006-2009, and under the DARPA BOLT
program 2012-2015). Since PORTAGE learns how to translate from a
large collection of bilingual sentence pairs, each consisting of a sentence
in the source language and its translation into the target language, a
version of PORTAGE can potentially be trained for any language pair
for which a sufficiently large number of bilingual sentence pairs can be
obtained. In practice, the NRC’s MT group has mainly created versions
of PORTAGE for Arabic → English (Arabic to English) MT, Chinese
→ English MT, and English ↔ French (bidirectional English-French)
MT.
When the CST project began, NRC already had an Arabic → English
version of PORTAGE: the version that tied for first place in the NIST
(US National Institute of Standards & Technology) Open MT evaluation of 2012. That by no means ensured that NRC would be able to
deliver an MT module that would satisfy the needs of the CST project.
There were several potential problems:
– Genre. It is well-known among experts on MT that no matter
what language pair is involved, an MT system trained on one genre
of bilingual text — e.g., news stories — will yield very low-quality
translations when it is deployed in an environment where it must
translate texts from a different genre – e.g., Tweets. This problem
is particularly acute when one of the two genres is formal, and the
other informal. The initial Arabic → English PORTAGE system
was mainly trained on formal data such as news or quasi-formal
genres such as Web forums.
Though the main CST scenarios sometimes involved translating Arabic
blogs (a quasi-formal genre), the main task for PORTAGE turned out
to be translating Arabic tweets: an informal genre for which there was
29
no bilingual training data whatsoever. The closest genre to tweets in
the original training data was 90,104 sentence pairs from SMS/Chat
data from the BOLT project but with about 20 million sentence pairs
in total, this represented barely 0.4% of the training data. Apart from
being much more informal in word choice and syntax than the training data, the data the translation module encountered in its CST deployment included many phenomena such as hash tags, emoticons, and
strange spellings (e.g., the Arabic equivalents to AHHHH or Yuckkkkk!
in English tweets) never observed in the training data. A genre problem unique to Arabic is Arabizi: the phenomenon, fairly frequent in
Arabic social media, of using Roman-language characters to represent
an Arabic word. A given Arabic word may be written several different
ways in Arabizi.
– Dialect. This problem is related to, but distinct from, the problem of genre. Educated Arabs use Modern Standard Arabic
(MSA) for most written communication, and for formal speech.
MSA is the only version of Arabic taught in schools; it is derived from Classical Arabic, which has high prestige because it
is the language of the Koran. However, much spoken communication occurs in local vernaculars: dialects of Arabic which are
often mutually incomprehensible (from a European perspective,
they might be considered separate languages). An analogous situation might have arisen in Europe if Romance-language speakers
had continued to use Latin for formal, especially written, communication (as was once the case for educated Europeans), but
Portuguese, Spanish, French and so on a daily basis to talk to
people from their own country. We were warned by Arabic experts before we began working on CST that social media were an
exception to the general rule that speakers of all the variants of
Arabic generally use MSA for written communication: we could
expect to see tweets heavily laced with or entirely consisting of
Tunisian Arabic, Iraqi Arabic, and so on. None of these dialects
were represented in the training data available to us. To make
matters worse, we were told that dialect words in tweets are often
written in Arabizi.
– Throughput. This was an even more serious problem than the
two previous ones. NRC’s existing Arabic → English system was
30
designed to perform well in international evaluations of the output quality of MT systems; speed was not a consideration. The
focus in building the existing system had been to use every possible technique to get good translations, even if some of these
techniques were computationally inefficient. Yet the CST project
would fail if the MT module was too slow: for the CST system to
be practically useful, the MT module had to chew through several
thousand sentences each minute. In initial tests of the previous
Arabic → English system in the CST environment, it was intolerably slow: e.g., it took 325 seconds to translate 100 Arabic
sentences (roughly 20 sentences per minute). We faced the challenge of speeding up Arabic → English PORTAGE by a factor of
at least 10 for it to be practically useful for CST and we had to
achieve this speed-up without compromising translation quality.
In the course of the project, we updated the MT module several times.
We tackled the throughput problem first since the module would be
unusable if it was too slow - then worked to improve translation quality.
A.1
Improving the throughput of the MT module
The initial MT module (which we’ll call “Version 0”) is shown in Figure 2. The models required to translate Arabic text to English are
trained offline on bilingual Arabic-English sentence pairs. In the terminology of the MT community, translation is called “decoding” and is
thus carried out by a module called the “decoder”. Arabic is unusual
among languages that have had considerable attention devoted to them
by the MT community: almost all research groups that work on Arabic
as a source language use a software package from outside each group to
preprocess Arabic texts. The consensus in the community is that Arabic preprocessing software from Columbia University (called MADA,
TOKAN, or MADAMIRA) is essential if you want to build a stateof-the-art system for translating Arabic into other languages (for most
other source languages, the details of preprocessing are less important).
As Figure 2 shows, in Version 0, MADA from Columbia U. was used
to translate Arabic text into “Buckwalterese”, a way of representing
Arabic text in the Latin alphabet. MADA does more than this: it splits
31
Initial MT module for CST (version 0)
2 stages of MT: preprocessing and decoding. For Arabic preprocessing, all major
groups use a tool from Columbia U. called MADA, which changes Arabic into units
that make MT easier. MADA was slow to load but was fast afterwards; decoding
loaded quickly but took nearly 3 sec per sentence.
SLOW TO
LOAD
Arabic sentence
‫ﻻوﻗد ﻧﺎﻟت ﺑﺎﺑوا ﻏﯾﻧﯾﺎ اﻟﺟدﯾدة اﺳﺗق‬١٦ ‫ﻟﮭﺎ ﻓﻲ‬
‫أﯾﻠول‬/١٩٧٥ ‫ﺳﺑﺗﻣﺑر‬.
Bilingual
Data
offline
Translation
models
MADA
AD
Modified Arabic
sentence
wqd nAlt bAbwA gynyA Aljdydp AstqlAlhA
fy 16 >ylwl / sbtmbr 1975 .
English sentence
Portage decoder
mainly loading
g
of MADA
mainly decoding
on required 2
25
5 sec + 3*N sec
sec.
NOTE: for N sentences, translation
Figure 2: Initial MT module
32
Arabic words into forms that more closely resemble English words. For
instance, in Arabic, the equivalent of articles like “a” and “the” are
fused to the following noun. MADA splits fused words of this type into
two separate words: e.g., it turns the Arabic versions of “adog” and
“thedog” into “a dog” and “the dog” written in Buckwalterese (and
carries out several other types of preprocessing as well).
Unfortunately, the version of MADA which NRC had permission to
deploy in the CST project took 25 seconds to load. Though MADA
preprocessing itself was relatively fast, the Version 0 decoder took about
3 seconds to translate each Arabic sentence. When called on to translate a new block of N sentences, Version 0 of PORTAGE therefore took
approximately (25 + 3*N) seconds to complete the task.
Figure 3 shows how by speeding up both Arabic preprocessing and decoding, we were able to turn these (25 + 3*N) seconds for decoding
N sentences into approximately (1 + 0.1*N) seconds. To achieve a
25-fold speed-up in Arabic preprocessing, we removed the MADA software from the decoding process. We substituted a table, trained in
advance using MADA, that maps Arabic words onto their MADA-ized
equivalents in Buckwalterese. There is a potential loss of quality here:
MADA can preprocess Arabic words it has never seen before, but the
map table can only preprocess words that it encountered while it was
being trained. This loss can be minimized by training the map table
on a large, varied collection of Arabic texts; the training data should
be chosen to include Arabic words that are likely to be deployed when
the MT module is being deployed in a practical application (e.g., the
scenarios for CST).
We then studied the internal workings of the phrase-based decoder
and discovered that it was considering far too many possible English
translations of each Arabic phrase. By cutting the number of phrase
translations considered, we speeded the decoder up by a factor of 15.
Then, we adjusted some decoder search parameters (stack size, beam
size, etc.) and were able to double decoder speed again at a cost of
only about 0.2 BLEU points in output quality (typically, differences in
BLEU score of less than 1.0 are not perceptible to MT system users).
In total, we thus speeded up the decoder by a factor of 30.
Figure 4 shows the cumulative effect of these changes on the time re33
December 2014 MT module (version 1)
Replacement of MADA by a MADA map table for preprocessing made loading of
the system 25 times faster; changes to the decoder made it 30 times faster.
Arabic text
offline
‫وﻗد ﻧﺎﻟت‬
‫ﻟﮭﺎﻻ ﺑﺎﺑوا ﻏﯾﻧﯾﺎ اﻟﺟدﯾدة اﺳﺗق‬
New Arabic sentence
‫وﻗد ﻧﺎﻟت ﺑﺎﺑوا ﻏﯾﻧﯾﺎ اﻟﺟدﯾدة اﺳﺗق‬
‫ﻻ‬١٦ ‫ﻟﮭﺎ ﻓﻲ‬
‫أﯾﻠول‬/١٩٧٥ ‫ﺳﺑﺗﻣﺑر‬.
Bilingual
Data
offline
offline
MADA
Map table
‫ → ﻧﺎﻟت‬wqd
‫ →اﺳﺗق‬bAbwA
map table
‫ → ﻧﺎﻟت‬wqd
‫ →اﺳﺗق‬bAbwA
Translation
models
M
Modified
Arabic sentence
wqd nAlt bAbwA gynyA Aljdydp
w
AstqlAlhA fy 16 >ylwl / sbtmbr 1975 .
Speeded-up
Portage decoder
English sentence
decoder speedups
MADA map table
i d 1 sec + 0.1*N
0 1*N sec
sec.
NOTE: for N sentences, translation now required
Figure 3: Improved version of MT module, December 2014
34
Version 1 vs. Version 0 Timing
View 1 – time as function of #sentences
View 2 – time as function of log(#sent.)
Figure 4: Speed improvement in MT module
quired to translate a block of sentences.
A.2
Improving Translation Quality
In early 2015, we turned our attention to improving translation quality. Throughout the CST project, there was a major problem: the
lack of bilingual training data for tweets, the main genre targeted by
the project. We also expected lack of dialectal written Arabic in the
training data to be a problem.
We addressed this problem in three steps: first, we wrote a small set of
preprocessing rules for making Arabic tweets more tractable. Second,
we acquired a large number of unilingual Arabic tweets that enabled
us to retrain the MADA map table, and a small amount of bilingual
additional training data that was somewhat closer in genre to the original training data. Third, we constructed a “dev” tuning set that we
35
Handling Twitter #hashtags
‫اﻟﻣﺳﻠﻣﯾن_ﻓﻲ_ﻛل_ﻣﻛﺎن‬# ‫اﻟﻠﮭم اﻧﺻر اﺧواﻧﻧﺎ‬
tokenize
Allhm AnSr AxwAn +nA # Almslmyn _ fy _ kl _ mkAn
wrap hashtags
translate
Allhm AnSr AxwAn +nA <ht> Almslmyn
y fyy kl mkAn </ht>
/
translate
& transfer markup
Oh God, forsake our <ht> Muslim brothers everywhere </ht>
/
unwrap hashtags
Oh God, forsake our #Muslim_brothers_everywhere.
Oh God, forsake our brothers # Muslims _ in _ all the place
Figure 5: Handling of Twitter #hashtags
expected to resemble tweets to some extent (and that included a small
number of translated tweets).
A.2.1
Rules for Handling Tweets
Figures 5 and 6 show how the specialized rules we implemented for
handling Arabic tweets improve translation quality. On each slide, the
blue arrow on the left points to the translation of the Arabic input
prior to implementation of these rules.
In the example on Figure 5, the translation was originally “Oh God,
forsake our brothers # Muslims in all the place”; with the rules
in place, it became the more understandable “Oh God, forsake our
#Muslim brothers everywhere”. In the course of this work, we discovered that in Arabic tweets, words that form part of a hashtag are
often used as part of a sentence as well. The English equivalent would
36
Handling non-Arabic scripts & multiple hash tags
‫ﺳورﯾﺎ‬# ‫ﻏزة‬# .. ‫ ل ﻻ ﻧﺎﻣت ﻟﻛم ﻋﯾن‬.. ‫ ﯾﺎ أﺻﺣﺎب اﻷﺳﻠﺣﺔ اﻟﻣﻛدﺳﺔ‬#GazaUnderAttack #Gaza #Syria
tokenize
yA ASHAb AlAslHp Almkdsp . . lA nAmt l +km Eyn . . # gzp # swryA # gazaunderattack # gaza
# syria
mark non-Arabic
script
translate
yA ASHAb AlAslHp Almkdsp . . lA nAmt l +km Eyn . . # gzp #
swryA <nas> #GazaUnderAttack #Gaza #Syria </nas>
translate selectively
Oh, those accumulated weapons ... does not sleep you eye...
#GazaUnderAttack #Gaza #Syria
Oh, those accumulated weapons ... does not sleep you eye ... # # Gaza Syria
gazaunderattack # while
Figure 6: Handling non-Arabic scripts and multiple hash tags
37
be a tweet like “I’m #Angry in San Francisco because it’s so cold today.” Here, the words “angry in San Francisco” do double duty as
constituents of a hash tag and as part of a sentence. Because exactly
the same phenomenon often occurs in Arabic tweets, we decided on a
strategy where hash tags and underscores are ignored during translation, then restored afterwards. This typically results in a more fluent
translation.
The example on Figure 6 shows how multiple hash tags, which may
involve non-Arabic script, are handled by the rules. Here, the Arabic
input contained three hash tags in the Latin alphabet. Version 0 of the
MT module generates output where these hash tags are separated from
the word sequences they are meant to tag: “Oh, those accumulated
weapons ... does not sleep you eye ... # # Gaza Syria gazaunderattack
# while”. With the rules in place, the system puts the hash tags
in a neat sequence at the end of the translated tweet: “Oh, those
accumulated weapons ... does not sleep you eye... #GazaUnderAttack
#Gaza #Syria.”
A.2.2
Additional Training Data
Recall that by replacing the MADA preprocessing software with a table that maps Arabic text in the input into preprocessed Buckwalterese
text, we obtain a large speedup in loading of the MT module. However, there is a cost: quality will go down, because unlike the original
MADA software, the map table cannot preprocess Arabic words that
aren’t in the training data in a useful way. They will be OOV (“out of
vocabulary”) words.
Fortunately, the frequency of OOV words encountered when the module
is deployed can be reduced by training the MADA map table offline on
large Arabic corpora, preferably ones likely to contain words it will encounter under operational conditions. This section has referred several
times to the shortage of bilingual tweet data for training the module’s translation models: Arabic tweets that have been translated into
English. But to retrain the MADA map table to reduce the frequency
of OOVs, we mainly need unilingual Arabic tweets, and there is no
shortage of those. Thus, after implementing the specialized rules for
handling tweets described in the previous subsection, the next step we
38
took to improve the quality of the MT module was to retrain the map
table on a large number of Arabic tweets (and some other unilingual
Arabic data as well).
Improving the translation models was much harder, because here we
require bilingual tweet-like data. We had originally planned to obtain
3000 or so translated tweets from the end-users, but this proved to
be unrealistic. They did supply us with 103 translated tweets, and a
native speaker of Arabic we hired (Norah Alkharashi) translated 175
tweets. As will be described shortly, even this small number of bilingual
tweets was useful for improving the system. However, with the total
bilingual training data consisting of 20 million sentence pairs, adding
a few hundred tweets would have no impact at all on the translation
models.
We therefore incorporated three other resources as training data:
– Inspecting the OOVs that remained after we retrained the MADA
map table, we noticed that a high proportion of them were names
of people, places, and organizations. We therefore added to the
data a set of paired Arabic-English Wikipedia article titles, as
this genre is known to be rich in named entities. There were 28K
Wikipedia title pairs (roughly 0.1% of the total training data).
– In the course of the BOLT project, Raytheon BBN had hired
speakers of two Arabic dialects – Levantine and Egyptian– via
Mechanical Turk. These Turkers translated 162K segment pairs
(22% Egyptian, 78% Levantine) from Weblogs into English. Our
BBN colleagues had warned us that the quality of these translations was poor. However, we asked the Arabic speaker we had
hired to assess them by looking at a random sample. Though she
is from Saudi Arabia, she reported that the Arabic tweets were
mostly MSA with a few dialect words inserted from time to time,
and almost entirely understandable by speakers of other Arabic dialects. The problems were on the English side. Though in general,
it was of acceptable quality, there were mistakes involving idioms,
phrasal verbs, verb tenses, verb agreement, & spelling mistakes.
We decided that the extra coverage of some dialect words would
more than make up for some problems with English, and incorporated these 162K pairs in the training data (that’s roughly 0.8%
39
of the total).
– Finally, there was one source of bilingual Arabic-English tweet
data available to us. Unfortunately, it proved to be of rather
poor quality. It is the result of a project carried out by CMU in
which tweets containing both Arabic and English were given to
Mechanical Turkers, who were asked to find Arabic sub-segments
in each tweet that had a matching English sub-segment (of course,
for each tweet the Turker also had the option of indicating that it
had no matches). Again, we asked the Arabic native speaker to
assess the quality of the CMU data. She found that around 55%
of it was bad. In most of the bad pairs, some of the information
contained in one of the two texts was missing from the text in the
other language. We therefore resorted to a length-based heuristic
where we removed pairs where the Arabic text had at least 1.5
times more tokens than the English text, or vice versa. This left
28K pairs, which we added to the training data (they constituted
roughly 0.1% of it).
At this point in our work, we did not have a set of bilingual tweets
on which to measure BLEU (the traditional measure of MT quality).
To measure the impact of some of the changes just listed, we therefore
measured the OOV rate (this only requires unilingual source data).
As Figure 7 shows, the retraining of the MADA map table yielded a
64% reduction in the OOV rate on a sample of 300 Arabic tweets, and
the incorporation of data from the three new corpora – the Wikipedia
titles, the BBN dialect weblogs, and the CMU partial tweets – resulted
in an overall 73% reduction of OOVs for the sample. Though we believe
implementation of the tweet preprocessing rules significantly improved
translation quality, the nature of these rules means they had no impact
on the OOV rate.
To make further progress, we needed to construct both a test set (for
calculating BLEU) and a “dev” set. The latter requires explanation.
An important part of building a statistical MT system is the tuning
step, which decides on the weights of the various information sources
(the various language models, the various translation models, and so
on). These weights are determined on a bilingual set of “dev” sentence
pairs; they can have a surprisingly large effect on MT performance.
40
February 2015 module (Version 2)
Version 2 has tweet preprocessing rules, a retrained MADA map table & three
new bilingual training corpora for translation models:
1. Wikipedia titles 2. BBN dialect weblogs 3. CMU partial tweets
# of OOVs in sample of 300 tweets
• Version 1 (Dec. 2014): 451
• With new map table: 162 (64% reduction)
• With new map table & new translation models:
120 (73% reduction overall)
Figure 7: MT module, verion 2
41
Though we did not have enough bilingual Arabic-English tweet pairs
to use as training data, we were able to construct dev and test sets that
split between them the 103 tweets translated for us by the end-users
and the 171 tweets translated by our in-house Arabic speaker. These
274 tweet pairs are far too few to construct either a dev or a test set, so
we added to them CMU data (because it comes from tweets) and BBN
data (because it is informal and contains some dialect phenomena).
The dev set we constructed contains 137 tweet pairs, 488 CMU segment
pairs and 478 BBN text pairs; our test contains 137 tweet pairs, 488
CMU segment pairs and 476 BBN text pairs.
We then built two MT modules trained on the training data described
above (with some minor differences), tuning both of them on this dev
set, and tested them on the test set. The details of the differences
in configuration between the two modules, Version 3a and Version
3b, would take up an inappropriate amount of space in this report.
As shown on MT SLIDE7, both systems have very respectable BLEU
scores, with Version 3b having a BLEU score +1.0 higher than that of
Version 3a. Version 3b also has fewer OOVs than Version 3a.
A.3
Discussion and Recommendations
Building a module that translates Arabic tweets into English was as
challenging as we’d expected, except in one respect. Contrary to our
expectation, dialect was not a major issue. According to the native
speaker of Arabic we’d hired, there were few tweets among those we
collected in the Syria scenario or among those collected earlier by BBN
that were too dialect-heavy for a reader of MSA from another region to
understand. Many tweets were identifiable as being written by someone
who speaks a given dialect, but this was typically a case of a few dialect words being sprinkled into text that was mainly in standard MSA.
Maybe our Arabic speaker is downplaying the extent of the problem;
maybe we dodged the dialect bullet by choosing a topic, Syria, that
for some reason does not attract heavily dialectal tweets. On the other
hand, maybe the difficulties posed by dialect to MT systems that translate Arabic social media texts have been exaggerated (perhaps because
dialect is a very big problem in spoken Arabic).
42
By contrast, the difficulties posed by the genre for CST Arabic tweets
were fully as serious as we’d expected.
Version 3b of the MT module was the final version deployed in the
project. Further algorithmic improvements to the MT module are likely
to encounter the phenomenon of diminishing returns: quality will not
go up significantly no matter what clever techniques are applied to the
current training data. By far the best way of improving the module
would be to collect a large number of Arabic tweets, to translate them,
and to incorporate the resulting bilingual corpus in the training data
for a new system. Building a component that can handle Arabizi would
also be very helpful.
43
B
Summarization
The summarization component provides the capability of summarizing input documents under concern. It distills and presents the most
important content of the documents. In the case of micro blogs such
as Twitter, the capability to summarize groups of tweets is potentially
very useful. For example, sudden peaks in volume on a given topic
are most often caused by specific events. Applying a multi-document
summarizer to the set of documents in a given peak would instantly
bring such events to light.
B.1
Functionality
In general, automatic summarization, or summarization in brief, is
a Natural Language Processing (NLP) technology that automatically
generates summaries for documents. More specifically, our summarization component performs extractive summarization for multiple documents. Below we first provide some background knowledge about
summarization (refer to [Zhu, 2010] for more work in the literature).
– Summarization vs. information retrieval (IR): IR is often set up in
a scenario in which users roughly know what they are looking for,
by providing queries. Summarization does not assume this but
aims at finding salient/representative information and removing
redundant content for a single document or a set of them.
– Single vs. multiple document summarization: single document
summarization generates a summary for each single document
(e.g., a news article). Multiple document summarization generates a summary for a set of documents, e.g., a set of news articles
or tweets. The approaches used in these two situations are similar.
One major difference is that a multi-document summarizer needs
to remove more redundant content.
– Extractive vs. abstractive summarization: extractive summarization selects sentences or larger pieces of text from the original
documents and presents them as summaries; abstractive summarization also attempts to rewrite the selected pieces to form more
44
coherent and cohesive summaries. The state-of-the-art approaches
focus more on extractive summarization, as abstractive summarization is a harder problem and its performance is less reliable.
We focus on extractive summarization.
In brief, our summarization component performs a multi-document,
extractive summarization for input documents under concern.
B.2
Interface
Conceptually, the input and output of our summarization component
is straightforward: the summarizer takes in a set of documents and
outputs important excerpts. The interface is shown in Figure 8.
Figure 8: A user can drag a selection box to select tweets under his or her
concern, to generate a summary. In the figure, two spikes of tweets between
September 9th and 11th were selected.
A user can leverage a selection box to choose tweets in a time period
of interest. In the figure, two spikes of tweets between September 9th
45
and 11th are selected. The user can click the “summarize” button at
middle-top of the figure to request a summary. The summarization
component then spends some time (often a number of seconds) to generate the summary—the amount of time used depends on the number
of documents selected. Once the summary is ready, an “Open Summarization” button, shown in Figure 9 is displayed. Users can then click
the “Open Summarization” button to read the summary.
B.3
Summarization Algorithm
Our model is built on a summarization algorithm called Maximal Marginal
Relevance (MMR) [Carbonell and Goldstein, 1998, Zhu, 2010]. The
reason for choosing MMR is two-fold. First, compared with more complicated models such as graph-based models, MMR is computationally
efficient, which accords well with our need to handle a large document
sets within a reasonable amount of time. In addition, MMR is an unsupervised summarization model and does not require human annotated
data to train the computers.
MMR builds a summary by selecting summary units iteratively. A
summary unit is a tweet when we have a collection of tweets, or a
sentence when we have a set of news articles. More specifically, in
each round of selection, MMR selects into the summary a unit that is
most similar to the documents to be summarized, but most dissimilar
to the previously selected units, to avoid redundant content. This is
repeated until the summary reaches the predefined length (20 units in
the current version, but the length can be easily changed). The MMR
uses the following equation to determine the next unit to be selected.
next sent = arg max(λsim1 (D, Ui ) − (1 − λ) max(sim2 (Uj , Ui ))) (1)
Uj ∈S
Ui ∈D\S
As shown in Formula 1, the sim1 term calculates the similarity between a unit Ui and a set of documents D. The assumption is that
a unit with a higher sim1 represents the content of the document set
better. The sim2 calculates the similarity between a candidate summary unit Ui and a unit Uj already in the summary S. Accordingly,
46
Figure 9: Once the summary is ready, an “Open Summarization” button
is shown (upper figure). Users can then click the “Open Summarization”
button to read the summaries (lower figure).
47
max(sim2 (.)) is the biggest sim2 score between the candidate unit and
the already-in-summary units. The assumption is that a unit with
a higher max(sim2 (.)) score contains higher redundancy with respect
to the already-in-summary units; therefore this unit should receive a
penalty. The similarity score, sim1 or sim2 , is calculated with the
cosine value between the corresponding units discussed above. The parameter λ is used to linearly combine sim1 and max(sim2 (.)). A unit
is selected into a summary if it maximizes the combined score. The
value of λ has been set in our code, but it can be adjusted.
48
C
Information Extraction
For the information extraction component of the CST project, we focused on two technologies: named entity recognition and entity linking.
The majority of our efforts focused on improving the state of the art in
entity recognition for social media texts, while we employed a known
technique for entity linking. We describe both solutions in detail below.
C.1
Named Entity Recognition
Named entity recognition (NER) is the task of finding rigid designators
as they appear in free text and assigning them to coarse types [Nadeau and Sekine, 2007].
For the CST project, we recognize the types person, location and organization, as illustrated in Figure 10. NER is the first step in many
information extraction pipelines, but it is also useful in its own right.
It provides a form of keyword spotting, allowing us to highlight terms
that are likely to be important in a text item. Furthermore, it allows
the system operator to organize items by the entities they contain, and
to collect statistics over mentions of specific entities.
There is considerable excitement at the prospect of porting information
extraction technology to social media platforms such as Twitter. Social media reacts to world events faster than traditional news sources,
and its sub-communities pay close attention to topics that other sources
might ignore. An early example of the potential inherent in social information extraction is the Twitter Calendar [Ritter et al., 2012], which
detects upcoming events (concerts, elections, video game releases, etc.)
based on the anticipatory chatter of Twitter users. Unfortunately, processing social media text presents a unique set of challenges, especially
for technologies designed for newswire: Twitter posts are short, the
language is informal, capitalization is inconsistent, and spelling variations and abbreviations run rampant. Tools that perform quite well on
newspaper articles can easily fail when applied to social media.
Our approach assumes a single message as input, with no accompanying
meta-data regarding the user posting the message or the date it was
posted. We are then tasked with finding each mention of a concrete
person, location or organization within that message. The location of
49
%
$,*$ # * $" '))!()-!
+ Figure 10: An example of semi-Markov tagging.
each such mention is indicated by a tag, as shown in Figure 10, where
the “O” tag is given special status to indicate the lack of an entity.
Our training data takes the form of tweets (in-domain) and sentences
from newspaper stories (out-of-domain), where both have been tagged
by humans. We then train a supervised machine learning algorithm
that can replicate the human tags on its training data, and generalize
to produce reasonable tags on data it has never seen before. We use
held-out, human-labeled data as a test set to measure the accuracy of
our tagger on previously unseen tweets, which allows us to determine
how well the system has generalized from its training data.
Armed with an affordable training set of 1,000 human-annotated tweets,
we establish a strong system for Twitter NER using a novel combination of well-understood techniques. We build two unsupervised word
representations in order to leverage a large collection of unannotated
tweets, while a data-weighting technique allows us to benefit from annotated newswire data that is outside of the Twitter domain. Taken
together, these two simple ideas establish a new state-of-the-art for
both our test sets. We rigorously test the impact of both continuous
and cluster-based word representations on Twitter NER, emphasizing
the dramatic improvement that they bring.
C.1.1
Data
Vital statistics for all of our data sets are shown in Table 1. For indomain NER data, we use three collections of annotated tweets: Fin10
was originally crowd-sourced by [Finin et al., 2010], and was manually
corrected by [Fromreide et al., 2014], while Rit11 [Ritter et al., 2011]
and Fro14 [Fromreide et al., 2014] were built by expert annotators. We
divide Fin10 temporally into a training set and a development set, and
we consider Rit11 and Fro14 to be our test sets. This reflects a plausible training scenario, with train and dev drawn from the same pool,
but with distinct tests drawn from later in time. These three data sets
50
were collected and unified by [Plank et al., 2014], who normalized the
tags into three entity classes: person (PER), location (LOC) and organization (ORG). The source text has also been normalized; notably,
all numbers are normalized to NUMBER, and all URLs and Twitter
@user names have been normalized to URL and @USER respectively.
We use the CoNLL 2003 newswire training set as a source of outof-domain NER annotations [Tjong Kim Sang and De Meulder, 2003].
This serves two purposes: first, it provides a large supply of out-ofdomain training data. Second, it allows us to illustrate the huge gap in
performance that occurs when applying newswire tools to social media.
The source text has been normalized to match the Twitter NER data,
and we have removed the MISC tag from the gold-standard, leaving
PER, LOC and ORG.
We use unannotated tweets to build various word representations. Our
unannotated corpus collects 98M tweets (1,995M tokens) from between
May 2011 and April 2012. These tweets have been tokenized and
post-processed to remove many special Unicode characters. Furthermore, the corpus consists only of tweets in which the NER system of
[Ritter et al., 2011] detects at least one entity. The automatic NER
tags are used only to select tweets for inclusion in the corpus, after
which the annotations are discarded. Filtering our tweets in this way
has two immediate effects: first, each tweet is very likely to contain an
entity mention, and therefore, be more useful to our unsupervised techniques. Second, the tweets are longer and seem to be more grammatical
than tweets drawn at random.
Data
Fin10 (Train)
Fin10Dev (Test)
Rit11 (Test)
Fro14 (Test)
CoNLL (Train)
Unlabeled tweets
Lines Types Tokens # PER # LOC # ORG
1,000 4,865 17,276
192
143
172
1,975 7,734 33,770
325
279
287
2,394 8,686 46,469
454
377
280
1,545 5,392 20,666
390
163
200
14,041 20,752 203,621
6,601
7,142
6,322
98M
57M 1,995M
–
–
–
Table 1: Details of our NER-annotated corpora. A line is a tweet in Twitter
and a sentence in newswire.
51
C.1.2
Methods
We will briefly summarize how we train a tagger to locate entities
in tweets below. In this framework, a complete tag sequence for an
input tweet is represented as a bag of features. The learning component learns weights on these features so that good tag sequences
receive higher scores than bad tag sequences. We call these weights the
model. The tagging component uses dynamic programming to search
the very large space of possible tag sequences for the highest-scoring
sequence according to our model. Therefore, the framework can be
specified modularly by describing the tagger, the learner and the features. As a rule of thumb, the quality of the features generally determines how well a system can generalize. More details can be found in
[Cherry and Guo, 2015, Cherry et al., 2015].
Tagger: We tag each tweet independently using a semi-Markov tagger [Sarawagi and Cohen, 2004], which tags phrasal entities using a
single operation, as opposed to traditional word-based entity tagging
schemes. An example tag sequence, drawn from one of our test sets,
is shown in Figure 10. Semi-Markov tagging gives us the freedom to
design features at either the phrase or the word level, while also simplifying our tag set. Furthermore, with our semi-Markov tags, we find
we have no need for Markov features that track previous tag assignments, as our entity labels cohere naturally. This speeds up tagging
dramatically.
Learner: Our tagger is trained online with large-margin updates, following a structured variant of the passive aggressive (PA) algorithm
[Crammer et al., 2006]. We regularize the model both with early stopping and by using PA’s regularization term C, which is similar to that
of an SVM. We also have the capacity to deploy example-specific Cparameters, allowing us to assign some examples more weight during
training. This is useful when combining Twitter training sets with
newswire training sets.
Lexical Features: Recall that our semi-Markov model allows for both
word and phrase-level features. The vast majority of our features are
word-level, with the representation for a phrase being the sum of the
features of its words. Our word-level features closely follow the set
proposed by [Ratnaparkhi, 1996], covering word identity, the identities
52
of surrounding words within a window of 2 tokens, and prefixes and
suffixes up to three characters in length. Each word identity feature
has three variants, with the first reporting the original word, the second reporting a lowercased version, and the third reporting a summary
of the word’s shape (“Mrs.” becomes “Aa.”). All word-level features
also have a variant that summarizes the word’s position within its entity. Our phrase-level features report phrase identity, with lowercased
and word shape variants, along with a bias feature that is always on.
Phrase identity features allow us to memorize tags for common phrases
explicitly. Following the standard discriminative tagging paradigm, all
features have the tag identity appended to them.
Representation Features: We also produce word-level features corresponding to a number of external representations: gazetteer membership, Brown clusters [Brown et al., 1992] and word embeddings. These
features are intended to help connect the words in our training data to
previously unseen words with similar representations.
Gazetteers are lists of words and phrases that share specific properties. In this project, we use a number of word lists covering common
entity types like people, films, jobs and nationalities. These were derived from a number of open sources by researchers at the University of Illinois [Ratinov and Roth, 2009]. To create features from these
gazetteers, we first segment the tweet into longest matching gazetteer
phrases, resolving overlapping phrases with a greedy left-to-right walk
through the tweet. Each word then generates a set of features indicating which gazetteers (if any) include its phrase.
Brown clusters are deterministic word clusters learned using a languagemodeling objective. Each word maps to exactly one cluster, and similar
words tend to be mapped to the same clusters. For cluster representations, we train Brown clusters on our unannotated corpus, using the
implementation by [Liang, 2005] to build 1,000 clusters over types that
occur with a minimum frequency of 10. Following [Miller et al., 2004],
each word generates indicators for bit prefixes of its binary cluster signature, for prefixes of length 2, 4 8 and 12.
Word embeddings are continuous representations of words, also learned
using a language-modeling objective. Each word is mapped to a fixedsize vector of real numbers, such that similar words tend to be given
53
similar vectors. For word embeddings, we use an in-house Java reimplementation of word2vec [Mikolov et al., 2013] to build 300-dimensional
vector representations for all types that occur at least 10 times in our
unannotated corpus. Each word then reports a real-valued feature (as
opposed to an indicator) for each of the 300 dimensions in its vector
representation. A single random vector is created to represent all outof-vocabulary words. Our vectors and clusters cover 2.5 million types.
Note that Brown clusters and word vectors are both trained using
language-modeling objectives on our large corpus of 98M unannotated
tweets, making their use an instance of semi-supervised learning. In
contrast, gazetteers are either constructed by experts or extracted from
Wikipedia categories.
C.1.3
Experiments
We have two scientific papers on NER that outline a rich set of experiments to help understand how various versions of our NER system
perform [Cherry and Guo, 2015, Cherry et al., 2015]. For the purposes
of this report, we felt it would be illustrative to condense these experiments down to three questions:
1. How does a newswire system perform on Twitter data?
2. How effective was our attempt to modify our newswire system for
Twitter data?
3. How well does our system perform on the various entity types on
Twitter data?
We will briefly answer these three questions in turn.
In all experiments we report F-measure (F1), which combines two other
metrics: precision and recall. Let #right count the number of entities
extracted by our system that match the human-labeled gold standard
exactly, and let #sys be the number of entities found by our system;
that is, the sum of both correct and incorrect entities. Precision measures the percentage of system entities that are correct:
prec =
54
#right
#sys
(2)
System
CoNLL Test Rit11
NRC 0.0 (CoNLL only)
84.3
27.1
Fro14
29.4
Table 2: A system trained only on newswire data, tested on newswire
(CoNLL) and social media data (Rit11, Fro14). Reporting F1.
Let #gold be the number of entities found in the human-labeled gold
standard annotation. Recall measures percentage of gold entities that
are extracted by the system:
rec =
#right
#gold
(3)
Finally, F-measure is the harmonic mean of precision and recall:
F1 =
2 ∗ prec ∗ rec
prec + rec
(4)
Newswire versus social media In our first experiment, we train a
system on our CoNLL newswire training set, and test on both held-out
newswire data (CoNLL Test), and on our held-out Twitter data. The
version of our system that we test here corresponds to the NRC’s entity
recognizer before the CST project began. This recognizer uses all our
lexical features, but has no representation features. The results are
shown in Table 2. As one can see, there is a huge divide between the
social media tests and the newswire tests. The NRC tagger was quite
well-suited to news, but handled social media very poorly.
Progression of the NRC Twitter NER system In the next comparison, we coarsely map our progress throughout the CST project,
shown in Table 3. The first and most obvious thing to do was to begin
training on in-domain Twitter data. We used 1,000 tweets from our
Fin10 set as training data, creating NRC 1.0. In transitioning from
NRC 0.0 to 1.0, we trade data volume for data quality, replacing 14K
out-of-domain sentences with 1k in-domain tweets. Note that this system does not use CoNLL newswire data, as we hadn’t yet developed
the data weighting algorithm that allowed us to effectively combine two
drastically different data sources [Cherry and Guo, 2015]. Next, we obtained our corpus of 98M unlabeled tweets, and set out to make use
55
of it. Our first attempt involved training word vectors on this corpus
with word2vec, creating word representation features for our system.
This resulted in NRC 2.0, which performs substantially better. Finally,
NRC.30 adds Brown clusters, gazetteers and incorporates the CoNLL
newswire data. It represents our strongest system, with F1 figures
having doubled with respect to NRC 1.0.
Performance by entity type Finally, we report the per-entity performance of the NRC 3.0 system in Table 4. For both test sets, ORG
is the most difficult. ORG is perhaps the broadest entity class, which
makes it difficult to tag, as it is rife with subtle distinctions: bands
(ORG) versus musicians (PER); companies (ORG) versus their products (O); and sports teams (ORG) versus their home cities (LOC). We
also suspect that our word representation features are not as suited
to organizations as they are people and locations. These results also
show that we are actually performing much better on person and location classes than our aggregate scores suggest, as the system’s total
performance is dragged down substantially by its difficulty with organizations.
C.1.4
Discussion
We have presented a brief overview of the NRC’s named entity recognition system for social media. It is characterized by its small in-domain
training set, and its extensive use of semi-supervised word representations constructed from large pools of unlabeled data. This is the
information extraction component that consumed the bulk of our efforts, but those efforts were well placed, as they resulted in a stronger
System
NRC 0.0
NRC 1.0
NRC 2.0
NRC 3.0
Rit11 Fro14
(CoNLL only)
27.1
29.4
(Fin10 only)
29.0
30.4
(1.0 + word vectors)
56.4
58.4
(2.0 + CoNLL, clusters, gazetteers) 59.9
63.2
Table 3: The progression of the NRC named entity recognizer throughout
the CST project. Reporting F1.
56
Test Set
Rit11
Fro14
PER
70.8
69.4
LOC
61.9
70.2
ORG
36.9
42.6
Table 4: F1 for our final system, organized by entity class.
CST system, while also having a substantial impact on the scientific
community.
C.2
Entity Linking
Entity linking is the task of resolving entity mentions found in text to
specific real-world entities in some background in database. For the
CST project, we solved this variant of the problem: for each entity
detected by the NRC NER system, find its corresponding Wikipedia
page, if one exists. This effort came late in the project, and we were
unable to obtain labeled in-domain training and test data for this task.
As such, we will briefly describe what we did, but we will provide no
evaluation of this approach.
We obtained a massive dictionary that connects English Wikipedia concepts to HTML anchor-text from other web pages, as collected throughout the web on a 2011 Google crawl [Spitkovsky and Chang, 2012]. For
example, someone linking to the Barack Obama Wikipedia page from
their own web page may use the anchor text, “Barack Obama,” or variants such as “Barack Hussein Obama II,” misspellings such as “Barak
Obama,” or a nickname like “the Obamanation.” Harvesting all of
these from throughout the web can provide a very reliable dictionary
of all the ways one can express each Wikipedia concept in text. Furthermore, the frequency with which each possible anchor phrase links
to a particular Wikipedia page helps disambiguate ambiguous phrases,
such as “George Clinton,” which is more likely to refer to the 1970s funk
musician than the 1800s US vice president. This dictionary alone has
been shown to establish an extremely high baseline for the Wikipedia
entity-linking task [Spitkovsky and Chang, 2011].
Given this dictionary, one can theoretically take any entity detected
by our NER system, and return the most frequently-linked page found
57
in the dictionary for that phrase. Most of the challenges with this approach came from the dictionary’s sheer size and its noisy nature. The
dictionary consists of 297,073,139 associations, mapping 175,100,788
unique strings to related English Wikipedia articles. It needed to be
pruned substantially to be used efficiently: we pruned according to
thresholds on a phrase-page pair’s raw frequency, as well as the probability of a page given the phrase. Once the dictionary was reduced to a
more manageable size, we loaded it into main memory as a sorted array,
and searched it with binary search, trading time efficiency for memory
efficiency. As mentioned above, the accuracy of this system was never
formally measured, but spot checks indicated that it has good coverage,
and that it does a good job of unifying different mentions of the same
person.
58
D
Sentiment & Emotion Analysis
In this section, we describe how we created a state-of-the-art SVM
classifier to detect the sentiment in tweets. The sentiment can be
one out of three possibilities: positive, negative, or neutral. We originally developed these classifiers to participate in an international competition organized by the Conference on Semantic Evaluation Exercises (SemEval-2013) [Wilson et al., 2013].7 The organizers created and
shared sentiment-labeled tweets for training, development, and testing.
The competition, officially referred to as Task 2: Sentiment Analysis
in Twitter, had more than 40 participating teams. Our submissions
stood first, obtaining a macro-averaged F-score of 69.02.
We implemented a number of surface-form, semantic, and sentiment
features. We also generated two large word–sentiment association lexicons, one from tweets with sentiment-word hashtags, and one from
tweets with emoticons. The automatically generated lexicons were particularly useful. In the message-level task for tweets, they alone provided a gain of more than 5 F-score points over and above that obtained
using all other features. The lexicons are made freely available.8
The emotion classification system used in this project also follows the
same architecture as our sentiment analysis system. That system classifies tweets into whether they express anger or no anger, fear or no
feat, dislike or no dislike, surprise or no surprise, joy or no joy, and
sadness or no sadness.
D.1
Sentiment Lexicons
Sentiment lexicons are lists of words with associations to positive and
negative sentiments.
D.1.1
Existing, Automatically Created Sentiment Lexicons
The manually created lexicons we used include the NRC Emotion
Lexicon [Mohammad and Turney, 2010, Mohammad and Yang, 2011]
7
8
http://www.cs.york.ac.uk/semeval-2013/task2
www.purl.com/net/sentimentoftweets
59
(about 14,000 words), the MPQA Lexicon [Wilson et al., 2005] (about
8,000 words), and the Bing Liu Lexicon [Hu and Liu, 2004] (about
6,800 words).
D.1.2 New, Tweet-Specific, Automatically Generated Sentiment Lexicons
NRC Hashtag Sentiment Lexicon:
Certain words in tweets are specially marked with a hashtag (#) to
indicate the topic or sentiment. [Mohammad, 2012] showed that hashtagged emotion words such as joy, sadness, angry, and surprised are
good indicators that the tweet as a whole (even without the hashtagged
emotion word) is expressing the same emotion. We adapted that idea
to create a large corpus of positive and negative tweets.
We polled the Twitter API every four hours from April to December
2012 in search of tweets with either a positive word hashtag or a negative word hashtag. A collection of 78 seed words closely related to
positive and negative such as #good, #excellent, #bad, and #terrible
were used (32 positive and 36 negative). These terms were chosen from
entries for positive and negative in the Roget’s Thesaurus.
A set of 775,000 of these tweets were used to generate a large word–
sentiment association lexicon. A tweet was considered positive if it has
one of the 32 positive hashtagged seed words, and negative if it had
one of the 36 negative hashtagged seed words. The association score
for a term w was calculated from these pseudo-labeled tweets as shown
below:
score(w) = P M I(w, positive) − P M I(w, negative)
(5)
where PMI stands for pointwise mutual information. A positive score
indicates association with positive sentiment, whereas a negative score
indicates association with negative sentiment. The magnitude is indicative of the degree of association. The final lexicon, which we will refer
to as the NRC Hashtag Sentiment Lexicon has entries for 54,129 unigrams and 316,531 bigrams. Entries were also generated for unigram–
unigram, unigram–bigram, and bigram–bigram pairs that were not necessarily contiguous in the tweets corpus. Pairs with certain punctuations, ‘@’ symbols, and some function words were removed. The lexicon
60
has entries for 308,808 non-contiguous pairs.
Sentiment140 Lexicon:
The sentiment140 corpus [Go et al., 2009] is a collection of 1.6 million
tweets that contain positive and negative emoticons. The tweets are
labeled positive or negative according to the emoticon. We generated
a sentiment lexicon from this corpus in the same manner as described
above (Section 2.2.1). This lexicon has entries for 62,468 unigrams,
677,698 bigrams, and 480,010 non-contiguous pairs.
D.2 Task: Automatically Detecting the Sentiment
of a Message
The objective of this task is to determine whether a given message is
positive, negative, or neutral.
D.2.1
Classifier and features
We trained a Support Vector Machine (SVM) [Fan et al., 2008] on the
training data provided. SVM is a state-of-the-art learning algorithm
proved to be effective on text categorization tasks and robust on large
feature spaces. The linear kernel and the value for the parameter
C=0.005 were chosen by cross-validation on the training data.
We normalized all URLs to http://someurl and all userids to @someuser.
We tokenized and part-of-speech tagged the tweets with the Carnegie
Mellon University (CMU) Twitter NLP tool [Gimpel et al., 2011]. Each
tweet was represented as a feature vector made up of the following
groups of features:
– word ngrams: presence or absence of contiguous sequences of 1, 2,
3, and 4 tokens; non-contiguous ngrams (ngrams with one token
replaced by *),
– character ngrams: presence or absence of contiguous sequences of
3, 4, and 5 characters);
– all-caps: the number of words with all characters in upper case;
61
– POS: the number of occurrences of each part-of-speech tag;
– hashtags: the number of hashtags;
– lexicons: the following sets of features were generated for each
of the three manually constructed sentiment lexicons (NRC Emotion Lexicon, MPQA, Bing Liu Lexicon) and for each of the two
automatically constructed lexicons (Hashtag Sentiment Lexicon
and Sentiment140 Lexicon). Separate feature sets were produced
for unigrams, bigrams, and non-contiguous pairs. The lexicon
features were created for all tokens in the tweet, for each part-ofspeech tag, for hashtags, and for all-caps tokens. For each token
w and emotion or polarity p, we used the sentiment/emotion score
score(w, p) to determine:
∗ total count of tokens in the tweet with score(w, p) > 0;
∗ total score = w∈tweet score(w, p);
∗ the maximal score = maxw∈tweet score(w, p);
∗ the score of the last token in the tweet with score(w, p) > 0;
– punctuation:
∗ the number of contiguous sequences of exclamation marks,
question marks, and both exclamation and question marks;
∗ whether the last token contains an exclamation or question
mark;
– emoticons: The polarity of an emoticon was determined with a
regular expression adopted from Christopher Potts’ tokenizing
script:9
∗ presence or absence of positive and negative emoticons at any
position in the tweet.
∗ whether the last token is a positive or negative emoticon;
– elongated words: the number of words with one character repeated
more than two times, for example, ‘soooo’;
– clusters: The CMU pos-tagging tool provides the token clusters
produced with the Brown clustering algorithm on 56 million Englishlanguage tweets. These 1,000 clusters serve as alternative representation of tweet content, reducing the sparcity of the token
space.
9
http://sentiment.christopherpotts.net/tokenizing.html
62
∗ the presence or absence of tokens from each of the 1000 clusters;
– negation: the number of negated contexts. Following [Pang et al., 2002],
we defined a negated context as a segment of a tweet that starts
with a negation word (e.g., no, shouldn’t) and ends with one
of the punctuation marks: ‘,’, ‘.’, ‘:’, ‘;’, ‘!’, ‘?’. A negated
context affects the ngram and lexicon features: we add ‘ NEG’
suffix to each word following the negation word (‘perfect’ becomes ‘perfect NEG’). The ‘ NEG’ suffix is also added to polarity
and emotion features (‘POLARITY positive’ becomes ‘POLARITY positive NEG’). The list of negation words was adopted from
Christopher Potts sentiment tutorial.10
We trained the SVM classifier on the set of 9,912 annotated tweets
(8,258 in the training set and 1,654 in the development set). We
applied the model to the previously unseen tweets gathered as
part of the CST system.
10
http://sentiment.christopherpotts.net/lingstruc.html
63
References
[Brown et al., 1992] Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra,
V. J. D., and Lai, J. C. (1992). Class-based n-gram models of natural
language. Computational linguistics, 18(4):467–479.
[Carbonell and Goldstein, 1998] Carbonell, J. G. and Goldstein, J. (1998).
The use of mmr, diversity-based reranking for reordering documents and
producing summaries. In Proc. of ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 335–336.
[Cherry and Guo, 2015] Cherry, C. and Guo, H. (2015). The unreasonable
effectiveness of word representations for twitter named entity recognition.
In Proceedings of the 2015 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, pages 735–745, Denver, Colorado. Association for Computational
Linguistics.
[Cherry et al., 2015] Cherry, C., Guo, H., and Dai, C. (2015). Nrc: Infused
phrase vectors for named entity recognition in twitter. In Proceedings of
the Workshop on Noisy User-generated Text, pages 54–60, Beijing, China.
Association for Computational Linguistics.
[Crammer et al., 2006] Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz,
S., and Singer, Y. (2006). Online passive-aggressive algorithms. The Journal of Machine Learning Research, 7:551–585.
[Fan et al., 2008] Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and
C.-J., L. (2008). LIBLINEAR: A Library for Large Linear Classification.
Journal of Machine Learning Research, 9:1871–1874.
[Finin et al., 2010] Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. (2010). Annotating named entities in twitter
data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical
Turk, pages 80–88.
[Fromreide et al., 2014] Fromreide, H., Hovy, D., and Søgaard, A. (2014).
Crowdsourcing and annotating NER for Twitter #drift. In LREC, pages
2544–2547, Reykjavik, Iceland.
64
[Gimpel et al., 2011] Gimpel, K., Schneider, N., O’Connor, B., Das, D.,
Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., and
Smith, N. A. (2011). Part-of-Speech Tagging for Twitter: Annotation,
Features, and Experiments. In Proceedings of the Annual Meeting of the
Association for Computational Linguistics.
[Go et al., 2009] Go, A., Bhayani, R., and Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision. In Final Projects from
CS224N for Spring 2008/2009 at The Stanford Natural Language Processing Group.
[Goutte et al., 2014] Goutte, C., Léger, S., and Carpuat, M. (2014). The nrc
system for discriminating similar languages. In Proceedings of the First
Workshop on Applying NLP Tools to Similar Languages, Varieties and
Dialects, pages 139–145, Dublin, Ireland. Association for Computational
Linguistics and Dublin City University.
[Hu and Liu, 2004] Hu, M. and Liu, B. (2004). Mining and summarizing
customer reviews. In Proceedings of the 10th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’04, pages
168–177, New York, NY, USA. ACM.
[Liang, 2005] Liang, P. (2005). Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J.
(2013). Efficient estimation of word representations in vector space. In
ICLR Workshop.
[Miller et al., 2004] Miller, S., Guinness, J., and Zamanian, A. (2004). Name
tagging with word clusters and discriminative training. In HLT-NAACL,
pages 337–342.
[Mohammad and Yang, 2011] Mohammad, S. and Yang, T. (2011). Tracking
Sentiment in Mail: How Genders Differ on Emotional Axes. In Proceedings
of the 2nd Workshop on Computational Approaches to Subjectivity and
Sentiment Analysis (WASSA 2.011), pages 70–79, Portland, Oregon.
[Mohammad, 2012] Mohammad, S. M. (2012). #emotional tweets. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared
65
task, and Volume 2: Proceedings of the Sixth International Workshop on
Semantic Evaluation, SemEval ’12, pages 246–255, Stroudsburg, PA.
[Mohammad and Turney, 2010] Mohammad, S. M. and Turney, P. D. (2010).
Emotions evoked by common words and phrases: Using mechanical turk
to create an emotion lexicon. In Proceedings of the NAACL-HLT 2010
Workshop on Computational Approaches to Analysis and Generation of
Emotion in Text, LA, California.
[Nadeau and Sekine, 2007] Nadeau, D. and Sekine, S. (2007). A survey of
named entity recognition and classification. Lingvisticae Investigationes,
30(1):3–26.
[Pang et al., 2002] Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs
up?: sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 79–86, Philadelphia, PA.
[Plank et al., 2014] Plank, B., Hovy, D., McDonald, R., and Søgaard, A.
(2014). Adapting taggers to Twitter with not-so-distant supervision. In
COLING, pages 1783–1792, Dublin, Ireland.
[Plutchik, 1962] Plutchik, R. (1962). The Emotions. New York: Random
House.
[Ratinov and Roth, 2009] Ratinov, L. and Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In CoNLL, pages
147–155.
[Ratnaparkhi, 1996] Ratnaparkhi, A. (1996). A maximum entropy model for
part-of-speech tagging. In EMNLP, pages 133–142.
[Ritter et al., 2011] Ritter, A., Clark, S., Mausam, and Etzioni, O. (2011).
Named entity recognition in tweets: An experimental study. In EMNLP,
pages 1524–1534, Edinburgh, Scotland, UK.
[Ritter et al., 2012] Ritter, A., Mausam, Etzioni, O., and Clark, S. (2012).
Open domain event extraction from twitter. In Proceedings of the 18th
ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD ’12, pages 1104–1112, New York, NY, USA. ACM.
66
[Sarawagi and Cohen, 2004] Sarawagi, S. and Cohen, W. W. (2004). Semimarkov conditional random fields for information extraction. In NIPS,
pages 1185–1192.
[Spitkovsky and Chang, 2011] Spitkovsky, V. I. and Chang, A. X. (2011).
Strong baselines for cross-lingual entity linking. In Proceedings of the
Fourth Text Analysis Conference (TAC 2011), Gaithersburg, Maryland,
USA.
[Spitkovsky and Chang, 2012] Spitkovsky, V. I. and Chang, A. X. (2012). A
cross-lingual dictionary for English Wikipedia concepts. In Proceedings of
the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey.
[Tjong Kim Sang and De Meulder, 2003] Tjong Kim Sang, E. F. and
De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task:
Language-independent named entity recognition. In CoNLL, pages 142–
147.
[Wilson et al., 2013] Wilson, T., Kozareva, Z., Nakov, P., Rosenthal, S.,
Stoyanov, V., and Ritter, A. (2013). SemEval-2013 Task 2: Sentiment
analysis in Twitter. In Proceedings of the International Workshop on Semantic Evaluation, SemEval ’13, Atlanta, Georgia, USA.
[Wilson et al., 2005] Wilson, T., Wiebe, J., and Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 347–354,
Stroudsburg, PA, USA.
[Zhu, 2010] Zhu, X. (2010). Summarizing Spoken Documents Through Utterance Selection. PhD thesis, Department of Computer Science, University
of Toronto.
[Zhu et al., 2013] Zhu, X., Cherry, C., Kiritchenko, S., Martin, J., and
de Bruijn, B. (2013). Detecting concept relations in clinical text: Insights from a state-of-the-art model. Journal of Biomedical Informatics,
46:275–285.
67