Countering Security Threats Using Natural Language Processing Pierre Isabelle C. Cherry R. Kuhn S. Mohammad National Research Council Prepared By: National Research Council 1200 Montreal Rd., Building M-50 Ottawa, ON K1A 0M5 Contract Reference Number: CSSP-2013-CP-1031 Technical Authority: Rodney Howes, DRDC – Centre for Security Science Disclaimer: The scientific or technical validity of this Contract Report is entirely the responsibility of the Contractor and the contents do not necessarily have the approval or endorsement of the Department of National Defence of Canada. Contract Report DRDC-RDDC-2016-C344 December 2016 © Her Majesty the Queen in Right of Canada, as represented by the Minister of National Defence, 2016 © Sa Majesté la Reine (en droit du Canada), telle que représentée par le ministre de la Défense nationale, 2016 Countering Security Threats Using Natural Language Technology Prepared by: P. Isabelle, C. Cherry, R. Kuhn and S. Mohammad National Research Council 1200 Montreal Rd., building M-50 Ottawa, K1A 0M5 Contract Reference Number: CSSP-2013-CP-1031 Scientific Authority: Rodney Howes DRDC Centre for Security Science 613-943-2474 The scientific or technical validity of this Contract report is entirely the responsibility of the Contractor and the contents do not necessarily have the approval or endorsement of the Department of National Defence of Canada. 1 Abstract The quantity of data that is available to information analysts has been growing at an exponential rate in the last two decades and will continue to do so in the foreseeable future. At the forefront of that growth are the new social media such as Twitter, Instagram and Facebook. Those vehicles do carry a wealth of information that could be of great value to security analysts, but the challenge is to uncover the small information gems in a huge mass of worthless material. In recent years, Canada has invested substantial amounts of money in research efforts on natural language technologies. The NRC has been highly successful on that front, developing world-class technologies for machine trabnslation, text summarization, information extraction and sentiment emotion analysis. While these technologies are already being used in various application areas, their potential in security analysis remains to be firmly established. This is exactly what this technology demonstration project set out to accomplish. Together with our industrial partners Thales TRT (Quebec city) and MediaMiser (Ottawa), and with the assistance of a professional intelligence service, we have developed a prototype system that can: 1) monitor social media on an ongoing basis, extracting from them huge multilingual collections of documents about interesting topics; 2) translate foreign language documents within such collections; 3) enrich the content of all documents using advanced linguistic analysis such as information extraction and sentiment analysis; 4) store the results in a special-purpose database; 5) provide users with unmatched flexibility in tailoring multi-faceted search queries based on criteria as diverse as source language, document genre, posting location, posting date, keywords, linguistic entities, author sentiment and emotions; and 6) present users with rich visualizations of the results of their search queries. The core part of this report contains the following: a) a mostly non technical presentation of the architecture of the prototype system that constitutes our main project result; b) An extensive video tour of that prototype which makes it easy to understand its value for information analysts; and c) a description of the user contributions, feedback and conclusions about this prototype system which is found to be “on its way to being a high-quality analytic tool [...]”. 2 Contents 1 Introduction 7 2 Overall Architecture 9 2.1 MediaMiser: data collection . . . . . . . . . . . . . . . . . . . 9 2.2 NRC: text enrichment . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Thales: data storage, retrieval and visualisation . . . . . . . . 18 3 A tour of the technology demonstrator 23 4 User Contributions and Feedback 24 5 Conclusions 26 A Machine Translation A.1 Improving the throughput of the MT module . A.2 Improving Translation Quality . . . . . . . . . A.2.1 Rules for Handling Tweets . . . . . . . A.2.2 Additional Training Data . . . . . . . A.3 Discussion and Recommendations . . . . . . . 29 31 35 36 38 42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Summarization 44 B.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 B.2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 B.3 Summarization Algorithm . . . . . . . . . . . . . . . . . . . . 46 C Information Extraction C.1 Named Entity Recognition C.1.1 Data . . . . . . . . C.1.2 Methods . . . . . . C.1.3 Experiments . . . . C.1.4 Discussion . . . . . C.2 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 52 54 56 57 D Sentiment & Emotion Analysis 59 D.1 Sentiment Lexicons . . . . . . . . . . . . . . . . . . . . . . . . 59 D.1.1 Existing, Automatically Created Sentiment Lexicons . 59 3 D.1.2 New, Tweet-Specific, Automatically Generated Sentiment Lexicons . . . . . . . . . . . . . . . . . . . . . . . 60 D.2 Task: Automatically Detecting the Sentiment of a Message . . 61 D.2.1 Classifier and features . . . . . . . . . . . . . . . . . . 61 4 List of Figures 1 2 3 4 5 6 7 8 9 10 Overall architecture of the CST system . . . . . . . . . . . . . Initial MT module . . . . . . . . . . . . . . . . . . . . . . . . Improved version of MT module, December 2014 . . . . . . . . Speed improvement in MT module . . . . . . . . . . . . . . . Handling of Twitter #hashtags . . . . . . . . . . . . . . . . . Handling non-Arabic scripts and multiple hash tags . . . . . . MT module, verion 2 . . . . . . . . . . . . . . . . . . . . . . . A user can drag a selection box to select tweets under his or her concern, to generate a summary. In the figure, two spikes of tweets between September 9th and 11th were selected. . . . Once the summary is ready, an “Open Summarization” button is shown (upper figure). Users can then click the “Open Summarization” button to read the summaries (lower figure). An example of semi-Markov tagging. . . . . . . . . . . . . . . 5 10 32 34 35 36 37 41 45 47 50 List of Tables 1 2 3 4 Details of our NER-annotated corpora. A line is a tweet in Twitter and a sentence in newswire. . . . . . . . . . . . . . . . A system trained only on newswire data, tested on newswire (CoNLL) and social media data (Rit11, Fro14). Reporting F1. The progression of the NRC named entity recognizer throughout the CST project. Reporting F1. . . . . . . . . . . . . . . . F1 for our final system, organized by entity class. . . . . . . . 6 51 55 56 57 1 Introduction The quantity of data that is available to information analysts has been growing at an exponential rate in the last two decades and will continue to do so in the foreseeable future. At the forefront of that growth are the new social media such as Twitter, Instagram and Facebook. It is clear that those vehicles carry a wealth of information that can be of great value to security analysts. An international example might be blogs or tweets in Arabic or Chinese that indicate developing threats to Canadian embassies. A Canadian example might be blogs or tweets from Canadians that suggest a social disturbance may be developing. Recall that during the June 2011 Vancouver hockey riot, many rioters and onlookers used Twitter, Facebook, etc. to describe what was going on in real time. Unfortunately, the exponential growth in data size on social media also means that any valuable information nugget tends to be buried underneath massive amounts of irrelevant material. For security analysts, time is of essence: they cannot afford to waste much time tossing away large amounts of worthless chaff. A large proportion of the potentially useful data is in the form of natural language texts of many different languages. Consequently, information analysts could greatly benefit from tools that could make them more efficient at finding useful pieces of information hidden within massive quantities of multilingual text. In recent years, most developed countries, including Canada, have invested substantial amounts of research money into natural language processing (NLP) technology. Lately, researchers in that area have been moving away from the traditional paradigm of manually-encoded rule systems to embrace a radically different paradigm: that of machines that can automatically learn from examples. This has resulted in very significant progress on a broad range of applications including machine translation, text classification, summarization, information extraction and sentiment & emotion analysis. Canadian researchers have been highly active on that R&D front. For example, the NRC has succeeeded in developing leading-edge technology on all the applications just mentioned. Our “leading-edge” qualification is backed up by the fact that over the last ten years, NRC has repeatedly and consistently obtained some of the very best marks in international technology benchmarking exercises including the following ones: • In 2012 NRC tied with Raytheon BBN for the first place in Chinese-to7 English and Arabic-to-English machine translation at NIST Open MT 2012.1 • In 2010, 2011 and 2012, NRC participated in the i2b2 technology benchmarking exercise for information extraction in the medical domain.2 Each time NRC’s results placed at the top. See for example [Zhu et al., 2013]. • Between 2013 and 2015, NRC’s sentiment analysis technology was a top performer on six different tasks of the SemEval annual benchmarking exercises.3 See for example [Wilson et al., 2013]. • In 2014 and 2015, NRC’s text categorization technology ranked first in the Discriminating Similar Languages Shared Task.4 [Goutte et al., 2014]. Moreover, NRC’s NLP technologies have already been deployed in many practical applications. Here are some examples: • The Extractor multilingual text summarization technology has been on the market for about 15 years.5 • The PORTAGE machine translation system is being commercialized since 2009 and currently in use by several private linguistic service providers as well as by the Canadian Translation Bureau.6 However, to the best of our knowledge, the value of state-of-the-art NLP technology has yet to be firmly established in the context of security analysis. The goal of the present CSSP project was precisely to demonstrate that there is indeed substantial value there for security analysts. Most of the relevant technologies were already available individually at the start of the project. The core of our effort was devoted to: 1) adapting each technology to the specificities of social media and security analysis; 2) assembling these components into a coherent demonstratable system that can be tested by 1 See See 3 See 4 See 5 See 6 See 2 http://www.itl.nist.gov/iad/mig/tests/mt/ https://www.i2b2.org/. https://en.wikipedia.org/wiki/SemEval http://ttg.uni-saarland.de/lt4vardial2015/papers/goutte2015.pdf. http://www.extractor.com/. http://www.terminotix.com/. 8 professional security analysts; 3) collecting feedback from analysts and using it to produce successively improved versions of the prototype. Such a system has sucessfully been assembled and extensively demonstrated not only to the user-partner of this project but also to many other organizations, both public and private. In section 2 we will examine the architecture of the project and of the resulting technology. In section 3 we will take the reader on an audio-visual tour of the prototype system that was built. In section 4, we will synthesize the feedback that we received from our user-partner after testing the technology at different stages of its development. Then, more technical aspects of our work are presented in a set of appendices that the less technically-minded reader can safely ignore. 2 Overall Architecture Our project involved three technical contributors: NRC, Thales TRT Canada Inc. and MediaMiser Inc. Each contributor developed or adapted software of its own and made it available to the project partners, typically through web services. Figure 1 shows the overall architecture. 2.1 MediaMiser: data collection Each social medium produces its own stream of documents. Generally speaking, those streams are much too large to be integrally captured. In practice, what this means is that interesting “topics” have to be monitored on a continuing basis so that potentially interesting documents about each such topic can be captured on-the-fly, ahead of examination time by the users, and stored for further processing and examination. For the purposes of our project, the partners agreed a small number of topics (often referred to as “scenarios” in our project documentation) to be used as a testbed for technology. Each selected topic was then encapsulated as a boolean combination of keywords which was used by MediaMiser to extract matching documents on a continuing basis from the set of media that we were monitoring. Here are some of the topics that were in focus at some time during the project: 9 Overall System Architecture Twitter Text Blogs Annotated Text News Stream Search Visualize 1 Figure 1: Overall architecture of the CST system 10 • Sochi Olympics. During winter 2014, we collected a sizable amount of data about Sochi Olympics, using keywords such as “Sochi Olympics”, “winter games”, etc. The scope of the collection process was limited to English documents. • The Ottawa War Memorial shootings. In October 2014, we started collecting data about the October 22 shooting events and their impact soon after the shooting happened, using keywords such as “Ottawa shooting”. Fortunately, using Twitter’s historical search mechanism, we were also able to go back in time so that our War memorial collection does cover the whole event. Here again, the collection process was limited to English documents. • The Syria crisis. Almost 100 million documents about the current civil war in Syria were extracted from social media and processed between June 2014 and the end of the project. In this case, both English and Arabic keywords were used so as to extract documents in those two languages. In the second year of the project, most experiments and demonstrations concentrated on that particular dataset. MediaMiser’s primary role was to monitor social media on a continuing basis for documents matching the keywords associated with any of our active topics and to extract all such documents from each relevant stream, no matter how numerous the matches might be. The volume of the Syria dataset reached a peak of about 1.5 million documents per day. In practice it was found necessary to limit the media sources to the following list: Twitter, plus various English newswire and English blog wires that were already being monitored by MediaMiser for other purposes. Later on, we added Arabic blogs for the benefit of our Syria dataset. The data extracted by MediaMiser included not only the documents as seen on the social media platforms but also some metadata that the different media provide about each document. For example, documents extracted from Twitter are each in the form of a JSON record which in addition to the document text includes metadata elements such as the following ones: • Author’s (pen) name. • Author’s (declared) place of residence. 11 • When available, geographical coordinates of the origin of the posting. This is only available in cases where the document was sent from a mobile device that then had geo-tracking turned on (about 2 percent in the case of our Syria dataset). • Language of the document. Twitter provides that information using their in-house language guesser. Our other sources only contained English-only or Arabic-only documents. • Social network information (Twitter only). The author field of each document is enriched with the lists of following and followed authors as well as a list of favorites. As the details of the available metadata vary greatly between different sources, MediaMiser carried out a process of metadata normalization so as to simplify downstream processing. The initial plan was that, once normalized, the data extracted by MediaMiser would be streamed to NRC for linguistic processing, and the result would thereafter be streamed to Thales TRT in Quebec City where it would be stored and made searchable. However, as it turned out, it proved impractical for NRC to deploy the high-bandwidth web service that would have been required for that purpose. For that reason, it was decided to install NRC’s linguistic technology on MediaMiser’s premises. The heavy computational burden involved in language processing (especially for machine translation) was thereby transferred to MediaMiser, forcing them at times to struggle to keep up with the incoming flow of documents. The available computing resources turned out to be somewhat underpowered given the high demands placed by NRC linguistic technology (especially those of machine translation). As a result, during some of the peaks in data volume, MediaMiser was unable to translate all of the extracted documents. In such cases, the original version of the document was extracted and streamed to Thales without its translation. In order to minimize the impact of this problem, MediaMiser provided Thales with an access to the NRC translation server running on their premises. Thales was then able to add an on-demand translation service to the demonstration system, so that the user would still be able, if needed, to see translations of the foreign language documents that had not been translated at capture time. 12 2.2 NRC: text enrichment NRC’s primary role was to enrich the texts from the documents collected on social media using its leading-edge natural language technology. This included the following components: • Machine translation. The initial plan was to deploy the NRC’s PORTAGE statistical machine translation for both Chinese-to-English and Arabic-to-English. However, as the project unfolded, the partners decided to concentrate on Arabic-to-English only, thereby reflecting the growing focus of the project on the Syria crisis dataset. The first step was to integrate a general-purpose Arabic-to-English PORTAGE translator to the CST prototype. Then, in a second step, the translation component was customized for the peculiarities of social media texts (in particular, Twitter), thereby obtaining significant gains in the quality of the translations. • Document summarization. NRC proceeded to deploy its ExtractorTM technology to automatically pull out from each document: – A set of words or phrases that can be considered as good keywords for that document, in that they reflect the core contents of the document. In the CST prototype, Extractor keywords are called “topics”. – A set of sentences that constitute a good summary for the document. The connection with the above keywords is that the extracted sentences are chosen so as to maximize the exposition of those keywords, without being overly redundant among themselves. Note that Extractor summarization operates on single documents only. In the case of micro-blogs such as Twitter, sentence extraction is irrelevant since documents typically contain only one sentence. However, during the project we realized that a technology capable of summarizing related groups of documents would be useful to security analysts. This is why halfway into the project we decided to develop such a capability (see below). 13 • Information extraction. We designate under that name any technology capable of automatically extracting structured information from semi-structured or unstructured text. The kinds of information users are typically interested in include the following ones: – What entities (e.g. persons, places, dates, amounts, etc) are being referred to in some particular text or collection of texts? – What relationships are being expressed between those entities (e.g. person X was born on date Z). – What events are being described involving those entities (e.g. an earthquake happened at place Y on date Z). For the purposes of this project we chose to concentrate on extracting the following entity types: persons, places and organizations. Early in the project, a first entity extractor was integrated which was based on a pre-existing generic open-source implementation. As the performance of that initial version yielded unsatisfactory results, we successively developed two improved versions. The final one yielded some of the best results ever reported on entity extraction for social media (see details in Appendix C below). The more advanced entity extractors include a capability to merge together references to one and the same entity through different expressions. For example, both “Gaddafi” and “Qadaffi”are sometimes used to refer to Muammar Gaddafi, the former Lybyan leader. The final version of our CST entity extractor implements such a functionality. It does so by linking each entity mention with its Wikipedia page (assuming it has one). “Gaffadi” and “Qadaffi” would then lead the user to the same Wikipedia page, namely https://en.wikipedia.org/wiki/Muammar Gaddafi. Extracting meaningful relationships and events from the text itself was beyond the scope of our resource-limited project. However, as we will see below, the CST protype still contains mechanisms for extracting simple co-occurrence relationships such as: entities X and Y tend to occur in the same documents or in documents by the same author or in documents that share the same hashtags, etc. • Sentiment and emotion analysis. We use the term “sentiment analysis” to designate the operation of assigning a given piece of text to 14 one of the categories “positive”, “negative” or “neutral” according to whether the author of the text is expressing a positive, negative or neutral attitude towards the content of that piece of text. In this project, our sentiment analyzer is applied independently on each sentence of an input document and the result is a sentiment score ranging between -1 (perfectly negative) and +1 (perfectly positive) with 0 meaning perfectly neutral. We designate under the term “emotion analysis” the operation of assigning to some piece of text one or more labels denoting the emotions, if any, that the author of that piece of text is conveying through it. In this project, we use the following list of six emotions that are drawn from Plutchik’s set of basic emotions [Plutchik, 1962]: joy, surprise, sadness, dislike, anger and fear. Our emotion analyzer is also applied independently to each sentence of an input document. Each sentence receives a number between 0 and 1 for each of the six emotions, according to the measured strength of the relevant emotion in that sentence. • Multi-document Summarizer As mentioned above, halfway into the project we realized that there was an acute need for a capability to summarize groups of documents rather than just single documents. This particular task turns out to be quite different from that of summarizing individual documents. For example, a multi-document summarizer is likely to have to deal with much more redundancy in its input. Think for example about similar accounts of the same event being published by different newspapers. In the case of micro blogs such as Twitter, single document summarization is irrelevant but a capability to summarize groups of tweets is potentially very useful. For example, sudden peaks in volume on a given topics are most often caused by specific events. Applying a multi-document summarizer to the set of documents in a given peak would instantly bring such events to light. NRC decided to produce its own novel technology for multi-document summarization and to apply it to the needs of the CST project. Over the last 6 months of the project, two successive versions of NRC’s multi-document summarizer were incorporated in the CST prototype system. An interesting aspect of this evolution is that it brought a significant change in the overall project architecture. Up to then, NRC’s language 15 technology was only working at the single-document level: documents extracted by MediaMiaser were individually subjected to summarization, information extraction and sentiment and emotion analysis by a process devoid of any awareness of the broader collection to which the document belongs. However, the scenario in which multi-document summarization is needed is one in which some user arbitrarily targets some specific group of documents (e.g. those contained in a particular peak). As a result, multidocument summarization cannot be performed at collection time: it needs to be user-triggerable at any time. We were thus led to amend the overall system architecture so as to give Thales, our system integrator, direct access to the NRC’s multi-document summarizer. Thales was then able to implement a user-triggered multi-document summarization capability in our prototype system. • Implementation of NRC’s linguistic technology NRC’s linguistic processing components are implemented through the following three schemes: 1. A machine translation service embodied as a batching and queuing system running on a machine located in MediaMiser’s data center. MediaMiser calls this service upon extracting any document that is marked as non-English. The translated text is then added as an additional metadata field in the JSON record associated with the document. During the project, this was used to deal with the 50 million Arabic documents included in the Syria crisis dataset. All the other datasets used in the project were English-only. Machine translation is a computation-intensive technology. As mentioned above, during some peaks in the volume of extracted foreign language data, the machine translation server was occasionally unable to cope in real time with the full incoming flow. On such occasions, some of the foreign language data was streamed to Thales without any translation or with only a partial translation. However, Thales was provided access to the translation server so that they were able to build an on-demand machine translation service that the users could resort to in case they were interested in reading some of the untranslated foreign language documents. 16 2. A linguistic annotation service implemented as a Representational State Transfer (REST) service working on JSON records. The Web server would then call the following sub-annotators: a tokenizer, the named-entity extractor and the sentiment and emotion analyzer, each implemented as one or more TCP/IP servers, as well as the Extractor summarizer implemented as a directly callable library. The effect is as follows: – The NRC tokenization sub-service segments the text from each input document into separate word tokens and separate sentences. The segmented version of the text is stored in additional metadata elements in the JSON record of each document. This prepares the ground for the application of the remaining sub-services. – The NRC Extractor sub-service adds to each input document: a) a metadata field containing a set of “key words” or “key phrases” that capture the topic of the document; and b) another metadata field which, in the case of multi-sentence documents, contains a few key sentences that constitute an extractive summary of the document. – The NRC entity extraction sub-service adds to each document a metadata field containing a list of entities of each of the following types: person, place or organization. An additional metadata field is also added to associate each such entity with a link to its Wikipedia entry (if it has one). – The NRC sentiment and emotion analysis sub-service adds the following metada fields to each document: a) a field containing a document-level aggregation of sentence-level sentiment, namely the proportion of positive, negative and neutral sentences in the document; b) a field that contains the list of sentence-level sentiment and emotions. In the latter field, each sentence of the document is assigned its most likely sentiment (positive, negative or neutral) together with probabilities for each, plus a probability value for each one of the following emotions: anger, dislike, fear, joy, sadness and surprise. – And finally, NRC’s multi-document summarizer service has been implemented as a Java library running on Thales’ ma17 chines in Quebec City. Recall that each set of documents to be summarized by that module is selected by the user at runtime. 2.3 Thales: data storage, retrieval and visualisation For each active dataset (or “scenario”) Thales is receiving an uninterrupted real-time stream of social media documents in JSON format. As discussed above, the metadata associated with each document has been normalized by MediaMiser and enriched using NRC technology with translations, document summaries, list of included entities and sentiment and emotion markings. All received documents are then stored and made searchable. For that purpose, Thales developed their own document-oriented database which supports real-time indexing of huge quantities of documents, as well as real-time search based on multi-faceted queries on the raw document content and the associated metadata. Starting from the whole collection of datasets available in the system, the user is given a wide range of filtering mechanisms will allow narrowing down on specific subsets of interest: – Dataset. Each document fed into the database belongs to one of the datasets being actively monitored in the project and identified as such in its metadata. The user starts by choosing one among the available datasets, such as “Syria crisis” (which currently contains some 100 million documents), “War memorial shootings” or “Ebola Canada”. – Language of the documents to be retrieved. This relevant piece of metadata is inserted in each tweet by Twitter. Our other sources are all monolingual and MediaMiser adds the relevant language metadata at collection time. The system user is then provided with the means to narrow down its focus on one or more of the languages for which documents are available. With the current datasets this is only used for the “Syria crisis” dataset, 18 which is made up of roughly equal numbers of English and Arabic documents. – Genre(s) of documents to be retrieved. MediaMiser adds to each extracted document a metadata field describing its genres among the following three ones: “tweet”, “News” or “blog”. The user can then restrict the scope of any given search to any combination of those three genres. – Documents matching some user-specified boolean combination of words (e.g. “chlorine AND attacks”). This is of course a very basic and standard mechanism for filtering down any document collection. A user-selected option is also provided to allow for matching the query not only against the document text but also against its metadata elements. – Documents posted at a specific time. The metadata provided to Thales includes a posting timestamp on each document.Thales was able to use this to provide the user with the means to restrict the search to any time interval between the moment the dataset under inspection started being collected and the present time. – Location of posting. The user can select on a map any specific rectangular geographical area to which the search will be restricted. This interesting functionality is unfortunately not available for all documents, since the relevant metadata is only available for a subset of them. For example, only about 2% of Twitter posts come with metadata indicating the precise geographical coordinates of the posting site, namely those posts that were sent with a device in which this kind of tracking is both available and enabled. However, Thales has contributed a mechanism that attempts to infer the posting location from the Twitter user profiles, in which users are allowed to include a free text description of their place of residence. Such descriptions often turn out to be difficult to interpret because they are not standardized (e.g. variable granularity of the location: country, region, town, etc). Moreover, the residence location and the posting location are not necessarily identical: the author may be traveling or cheating about his true place of residence. But this approximation allows us to increase the geo-location coverage to about 30% of the input data. 19 – Hashtags (Twitter only). The user can restrict the document collection to those bearing some particular hashtags(s). – @authors (twitter only). The user can restrict the document collection to those posted by a specific author. – Topics. NRC’s Extractor system has been used to annotate the English text (original or obtained by machine translation) with a set of words and phrases that capture the topic of the document. The user is enabled to filter down the current collection to those documents marked with any of those topics. – Entities. NRC has added to each document some metadata showing the entities that have been identified in the text of that document, among the following entity types: persons, places and organizations. The user can take advantage of this marking and filter down the current collection to those documents containing some particular entity or set of entities. – Sentiment. NRC has added to each document metadata that indicates a sentiment score between -1 (completely negative) and +1 (completely positive). The user is given the means to filter down the current collection to show only those within some subrange of sentiment score. For example, a user might use this to filter the set of posts that contain the word “ISIS” to the subset of them that is very positive (say, sentiment score > 0.75). – Emotions. NRC has added to each document a score ranging between [0-1] for each of six emotions: joy, surprise, sadness, dislike, anger and fear. Thales has used a threshold of 0.5 to binarize the presence or absence of each sentiment. The user is then enabled to filter down the current selection so as to only show those that express some particular emotion. For example, the user could ask to see all posts that refer to some particular entity (say, a given person) while expressing anger. Given any collection of documents that represents either a complete dataset (such as “Syria crisis”) or a subset of it that has been obtained using any combination of the filtering devices enumerated above, the user will be presented with viewing mechanisms both at the collection level and at the document level. 20 At the collection level, the user can see all of the following: – A timeline that displays the density level of postings over time bins that span the whole period during which the sub-collection was extracted (up to now if the extraction process is still active). This is useful, for example, to spot significant peaks in volume, which are often associated with important new events. This is implemented as a bar chart. Moreover, each bar is segmented in such a way as to display the sentiment distribution between positive (green), neutral (blue) or negative (red). This makes it possible to observe the evolution of relative sentiment over time. Note also that the timeline display is interactive in that it allows the user to apply the time-based filtering mentioned above. – A map that shows the geographical distribution of postings in the current sub-collection. That distribution can be observed at various levels of granularity from a whole-earth view down to city-block level using a zooming function. The display can be switched between a small widget on the main interface Web page and a fullscreen view. Here again, the display can be used not only as a viewing device over the current collection but also as a triggering device for the location-based filtering mentioned above. – A word cloud that shows the relative salience of userselected classes of objects within the current sub-collection. The classes that can be selected include the following: hashtags, topics, twitter authors, persons, places and organizations. The latter three classes correspond to the classes of entities extracted using NRC technology. Without such a technology it would not be possible to observe the relative salience of persons (or places, organisations) in a given dataset. Once again, the display is used not only as an output device but also a trigger mechanism for document filtering. When the user clicks on any element of the word cloud, the current collection gets filtered down to the subcollection containing that element. Note that the word cloud is always strongly interacting with all the filtering mechanisms. For example, if the collection is reduced to the production of one particular author, a word cloud on topics makes it possible to in21 stantly survey the range of interests of that particular author. – A sentiment graph that shows the distribution of posts on the negative/positive axis. One can use this to observe differences between datasets or subsets. For example, one can easily see that the recently introduced “airline” dataset is neatly centered on neutrality while the “Syria crisis” dataset is skewed towards the negative side. Like the other widgets described above, the sentiment graph is not only a display mechanism. It also serves as a triggering device for sentiment-based filtering: selecting any range on the sentiment axis will have the effect of filtering the current collection down to documents that are within that range. – An emotion graph that displays the relative salience of the six annotated emotions in the current sub-collection. The user can toogle the display between a bar chart and a radar chart. Like the other widgets described before this one can also be used a a filtering trigger: clicking on the zone associated with a particular emotion will have the effect of filtering the current collection down to documents that have been marked as expressing that emotion. – A co-occurence network that allows the user to examine co-occurrence relations between various kinds of objects within the current sub-collection. In doing so, we can distinguish between: a) the objects between which co-occurrences are to be observed; b) the domain in which the co-occurrence is taking place; and c) the strength of a given co-occurrence between a pair of objects in a given domain. In a graphical representation, the objects are the nodes, the domains can be represented by types of edges between the nodes and the association strength can be represented as the relative thickness of the edges. The network available in our final prototype should be viewed as an incomplete attempt to give users a very general tool for exploring a large variety of possible associations. The generality comes from offering a large choice of objects (the nodes can be hashtags, topics, authors, posting locations and entities including persons, places and organisations) and a large choice of co-occurrence domains (single documents, documents from the same author, documents about the same topic, document referring to the same person, etc.). Unfortunately, we ran out of time before we could implement any display of associ22 ation strength. In the current state, the user can toggle between the small widget view and a fullscreen view. In the widget view, one can choose among preset types of co-occurrence relations. For example, the selection “Co-mentioned hashtags” displays pairs of hashtags that appear at least once in the same document. Initially, the view is centered on one arbitrary hashtag, but the user can change that center at will. Two other presets include the same kinds of document-internal co-occurrence relation but for topics and entities. The remaining presets tackle more complex kinds of co-occurrence that will not be discussed here. When the user switches to the fullscreen view, the same presets are available, but the user can also select a custom type of co-occurence among an almost endless variety. For example, a user choosing “topics” as nodes and “same document” would get the same as one of the presets mentioned above. However, when the domain is moved to “same author”, the target is shifted from the number of documents to the number of authors mentioning those two topics. We believe that the approach we have sketched opens up a large array of interesting possibilities which will be easier to investigate once the quantitative aspect (relative strength of co-occurrences) has been implemented. 3 A tour of the technology demonstrator We are pleased to offer the reader a detailed tour of the demonstrator system that constitutes the main deliverable of our CSSP project. The present report should normally be accompanied with an HTML file named ”Tour of the CST prototype system.html” which contains a user interface to the video and a rather large file named “CST Tour.webm” which contains the video itself. All you needed to embark on the tour is to load ”Tour of the CST prototype system.html” in your Web browser. The whole tour takes about 28 minutes but it is possible to cherry pick your preferred topics using a set of cue points provided in the user interface. Enjoy the tour! 23 4 User Contributions and Feedback The main roles of the user partner as defined in the initial project plan were: 1) to help us the define the requirements and functional specifications that our technology should meet; 2) to manually annotate some raw data extracted from social media in order to help us train supervised machine learning algorithms; and 3) to provide feedback on the successive versions of the system prototype developed in the course of the project so as help prioritize areas for improvement. Concerning the first point, a noteworthy contribution from our users was their strong involvement in the so-called “scenario day” meeting that took place on 2 October 2013. While all partners were represented, the user partner sent a 10-strong delegation to discuss the technology requirements for our project. Given the need for the project to work in an unclassified setting, it was agreed on that occasion that we should focus on requirements that would allow us to deal well with real-life “proxy scenarios” (i.e. datasets). It was then agreed that the following datasets presented all the characteristics and complexity of the genuine datasets that are of interest for information analysts: – The Sochi Olympic Winter Games (February 7-23, 2014); – The League of Legends online game – The civil war in Syria Data on those three topics soon started to be extracted from social media and was used for our experiments and demonstrations. In particular, the Syria dataset soon became the core focus for the remainder of the project. User comments frequently led to revisions in our priority scheme. One of the most important such revisions was to abandon our original plan of working on the Chinese language as well. Given the developing emphasis of the project on the tesbed provided by the Syria crisis dataset, the users pointed out their preference for the project to concentrate on doing the best possible job on the Arabic and English languages. During the project, our user partner also manually annotated some data from the Syria crisis dataset to help us evaluate the performance of our technology on that kind of material: 24 – Evaluate the level of noise in the raw data extracted by MediaMiser (Syria dataset). – Evaluate the precision of our sentiment and emotion analysis on the Syria dataset. – Provide English translations for some Arabic documents from the Syria dataset. Given available human resources, it was obviously not possible to annotate enough data for training machine learning algorithms. Rather, the data annotated by our users was used for technology evaluation purposes. The three samples described allowed us to confirm that our technologies were working reasonably well on the Syria dataset. In the initial project plan, we had assumed that three different versions of the prototype would successively be tested in house by our user partner. Unfortunately, we soon found out that this was impractical. Since our system was being developed in an unclassified setting, it was not possible to address our users’ security requirements for in-house technologies in a satisfactory manner. The on-site testing was thus replaced with two different test settings. The first one was a series of demonstrations and hands-on sessions that were organized for representatives of our user partner: – Task 2 “go-nogo” meeting (11 March 2014); – Task 3 meeting (28 August 2014); – Task 4 meeting (12 February 2015); – Task 5 meeting (4 May 2015); – Task 6 “final” meeting (28 July 2015). The second test setting was a permanent one: even though this was not part of the original plan, Thales decided to provide project partners with uninterrupted access to the evolving version of the CST prototype through the Web. This way it was also possible for project partners to give spontaneous demos of the prototype to interested parties who were not official project participants and to collect their feedback. This ongoing availability proved to be a huge asset for participants. 25 We now turn to synthesizing the feedback our prototype system received from our user partner. Looking at the final version of the prototype, we can say that multiple components were implemented as a result of user feedback: – Various data filtering, sorting and grouping mechanisms such as filtering by authors or languages, sorting by chronological order and grouping tweets and their retweets. – The capability to exclude any geographical area from the search. – The capability to export sets of results so that they could then be processed using different systems. – The ability to define persistent search queries, over and above the current working session. The overall feedback regarding our system was very positive. Basically, we received a strong confirmation of the core hypothesis underlying the whole project, namely that advanced linguistic technology could be harnessed for the benefit of information analysts. In particular, the feedback made it clear that entity extraction, sentiment analysis and machine translation were each extremely valuable. The integration of these linguistic technologies with other technologies also proved highly valuable. For example, users greatly appreciated the unique capabilities of the map widget that Thales incorporated in our prototype. In that respect it was also noted on occasion that some of those non-linguistic technologies could have been pushed further. For example, some users remarked that the timeline widget incorporated by Thales could have been extended in a way to better cope with short time intervals. The overall conclusion from the user partner was that our prototype system was on its way to being a high-quality analytic tool and just needs a little more development to reach its full potential. 5 Conclusions Our CSSP project set out to provide a concrete demonstration of the claim that leading-edge linguistic technologies such as machine translation, summarization, information extraction and sentiment emotion 26 analysis can be extremely useful to security analysts interested in “big data” monitoring on social medias. The goal was not only to integrate those linguistic technologies together, but also to integrate them with several other basic capabilities that were needed in order to provide a realistic tested: information retrieval from social media, social network analysis and advanced visualization and user interaction facilities. In accordance with our plan, a first version of the prototype we were building was demonstrated six months after the beginning of this twoyear project. Moreover, even though this was not part of the original plan, Thales (our system integrator) provided the partners with an ongoing access to the evolving prototype over the world Wide Web. This greatly facilitated ongoing interaction between system developers and the user partner, which helped steer the project towards the most successful outcome possible. This interaction led to some changes with respect to the original plan. For example, the idea of covering the Chinese language alongside English and Arabic was abandoned in favor of more sustained work on the two remaining languages. While this and a few other planned capabilities were dropped, many new ones were added. This included the permanently online demo mentioned above but also a myriad of specific system features such as entity linking, various filtering and sorting devices, persistent search queries, a data export capability, etc. The partnership worked in a very smooth way: all partners repeatedly expressed their satisfaction with the way the project was unfolding. The most tangible result was a technology demonstrator that has been extensively tested by our user partner and demonstrated to a wider public on many occasions, the last one of which was a public event hosted by Borden, Ladner Gervais in their downtown Ottawa office on 24 November 2015. The interested reader is invited to embark on our audio-visual tour of the resulting prototype system by following the indications given in section 3 above. Hopefully, he/she will then come to agree with the final verdict of our user-partner who declared that our prototype was well “on its way to being a high-quality analytic tool [...]”. In this respect, our industrial partners each have their own plans to make good use of the results of our project for improving or augmenting their respective commercial offerings. 27 Appendices 28 A Machine Translation As mentioned above, NRC has state-of-the-art machine translation (MT) technology. The NRC’s phrase-based MT system is called PORTAGE, and it regularly places at or near the top in international evaluations of the quality of MT system outputs. Another measure of PORTAGE’s standing is that the NRC has twice received substantial funding from DARPA in exchange for its participation in R&D projects (under the DARPA GALE program from 2006-2009, and under the DARPA BOLT program 2012-2015). Since PORTAGE learns how to translate from a large collection of bilingual sentence pairs, each consisting of a sentence in the source language and its translation into the target language, a version of PORTAGE can potentially be trained for any language pair for which a sufficiently large number of bilingual sentence pairs can be obtained. In practice, the NRC’s MT group has mainly created versions of PORTAGE for Arabic → English (Arabic to English) MT, Chinese → English MT, and English ↔ French (bidirectional English-French) MT. When the CST project began, NRC already had an Arabic → English version of PORTAGE: the version that tied for first place in the NIST (US National Institute of Standards & Technology) Open MT evaluation of 2012. That by no means ensured that NRC would be able to deliver an MT module that would satisfy the needs of the CST project. There were several potential problems: – Genre. It is well-known among experts on MT that no matter what language pair is involved, an MT system trained on one genre of bilingual text — e.g., news stories — will yield very low-quality translations when it is deployed in an environment where it must translate texts from a different genre – e.g., Tweets. This problem is particularly acute when one of the two genres is formal, and the other informal. The initial Arabic → English PORTAGE system was mainly trained on formal data such as news or quasi-formal genres such as Web forums. Though the main CST scenarios sometimes involved translating Arabic blogs (a quasi-formal genre), the main task for PORTAGE turned out to be translating Arabic tweets: an informal genre for which there was 29 no bilingual training data whatsoever. The closest genre to tweets in the original training data was 90,104 sentence pairs from SMS/Chat data from the BOLT project but with about 20 million sentence pairs in total, this represented barely 0.4% of the training data. Apart from being much more informal in word choice and syntax than the training data, the data the translation module encountered in its CST deployment included many phenomena such as hash tags, emoticons, and strange spellings (e.g., the Arabic equivalents to AHHHH or Yuckkkkk! in English tweets) never observed in the training data. A genre problem unique to Arabic is Arabizi: the phenomenon, fairly frequent in Arabic social media, of using Roman-language characters to represent an Arabic word. A given Arabic word may be written several different ways in Arabizi. – Dialect. This problem is related to, but distinct from, the problem of genre. Educated Arabs use Modern Standard Arabic (MSA) for most written communication, and for formal speech. MSA is the only version of Arabic taught in schools; it is derived from Classical Arabic, which has high prestige because it is the language of the Koran. However, much spoken communication occurs in local vernaculars: dialects of Arabic which are often mutually incomprehensible (from a European perspective, they might be considered separate languages). An analogous situation might have arisen in Europe if Romance-language speakers had continued to use Latin for formal, especially written, communication (as was once the case for educated Europeans), but Portuguese, Spanish, French and so on a daily basis to talk to people from their own country. We were warned by Arabic experts before we began working on CST that social media were an exception to the general rule that speakers of all the variants of Arabic generally use MSA for written communication: we could expect to see tweets heavily laced with or entirely consisting of Tunisian Arabic, Iraqi Arabic, and so on. None of these dialects were represented in the training data available to us. To make matters worse, we were told that dialect words in tweets are often written in Arabizi. – Throughput. This was an even more serious problem than the two previous ones. NRC’s existing Arabic → English system was 30 designed to perform well in international evaluations of the output quality of MT systems; speed was not a consideration. The focus in building the existing system had been to use every possible technique to get good translations, even if some of these techniques were computationally inefficient. Yet the CST project would fail if the MT module was too slow: for the CST system to be practically useful, the MT module had to chew through several thousand sentences each minute. In initial tests of the previous Arabic → English system in the CST environment, it was intolerably slow: e.g., it took 325 seconds to translate 100 Arabic sentences (roughly 20 sentences per minute). We faced the challenge of speeding up Arabic → English PORTAGE by a factor of at least 10 for it to be practically useful for CST and we had to achieve this speed-up without compromising translation quality. In the course of the project, we updated the MT module several times. We tackled the throughput problem first since the module would be unusable if it was too slow - then worked to improve translation quality. A.1 Improving the throughput of the MT module The initial MT module (which we’ll call “Version 0”) is shown in Figure 2. The models required to translate Arabic text to English are trained offline on bilingual Arabic-English sentence pairs. In the terminology of the MT community, translation is called “decoding” and is thus carried out by a module called the “decoder”. Arabic is unusual among languages that have had considerable attention devoted to them by the MT community: almost all research groups that work on Arabic as a source language use a software package from outside each group to preprocess Arabic texts. The consensus in the community is that Arabic preprocessing software from Columbia University (called MADA, TOKAN, or MADAMIRA) is essential if you want to build a stateof-the-art system for translating Arabic into other languages (for most other source languages, the details of preprocessing are less important). As Figure 2 shows, in Version 0, MADA from Columbia U. was used to translate Arabic text into “Buckwalterese”, a way of representing Arabic text in the Latin alphabet. MADA does more than this: it splits 31 Initial MT module for CST (version 0) 2 stages of MT: preprocessing and decoding. For Arabic preprocessing, all major groups use a tool from Columbia U. called MADA, which changes Arabic into units that make MT easier. MADA was slow to load but was fast afterwards; decoding loaded quickly but took nearly 3 sec per sentence. SLOW TO LOAD Arabic sentence ﻻوﻗد ﻧﺎﻟت ﺑﺎﺑوا ﻏﯾﻧﯾﺎ اﻟﺟدﯾدة اﺳﺗق١٦ ﻟﮭﺎ ﻓﻲ أﯾﻠول/١٩٧٥ ﺳﺑﺗﻣﺑر. Bilingual Data offline Translation models MADA AD Modified Arabic sentence wqd nAlt bAbwA gynyA Aljdydp AstqlAlhA fy 16 >ylwl / sbtmbr 1975 . English sentence Portage decoder mainly loading g of MADA mainly decoding on required 2 25 5 sec + 3*N sec sec. NOTE: for N sentences, translation Figure 2: Initial MT module 32 Arabic words into forms that more closely resemble English words. For instance, in Arabic, the equivalent of articles like “a” and “the” are fused to the following noun. MADA splits fused words of this type into two separate words: e.g., it turns the Arabic versions of “adog” and “thedog” into “a dog” and “the dog” written in Buckwalterese (and carries out several other types of preprocessing as well). Unfortunately, the version of MADA which NRC had permission to deploy in the CST project took 25 seconds to load. Though MADA preprocessing itself was relatively fast, the Version 0 decoder took about 3 seconds to translate each Arabic sentence. When called on to translate a new block of N sentences, Version 0 of PORTAGE therefore took approximately (25 + 3*N) seconds to complete the task. Figure 3 shows how by speeding up both Arabic preprocessing and decoding, we were able to turn these (25 + 3*N) seconds for decoding N sentences into approximately (1 + 0.1*N) seconds. To achieve a 25-fold speed-up in Arabic preprocessing, we removed the MADA software from the decoding process. We substituted a table, trained in advance using MADA, that maps Arabic words onto their MADA-ized equivalents in Buckwalterese. There is a potential loss of quality here: MADA can preprocess Arabic words it has never seen before, but the map table can only preprocess words that it encountered while it was being trained. This loss can be minimized by training the map table on a large, varied collection of Arabic texts; the training data should be chosen to include Arabic words that are likely to be deployed when the MT module is being deployed in a practical application (e.g., the scenarios for CST). We then studied the internal workings of the phrase-based decoder and discovered that it was considering far too many possible English translations of each Arabic phrase. By cutting the number of phrase translations considered, we speeded the decoder up by a factor of 15. Then, we adjusted some decoder search parameters (stack size, beam size, etc.) and were able to double decoder speed again at a cost of only about 0.2 BLEU points in output quality (typically, differences in BLEU score of less than 1.0 are not perceptible to MT system users). In total, we thus speeded up the decoder by a factor of 30. Figure 4 shows the cumulative effect of these changes on the time re33 December 2014 MT module (version 1) Replacement of MADA by a MADA map table for preprocessing made loading of the system 25 times faster; changes to the decoder made it 30 times faster. Arabic text offline وﻗد ﻧﺎﻟت ﻟﮭﺎﻻ ﺑﺎﺑوا ﻏﯾﻧﯾﺎ اﻟﺟدﯾدة اﺳﺗق New Arabic sentence وﻗد ﻧﺎﻟت ﺑﺎﺑوا ﻏﯾﻧﯾﺎ اﻟﺟدﯾدة اﺳﺗق ﻻ١٦ ﻟﮭﺎ ﻓﻲ أﯾﻠول/١٩٧٥ ﺳﺑﺗﻣﺑر. Bilingual Data offline offline MADA Map table → ﻧﺎﻟتwqd →اﺳﺗقbAbwA map table → ﻧﺎﻟتwqd →اﺳﺗقbAbwA Translation models M Modified Arabic sentence wqd nAlt bAbwA gynyA Aljdydp w AstqlAlhA fy 16 >ylwl / sbtmbr 1975 . Speeded-up Portage decoder English sentence decoder speedups MADA map table i d 1 sec + 0.1*N 0 1*N sec sec. NOTE: for N sentences, translation now required Figure 3: Improved version of MT module, December 2014 34 Version 1 vs. Version 0 Timing View 1 – time as function of #sentences View 2 – time as function of log(#sent.) Figure 4: Speed improvement in MT module quired to translate a block of sentences. A.2 Improving Translation Quality In early 2015, we turned our attention to improving translation quality. Throughout the CST project, there was a major problem: the lack of bilingual training data for tweets, the main genre targeted by the project. We also expected lack of dialectal written Arabic in the training data to be a problem. We addressed this problem in three steps: first, we wrote a small set of preprocessing rules for making Arabic tweets more tractable. Second, we acquired a large number of unilingual Arabic tweets that enabled us to retrain the MADA map table, and a small amount of bilingual additional training data that was somewhat closer in genre to the original training data. Third, we constructed a “dev” tuning set that we 35 Handling Twitter #hashtags اﻟﻣﺳﻠﻣﯾن_ﻓﻲ_ﻛل_ﻣﻛﺎن# اﻟﻠﮭم اﻧﺻر اﺧواﻧﻧﺎ tokenize Allhm AnSr AxwAn +nA # Almslmyn _ fy _ kl _ mkAn wrap hashtags translate Allhm AnSr AxwAn +nA <ht> Almslmyn y fyy kl mkAn </ht> / translate & transfer markup Oh God, forsake our <ht> Muslim brothers everywhere </ht> / unwrap hashtags Oh God, forsake our #Muslim_brothers_everywhere. Oh God, forsake our brothers # Muslims _ in _ all the place Figure 5: Handling of Twitter #hashtags expected to resemble tweets to some extent (and that included a small number of translated tweets). A.2.1 Rules for Handling Tweets Figures 5 and 6 show how the specialized rules we implemented for handling Arabic tweets improve translation quality. On each slide, the blue arrow on the left points to the translation of the Arabic input prior to implementation of these rules. In the example on Figure 5, the translation was originally “Oh God, forsake our brothers # Muslims in all the place”; with the rules in place, it became the more understandable “Oh God, forsake our #Muslim brothers everywhere”. In the course of this work, we discovered that in Arabic tweets, words that form part of a hashtag are often used as part of a sentence as well. The English equivalent would 36 Handling non-Arabic scripts & multiple hash tags ﺳورﯾﺎ# ﻏزة# .. ل ﻻ ﻧﺎﻣت ﻟﻛم ﻋﯾن.. ﯾﺎ أﺻﺣﺎب اﻷﺳﻠﺣﺔ اﻟﻣﻛدﺳﺔ#GazaUnderAttack #Gaza #Syria tokenize yA ASHAb AlAslHp Almkdsp . . lA nAmt l +km Eyn . . # gzp # swryA # gazaunderattack # gaza # syria mark non-Arabic script translate yA ASHAb AlAslHp Almkdsp . . lA nAmt l +km Eyn . . # gzp # swryA <nas> #GazaUnderAttack #Gaza #Syria </nas> translate selectively Oh, those accumulated weapons ... does not sleep you eye... #GazaUnderAttack #Gaza #Syria Oh, those accumulated weapons ... does not sleep you eye ... # # Gaza Syria gazaunderattack # while Figure 6: Handling non-Arabic scripts and multiple hash tags 37 be a tweet like “I’m #Angry in San Francisco because it’s so cold today.” Here, the words “angry in San Francisco” do double duty as constituents of a hash tag and as part of a sentence. Because exactly the same phenomenon often occurs in Arabic tweets, we decided on a strategy where hash tags and underscores are ignored during translation, then restored afterwards. This typically results in a more fluent translation. The example on Figure 6 shows how multiple hash tags, which may involve non-Arabic script, are handled by the rules. Here, the Arabic input contained three hash tags in the Latin alphabet. Version 0 of the MT module generates output where these hash tags are separated from the word sequences they are meant to tag: “Oh, those accumulated weapons ... does not sleep you eye ... # # Gaza Syria gazaunderattack # while”. With the rules in place, the system puts the hash tags in a neat sequence at the end of the translated tweet: “Oh, those accumulated weapons ... does not sleep you eye... #GazaUnderAttack #Gaza #Syria.” A.2.2 Additional Training Data Recall that by replacing the MADA preprocessing software with a table that maps Arabic text in the input into preprocessed Buckwalterese text, we obtain a large speedup in loading of the MT module. However, there is a cost: quality will go down, because unlike the original MADA software, the map table cannot preprocess Arabic words that aren’t in the training data in a useful way. They will be OOV (“out of vocabulary”) words. Fortunately, the frequency of OOV words encountered when the module is deployed can be reduced by training the MADA map table offline on large Arabic corpora, preferably ones likely to contain words it will encounter under operational conditions. This section has referred several times to the shortage of bilingual tweet data for training the module’s translation models: Arabic tweets that have been translated into English. But to retrain the MADA map table to reduce the frequency of OOVs, we mainly need unilingual Arabic tweets, and there is no shortage of those. Thus, after implementing the specialized rules for handling tweets described in the previous subsection, the next step we 38 took to improve the quality of the MT module was to retrain the map table on a large number of Arabic tweets (and some other unilingual Arabic data as well). Improving the translation models was much harder, because here we require bilingual tweet-like data. We had originally planned to obtain 3000 or so translated tweets from the end-users, but this proved to be unrealistic. They did supply us with 103 translated tweets, and a native speaker of Arabic we hired (Norah Alkharashi) translated 175 tweets. As will be described shortly, even this small number of bilingual tweets was useful for improving the system. However, with the total bilingual training data consisting of 20 million sentence pairs, adding a few hundred tweets would have no impact at all on the translation models. We therefore incorporated three other resources as training data: – Inspecting the OOVs that remained after we retrained the MADA map table, we noticed that a high proportion of them were names of people, places, and organizations. We therefore added to the data a set of paired Arabic-English Wikipedia article titles, as this genre is known to be rich in named entities. There were 28K Wikipedia title pairs (roughly 0.1% of the total training data). – In the course of the BOLT project, Raytheon BBN had hired speakers of two Arabic dialects – Levantine and Egyptian– via Mechanical Turk. These Turkers translated 162K segment pairs (22% Egyptian, 78% Levantine) from Weblogs into English. Our BBN colleagues had warned us that the quality of these translations was poor. However, we asked the Arabic speaker we had hired to assess them by looking at a random sample. Though she is from Saudi Arabia, she reported that the Arabic tweets were mostly MSA with a few dialect words inserted from time to time, and almost entirely understandable by speakers of other Arabic dialects. The problems were on the English side. Though in general, it was of acceptable quality, there were mistakes involving idioms, phrasal verbs, verb tenses, verb agreement, & spelling mistakes. We decided that the extra coverage of some dialect words would more than make up for some problems with English, and incorporated these 162K pairs in the training data (that’s roughly 0.8% 39 of the total). – Finally, there was one source of bilingual Arabic-English tweet data available to us. Unfortunately, it proved to be of rather poor quality. It is the result of a project carried out by CMU in which tweets containing both Arabic and English were given to Mechanical Turkers, who were asked to find Arabic sub-segments in each tweet that had a matching English sub-segment (of course, for each tweet the Turker also had the option of indicating that it had no matches). Again, we asked the Arabic native speaker to assess the quality of the CMU data. She found that around 55% of it was bad. In most of the bad pairs, some of the information contained in one of the two texts was missing from the text in the other language. We therefore resorted to a length-based heuristic where we removed pairs where the Arabic text had at least 1.5 times more tokens than the English text, or vice versa. This left 28K pairs, which we added to the training data (they constituted roughly 0.1% of it). At this point in our work, we did not have a set of bilingual tweets on which to measure BLEU (the traditional measure of MT quality). To measure the impact of some of the changes just listed, we therefore measured the OOV rate (this only requires unilingual source data). As Figure 7 shows, the retraining of the MADA map table yielded a 64% reduction in the OOV rate on a sample of 300 Arabic tweets, and the incorporation of data from the three new corpora – the Wikipedia titles, the BBN dialect weblogs, and the CMU partial tweets – resulted in an overall 73% reduction of OOVs for the sample. Though we believe implementation of the tweet preprocessing rules significantly improved translation quality, the nature of these rules means they had no impact on the OOV rate. To make further progress, we needed to construct both a test set (for calculating BLEU) and a “dev” set. The latter requires explanation. An important part of building a statistical MT system is the tuning step, which decides on the weights of the various information sources (the various language models, the various translation models, and so on). These weights are determined on a bilingual set of “dev” sentence pairs; they can have a surprisingly large effect on MT performance. 40 February 2015 module (Version 2) Version 2 has tweet preprocessing rules, a retrained MADA map table & three new bilingual training corpora for translation models: 1. Wikipedia titles 2. BBN dialect weblogs 3. CMU partial tweets # of OOVs in sample of 300 tweets • Version 1 (Dec. 2014): 451 • With new map table: 162 (64% reduction) • With new map table & new translation models: 120 (73% reduction overall) Figure 7: MT module, verion 2 41 Though we did not have enough bilingual Arabic-English tweet pairs to use as training data, we were able to construct dev and test sets that split between them the 103 tweets translated for us by the end-users and the 171 tweets translated by our in-house Arabic speaker. These 274 tweet pairs are far too few to construct either a dev or a test set, so we added to them CMU data (because it comes from tweets) and BBN data (because it is informal and contains some dialect phenomena). The dev set we constructed contains 137 tweet pairs, 488 CMU segment pairs and 478 BBN text pairs; our test contains 137 tweet pairs, 488 CMU segment pairs and 476 BBN text pairs. We then built two MT modules trained on the training data described above (with some minor differences), tuning both of them on this dev set, and tested them on the test set. The details of the differences in configuration between the two modules, Version 3a and Version 3b, would take up an inappropriate amount of space in this report. As shown on MT SLIDE7, both systems have very respectable BLEU scores, with Version 3b having a BLEU score +1.0 higher than that of Version 3a. Version 3b also has fewer OOVs than Version 3a. A.3 Discussion and Recommendations Building a module that translates Arabic tweets into English was as challenging as we’d expected, except in one respect. Contrary to our expectation, dialect was not a major issue. According to the native speaker of Arabic we’d hired, there were few tweets among those we collected in the Syria scenario or among those collected earlier by BBN that were too dialect-heavy for a reader of MSA from another region to understand. Many tweets were identifiable as being written by someone who speaks a given dialect, but this was typically a case of a few dialect words being sprinkled into text that was mainly in standard MSA. Maybe our Arabic speaker is downplaying the extent of the problem; maybe we dodged the dialect bullet by choosing a topic, Syria, that for some reason does not attract heavily dialectal tweets. On the other hand, maybe the difficulties posed by dialect to MT systems that translate Arabic social media texts have been exaggerated (perhaps because dialect is a very big problem in spoken Arabic). 42 By contrast, the difficulties posed by the genre for CST Arabic tweets were fully as serious as we’d expected. Version 3b of the MT module was the final version deployed in the project. Further algorithmic improvements to the MT module are likely to encounter the phenomenon of diminishing returns: quality will not go up significantly no matter what clever techniques are applied to the current training data. By far the best way of improving the module would be to collect a large number of Arabic tweets, to translate them, and to incorporate the resulting bilingual corpus in the training data for a new system. Building a component that can handle Arabizi would also be very helpful. 43 B Summarization The summarization component provides the capability of summarizing input documents under concern. It distills and presents the most important content of the documents. In the case of micro blogs such as Twitter, the capability to summarize groups of tweets is potentially very useful. For example, sudden peaks in volume on a given topic are most often caused by specific events. Applying a multi-document summarizer to the set of documents in a given peak would instantly bring such events to light. B.1 Functionality In general, automatic summarization, or summarization in brief, is a Natural Language Processing (NLP) technology that automatically generates summaries for documents. More specifically, our summarization component performs extractive summarization for multiple documents. Below we first provide some background knowledge about summarization (refer to [Zhu, 2010] for more work in the literature). – Summarization vs. information retrieval (IR): IR is often set up in a scenario in which users roughly know what they are looking for, by providing queries. Summarization does not assume this but aims at finding salient/representative information and removing redundant content for a single document or a set of them. – Single vs. multiple document summarization: single document summarization generates a summary for each single document (e.g., a news article). Multiple document summarization generates a summary for a set of documents, e.g., a set of news articles or tweets. The approaches used in these two situations are similar. One major difference is that a multi-document summarizer needs to remove more redundant content. – Extractive vs. abstractive summarization: extractive summarization selects sentences or larger pieces of text from the original documents and presents them as summaries; abstractive summarization also attempts to rewrite the selected pieces to form more 44 coherent and cohesive summaries. The state-of-the-art approaches focus more on extractive summarization, as abstractive summarization is a harder problem and its performance is less reliable. We focus on extractive summarization. In brief, our summarization component performs a multi-document, extractive summarization for input documents under concern. B.2 Interface Conceptually, the input and output of our summarization component is straightforward: the summarizer takes in a set of documents and outputs important excerpts. The interface is shown in Figure 8. Figure 8: A user can drag a selection box to select tweets under his or her concern, to generate a summary. In the figure, two spikes of tweets between September 9th and 11th were selected. A user can leverage a selection box to choose tweets in a time period of interest. In the figure, two spikes of tweets between September 9th 45 and 11th are selected. The user can click the “summarize” button at middle-top of the figure to request a summary. The summarization component then spends some time (often a number of seconds) to generate the summary—the amount of time used depends on the number of documents selected. Once the summary is ready, an “Open Summarization” button, shown in Figure 9 is displayed. Users can then click the “Open Summarization” button to read the summary. B.3 Summarization Algorithm Our model is built on a summarization algorithm called Maximal Marginal Relevance (MMR) [Carbonell and Goldstein, 1998, Zhu, 2010]. The reason for choosing MMR is two-fold. First, compared with more complicated models such as graph-based models, MMR is computationally efficient, which accords well with our need to handle a large document sets within a reasonable amount of time. In addition, MMR is an unsupervised summarization model and does not require human annotated data to train the computers. MMR builds a summary by selecting summary units iteratively. A summary unit is a tweet when we have a collection of tweets, or a sentence when we have a set of news articles. More specifically, in each round of selection, MMR selects into the summary a unit that is most similar to the documents to be summarized, but most dissimilar to the previously selected units, to avoid redundant content. This is repeated until the summary reaches the predefined length (20 units in the current version, but the length can be easily changed). The MMR uses the following equation to determine the next unit to be selected. next sent = arg max(λsim1 (D, Ui ) − (1 − λ) max(sim2 (Uj , Ui ))) (1) Uj ∈S Ui ∈D\S As shown in Formula 1, the sim1 term calculates the similarity between a unit Ui and a set of documents D. The assumption is that a unit with a higher sim1 represents the content of the document set better. The sim2 calculates the similarity between a candidate summary unit Ui and a unit Uj already in the summary S. Accordingly, 46 Figure 9: Once the summary is ready, an “Open Summarization” button is shown (upper figure). Users can then click the “Open Summarization” button to read the summaries (lower figure). 47 max(sim2 (.)) is the biggest sim2 score between the candidate unit and the already-in-summary units. The assumption is that a unit with a higher max(sim2 (.)) score contains higher redundancy with respect to the already-in-summary units; therefore this unit should receive a penalty. The similarity score, sim1 or sim2 , is calculated with the cosine value between the corresponding units discussed above. The parameter λ is used to linearly combine sim1 and max(sim2 (.)). A unit is selected into a summary if it maximizes the combined score. The value of λ has been set in our code, but it can be adjusted. 48 C Information Extraction For the information extraction component of the CST project, we focused on two technologies: named entity recognition and entity linking. The majority of our efforts focused on improving the state of the art in entity recognition for social media texts, while we employed a known technique for entity linking. We describe both solutions in detail below. C.1 Named Entity Recognition Named entity recognition (NER) is the task of finding rigid designators as they appear in free text and assigning them to coarse types [Nadeau and Sekine, 2007]. For the CST project, we recognize the types person, location and organization, as illustrated in Figure 10. NER is the first step in many information extraction pipelines, but it is also useful in its own right. It provides a form of keyword spotting, allowing us to highlight terms that are likely to be important in a text item. Furthermore, it allows the system operator to organize items by the entities they contain, and to collect statistics over mentions of specific entities. There is considerable excitement at the prospect of porting information extraction technology to social media platforms such as Twitter. Social media reacts to world events faster than traditional news sources, and its sub-communities pay close attention to topics that other sources might ignore. An early example of the potential inherent in social information extraction is the Twitter Calendar [Ritter et al., 2012], which detects upcoming events (concerts, elections, video game releases, etc.) based on the anticipatory chatter of Twitter users. Unfortunately, processing social media text presents a unique set of challenges, especially for technologies designed for newswire: Twitter posts are short, the language is informal, capitalization is inconsistent, and spelling variations and abbreviations run rampant. Tools that perform quite well on newspaper articles can easily fail when applied to social media. Our approach assumes a single message as input, with no accompanying meta-data regarding the user posting the message or the date it was posted. We are then tasked with finding each mention of a concrete person, location or organization within that message. The location of 49 % $,*$ # * $" '))!()-! + Figure 10: An example of semi-Markov tagging. each such mention is indicated by a tag, as shown in Figure 10, where the “O” tag is given special status to indicate the lack of an entity. Our training data takes the form of tweets (in-domain) and sentences from newspaper stories (out-of-domain), where both have been tagged by humans. We then train a supervised machine learning algorithm that can replicate the human tags on its training data, and generalize to produce reasonable tags on data it has never seen before. We use held-out, human-labeled data as a test set to measure the accuracy of our tagger on previously unseen tweets, which allows us to determine how well the system has generalized from its training data. Armed with an affordable training set of 1,000 human-annotated tweets, we establish a strong system for Twitter NER using a novel combination of well-understood techniques. We build two unsupervised word representations in order to leverage a large collection of unannotated tweets, while a data-weighting technique allows us to benefit from annotated newswire data that is outside of the Twitter domain. Taken together, these two simple ideas establish a new state-of-the-art for both our test sets. We rigorously test the impact of both continuous and cluster-based word representations on Twitter NER, emphasizing the dramatic improvement that they bring. C.1.1 Data Vital statistics for all of our data sets are shown in Table 1. For indomain NER data, we use three collections of annotated tweets: Fin10 was originally crowd-sourced by [Finin et al., 2010], and was manually corrected by [Fromreide et al., 2014], while Rit11 [Ritter et al., 2011] and Fro14 [Fromreide et al., 2014] were built by expert annotators. We divide Fin10 temporally into a training set and a development set, and we consider Rit11 and Fro14 to be our test sets. This reflects a plausible training scenario, with train and dev drawn from the same pool, but with distinct tests drawn from later in time. These three data sets 50 were collected and unified by [Plank et al., 2014], who normalized the tags into three entity classes: person (PER), location (LOC) and organization (ORG). The source text has also been normalized; notably, all numbers are normalized to NUMBER, and all URLs and Twitter @user names have been normalized to URL and @USER respectively. We use the CoNLL 2003 newswire training set as a source of outof-domain NER annotations [Tjong Kim Sang and De Meulder, 2003]. This serves two purposes: first, it provides a large supply of out-ofdomain training data. Second, it allows us to illustrate the huge gap in performance that occurs when applying newswire tools to social media. The source text has been normalized to match the Twitter NER data, and we have removed the MISC tag from the gold-standard, leaving PER, LOC and ORG. We use unannotated tweets to build various word representations. Our unannotated corpus collects 98M tweets (1,995M tokens) from between May 2011 and April 2012. These tweets have been tokenized and post-processed to remove many special Unicode characters. Furthermore, the corpus consists only of tweets in which the NER system of [Ritter et al., 2011] detects at least one entity. The automatic NER tags are used only to select tweets for inclusion in the corpus, after which the annotations are discarded. Filtering our tweets in this way has two immediate effects: first, each tweet is very likely to contain an entity mention, and therefore, be more useful to our unsupervised techniques. Second, the tweets are longer and seem to be more grammatical than tweets drawn at random. Data Fin10 (Train) Fin10Dev (Test) Rit11 (Test) Fro14 (Test) CoNLL (Train) Unlabeled tweets Lines Types Tokens # PER # LOC # ORG 1,000 4,865 17,276 192 143 172 1,975 7,734 33,770 325 279 287 2,394 8,686 46,469 454 377 280 1,545 5,392 20,666 390 163 200 14,041 20,752 203,621 6,601 7,142 6,322 98M 57M 1,995M – – – Table 1: Details of our NER-annotated corpora. A line is a tweet in Twitter and a sentence in newswire. 51 C.1.2 Methods We will briefly summarize how we train a tagger to locate entities in tweets below. In this framework, a complete tag sequence for an input tweet is represented as a bag of features. The learning component learns weights on these features so that good tag sequences receive higher scores than bad tag sequences. We call these weights the model. The tagging component uses dynamic programming to search the very large space of possible tag sequences for the highest-scoring sequence according to our model. Therefore, the framework can be specified modularly by describing the tagger, the learner and the features. As a rule of thumb, the quality of the features generally determines how well a system can generalize. More details can be found in [Cherry and Guo, 2015, Cherry et al., 2015]. Tagger: We tag each tweet independently using a semi-Markov tagger [Sarawagi and Cohen, 2004], which tags phrasal entities using a single operation, as opposed to traditional word-based entity tagging schemes. An example tag sequence, drawn from one of our test sets, is shown in Figure 10. Semi-Markov tagging gives us the freedom to design features at either the phrase or the word level, while also simplifying our tag set. Furthermore, with our semi-Markov tags, we find we have no need for Markov features that track previous tag assignments, as our entity labels cohere naturally. This speeds up tagging dramatically. Learner: Our tagger is trained online with large-margin updates, following a structured variant of the passive aggressive (PA) algorithm [Crammer et al., 2006]. We regularize the model both with early stopping and by using PA’s regularization term C, which is similar to that of an SVM. We also have the capacity to deploy example-specific Cparameters, allowing us to assign some examples more weight during training. This is useful when combining Twitter training sets with newswire training sets. Lexical Features: Recall that our semi-Markov model allows for both word and phrase-level features. The vast majority of our features are word-level, with the representation for a phrase being the sum of the features of its words. Our word-level features closely follow the set proposed by [Ratnaparkhi, 1996], covering word identity, the identities 52 of surrounding words within a window of 2 tokens, and prefixes and suffixes up to three characters in length. Each word identity feature has three variants, with the first reporting the original word, the second reporting a lowercased version, and the third reporting a summary of the word’s shape (“Mrs.” becomes “Aa.”). All word-level features also have a variant that summarizes the word’s position within its entity. Our phrase-level features report phrase identity, with lowercased and word shape variants, along with a bias feature that is always on. Phrase identity features allow us to memorize tags for common phrases explicitly. Following the standard discriminative tagging paradigm, all features have the tag identity appended to them. Representation Features: We also produce word-level features corresponding to a number of external representations: gazetteer membership, Brown clusters [Brown et al., 1992] and word embeddings. These features are intended to help connect the words in our training data to previously unseen words with similar representations. Gazetteers are lists of words and phrases that share specific properties. In this project, we use a number of word lists covering common entity types like people, films, jobs and nationalities. These were derived from a number of open sources by researchers at the University of Illinois [Ratinov and Roth, 2009]. To create features from these gazetteers, we first segment the tweet into longest matching gazetteer phrases, resolving overlapping phrases with a greedy left-to-right walk through the tweet. Each word then generates a set of features indicating which gazetteers (if any) include its phrase. Brown clusters are deterministic word clusters learned using a languagemodeling objective. Each word maps to exactly one cluster, and similar words tend to be mapped to the same clusters. For cluster representations, we train Brown clusters on our unannotated corpus, using the implementation by [Liang, 2005] to build 1,000 clusters over types that occur with a minimum frequency of 10. Following [Miller et al., 2004], each word generates indicators for bit prefixes of its binary cluster signature, for prefixes of length 2, 4 8 and 12. Word embeddings are continuous representations of words, also learned using a language-modeling objective. Each word is mapped to a fixedsize vector of real numbers, such that similar words tend to be given 53 similar vectors. For word embeddings, we use an in-house Java reimplementation of word2vec [Mikolov et al., 2013] to build 300-dimensional vector representations for all types that occur at least 10 times in our unannotated corpus. Each word then reports a real-valued feature (as opposed to an indicator) for each of the 300 dimensions in its vector representation. A single random vector is created to represent all outof-vocabulary words. Our vectors and clusters cover 2.5 million types. Note that Brown clusters and word vectors are both trained using language-modeling objectives on our large corpus of 98M unannotated tweets, making their use an instance of semi-supervised learning. In contrast, gazetteers are either constructed by experts or extracted from Wikipedia categories. C.1.3 Experiments We have two scientific papers on NER that outline a rich set of experiments to help understand how various versions of our NER system perform [Cherry and Guo, 2015, Cherry et al., 2015]. For the purposes of this report, we felt it would be illustrative to condense these experiments down to three questions: 1. How does a newswire system perform on Twitter data? 2. How effective was our attempt to modify our newswire system for Twitter data? 3. How well does our system perform on the various entity types on Twitter data? We will briefly answer these three questions in turn. In all experiments we report F-measure (F1), which combines two other metrics: precision and recall. Let #right count the number of entities extracted by our system that match the human-labeled gold standard exactly, and let #sys be the number of entities found by our system; that is, the sum of both correct and incorrect entities. Precision measures the percentage of system entities that are correct: prec = 54 #right #sys (2) System CoNLL Test Rit11 NRC 0.0 (CoNLL only) 84.3 27.1 Fro14 29.4 Table 2: A system trained only on newswire data, tested on newswire (CoNLL) and social media data (Rit11, Fro14). Reporting F1. Let #gold be the number of entities found in the human-labeled gold standard annotation. Recall measures percentage of gold entities that are extracted by the system: rec = #right #gold (3) Finally, F-measure is the harmonic mean of precision and recall: F1 = 2 ∗ prec ∗ rec prec + rec (4) Newswire versus social media In our first experiment, we train a system on our CoNLL newswire training set, and test on both held-out newswire data (CoNLL Test), and on our held-out Twitter data. The version of our system that we test here corresponds to the NRC’s entity recognizer before the CST project began. This recognizer uses all our lexical features, but has no representation features. The results are shown in Table 2. As one can see, there is a huge divide between the social media tests and the newswire tests. The NRC tagger was quite well-suited to news, but handled social media very poorly. Progression of the NRC Twitter NER system In the next comparison, we coarsely map our progress throughout the CST project, shown in Table 3. The first and most obvious thing to do was to begin training on in-domain Twitter data. We used 1,000 tweets from our Fin10 set as training data, creating NRC 1.0. In transitioning from NRC 0.0 to 1.0, we trade data volume for data quality, replacing 14K out-of-domain sentences with 1k in-domain tweets. Note that this system does not use CoNLL newswire data, as we hadn’t yet developed the data weighting algorithm that allowed us to effectively combine two drastically different data sources [Cherry and Guo, 2015]. Next, we obtained our corpus of 98M unlabeled tweets, and set out to make use 55 of it. Our first attempt involved training word vectors on this corpus with word2vec, creating word representation features for our system. This resulted in NRC 2.0, which performs substantially better. Finally, NRC.30 adds Brown clusters, gazetteers and incorporates the CoNLL newswire data. It represents our strongest system, with F1 figures having doubled with respect to NRC 1.0. Performance by entity type Finally, we report the per-entity performance of the NRC 3.0 system in Table 4. For both test sets, ORG is the most difficult. ORG is perhaps the broadest entity class, which makes it difficult to tag, as it is rife with subtle distinctions: bands (ORG) versus musicians (PER); companies (ORG) versus their products (O); and sports teams (ORG) versus their home cities (LOC). We also suspect that our word representation features are not as suited to organizations as they are people and locations. These results also show that we are actually performing much better on person and location classes than our aggregate scores suggest, as the system’s total performance is dragged down substantially by its difficulty with organizations. C.1.4 Discussion We have presented a brief overview of the NRC’s named entity recognition system for social media. It is characterized by its small in-domain training set, and its extensive use of semi-supervised word representations constructed from large pools of unlabeled data. This is the information extraction component that consumed the bulk of our efforts, but those efforts were well placed, as they resulted in a stronger System NRC 0.0 NRC 1.0 NRC 2.0 NRC 3.0 Rit11 Fro14 (CoNLL only) 27.1 29.4 (Fin10 only) 29.0 30.4 (1.0 + word vectors) 56.4 58.4 (2.0 + CoNLL, clusters, gazetteers) 59.9 63.2 Table 3: The progression of the NRC named entity recognizer throughout the CST project. Reporting F1. 56 Test Set Rit11 Fro14 PER 70.8 69.4 LOC 61.9 70.2 ORG 36.9 42.6 Table 4: F1 for our final system, organized by entity class. CST system, while also having a substantial impact on the scientific community. C.2 Entity Linking Entity linking is the task of resolving entity mentions found in text to specific real-world entities in some background in database. For the CST project, we solved this variant of the problem: for each entity detected by the NRC NER system, find its corresponding Wikipedia page, if one exists. This effort came late in the project, and we were unable to obtain labeled in-domain training and test data for this task. As such, we will briefly describe what we did, but we will provide no evaluation of this approach. We obtained a massive dictionary that connects English Wikipedia concepts to HTML anchor-text from other web pages, as collected throughout the web on a 2011 Google crawl [Spitkovsky and Chang, 2012]. For example, someone linking to the Barack Obama Wikipedia page from their own web page may use the anchor text, “Barack Obama,” or variants such as “Barack Hussein Obama II,” misspellings such as “Barak Obama,” or a nickname like “the Obamanation.” Harvesting all of these from throughout the web can provide a very reliable dictionary of all the ways one can express each Wikipedia concept in text. Furthermore, the frequency with which each possible anchor phrase links to a particular Wikipedia page helps disambiguate ambiguous phrases, such as “George Clinton,” which is more likely to refer to the 1970s funk musician than the 1800s US vice president. This dictionary alone has been shown to establish an extremely high baseline for the Wikipedia entity-linking task [Spitkovsky and Chang, 2011]. Given this dictionary, one can theoretically take any entity detected by our NER system, and return the most frequently-linked page found 57 in the dictionary for that phrase. Most of the challenges with this approach came from the dictionary’s sheer size and its noisy nature. The dictionary consists of 297,073,139 associations, mapping 175,100,788 unique strings to related English Wikipedia articles. It needed to be pruned substantially to be used efficiently: we pruned according to thresholds on a phrase-page pair’s raw frequency, as well as the probability of a page given the phrase. Once the dictionary was reduced to a more manageable size, we loaded it into main memory as a sorted array, and searched it with binary search, trading time efficiency for memory efficiency. As mentioned above, the accuracy of this system was never formally measured, but spot checks indicated that it has good coverage, and that it does a good job of unifying different mentions of the same person. 58 D Sentiment & Emotion Analysis In this section, we describe how we created a state-of-the-art SVM classifier to detect the sentiment in tweets. The sentiment can be one out of three possibilities: positive, negative, or neutral. We originally developed these classifiers to participate in an international competition organized by the Conference on Semantic Evaluation Exercises (SemEval-2013) [Wilson et al., 2013].7 The organizers created and shared sentiment-labeled tweets for training, development, and testing. The competition, officially referred to as Task 2: Sentiment Analysis in Twitter, had more than 40 participating teams. Our submissions stood first, obtaining a macro-averaged F-score of 69.02. We implemented a number of surface-form, semantic, and sentiment features. We also generated two large word–sentiment association lexicons, one from tweets with sentiment-word hashtags, and one from tweets with emoticons. The automatically generated lexicons were particularly useful. In the message-level task for tweets, they alone provided a gain of more than 5 F-score points over and above that obtained using all other features. The lexicons are made freely available.8 The emotion classification system used in this project also follows the same architecture as our sentiment analysis system. That system classifies tweets into whether they express anger or no anger, fear or no feat, dislike or no dislike, surprise or no surprise, joy or no joy, and sadness or no sadness. D.1 Sentiment Lexicons Sentiment lexicons are lists of words with associations to positive and negative sentiments. D.1.1 Existing, Automatically Created Sentiment Lexicons The manually created lexicons we used include the NRC Emotion Lexicon [Mohammad and Turney, 2010, Mohammad and Yang, 2011] 7 8 http://www.cs.york.ac.uk/semeval-2013/task2 www.purl.com/net/sentimentoftweets 59 (about 14,000 words), the MPQA Lexicon [Wilson et al., 2005] (about 8,000 words), and the Bing Liu Lexicon [Hu and Liu, 2004] (about 6,800 words). D.1.2 New, Tweet-Specific, Automatically Generated Sentiment Lexicons NRC Hashtag Sentiment Lexicon: Certain words in tweets are specially marked with a hashtag (#) to indicate the topic or sentiment. [Mohammad, 2012] showed that hashtagged emotion words such as joy, sadness, angry, and surprised are good indicators that the tweet as a whole (even without the hashtagged emotion word) is expressing the same emotion. We adapted that idea to create a large corpus of positive and negative tweets. We polled the Twitter API every four hours from April to December 2012 in search of tweets with either a positive word hashtag or a negative word hashtag. A collection of 78 seed words closely related to positive and negative such as #good, #excellent, #bad, and #terrible were used (32 positive and 36 negative). These terms were chosen from entries for positive and negative in the Roget’s Thesaurus. A set of 775,000 of these tweets were used to generate a large word– sentiment association lexicon. A tweet was considered positive if it has one of the 32 positive hashtagged seed words, and negative if it had one of the 36 negative hashtagged seed words. The association score for a term w was calculated from these pseudo-labeled tweets as shown below: score(w) = P M I(w, positive) − P M I(w, negative) (5) where PMI stands for pointwise mutual information. A positive score indicates association with positive sentiment, whereas a negative score indicates association with negative sentiment. The magnitude is indicative of the degree of association. The final lexicon, which we will refer to as the NRC Hashtag Sentiment Lexicon has entries for 54,129 unigrams and 316,531 bigrams. Entries were also generated for unigram– unigram, unigram–bigram, and bigram–bigram pairs that were not necessarily contiguous in the tweets corpus. Pairs with certain punctuations, ‘@’ symbols, and some function words were removed. The lexicon 60 has entries for 308,808 non-contiguous pairs. Sentiment140 Lexicon: The sentiment140 corpus [Go et al., 2009] is a collection of 1.6 million tweets that contain positive and negative emoticons. The tweets are labeled positive or negative according to the emoticon. We generated a sentiment lexicon from this corpus in the same manner as described above (Section 2.2.1). This lexicon has entries for 62,468 unigrams, 677,698 bigrams, and 480,010 non-contiguous pairs. D.2 Task: Automatically Detecting the Sentiment of a Message The objective of this task is to determine whether a given message is positive, negative, or neutral. D.2.1 Classifier and features We trained a Support Vector Machine (SVM) [Fan et al., 2008] on the training data provided. SVM is a state-of-the-art learning algorithm proved to be effective on text categorization tasks and robust on large feature spaces. The linear kernel and the value for the parameter C=0.005 were chosen by cross-validation on the training data. We normalized all URLs to http://someurl and all userids to @someuser. We tokenized and part-of-speech tagged the tweets with the Carnegie Mellon University (CMU) Twitter NLP tool [Gimpel et al., 2011]. Each tweet was represented as a feature vector made up of the following groups of features: – word ngrams: presence or absence of contiguous sequences of 1, 2, 3, and 4 tokens; non-contiguous ngrams (ngrams with one token replaced by *), – character ngrams: presence or absence of contiguous sequences of 3, 4, and 5 characters); – all-caps: the number of words with all characters in upper case; 61 – POS: the number of occurrences of each part-of-speech tag; – hashtags: the number of hashtags; – lexicons: the following sets of features were generated for each of the three manually constructed sentiment lexicons (NRC Emotion Lexicon, MPQA, Bing Liu Lexicon) and for each of the two automatically constructed lexicons (Hashtag Sentiment Lexicon and Sentiment140 Lexicon). Separate feature sets were produced for unigrams, bigrams, and non-contiguous pairs. The lexicon features were created for all tokens in the tweet, for each part-ofspeech tag, for hashtags, and for all-caps tokens. For each token w and emotion or polarity p, we used the sentiment/emotion score score(w, p) to determine: ∗ total count of tokens in the tweet with score(w, p) > 0; ∗ total score = w∈tweet score(w, p); ∗ the maximal score = maxw∈tweet score(w, p); ∗ the score of the last token in the tweet with score(w, p) > 0; – punctuation: ∗ the number of contiguous sequences of exclamation marks, question marks, and both exclamation and question marks; ∗ whether the last token contains an exclamation or question mark; – emoticons: The polarity of an emoticon was determined with a regular expression adopted from Christopher Potts’ tokenizing script:9 ∗ presence or absence of positive and negative emoticons at any position in the tweet. ∗ whether the last token is a positive or negative emoticon; – elongated words: the number of words with one character repeated more than two times, for example, ‘soooo’; – clusters: The CMU pos-tagging tool provides the token clusters produced with the Brown clustering algorithm on 56 million Englishlanguage tweets. These 1,000 clusters serve as alternative representation of tweet content, reducing the sparcity of the token space. 9 http://sentiment.christopherpotts.net/tokenizing.html 62 ∗ the presence or absence of tokens from each of the 1000 clusters; – negation: the number of negated contexts. Following [Pang et al., 2002], we defined a negated context as a segment of a tweet that starts with a negation word (e.g., no, shouldn’t) and ends with one of the punctuation marks: ‘,’, ‘.’, ‘:’, ‘;’, ‘!’, ‘?’. A negated context affects the ngram and lexicon features: we add ‘ NEG’ suffix to each word following the negation word (‘perfect’ becomes ‘perfect NEG’). The ‘ NEG’ suffix is also added to polarity and emotion features (‘POLARITY positive’ becomes ‘POLARITY positive NEG’). The list of negation words was adopted from Christopher Potts sentiment tutorial.10 We trained the SVM classifier on the set of 9,912 annotated tweets (8,258 in the training set and 1,654 in the development set). We applied the model to the previously unseen tweets gathered as part of the CST system. 10 http://sentiment.christopherpotts.net/lingstruc.html 63 References [Brown et al., 1992] Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., and Lai, J. C. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4):467–479. [Carbonell and Goldstein, 1998] Carbonell, J. G. and Goldstein, J. (1998). The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proc. of ACM SIGIR Conference on Research and Development in Information Retrieval, pages 335–336. [Cherry and Guo, 2015] Cherry, C. and Guo, H. (2015). The unreasonable effectiveness of word representations for twitter named entity recognition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 735–745, Denver, Colorado. Association for Computational Linguistics. [Cherry et al., 2015] Cherry, C., Guo, H., and Dai, C. (2015). Nrc: Infused phrase vectors for named entity recognition in twitter. In Proceedings of the Workshop on Noisy User-generated Text, pages 54–60, Beijing, China. Association for Computational Linguistics. [Crammer et al., 2006] Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., and Singer, Y. (2006). Online passive-aggressive algorithms. The Journal of Machine Learning Research, 7:551–585. [Fan et al., 2008] Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and C.-J., L. (2008). LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874. [Finin et al., 2010] Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. (2010). Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 80–88. [Fromreide et al., 2014] Fromreide, H., Hovy, D., and Søgaard, A. (2014). Crowdsourcing and annotating NER for Twitter #drift. In LREC, pages 2544–2547, Reykjavik, Iceland. 64 [Gimpel et al., 2011] Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., and Smith, N. A. (2011). Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. [Go et al., 2009] Go, A., Bhayani, R., and Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision. In Final Projects from CS224N for Spring 2008/2009 at The Stanford Natural Language Processing Group. [Goutte et al., 2014] Goutte, C., Léger, S., and Carpuat, M. (2014). The nrc system for discriminating similar languages. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pages 139–145, Dublin, Ireland. Association for Computational Linguistics and Dublin City University. [Hu and Liu, 2004] Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 168–177, New York, NY, USA. ACM. [Liang, 2005] Liang, P. (2005). Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology. [Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR Workshop. [Miller et al., 2004] Miller, S., Guinness, J., and Zamanian, A. (2004). Name tagging with word clusters and discriminative training. In HLT-NAACL, pages 337–342. [Mohammad and Yang, 2011] Mohammad, S. and Yang, T. (2011). Tracking Sentiment in Mail: How Genders Differ on Emotional Axes. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), pages 70–79, Portland, Oregon. [Mohammad, 2012] Mohammad, S. M. (2012). #emotional tweets. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared 65 task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval ’12, pages 246–255, Stroudsburg, PA. [Mohammad and Turney, 2010] Mohammad, S. M. and Turney, P. D. (2010). Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. In Proceedings of the NAACL-HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, LA, California. [Nadeau and Sekine, 2007] Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26. [Pang et al., 2002] Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 79–86, Philadelphia, PA. [Plank et al., 2014] Plank, B., Hovy, D., McDonald, R., and Søgaard, A. (2014). Adapting taggers to Twitter with not-so-distant supervision. In COLING, pages 1783–1792, Dublin, Ireland. [Plutchik, 1962] Plutchik, R. (1962). The Emotions. New York: Random House. [Ratinov and Roth, 2009] Ratinov, L. and Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In CoNLL, pages 147–155. [Ratnaparkhi, 1996] Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In EMNLP, pages 133–142. [Ritter et al., 2011] Ritter, A., Clark, S., Mausam, and Etzioni, O. (2011). Named entity recognition in tweets: An experimental study. In EMNLP, pages 1524–1534, Edinburgh, Scotland, UK. [Ritter et al., 2012] Ritter, A., Mausam, Etzioni, O., and Clark, S. (2012). Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 1104–1112, New York, NY, USA. ACM. 66 [Sarawagi and Cohen, 2004] Sarawagi, S. and Cohen, W. W. (2004). Semimarkov conditional random fields for information extraction. In NIPS, pages 1185–1192. [Spitkovsky and Chang, 2011] Spitkovsky, V. I. and Chang, A. X. (2011). Strong baselines for cross-lingual entity linking. In Proceedings of the Fourth Text Analysis Conference (TAC 2011), Gaithersburg, Maryland, USA. [Spitkovsky and Chang, 2012] Spitkovsky, V. I. and Chang, A. X. (2012). A cross-lingual dictionary for English Wikipedia concepts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. [Tjong Kim Sang and De Meulder, 2003] Tjong Kim Sang, E. F. and De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL, pages 142– 147. [Wilson et al., 2013] Wilson, T., Kozareva, Z., Nakov, P., Rosenthal, S., Stoyanov, V., and Ritter, A. (2013). SemEval-2013 Task 2: Sentiment analysis in Twitter. In Proceedings of the International Workshop on Semantic Evaluation, SemEval ’13, Atlanta, Georgia, USA. [Wilson et al., 2005] Wilson, T., Wiebe, J., and Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 347–354, Stroudsburg, PA, USA. [Zhu, 2010] Zhu, X. (2010). Summarizing Spoken Documents Through Utterance Selection. PhD thesis, Department of Computer Science, University of Toronto. [Zhu et al., 2013] Zhu, X., Cherry, C., Kiritchenko, S., Martin, J., and de Bruijn, B. (2013). Detecting concept relations in clinical text: Insights from a state-of-the-art model. Journal of Biomedical Informatics, 46:275–285. 67
© Copyright 2026 Paperzz