HORIZON 2020 ICT - INFORMATION AND COMMUNICATION TECHNOLOGIES PROMOTING FINANCIAL AWARENESS AND STABILITY H2020 – 687895 DATA AND INFORMATION STREAMS - ASSESSMENT TOOLS WORK PACKAGE NO. WP2 WORK PACKAGE TITLE LINKED DATA LIFE CYCLE TASK NO. T2.1 TASK TITLE DATA HARVESTING, EXTRACTION AND ASSESSMENT MILESTONE NO. 2 ORGANIZATION NAME OF LEAD SWC CONTRACTOR FOR THIS DELIVERABLE EDITOR ARTEM REVENKO (SWC) CONTRIBUTORS HEIDELINDE HOBEL (SWC), IOANNIS PRAGIDIS, EIRINI KARAPISTOLI (DUTH), GEORGE PANOS, CHRISTOFOROS BOUZANIS (UOG), ANNA SATSIOU, IOANNIS KOMPATSIARIS (CERTH) REVIEWERS ANTONIS SARIGIANNIDIS (DUTH), PETER HANECAK (EEA) STATUS (F: FINAL; D: DRAFT) F NATURE R – REPORT DISSEMINATION LEVEL PU – PUBLIC PROJECT START DATE AND DURATION JANUARY 2016, 36 MONTHS DUE DATE OF DELIVERABLE: SEPTEMBER 30, 2016 ACTUAL SUBMISSION DATE: SEPTEMBER 30, 2016 REVISION HISTORY VERSION DATE MODIFIED BY CHANGES 0.1 11-07-2016 A. REVENKO (SWC) FIRST VERSION OF TOC 0.2 22-07-2016 H. HOBEL (SWC) DRAFTED EXECUTIVE SUMMARY, INTRODUCTION, CONCLUSION 0.2.1 01-08-2016 A. REVENKO (SWC) TOC REFINED AND FINALIZED 0.3 10-08-2016 A. REVENKO (SWC) ADDED SECTION 2 0.4 16-08-2016 A. REVENKO (SWC) ADDED SECTION 3 0.4.1 17-08-2016 A. REVENKO (SWC) I. PRAGIDIS, E. KARAPISTOLI (DUTH), G. PANOS, C. BOUZANIS (UOG), I. KOMPATSIARIS, A. SATSIOU (CERTH) PROVIDED FEEDBACK RECEIVED ON DATA ASSESSMENT TOOLS 0.5 19-08-2016 A. REVENKO (SWC) ADDED SECTION 4 0.6 23-08-2016 A. REVENKO (SWC) ADDED SECTION 5 0.7 26-08-2016 A. REVENKO (SWC) ADDED SECTION 6 0.7.1 29-08-2016 A. REVENKO (SWC) FINALIZED EXECUTIVE SUMMARY, INTRODUCTION, CONCLUSION 0.8 02-09-2016 A. REVENKO (SWC) FINAL OFFICIAL PRELIMINARY DRAFT VERSION OF D2.2 0.9 22-09-2016 A. REVENKO (SWC) REVISION ACCORDING TO COMMENTS FROM THE REVIEWERS 1.0 29-09-2016 A. REVENKO (SWC) FINAL REVIEW, PROOFING AND QUALITY CONTROL. READY FOR SUBMISSION Deliverable D2.2 Dissemination level (PU) Contract N. 687895 List of Abbreviations ARI Automated Readability Index. 5, 21, 22 CLI Coleman-Liau Index. 5, 21–24 FKGL Flesch–Kincaid Grade Level Formula. 5, 21, 22 IDF Inverse Document Frequency. 16, 19 LDA Latent Dirichlet Allocation. 31 ME Maximum Entropy. 12 NMF Non-negative Matrix Factorization. 31 PLSA Probabilistic Latent Semantic Analysis. 31 POS Part of Speech. 12 SBD Sentence Boundary Disambiguation. 11, 12 SMOG Simple Measure of Gobbledygook. 5, 21–24 STW Standard Thesaurus Wirtschaft. 10, 17, 18, 27 TF Term Frequency. 16, 19 WSJ Wall Street Journal. 12 September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 3 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 List of Figures 1 2 3 4 5 6 7 8 Reconstruction error and density of topics with plain annotations . . Reconstruction error and density of topics with enriched annotations Topic density and relatedness on plain training data . . . . . . . . . Topic density and relatedness on enriched training data . . . . . . . Topic density and relatedness on plain test data . . . . . . . . . . . . Topic density and relatedness on enriched test data . . . . . . . . . . Topic transition matrix using pseudo-inverse matrix . . . . . . . . . Topic transition matrix using optimization algorithms . . . . . . . . September 30, 2016 H2020-687895 ©The PROFIT Consortium . . . . . . . . . . . . . . . . . . . . . . . . 32 33 35 36 37 38 40 41 Page 4 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 List of Tables 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Precision, recall, and f1-measure for all items . . . . . . . . . . . . . Average and standard deviation of precision, recall, and f1-measure all items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average and standard deviation of precision, recall, and f1-measure 14 items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Textual statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics on the number of extracted concepts . . . . . . . . . . . . Interpretations of the readability scores . . . . . . . . . . . . . . . . FKGL scores per category . . . . . . . . . . . . . . . . . . . . . . . . ARI scores per category . . . . . . . . . . . . . . . . . . . . . . . . . CLI scores per category . . . . . . . . . . . . . . . . . . . . . . . . . SMOG scores per category . . . . . . . . . . . . . . . . . . . . . . . . Accuracy scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Top 15 most important concepts per category for plain data . . . . . Top 15 most important concepts per category for enriched data . . . 5 Topics of weeks 0 and 1 of year 2015 . . . . . . . . . . . . . . . . . 5 Topics of weeks 2 and 3 of year 2015 . . . . . . . . . . . . . . . . . September 30, 2016 H2020-687895 ©The PROFIT Consortium . . for . . for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 . 15 . . . . . . . . . . . . . . 15 18 19 20 21 22 23 24 27 27 28 29 42 43 Page 5 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table of Contents Executive Summary 8 1 Introduction 1.1 Scope of the Document . . . 1.2 Relation to PROFIT Project 1.3 Goals . . . . . . . . . . . . . 1.4 Notes on Concept Extraction 1.5 Note on Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preprocessing 2.1 Sentence Boundary Disambiguation . . . . . 2.1.1 Introduction . . . . . . . . . . . . . 2.1.2 Implemented Method . . . . . . . . 2.1.3 Evaluation Results . . . . . . . . . . 2.2 Counting Syllables . . . . . . . . . . . . . . 2.3 Vectorization . . . . . . . . . . . . . . . . . 2.3.1 Introduction . . . . . . . . . . . . . 2.3.2 Term Frequency – Inverse Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 . 9 . 9 . 10 . 10 . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 11 12 13 15 16 16 16 3 Annotated Articles 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Text Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Feature Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Readability Indices 4.1 Introduction . . . . . . . . . . . . 4.2 Flesch-Kincaid Grade Level . . . 4.2.1 Evaluation Results . . . . 4.3 Automated Readability Index . . 4.3.1 Evaluation Results . . . . 4.4 Coleman-Liau Index . . . . . . . 4.4.1 Evaluation Results . . . . 4.5 Simple Measure of Gobbledygook 4.5.1 Evaluation Results . . . . 4.6 Usage . . . . . . . . . . . . . . . 5 Text Categorization 5.1 Introduction . . . . 5.2 Classifiers . . . . . 5.3 Evaluation Result . 5.4 Usage . . . . . . . September 30, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H2020-687895 ©The PROFIT Consortium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 20 21 21 22 22 23 23 24 24 . . . . 25 25 25 26 28 Page 6 of 49 Deliverable D2.2 Dissemination level (PU) 6 Topic Discovery 6.1 Introduction . . . . . . . . . . . . 6.2 Factorization and Topics . . . . . 6.2.1 Evaluation Results . . . . 6.3 Topical Density and Relatedness 6.3.1 Evaluation Results . . . . 6.4 Topic Transition . . . . . . . . . 6.4.1 Evaluation Results . . . . 6.5 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contract N. 687895 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion September 30, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 30 31 34 34 39 39 44 45 H2020-687895 ©The PROFIT Consortium Page 7 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Executive Summary The deliverable presents the results of preparing tools for assessment of the information flowing into the platform. It consists of the following results achieved in the first nine months of the project: 1. Analysis of possible data and information streams; 2. Analysis, implementation, and the results of tests of the text preprocessing techniques; 3. Analysis of the available techniques and approaches for assessing the possible data and information; 4. Implementation and testing of the suitable methods. The main contributions of the deliverable may be found in Section 2 analysis of preprocessing techniques; Section 4 analysis of readability metrics; Section 5 analysis of categorization task; Section 6 analysis of topic modeling. Each section contains an introduction, discussion of the methods, tests results, and a description of usage patterns. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 8 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 1 Introduction 1.1 Scope of the Document The document describes the output of the preparation of the assessment tools for the data input to the PROFIT project platform. Based on the discussions with the partners of the consortium the main input streams were identified to be textual inputs. Based on this outcome methods and techniques for assessing textual input were identified, investigated, implemented, and tested. The data obtained in the course of D2.3 “Data crawlers, adaptors and extractors” (M12) was reused. The data was suggested by the experts in the field and represents a high quality collection of news articles relevant to the field of financial economics. The document consists of a detailed description of the assessment methods together with the test results on the described data. Based on the tests results the useful variants of the methods were identified and calibrated. Moreover, usage patterns of each method are described. Some of the developed methods were identified to be useful in the WP4 “Market Sentiment-based Financial Forecasting”. 1.2 Relation to PROFIT Project An investigation of tools for assessing the input was carried out in frames of this deliverable. In the discussion with the partners it was identified that the numerical and the factual data will be obtained from the trusted sources like Eurostat1 and/or OPEC2 , therefore no justification of such data is required. The only input that requires assessment is expected to be the textual input. Most of the unprocessed information is going to be represented at least partially in the textual form, including news articles and educational materials. The interaction with the users is considered a very important part of the project; interaction between the users and between users and the platform will also feature textual format. Assessing the textual data is of a great importance in order to identify potentially malicious, difficult, provocative, significant input. Moreover, some of the work in this deliverable, such as topic discovery may be reused in the project in frames of Work Package 4 “Market Sentiment-based Financial Forecasting”. Topic discovery is capable of identifying new emerging topical trends and evolution of the existing ones, hence may be used for identification of the subjects of sentiment analysis. 1 2 http://ec.europa.eu/eurostat http://www.opec.org/opec_web/en/ September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 9 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 1.3 Goals The goal of this deliverable is to prepare the methodology framework and the implement tools for assessing (textual) input. Those tools would facilitate the identification of input requiring additional attention and the processing of such input by the moderators. Moreover, the tools should be useful for getting some insights for further analysis of the input from users and from the news articles. Hence, the assessment includes the methods for assessing the quality of the text, the relation to the field of interests, the distribution of the topics (wrt prior topics extracted from the already processed data), and the identification of potentially new trends and dependencies. 1.4 Notes on Concept Extraction For all the assessment methods except purely textual analysis the extraction of concepts was used. To this end, the extracted concepts are used as a better alternative for identifying keywords in the documents. As the concepts are chosen and curated by experts in advance they represent an essential background knowledge about the field of interest. Namely the concepts identify the main elements of the text that an expert would focus the attention on. Moreover, since the concepts are taken from a thesaurus the semantic relations between concepts are known and may be used for even deeper analysis. For performing the work described in this deliverable the STW Economics thesaurus was used [Borst and Neubert, 2009]. The extraction was performed using PoolParty3 . For more information see Deliverable 2.1 “PROFIT core knowledge model” delivered in Month 6. 1.5 Note on Implementation The described methods were implemented in Python using the scikit-learn library [Pedregosa et al., 2011]. 3 poolparty.biz September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 10 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 2 Preprocessing 2.1 Sentence Boundary Disambiguation 2.1.1 Introduction Sentence boundary disambiguation is the task of identifying the individual sentences within a text. Because the sentence is the basic textual unit immediately above the word and phrase, Sentence Boundary Disambiguation (SBD) is one of the most essential problems for many applications of Natural Language Processing – Parsing, Information Extraction, Machine Translation, and Document Summarizations. The SBD problem is not always simple. Usually a sentence ends with a terminal punctuation, such as a ., ?, or !. However, a period can be associated with an abbreviation, such as Mr. or represent a decimal point in a number like $12.58. In these cases, it is a part of an abbreviation or a number; we cannot delimit a sentence because the period has a different meaning here. On the other hand, the trailing period of an abbreviation can also represent the end of a sentence at the same time. In most such cases, the word following this period is a capitalized common word (e.g., “The President lives in Washington D.C. He likes that place.”). The original SBD systems were built from manually generated rules in the form of regular expressions for grammar, which is augmented by a list of abbreviations, common words, proper names, etc. For example, the Almebic system [Aberdeen et al., 1995] deploys over 100 regular-expression rules written in Flex. Such a system may work well on the language or corpus for which the system was initially designed. Nevertheless, developing and maintaining an accurate rule-based system requires substantial hand coding effort and domain knowledge. Another drawback of this kind of systems is that it is difficult to port an existing system to other languages. On the advantages of such systems one may note that such systems are easy to develop, and they do not require any annotated data for training. The current research activity in SBD focuses on employing machine learning techniques, which treat the SBD task as a standard classification problem. The general principles of these systems are: training the system on a training set (usually annotated) to make the system “remember” the features of the local context around the sentence-breaking punctuation or global information on the list of abbreviations and proper names, and then recognize the real text sentences using this trained system. Those systems have the following drawbacks: • they require more effort for development than rule-based systems, • they demand annotated data for training, • they lack transparency (compared to rule-based systems). [Palmer and Hearst, 1997] developed a system, called SATZ, to classify the potential sentence boundary by using the local syntactic context. To obtain the syntactic information September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 11 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 for local context, SATZ needs the words in the context to be tagged with part-of-speech (POS). An additional drawback of using POS tagging is that the POS data is language specific, hence no straight-forward extension to new languages is possible. The authors reported a performance of around 1.0% in terms of error rate on Wall Street Journal (WSJ) data. In order to solve the problems encountered by the SATZ system, [Mikheev, 2000] proposed a method that segments a sentence into smaller sections. The authors claimed only a 0.25% error rate on the Brown corpus and a 0.39% error rate on the WSJ corpus. There are known approaches to SBD without using POS tags.[Reynar and Ratnaparkhi, 1997] presented a solution based on a Maximum Entropy (ME) model for the SBD problem. The model can attain an accuracy of 98.8% on the WSJ data, a quite good performance given that the model is simple and the data feature selection is quite flexible. Although the results reported by the authors of the systems achieve a great performance of around 1% in terms of error rate, one can expect a slightly worse performance on a general corpus, not the one used for tuning the system. The state of the art systems may achieve an accuracy of 95% and above. 2.1.2 Implemented Method The implemented method is based on the manual rules for finding the sentence boundaries. The following advantages of the rule-based method are considered: • No annotated data is needed; • No user interaction is required; • Easy to implement and use. The results of SBD is going to be used in assessing the readability of a text. For this purpose the error rates up to 5-10% are acceptable. Therefore, in the implementation special attention was paid to ensure that the method is robust, easy, and transparent. In this direction, the number of rules is kept small even at the cost of some accuracy decrease. The following regular expression [Aho and Ullman, 1992, Chapter 10] was used to find candidates for sentence boundaries: (?<=[.?!\]\n]) ([\"\’]?[\s|\r\n]+[\"\’]?) (?=(\([a-z])|(\(?[A-Z])) (1) The first part starting with (?<= is a so-called positive look-behind and is responsible for matching the characters that precede a sentence boundary. The expression specifies that one of the .?!] symbols or a new line should be present before the sentence boundary. The second part specifies the boundary itself. In this part we define that the sentence boundary is either a space or a new line and may be surround by single or double quotes September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 12 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 on both sides. The third part is responsible for the succeeding part and is called the positive look-ahead. In this part it is specified that the sentence boundary is followed by either an opening bracket and a lower case character or an optional opening bracket and an upper case character. Although the specified regular expression is able to find most of the sentence boundaries it also matches a lot of cases which are not sentence boundaries. In particular, all the abbreviations, names, special words like “etc.” are going to trigger the matching. In order to avoid this a list of exceptions is adopted. A short list containing 4 items was used: 1. “Mr”, 2. “Mrs”, 3. “etc”, 4. all the single upper case characters. Although user interaction is not required, the applied method is quite flexible to be extended. For instance, it is possible to add new rules for detecting boundaries and non-boundaries. The first and the third parts of the regular expression (1) are actually implemented as list and can be extended by further entries. Moreover, in order to capture a possible dependency between first and third parts one can easily add more regular expressions like (1). The list of exceptions can be extended at any time as well. 2.1.3 Evaluation Results The implemented procedure was tested on the Project Gutenberg4 corpora as provided by NLTK5 . The corpora consists of 18 items: 1. “Emma” by Jane Austen, 2. “Persuasion” by Jane Austen, 3. “Sense and Sensibility” by Jane Austen, 4. Bible, 5. Poems by William Blake, 6. Stories by Sara Cone Bryant, 7. “The Adventures of Buster Bear” by Thornton Waldo Burgess, 8. “Alice in Wonderland” by Lewis Carroll, 9. “The Ball and the Cross” by Gilbert Keith Chesterton, 4 5 http://www.gutenberg.org/ http://www.nltk.org/nltk_data/ September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 13 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 1: Precision, recall, and f1-measure for all items Precision Recall f1-measure 1 0.92 0.85 0.88 2 0.96 0.91 0.93 3 0.94 0.89 0.91 4 0.10 0.01 0.02 5 0.17 0.40 0.24 6 0.83 0.82 0.82 7 0.91 0.87 0.89 8 0.74 0.65 0.69 9 0.91 0.87 0.89 10 0.94 0.90 0.92 11 0.90 0.86 0.88 12 0.86 0.77 0.81 13 0.89 0.80 0.84 14 0.85 0.78 0.81 15 0.97 0.94 0.95 16 0.98 0.97 0.98 17 0.97 0.92 0.94 18 0.62 0.46 0.53 10. “The Innocence of Father Brown” by Gilbert Keith Chesterton, 11. “The Man Who Was Thursday” by Gilbert Keith Chesterton, 12. “The Parent’s Assistant” by Maria Edgeworth, 13. “Moby-Dick; or, The Whale” by Herman Melville, 14. “Paradise Lost” by John Milton, 15. “The Tragedy of Julius Caesar” by William Shakespeare, 16. “The Tragedy of Hamlet, Prince of Denmark” by William Shakespeare, 17. “The Tragedy of Macbeth” by William Shakespeare, 18. “Leaves of Grass” by Walt Whitman. The correct partitioning of the texts into sentences is known for the corpora. The results for each individual corpus is presented in terms of precision, recall, and f1-measure [Forman, 2003] in Table 1. The average and standard deviations can be found in Table 2. 4 items from the list are significant outliers in our analysis: • 4 Bible written in a special style which we do not expect as input, September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 14 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 2: Average and standard deviation of precision, recall, and f1-measure for all items Precision Recall F1 measure Average 0.80 0.76 0.78 Standard deviation 0.25 0.24 0.25 Table 3: Average and standard deviation of precision, recall, and f1-measure for 14 items Precision Recall F1 measure Average 0.92 0.89 0.89 Standard deviation 0.05 0.04 0.05 • 5 and 18 are poems, hence much more difficult for processing, • 8 written in a fancy way also difficult for processing. The scores for these items are significantly lower than for the other entries because of the very special styles used in these corpora. After considering 14 items (excluding the 4 special cases listed above) we obtain much better scores, as depicted in Table 3. Although we only achieve accuracy of about 90%, which is at least 5% less than the state of the art system, we may be satisfied with the implemented procedure because this accuracy is sufficient and the approach is kept simple. 2.2 Counting Syllables The complexity of counting syllables in the words depends on the language used. For some languages, like Finnish, each word can be divided into syllables using only general rules. However, possibly due to the weak correspondence between sounds and letters in the spelling of modern English, written syllabification in English is based mostly on etymological or morphological principles instead of phonetic principles 6 . In the implementation a simple syllabification algorithm was used. The algorithm counts each occurrence of one or more vowels consequently as a syllable. Only one exception is made: if a word ends in vowels, consonant, and “e” then this “e” is not counted as a separate syllable. 6 https://en.wikipedia.org/wiki/Syllabification September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 15 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 2.3 Vectorization 2.3.1 Introduction In order to use numerical and discrete algorithm on textual data it is necessary to have a nice representation of the textual data. A common approach used in many applications is to represent the text as a vector in some feature space [Agirre and Soroa, 2009, Mihalcea, 2010, Yarowsky, 1995]. In the project settings we have a valuable background data which we are free to use, namely, the thesauri. The concepts in the thesauri are chosen by experts and represent valuable instances in the field of study. Those concepts are used as features in the text representations. As a result, each text is represented as a vector with numerical entries. Each entry is a non-negative number indicating the degree to which the document can be described by the respective concept. 2.3.2 Term Frequency – Inverse Document Frequency The degree to which each document can be described by a concept can be computed by multiple ways. A very commonly used technique for weighting different term in the vectorization process is term frequency – inverse document frequency [Salton and McGill, 1986]. The first part – term frequency TF – is the number of occurrences of the term in the document. The more often the term occurs the better it represents the document [Luhn, 1957]. The second part – inverse document frequency idf – is the inverse of the number of documents that contain the term. The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs [Sparck Jones, 1972]. There exist multiple variations of the TF-IDF weighting. In the assessment tools the straight forward version was used: tf ∗ idf. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 16 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 3 Annotated Articles 3.1 Introduction As a part of Deliverable 2.3 “Data crawlers, adapters and extractors” due in Month 12 a crawler for collecting the data from the website investing.com was developed and used. This source was provided by the partners from DUTH as a source of quality financial news articles. The news articles were collected for three categories: Euro / USD exchange rate: 19119 articles, Eurostoxx50: 5834 articles, Crude oil: 14209 articles. We use this set of data as a basis for the evaluation of the designed assessment tools. Since the articles are approved by the experts we do not doubt their quality. Therefore, the articles represent a good testbed for calibrating and benchmarking the procedures. In the following subsections we analyze the articles in order to get the insights about the obtained data. 3.2 Text Statistics We start the analysis of the articles from the purely textual analysis, namely, counting the words, unique words, polysyllables (words with 3 or more syllables), etc. The results of this analysis are represented in Table 4. Each entry in the table is formed as “average ± standard deviation”. From the table we observe that though the deviation of the length of articles counted in words is large, the articles about crude oil are about 1.5 times longer than the articles about euro / usd exhange rate. Moreover, the articles about eurostoxx are on average 100 words shorter than the articles about crude oil. Both use about the same number of unique words on average, meaning that the ratio of unique words is higher in the eurostoxx news. In terms of word length we observed that the value remains more or less the same across categories and is about 1.5 syllables per word. Articles about euro / usd exchange rate tend to contain less long words (polysyllables), though they are also shorter on average. On average each sentence contains about 23 words in all categories. 3.3 Feature Statistics All the fetched articles were send to the PoolParty extractor (see Deliverable 2.1 “PROFIT core knowledge model” delivered in Month 6 and https://www.poolparty. biz/poolparty-extractor/) in order to run the extraction procedure, i.e., to find all the relevant concepts in the articles. In the course of Task 2.2 “Semantic Data Modeling, Linking and Enrichment” the STW Economics thesaurus slightly modified by the September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 17 of 49 Deliverable D2.2 Dissemination level (PU) Words Unique words Polysyllables Syllables per word Words per sentence Table 4: Textual statistics euro / usd eustoxx crude oil 310 ± 167 380 ± 109 436 ± 184 185 ± 84 252 ± 67 260 ± 102 47 ± 28 66 ± 20 66 ± 37 1.47 ± 0.15 1.53 ± 0.09 1.48 ± 0.19 23 ± 8 24 ± 3 24 ± 6 Contract N. 687895 all 367 ± 176 223 ± 96 57 ± 32 1.48 ± 0.16 23 ± 7 experts, about 300 concepts were added. This modified thesaurus was used in the extraction process. The results of the extraction and vectorization using the original STW Economics thesaurus should resemble the results reported in this deliverable. We use the advantage of having the thesaurus and create a second vectorization of the articles. In the second vectorization we enrich the data using the hierarchical relations, i.e. we add all the broader concepts for the extracted concepts. In this vectorization many top level concepts occur very often; for example, such general concepts as “Economics” will occur in almost every article because “Economics” would be a broader concept for at least one concept extracted from the article. Therefore, pre-filtering of the features is desirable; pre-filtering based on the frequency of the concepts eliminates a-priori irrelevant concepts and speeds up the process of learning. However, even after pre-filtering the enriched vectorization contains a lot of irrelevant features; moreover, obviously, many feature are highly dependent on each other since the tight semantic relation of broader and narrow hold between features by design. This fact is taken into account when performing the relevant tasks. Table 5 presents the number of concepts and the frequency of their occurrences as well. The density of the data is the ratio of the non-zero values in the matrix to the total number of entries in the matrix. The frequency rows in the table contains the information about the number of concepts occurring at least or at most the specified frequency. For example, there are 10 concepts occurring in more than 50% of documents for “plain” vectorization and 100 such concepts for the enriched vectorization. As expected the density of the enriched representation is much higher. This is attached to the fact that more general concepts are taken into account in the enriched representation, hence they appear in the extraction each time any of their narrower concepts is found in the article. The total number of articles is 39150. Hence 0.01% of the articles is about 4 articles. The rare features occurring in 0.01% of the documents may be a source for overfitting and are often discarded during the pre-filtering. After removing the rare concepts from the enriched data the broader concepts are preserved. Hence, we may consider the elimination of the rare concepts as a generalization. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 18 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 5: Statistics on the number of extracted concepts Plain Enriched Number of concepts 2498 4185 Density 0.018 0.055 Frequency > 50% 10 100 Frequency > 33% 30 210 Frequency < 0.1% 1512 2960 Frequency < 0.01% 722 2078 After the extraction of concepts the TF-IDF processing is used7 , see Section 2.3.2. The vectorized data is used in the subsequent sections of this document for testing and calibration purposes. 7 No normalization is done after TF-IDF, i.e. the sum of the scores for each article can take any value. Some classifiers, for example Support Vector Machines, are known to work much better on normalized data [Zhang and Oles, 2001]. However, for the logistic regression the normalization is not essential. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 19 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 6: Interpretations of the readability scores Score Grade Level 1 Kindergarten 2 First Grade ... ... 9 Eighth grade 10 High school freshman 11 High school sophomore 12 High school junior 13 High school senior 14 College freshman 15 College sophomore 16 College junior 17 College senior 18 College graduate 4 Readability Indices 4.1 Introduction Readability tests, readability formulae, or readability metrics are formulae for evaluating the readability of text, usually by counting syllables, words, and sentences. Readability tests are often used as an alternative to conducting an actual statistical survey of human readers of the subject text (a readability survey). Word processing applications often have readability tests built-in, which can be deployed on documents in-editing. The application of a useful readability test protocol will will offer a rigorous indication of a text’s readability. The accuracy can be further improved when finding the average readability of a large number of works. The tests generate a score based on characteristics such as statistical average word length (which is used as an unreliable proxy for semantic difficulty) and sentence length (as an unreliable proxy for syntactic complexity) of the work. Although different formulae for assessing the readability are used in this work they all follow the same interpretation pattern: the obtained score predicts the number of years of study required to easily comprehend the text. Table 6 provides helpful details in this direction. 4.2 Flesch-Kincaid Grade Level The Flesch–Kincaid readability test is a readability test designed to indicate how difficult a reading passage in English is to understand. For this purpose word length and sentence length are used [Kincaid et al., 1975]. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 20 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 7: FKGL scores per category Category Average Standard Deviation eustoxx 11.8 1.6 crude oil 11.2 3.2 eustoxx 10.6 4.1 all 11.0 3.5 The readability test is used extensively in the field of education. The “Flesch–Kincaid Grade Level Formula” (FKGL) presents a score as a U.S. grade level, making it easier for teachers, parents, librarians, and others to judge the readability level of various books and texts. It can also mean the number of years of education generally required to understand this text, relevant when the formula results in a number greater than 10. The grade level is calculated with the following formula: 0.39 total words total sentences + 11.8 total syllables total words − 15.59 The result is a number that corresponds to a U.S. grade level. For instance, the sentence “The Australian platypus is seemingly a hybrid of a mammal and reptilian creature” gets a score of 13.1 as it has 26 syllables and 13 words. The grade level formula emphasizes the sentence length over the word length. By creating one-word strings with hundreds of random characters, grade levels may be attained that are hundreds of times larger than high school completion in the United States. Due to the formula’s construction the score does not have an upper bound. The lowest grade level score in theory is 3.40, but there are few real passages in which every sentence consists of a single one-syllable word. Green Eggs and Ham by Dr. Seuss comes close, averaging 5.7 words per sentence and 1.02 syllables per word, with a grade level of 1.3.8 4.2.1 Evaluation Results The averages and standard deviations of the FKGL score on the annotated corpora, described in Section 3, are presented in Table 7. 4.3 Automated Readability Index The Automated Readability Index (ARI) is a readability test for English texts, designed to measure the understandability of a text. Like the FKGL, SMOG index, and CLI, 8 https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 21 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 8: ARI scores per category Category Average Standard Deviation eustoxx 14.0 1.6 crude oil 13.4 3.4 eustoxx 12.6 4.5 all 13.1 3.8 it produces an approximate representation of the US grade level needed to comprehend the text 9 . The formula for calculating the automated readability index is given below: total characters total words + 4.71 − 21.43 0.5 total sentences total words Unlike other indices, the ARI along with the CLI, rely on a factor of characters per word, instead of the usual syllables per word. Although opinion varies on its accuracy as compared to the syllables/word and complex words indices, characters/word is often faster to calculate, as the number of characters is more readily and accurately counted by computer programs than syllables. In fact, this index was designed for real-time monitoring of readability on electric typewriters [Smith and Senter, 1967]. 4.3.1 Evaluation Results The averages and standard deviations of the ARI score on the annotated corpora described in Section 3 are presented in Table 8. 4.4 Coleman-Liau Index The ColemanLiau index (CLI) is a readability test designed by Meri Coleman and T. L. Liau to gauge the understandability of a text [Coleman and Liau, 1975]. Like the FKGL, Gunning fog index, SMOG index, and ARI, its output approximates the US grade level thought necessary to comprehend the text 10 . Like the ARI but unlike most of the other indices, CLI relies on characters instead of syllables per word. Although opinion varies on its accuracy as compared to the syllable word and complex word indices, characters are more readily and accurately counted by computer programs than are syllables. The CLI was designed to be easily calculated from samples of hard-copy text mechanically. Unlike syllable-based readability indices, it does not entail the knowledge of char9 10 https://en.wikipedia.org/wiki/Automated_readability_index https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 22 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 9: CLI scores per category Category Average Standard Deviation eustoxx 12.3 1.0 crude oil 11.4 2.2 eustoxx 11.0 1.7 all 11.3 1.9 acter content, originated from words. On the contrary the character length is enough. Therefore, it could be used in conjunction with theoretically simple mechanical scanners that would only need to recognize character, word, and sentence boundaries. Hence, the full optical character recognition or manual keypunching is not required. The CLI is calculated with the following formula: total words total characters − 29.6 − 15.8 5.88 total words total sentences 4.4.1 Evaluation Results The averages and standard deviations of the CLI score on the annotated corpora described in Section 3 are presented in Table 9. 4.5 Simple Measure of Gobbledygook The SMOG grade is a measure of readability that estimates the years of education needed to understand a piece of writing. SMOG is the acronym derived from Simple Measure of Gobbledygook. It is widely used, particularly for checking health messages. The SMOG grade yields a 0.985 correlation with a standard error of 1.5159 grades with the grades of readers who had 100% comprehension of test materials. [Mc Laughlin, 1969] 11 The formula for calculating the SMOG grade was developed by G. Harry McLaughlin as a more accurate and more easily calculated substitute for the Gunning fog index and published in 1969. A 2010 study [Fitzsimmons et al., 2010] published in the Journal of the Royal College of Physicians of Edinburgh stated that “SMOG should be the preferred measure of readability when evaluating consumer-oriented healthcare material.” The study found that “The Flesch–Kincaid formula significantly underestimated reading difficulty compared with the gold standard SMOG formula.” 11 https://en.wikipedia.org/wiki/SMOG September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 23 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 10: SMOG scores per category Category Average Standard Deviation eustoxx 14.7 1.0 crude oil 13.8 1.8 eustoxx 13.5 2.2 all 13.8 2.0 To calculate SMOG r 1.043 number of polysyllables × 30 + 3.1291 number of sentences 4.5.1 Evaluation Results The averages and standard deviations of the CLI score on the annotated corpora described in Section 3 are presented in Table 9. 4.6 Usage The readability tests are used to assess the readability of the textual data in the project. Depending on the purpose of the text different scores can be expected / desirable. For the introductory educational resource one may require a readability score under 12 to make them accessible for a wider audience. However, for the comments to some specialized discussion one may require a higher readability score to identify potentially unprofessional comments and maintain the discussion on a high level. The most stable results, i.e., the least standard deviation, were shown by the SMOG index and CLI score. However, the average values of these two scores are 2.5 different from each other. As this difference persists over different categories, it can be considered as a calibration coefficient for the financial domain. As an outcome of this analysis the average of two scores will be used to assess the readability of the text: CLI score + 2.5 and SMOG index. The expected value for a specialized article is around 14 with a deviation of 2. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 24 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 5 Text Categorization 5.1 Introduction Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. For example, news stories are typically organized by subject categories or geographical codes. Academic papers are often classified by technical domains and sub-domains; patient reports in health-care organizations are often indexed from multiple aspects, using taxonomies of disease categories, types of surgical procedures, insurance reimbursement codes and so on. Another widespread application of text categorization is spam filtering, where email messages are classified into the two categories of spam and non-spam [Yang and Joachims, 2008]. 5.2 Classifiers The classifiers are used to solve the categorization task. In the mathematical abstraction a classifier is a function that takes a text and a set of categories as an input and outputs one category. The text is predicted to belong to the output category. Hence, the categories should be provided in advance. Moreover, before a classifier is able to do its job it has to be “learned” on an annotated data, i.e. a set of texts with known categories. Different types of classifiers can be used. Empirical evaluations have shown the performance of those non-linear classifiers comparable to stronger linear classifiers [Fan and Yang, 2003, Zhang and Oles, 2001]. Taking this information into account we have chosen a so-called logistic regression for performing the task [Hosmer Jr and Lemeshow, 2004]. Besides the ease of use and availability of libraries for different programming languages, the logistic regression offers a straight-forward interpretations of the feature weights as the importance of features. In other words for the different features (concepts) the classifier contains the information about the “importance” for the categories. Although this importance is not used for the assessment of the text directly it may become useful for some other application in the project. The logistic regression classifier accepts several meta-parameters that influence the quality of classification. For estimating the values of the parameters the grid search is used [Lerman, 1980]. In the course of the grid search multiple logistic regression classifiers with different meta-parameters are learned and their performance is then evaluated. The meta-parameters yielding the best performance are chosen as the parameters for the final classifier. In the data there exist 3 classes. In different applications of the categorization task in the project platform we may expect to have more classes. Therefore, it is necessary to design a system without restrictions on the number of classes. Taking this into account we need a multi-class classification. However, originally logistic regression is September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 25 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 designed for a binary classification problem, i.e., distinguishing between two classes. Several approaches to extend logistic regression to multi-class classification are known [Tsoumakas and Katakis, 2006]. The most popular two being one-vs-one and one-vsrest. Both approaches rely on the idea of learning an ensemble of classifiers. We have chosen the one-vs-rest approach because of its simplicity. Moreover, it is reported that under common conditions it is not any worse than other more sophisticated approaches [Rifkin and Klautau, 2004]. The approach consists in the following: for each class the training data is re-annotated; the class is separated from the rest of the classes in that all the other classes are merged into one class “rest”. Therefore for each class we learn a classifier to distinguish this class and the rest. In the testing and working phase all the classifiers are used; the predicted class is the class of the classifier with the highest score. In order to evaluate the performance of the classifier a well-known technique of cross validation is used [Kohavi et al., 1995]. The technique consists in the following: first the annotated data is divided into several chunks. Next one chunk is left for testing and does not participate in the learning phase. The classifier is learned on the training chunks (all the chunks except the testing chunk). Then the next chunk is taken for testing and so on. We measure the accuracy of the classifiers, i.e. the number of instances with correctly predicted classes divided by the total number of instances. In this way we aggregate the score for all classes in multi-class classification. When learning logistic regression classifier one often considers regularizations to avoid over-fitting, especially when there is a only small number of training examples, or when there are a large number of parameters to be learned [Goodman et al., 2004, Ng, 2004]. The two most widely used regularization methods are L1 and L2 regularizations. The L1 regularized logistic regression is often used for feature selection, and has been shown to have good generalization performance in the presence of many irrelevant features. Namely, L1 adds a penalty for each new feature during the learning process, therefore forcing the classifier to use as little features as possible. We test the performance of the classifiers with both regularizations. 5.3 Evaluation Result For the testing the articles collected from investing.com were used. The category of each article is known, therefore the data is annotated. Though the categories are not balanced in the number of samples, the dataset is large. For the testing the stratified K-fold cross validation is used. The difference from the described above cross validation is that in every chunk the percentage of the samples of each class is kept the same as in the complete set. The number of chunks is chosen to be 10, i.e., 90% of the data is used for training and 10% is used for testing. The accuracy of the 10 fold cross validation of L1 and L2 normalizations as well as the number of features with non-zero entries for L1 regularization are presented in Table 11. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 26 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 11: Accuracy scores L1 L1 number of features L2 Plain 93.3% ± 0.5 % 1492 93.2% ± 0.3 % Enriched 93.4% ± 0.5 % 1428 93.2% ± 0.3 % Table 12: Confusion matrix Actual Crude oil Euro / USD Eustoxx50 Crude oil 1294 34 7 Predicted Euro / USD 85 1870 19 Eustoxx50 41 7 557 From the results we may deduce that the enrichment of the data does not improve the quality of the classification, however, the number of important features may be decreased. This may happen due to generalization of some features and may lead to a more robust performance. However, further tests show that this behavior is not monotonous and further investigation is needed to find more insights into these phenomena. However, such an investigation lies outside of the scopes of the project. The scores of the classification of some similar dataset of news articles containing 20 classes are reported to be between 80 and 90 % [Nigam et al., 2000, Li and Vogel, 2010, Dai and Liu, 2014]. Although the number of classes in our case is much lower, the classes themselves are more similar and belong to the same domain. Therefore, the scores of above 90% are satisfactory for the considered use-case. We believe that these score are due to the selection of features, namely the usage of concepts from the thesaurus as features. Since the thesaurus was created in the interaction with experts from the field, the features are guaranteed to be meaningful. Another useful method of surveying the results and accuracy of the classification is the confusion matrix. The confusion matrix gives an overview of the correctly and incorrectly classified instances. For creating the matrix, 10% of the data was held out preserving the percentage of the instances for different classes. The matrix can be found in Table 12. As can be seen from the matrix a significant amount of mistakes is caused by misclassification of the instances of the “Crude oil” class. Moreover, some instances of the “Eustoxx50” class were classified as belonging to “Euro / USD” class. We also give an overview of the most important features per class in Tables 13 and 14. The table entries are the used features originated by the concepts. They are represented by their preferred label taken from the STW Economics thesaurus. It is worth noting September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 27 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 13: Top 15 most important concepts per category for plain data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Crude oil Natural gas composite output index Membership Foreign exchange Stock exchange oil brent oil for delivery Conservative party MSCI Oil price Reuters barrel crude sweet crude future Petroleum Euro / USD New York Stock Exchange AUD/USD Monetary union USD/CHF Commodity Futures Lobbying Replacement investment Euro Economic forecast Offshore financial centre Exchange Futures exchange EUR/JPY greenback versus a basket EUR/USD Eustoxx50 Forecast South Korea Information Market EUR/GBP FTSE 100 (U.K.) Interest rate policy U.K Euro Stoxx (Euro zone) hang seng index million barrel pound to dollar NYSE CAC 40 (France) DAX that for plain data in the “Crude oil” class, the 4th and the 5th most important features contain the token “exchange”. Due to internal annotation algorithm of PoolParty used for annotation the overlapping preferred labels may be annotated by several concepts, i.e., in many occurrences of “Foreign exchange” the concept “Exchange” is also present in annotations. This may be the sources of misclassification of the instances of “Crude oil” class. Moreover, it is interesting that among the 4 currency pairs important for “Euro /USD” class the pair Euro to USD is actually the least important. Also it is worth mentioning that pound to dollar pair is important for Eustoxx50. For the enriched data we may observe that more general features gain importance. This fact may correspond to the general intuition and may ease the analysis. However, the quality of the classification is not improved. 5.4 Usage The categories offer the functionality of grouping the articles. As opposed to some automatic grouping, the categorization described in this section allows for the manual control over the final outcome. Such a grouping may be useful for end users to pre-filter the data as well as for the platform designers to organize the data in the categories for further analysis. The users’ input may be categorized automatically for better structuring of the data in the storage. The input obtained from other websites may be also categorized if the September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 28 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Table 14: Top 15 most important concepts per category for enriched data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Crude oil Gases Treasuries Industries S&P 500 Reuters Commodity price Bloomberg crude stockpiles MSCI oil contract Stock exchange Oil price Commodity Futures brent oil for delivery crude oil inventories Euro / USD Futures market Foreign Exchange Market V.16.04 Probability Theory Time series analysis Industrial production Bond One-person household European Central Bank New York Stock Exchange greenback versus a basket WTI US Dollar USD/JPY EUR/GBP EUR/JPY Eustoxx50 P.21 Packages Chinese United Kingdom Bank P.10 Electrical Engineering Asian LONDON International financial market Forecast U.K. Euro Stoxx (Euro zone) hang seng index CAC 40 (France) NYSE DAX category is not known in advance. In case the input is a comment from a user, the categorization may check if the comment belongs to the same category as the material it refers to. As can be seen from the discussion of the results some important insights in the data may be obtained through the investigation of the trained classifiers. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 29 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 6 Topic Discovery 6.1 Introduction In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering the topics and distribution of those topics in each document. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Topic modeling algorithms analyze the words of the original texts to discover the themes that run through them, how those themes are connected to each other and how they change over time. Topic modeling algorithms do not require any prior annotations or labeling of the documents – the topics emerge from the analysis of the original texts. Topic modeling enables us to organize and summarize electronic archives at a scale that would be impossible by human annotation. Automatic extraction of meaningful themes from large amounts of documents can help detecting events taking place in real time and facilitate the exploration of unlabeled user-generated content archives. 6.2 Factorization and Topics In mathematics, factorization is the decomposition of an object into a product of other objects, or factors, which when multiplied together give the original object. The aim of factoring is usually to reduce something to “basic building blocks”. Factorization technique were proved to be useful in many different application fields such as image factorization [Tomasi and Kanade, 1992], biomedicine [Gao and Church, 2005], and robotics [Meng and Zhuang, 2007]. As applied to text mining the most common usage of factorization is the so-called topic modeling. The rational behind topic modeling is to model the set of topics that best represent the texts. The topics are actually the textual factors. There exists two main approaches to topic modeling: Probabilistic approach A statistical generative model would attempt to explain a set of observations (documents) by latent variables (topics) that would be topics in this case [Blei, 2012]. The similarities in the documents may be explained by the fact that they are the result of the activity of the same latent variables. Each latent variable has a probability of generating a certain word. One of the first techniques September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 30 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 of probabilistic topic modeling is probabilistic latent semantic analysis (PLSA) [Blei et al., 2003]. The most popular probabilistic topic modeling approach is the latent Dirichlet allocation (LDA) [Pritchard et al., 2000]; the advantage over PLSA is that each document may be represented as a mixture of topics. Many other extensions exist allowing the consideration of multiple other conditions. In order to achieve better results some assumptions have to be made. For example, in LDA an assumptions about prior Dirichlet distribution is made. Such assumptions may be arguable and seem artificial. The mathematical representation of the model is rather complicated. Algebraic approach Non-negative matrix factorization (NMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically. The model is rather simple and robust, it allows for straight-forward interpretation of foundations and results. Moreover, thanks to non-uniqueness of the decomposition the model is easily customizable to take additional conditions into account (such as, for example, sparsity). Theoretically some variations of the model are equivalent to PLSA. In the course of this work we focus on the algebraic approach and NMF method in particular. The learning algorithm makes of the coordinate descent method from the LIBLINEAR library [Fan et al., 2008]. The number of topics is a parameter of the method and has to be given in advance. However, it is rare that the number of topics is known in advance. Therefore it is necessary to somehow determine the optimal number. There exist different techniques for determining this number. We use the “elbow rule” for this purpose [Thorndike, 1953, Ketchen and Shook, 1996]. The idea of the elbow rule is that we increase the number of topics until we find out the each next topic improves the approximation of the data less than the previous ones. Figure 2 shows that method. The difference of the performance between 14 and 16 topics is larger than the difference in the performance between 74 and 76 topics. The optimal number of topics is the number at which the behavior of the curve changes. Increasing the number of topics increases the computation time as well, while the generalization capabilities of the model decrease. 6.2.1 Evaluation Results The performance of topic modeling is measured depending on the number of topics using all the documents from the corpora described in Section 3. The number of topics varies September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 31 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 1: Reconstruction error and density of topics with plain annotations from 10 to 80. The performance is measured in terms of: Reconstruction Error The norm of the difference |V − W ∗ H| is measured. W ∗ H is the reconstruction of the original data through the found decomposition into the topics, hence the name. Density The density of the topics H is measured, i.e. the ratio of the non-zero entries to the total number of entries. The more dense is H the more often words are used in different topics. It is desirable to have sparse matrix so that topics are described with a small number of different words. The result for the article annotations without broaders is presented in Figure 1, the result with broaders is presented in Figure 2. For both cases the optimal number of topics determined with the elbow rule is about 30 topics. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 32 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 2: Reconstruction error and density of topics with enriched annotations September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 33 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 6.3 Topical Density and Relatedness The topics represent the main directions of possible contribution of a certain piece of text. The more topics are represented in the text the less focused the contribution is. Moreover, since the topics are weighted we can assess the numerical density of the topics. As the measure of the topical density we have chosen the relation D = max( Pwwi i ), i.e. i i the maximal topic weight divided by the sum of the weights of the represented topics. Hence the topical density D denotes the ratio between the main contribution and the overall sum of the contributions. P wi . The The topical relatedness is defined as the sum of all contributions, i.e., R = i topical relatedness R identifies to which extent a text is related to the field of interest, i.e., to the main topics of the texts from the training set. The two values D and R are complementary to each other in a certain sense. If a text contains only a few concepts, its topical density is expected to be high since only a few (or one) topics will be represented in the text. However, in this case topical relatedness will be low. On the other hand, if there are a lot of concepts in a text then its topical relatedness is likely to be high; however, if the concepts are taken randomly and place in the text then the topical density of the text will be low. Therefore, only texts with focused contribution in several topics and a high number of concepts are expected to have high scores in both values. 6.3.1 Evaluation Results The tests are performed on the annotated articles using plain and enrich annotations. The dataset is split into 80% for learning topics and 20% for testing. This way we check that the trained topics perform well on the unseen data. The dependencies of topic relatedness and topic density on the number of words in the article and number of concepts extracted from the article are built. The results are presented in Figures 3, 4, 5, and 6. On each figure a least square approximation is built and the correlation coefficient R2 is presented. The best results are obtained for enriched data and the dependency of topic relatedness on the number of words. For the topic density the correlation coefficient was discovered to be quite small. Therefore, the dependency of the topic density on the number of extracted concepts can be neglected. One may also note that the characteristics of the dependencies change from plain data to enriched data. Namely, if the data is enriched with broader concepts then the number of concepts becomes a more reliable predictor for the topic relatedness. Overall one may assume that for shorter snippets of text, topic density appears to be an important value and may reach the values up to 1, whereas topic relatedness may still remain low, close to 0. For longer snippets of text topic density may be lower, where as the topic relatedness may be expected to be equal to the number of words multiplied by the slope computed from the training data. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 34 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 3: Topic density and relatedness on plain training data September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 35 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 4: Topic density and relatedness on enriched training data September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 36 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 5: Topic density and relatedness on plain test data September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 37 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 6: Topic density and relatedness on enriched test data September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 38 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 6.4 Topic Transition News articles (especially in financial field) attempt to respond to the latest change in all possible fields of life that may have an impact on economics and finance. Hence, in the analysis of topics we cannot rely on the news articles that are years or even months old. However, some news topics tend to persist over long periods of time while others are emerging and fading out rapidly. Detection of such emerging and fading out topics may be important for economic forecasting and may be valuable for the users of the platform. Therefore, the task of detecting new topics and investigation of the transitions of the old topics is a relevant assessment task. Over the last decade, many strategies for topic detection based on probabilistic models have appeared [AlSumait et al., 2008, Cao et al., 2007, Vaca et al., 2014]. However, these models have a main drawback in that high computational times make them unable to deal with large amounts of documents arriving in real time. Hence, these approaches cannot be applied in many real-world scenarios where data must be processed online and efficiently. Prominent examples are online news outlets and social media where users are continuously producing large amount of data whose topics rapidly grow and fade in intensity across time. Moreover, these approaches make use of complex models that are difficult to implement and investigate. Some approaches [Vaca et al., 2014] introduce new meta-parameters, therefore making the models even more difficult to use. Other authors make use of a simpler approach relying on the existing methods [Panisson et al., 2014]. The idea is to sample the data into chunks based on the date of appearance, compute the topic distribution for those chunks and then compare the obtained distributions. The comparison may yield the insights about the transitions of the topics and the emergence of new topics. Though the results of the investigation of the topic transitions cannot be directly used for text assessment since a set of text (corpus) is needed to identify the transitions, it makes sense to investigate this functionality in frames of this deliverable since topic modeling is under investigation. 6.4.1 Evaluation Results We compute the distribution of topics Hn,n+k for the time period of k weeks starting from week n. Next we take Hn+k,n+2k and compare the two topic distributions. We are aiming to find the matrix M such that Hn+k,n+2k = M̂ ∗ Hn,n+k , where ∗ is the dot product. In general such a matrix M̂ does not exist so we are looking for such a matrix that would “explain” most of the values in Hn+k,n+2k . The approach described in [Panisson et al., 2014] uses the cosine score to find the transitions between topics. In case topics correlate in the original distribution the cosine approach may overcount the evidence and “explain” the same new topic twice. Therefore we have decided to use a different approach. The initial idea is to use the inverse matri−1 −1 ces. If the inverse matrix Hn,n+k exists then we could express M = Hn+k,n+2k ∗ Hn,n+k . However, the inverse matrix does not exist in the general case. Therefore, we could take September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 39 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 7: Topic transition matrix using pseudo-inverse matrix + [Ben-Israel and Greville, 2003]. the Moore-Penrose pseudo-inverse matrix instead Hn,n+k + . Hence, we obtain an approximation of the transition matrix M = Hn+k,n+2k ∗ Hn,n+k The disadvantage of the inverse matrices is that the resulting transition matrix M may have negative entries. The negative entries cannot be interpreted meaningfully. As an alternative approach we implement a method to solve a proper optimization task minimizing the difference between Hn+k,n+2k and M ∗ Hn,n+k with an additional constraint on the non-negativity of M . We present the results of both methods as colormaps. In Figures 7 and 8 we present the two matrices obtained for the topics from the year 2015 with n = 0, k = 2. For better representation we limit the number of topics to 10. The reconstruction errors are: Pseudo-inverse 37.8 Optimization 38.6 As there is no big difference in reconstruction error we choose the optimization problem as the preferred option for building the topic transitions. As the output of the transition September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 40 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Figure 8: Topic transition matrix using optimization algorithms September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 41 of 49 Deliverable D2.2 Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Dissemination level (PU) Contract N. 687895 Table 15: 5 Topics of weeks 0 and 1 of year 2015 NASDAQ, New York Stock Exchange, NYSE, International financial market, Financial market, Corporation, B.08.03 Taxes and Choice of Organizational Form or Location, Organizational form, B.01.04 Organizational Forms, N.05.02.02 Economic Private Law Western Europe, G.01.05 Western Europe, EU countries, France, PARIS, Europe, LONDON, United Kingdom, Private bank, Bank percent, G.02 Asia, Asia, East Asia, G.02.02 East Asia, S&P 500, Yield, Return to capital, Tokyo, Japan Central America, G.04.01 Central America, Mexico, Refinery, Chemical Industry, Petrochemical industry, Basic chemical industry, Mexicans, Mexican, Latin Americans XETRA, DAX, Stock price, AG NA O.N, Partnership, KGAA, MDAX, DAX (Germany), W.14.03.02 Financial Economics, B.08.03 Taxes and Choice of Organizational Form or Location V.07.05 Foreign Trade, W.10.04 Export Sector, Export, External sector, Foreign Trade, Process, Enterprise, Law, W.21 Business Services, Energy National Accounts, Consumer good, Goods, Gross Domestic Product, National product, V.03.05 Consumption and Savings, Balance of Payments, V.07.08 Balance of Payments, Durable good, durable goods Airline, Transport sector, W.12.01.04 Air Transport, Hedging, Fuel, W.12.01 Transport Mode, W.12 Transport and Tourism, Costs, B.03.02 Cost Accounting, Strategy manufacture, National Accounts, National product, European Central Bank, V.02.02 Theory of the Firm, Factor price, Costs, Chairman of the ECB, Mario Draghi, N.04.04.04 European Integration and EU Policy Arab countries, G.02.01 Middle East, Middle East, Syria, Asia, G.02 Asia, Iraq, Social conflict, Labour dispute, Strike September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 42 of 49 Deliverable D2.2 Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Dissemination level (PU) Contract N. 687895 Table 16: 5 Topics of weeks 2 and 3 of year 2015 Western Europe, France, PARIS, G.01.05 Western Europe, EU countries, Europe, CAC 40 (France), Private bank, France, LONDON NASDAQ, New York Stock Exchange, NYSE, International financial market, Financial market, Corporation, Organizational form, B.08.03 Taxes and Choice of Organizational Form or Location, B.01.04 Organizational Forms, N.05.02.02 Economic Private Law B.03.02 Cost Accounting, Factor price, V.02.02 Theory of the Firm, Costs, B.04.02 Wage Payment Systems and Fringe Benefits, Income, Cost of capital, V.05.04 Interest Rate, Return to capital, V.03.03 Capital Mineral resources, Natural gas resources, Natural gas, Gases, cubic foot, cubic foot, Gases, billion cubic foot, natural gas future, Climate XETRA, DAX, Stock price, DAX (Germany), MDAX, Partnership, KGAA, W.14.03.02 Financial Economics, AG NA O.N, Germany Oil Prices Future contracts, V.05.06.03 Forward Market, Commodity Futures, Futures contract, million barrel, Brent Crude Oil Futures, crude oil inventories, W.04.01.04 Energy Policy, V.12.02.01 Energy Policy, B.08.02 Statement of Assets International Monetary System, Economic Integration, International economic relations, International relations, Economic union, European Economic and Monetary Union, Euro Area, N.04.04.04 European Integration and EU Policy, European Integration, Monetary union percent, B.02.01.02 Debt Financing, V.05.06.02.02 Bond Market, Fixed-income securities, Yield, Bond, V.03.01 Aggregate Investment, Public Debt, Debt, V.09.06 State-Owned Assets and Public Debt V.09.07 Economics of Taxation, V.09.07.03 Tax Types, Revenue, Public Revenues, V.09.03 Public Budget, Public Finance, W.22.01.02 Financial Administration and Public Finance, Tax, V.11 Regional Economics and Infrastructure, V.09.07.01 Theory of Taxation N.10.01.02 Climate, Weather, and Air, W.12 Transport and Tourism, bpd, Means of transport, P.09 Vehicles, Climate change, G.02.01 Middle East, G.02 Asia, Middle East, Climate September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 43 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 analysis we obtain the information about the disappearing topics and the emerging topics: Disappearing 5, 6, 7 Emerging 3, 5, 8 6.5 Usage The topic density and topic relatedness may be used to assess the overall impact of the textual contribution on the platform coming from different sources including users’ contribution. The overall methodology was developed in course of this deliverable, further tests may be needed to evaluate the efficiency on the real user inputs. The topic transitions can be used to analyze text coming from news and from user inputs as well. In contrast to other methods developed in the course of this deliverable, the topic transition is not applied to individual texts but to a set of texts. The usage of these methods has been discussed with the partners in the context of WP4 “Market Sentiment-based Financial Forecasting”. Moreover, this methodology allows for the straight-forward extension for the observation of the evolution of the topics. As an additional tool methodologies for the identification of potential trends with the reasons for those trends is being planned based on positive and negative associative rules. These methodologies may be developed in the further months of the Task 2.1 “Data Harvesting, Extraction and Assessment” (M1-M24). September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 44 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 Conclusion The deliverable 2.2 “Data and information streams – assessment tools” represents the results of preparing the assessment tools for the PROFIT project. In the discussion with the partners it was found out that the only input the requires assessment is the textual input. Therefore, the assessment methods were developed for the analysis of purely textual information based on the data acquired earlier and the background thesaurus. In the context of this activity the following was done: 1. Methods for preprocessing of the textual input are developed. Those methods include sentence boundary disambiguation, syllabification, and standard text vectorization methods. 2. Readability indices are investigated and implemented. Evaluation tests were carried out to identify the most reliable indices. 3. Text categorization methods are implemented. The results of the categorization using different regularization techniques and enrichment of the texts using thesauri were investigated. The relevance of features is examined. 4. Topic modeling and topic discovery approaches are investigated and implemented. Two metrics based on the topic decomposition are introduced: topic density and topic relatedness. The behavior of these metrics is investigated subject to the possible usage patterns. Topic transition methodologies are investigated, implemented, and tested. 5. The data extracted in frames of D2.3 “Data crawlers, adaptors and extractors” (M12) is used for tests and calibration. This input data set is suggested and approved by the partners who are experts in the field. The conclusions are based on the evaluation results obtained from this data set assessment. 6. Usage patterns of the metrics are investigated and suggested. For each metrics an application scenario and the condition of the usage are identified. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 45 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 References [Aberdeen et al., 1995] Aberdeen, J., Burger, J., Day, D., Hirschman, L., Robinson, P., and Vilain, M. (1995). Mitre: description of the alembic system used for muc6. In Proceedings of the 6th conference on Message understanding, pages 141–155. Association for Computational Linguistics. [Agirre and Soroa, 2009] Agirre, E. and Soroa, A. (2009). Personalizing pagerank for word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’09, pages 33–41, Stroudsburg, PA, USA. Association for Computational Linguistics. [Aho and Ullman, 1992] Aho, A. V. and Ullman, J. D. (1992). Foundations of computer science. Computer Science Press, Inc. [AlSumait et al., 2008] AlSumait, L., Barbará, D., and Domeniconi, C. (2008). Online lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In 2008 eighth IEEE international conference on data mining, pages 3–12. IEEE. [Ben-Israel and Greville, 2003] Ben-Israel, A. and Greville, T. N. (2003). Generalized inverses: theory and applications, volume 15. Springer Science & Business Media. [Blei, 2012] Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77–84. [Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022. [Borst and Neubert, 2009] Borst, T. and Neubert, J. (2009). Case study: Publishing stw thesaurus for economics as linked open data. W3C Semantic Web Use Cases and Case Studies. [Cao et al., 2007] Cao, B., Shen, D., Sun, J.-T., Wang, X., Yang, Q., and Chen, Z. (2007). Detect and track latent factors with online nonnegative matrix factorization. In IJCAI, volume 7, pages 2689–2694. [Coleman and Liau, 1975] Coleman, M. and Liau, T. L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283. [Dai and Liu, 2014] Dai, J. and Liu, X. (2014). Approach for text classification based on the similarity measurement between normal cloud models. The Scientific World Journal, 2014. [Fan and Yang, 2003] Fan, L. and Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proc. ICML, pages 472–479. [Fan et al., 2008] Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 46 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 [Fitzsimmons et al., 2010] Fitzsimmons, P., Michael, B., Hulley, J., and Scott, G. (2010). A readability assessment of online parkinson’s disease information. The journal of the Royal College of Physicians of Edinburgh, 40(4):292–296. [Forman, 2003] Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research, 3(Mar):1289– 1305. [Gao and Church, 2005] Gao, Y. and Church, G. (2005). Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, 21(21):3970–3975. [Goodman et al., 2004] Goodman, J. et al. (2004). Exponential priors for maximum entropy models. In HLT-NAACL, pages 305–312. Citeseer. [Hosmer Jr and Lemeshow, 2004] Hosmer Jr, D. W. and Lemeshow, S. (2004). Applied logistic regression. John Wiley & Sons. [Ketchen and Shook, 1996] Ketchen, D. J. and Shook, C. L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal, 17(6):441–458. [Kincaid et al., 1975] Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document. [Kohavi et al., 1995] Kohavi, R. et al. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI, volume 14, pages 1137–1145. [Lerman, 1980] Lerman, P. (1980). Fitting segmented regression models by grid search. Applied Statistics, pages 77–84. [Li and Vogel, 2010] Li, B. and Vogel, C. (2010). Improving multiclass text classification with error-correcting output coding and sub-class partitions. In Canadian Conference on Artificial Intelligence, pages 4–15. Springer. [Luhn, 1957] Luhn, H. P. (1957). A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1(4):309–317. [Mc Laughlin, 1969] Mc Laughlin, G. H. (1969). Smog grading-a new readability formula. Journal of reading, 12(8):639–646. [Meng and Zhuang, 2007] Meng, Y. and Zhuang, H. (2007). Autonomous robot calibration using vision technology. Robotics and Computer-Integrated Manufacturing, 23(4):436–446. [Mihalcea, 2010] Mihalcea, R. (2010). Word sense disambiguation. In Sammut, C. and Webb, G., editors, Encyclopedia of Machine Learning, pages 1027–1030. Springer US. [Mikheev, 2000] Mikheev, A. (2000). Tagging sentence boundaries. In Proceedings of the September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 47 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 1st North American chapter of the Association for Computational Linguistics conference, pages 264–271. Association for Computational Linguistics. [Ng, 2004] Ng, A. Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78. ACM. [Nigam et al., 2000] Nigam, K., Mccallum, A. K., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2):103–134. [Palmer and Hearst, 1997] Palmer, D. D. and Hearst, M. A. (1997). Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267. [Panisson et al., 2014] Panisson, A., Gauvin, L., Quaggiotto, M., and Cattuto, C. (2014). Mining concurrent topical activity in microblog streams. arXiv preprint arXiv:1403.1403. [Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. [Pritchard et al., 2000] Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2):945– 959. [Reynar and Ratnaparkhi, 1997] Reynar, J. C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fifth conference on Applied natural language processing, pages 16–19. Association for Computational Linguistics. [Rifkin and Klautau, 2004] Rifkin, R. and Klautau, A. (2004). In defense of one-vs-all classification. Journal of machine learning research, 5(Jan):101–141. [Salton and McGill, 1986] Salton, G. and McGill, M. J. (1986). Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA. [Smith and Senter, 1967] Smith, E. A. and Senter, R. (1967). Automated readability index. AMRL-TR. Aerospace Medical Research Laboratories (6570th). [Sparck Jones, 1972] Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21. [Thorndike, 1953] Thorndike, R. L. (1953). Who belongs in the family? Psychometrika, 18(4):267–276. [Tomasi and Kanade, 1992] Tomasi, C. and Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 48 of 49 Deliverable D2.2 Dissemination level (PU) Contract N. 687895 [Tsoumakas and Katakis, 2006] Tsoumakas, G. and Katakis, I. (2006). Multi-label classification: An overview. Dept. of Informatics, Aristotle University of Thessaloniki, Greece. [Vaca et al., 2014] Vaca, C. K., Mantrach, A., Jaimes, A., and Saerens, M. (2014). A time-based collective factorization for topic discovery and monitoring in news. In Proceedings of the 23rd international conference on World wide web, pages 527–538. ACM. [Yang and Joachims, 2008] Yang, Y. and Joachims, T. (2008). Text categorization. 3(5):4242. revision #91858. [Yarowsky, 1995] Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pages 189–196. Association for Computational Linguistics. [Zhang and Oles, 2001] Zhang, T. and Oles, F. J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval, 4(1):5–31. September 30, 2016 H2020-687895 ©The PROFIT Consortium Page 49 of 49
© Copyright 2026 Paperzz