H2020 – 687895 DATA AND INFORMATION

HORIZON 2020
ICT - INFORMATION AND COMMUNICATION TECHNOLOGIES
PROMOTING FINANCIAL AWARENESS AND STABILITY
H2020 – 687895
DATA AND INFORMATION STREAMS - ASSESSMENT TOOLS
WORK PACKAGE NO.
WP2
WORK PACKAGE TITLE
LINKED DATA LIFE CYCLE
TASK NO.
T2.1
TASK TITLE
DATA HARVESTING, EXTRACTION AND
ASSESSMENT
MILESTONE NO.
2
ORGANIZATION NAME OF LEAD
SWC
CONTRACTOR FOR THIS DELIVERABLE
EDITOR
ARTEM REVENKO (SWC)
CONTRIBUTORS
HEIDELINDE HOBEL (SWC), IOANNIS PRAGIDIS, EIRINI
KARAPISTOLI (DUTH), GEORGE PANOS, CHRISTOFOROS BOUZANIS
(UOG), ANNA SATSIOU, IOANNIS KOMPATSIARIS (CERTH)
REVIEWERS
ANTONIS SARIGIANNIDIS (DUTH), PETER HANECAK (EEA)
STATUS (F: FINAL; D: DRAFT)
F
NATURE
R – REPORT
DISSEMINATION LEVEL
PU – PUBLIC
PROJECT START DATE AND DURATION
JANUARY 2016, 36 MONTHS
DUE DATE OF DELIVERABLE:
SEPTEMBER 30, 2016
ACTUAL SUBMISSION DATE:
SEPTEMBER 30, 2016
REVISION HISTORY
VERSION
DATE
MODIFIED BY
CHANGES
0.1
11-07-2016
A. REVENKO (SWC)
FIRST VERSION OF TOC
0.2
22-07-2016
H. HOBEL (SWC)
DRAFTED EXECUTIVE SUMMARY, INTRODUCTION,
CONCLUSION
0.2.1
01-08-2016
A. REVENKO (SWC)
TOC REFINED AND FINALIZED
0.3
10-08-2016
A. REVENKO (SWC)
ADDED SECTION 2
0.4
16-08-2016
A. REVENKO (SWC)
ADDED SECTION 3
0.4.1
17-08-2016
A. REVENKO (SWC)
I. PRAGIDIS, E. KARAPISTOLI (DUTH), G. PANOS,
C. BOUZANIS (UOG), I. KOMPATSIARIS, A.
SATSIOU (CERTH) PROVIDED FEEDBACK RECEIVED
ON DATA ASSESSMENT TOOLS
0.5
19-08-2016
A. REVENKO (SWC)
ADDED SECTION 4
0.6
23-08-2016
A. REVENKO (SWC)
ADDED SECTION 5
0.7
26-08-2016
A. REVENKO (SWC)
ADDED SECTION 6
0.7.1
29-08-2016
A. REVENKO (SWC)
FINALIZED EXECUTIVE SUMMARY, INTRODUCTION,
CONCLUSION
0.8
02-09-2016
A. REVENKO (SWC)
FINAL OFFICIAL PRELIMINARY DRAFT VERSION OF
D2.2
0.9
22-09-2016
A. REVENKO (SWC)
REVISION ACCORDING TO COMMENTS FROM THE
REVIEWERS
1.0
29-09-2016
A. REVENKO (SWC)
FINAL REVIEW, PROOFING AND QUALITY CONTROL.
READY FOR SUBMISSION
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
List of Abbreviations
ARI Automated Readability Index. 5, 21, 22
CLI Coleman-Liau Index. 5, 21–24
FKGL Flesch–Kincaid Grade Level Formula. 5, 21, 22
IDF Inverse Document Frequency. 16, 19
LDA Latent Dirichlet Allocation. 31
ME Maximum Entropy. 12
NMF Non-negative Matrix Factorization. 31
PLSA Probabilistic Latent Semantic Analysis. 31
POS Part of Speech. 12
SBD Sentence Boundary Disambiguation. 11, 12
SMOG Simple Measure of Gobbledygook. 5, 21–24
STW Standard Thesaurus Wirtschaft. 10, 17, 18, 27
TF Term Frequency. 16, 19
WSJ Wall Street Journal. 12
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 3 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
List of Figures
1
2
3
4
5
6
7
8
Reconstruction error and density of topics with plain annotations . .
Reconstruction error and density of topics with enriched annotations
Topic density and relatedness on plain training data . . . . . . . . .
Topic density and relatedness on enriched training data . . . . . . .
Topic density and relatedness on plain test data . . . . . . . . . . . .
Topic density and relatedness on enriched test data . . . . . . . . . .
Topic transition matrix using pseudo-inverse matrix . . . . . . . . .
Topic transition matrix using optimization algorithms . . . . . . . .
September 30, 2016
H2020-687895 ©The PROFIT Consortium
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
33
35
36
37
38
40
41
Page 4 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
List of Tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Precision, recall, and f1-measure for all items . . . . . . . . . . . . .
Average and standard deviation of precision, recall, and f1-measure
all items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average and standard deviation of precision, recall, and f1-measure
14 items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Textual statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Statistics on the number of extracted concepts . . . . . . . . . . . .
Interpretations of the readability scores . . . . . . . . . . . . . . . .
FKGL scores per category . . . . . . . . . . . . . . . . . . . . . . . .
ARI scores per category . . . . . . . . . . . . . . . . . . . . . . . . .
CLI scores per category . . . . . . . . . . . . . . . . . . . . . . . . .
SMOG scores per category . . . . . . . . . . . . . . . . . . . . . . . .
Accuracy scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Top 15 most important concepts per category for plain data . . . . .
Top 15 most important concepts per category for enriched data . . .
5 Topics of weeks 0 and 1 of year 2015 . . . . . . . . . . . . . . . . .
5 Topics of weeks 2 and 3 of year 2015 . . . . . . . . . . . . . . . . .
September 30, 2016
H2020-687895 ©The PROFIT Consortium
. .
for
. .
for
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. 14
. 15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
18
19
20
21
22
23
24
27
27
28
29
42
43
Page 5 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table of Contents
Executive Summary
8
1 Introduction
1.1 Scope of the Document . . .
1.2 Relation to PROFIT Project
1.3 Goals . . . . . . . . . . . . .
1.4 Notes on Concept Extraction
1.5 Note on Implementation . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Preprocessing
2.1 Sentence Boundary Disambiguation . . . . .
2.1.1 Introduction . . . . . . . . . . . . .
2.1.2 Implemented Method . . . . . . . .
2.1.3 Evaluation Results . . . . . . . . . .
2.2 Counting Syllables . . . . . . . . . . . . . .
2.3 Vectorization . . . . . . . . . . . . . . . . .
2.3.1 Introduction . . . . . . . . . . . . .
2.3.2 Term Frequency – Inverse Document
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
. 9
. 9
. 10
. 10
. 10
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Frequency
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
11
12
13
15
16
16
16
3 Annotated Articles
17
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Text Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Feature Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Readability Indices
4.1 Introduction . . . . . . . . . . . .
4.2 Flesch-Kincaid Grade Level . . .
4.2.1 Evaluation Results . . . .
4.3 Automated Readability Index . .
4.3.1 Evaluation Results . . . .
4.4 Coleman-Liau Index . . . . . . .
4.4.1 Evaluation Results . . . .
4.5 Simple Measure of Gobbledygook
4.5.1 Evaluation Results . . . .
4.6 Usage . . . . . . . . . . . . . . .
5 Text Categorization
5.1 Introduction . . . .
5.2 Classifiers . . . . .
5.3 Evaluation Result .
5.4 Usage . . . . . . .
September 30, 2016
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
H2020-687895 ©The PROFIT Consortium
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
20
21
21
22
22
23
23
24
24
.
.
.
.
25
25
25
26
28
Page 6 of 49
Deliverable D2.2
Dissemination level (PU)
6 Topic Discovery
6.1 Introduction . . . . . . . . . . . .
6.2 Factorization and Topics . . . . .
6.2.1 Evaluation Results . . . .
6.3 Topical Density and Relatedness
6.3.1 Evaluation Results . . . .
6.4 Topic Transition . . . . . . . . .
6.4.1 Evaluation Results . . . .
6.5 Usage . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contract N. 687895
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Conclusion
September 30, 2016
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
30
30
31
34
34
39
39
44
45
H2020-687895 ©The PROFIT Consortium
Page 7 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Executive Summary
The deliverable presents the results of preparing tools for assessment of the information
flowing into the platform. It consists of the following results achieved in the first nine
months of the project:
1. Analysis of possible data and information streams;
2. Analysis, implementation, and the results of tests of the text preprocessing techniques;
3. Analysis of the available techniques and approaches for assessing the possible data
and information;
4. Implementation and testing of the suitable methods.
The main contributions of the deliverable may be found in
Section 2 analysis of preprocessing techniques;
Section 4 analysis of readability metrics;
Section 5 analysis of categorization task;
Section 6 analysis of topic modeling.
Each section contains an introduction, discussion of the methods, tests results, and a
description of usage patterns.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 8 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
1 Introduction
1.1 Scope of the Document
The document describes the output of the preparation of the assessment tools for the
data input to the PROFIT project platform. Based on the discussions with the partners
of the consortium the main input streams were identified to be textual inputs. Based
on this outcome methods and techniques for assessing textual input were identified,
investigated, implemented, and tested. The data obtained in the course of D2.3 “Data
crawlers, adaptors and extractors” (M12) was reused. The data was suggested by the
experts in the field and represents a high quality collection of news articles relevant to
the field of financial economics.
The document consists of a detailed description of the assessment methods together with
the test results on the described data. Based on the tests results the useful variants of
the methods were identified and calibrated. Moreover, usage patterns of each method
are described.
Some of the developed methods were identified to be useful in the WP4 “Market
Sentiment-based Financial Forecasting”.
1.2 Relation to PROFIT Project
An investigation of tools for assessing the input was carried out in frames of this deliverable. In the discussion with the partners it was identified that the numerical and the
factual data will be obtained from the trusted sources like Eurostat1 and/or OPEC2 ,
therefore no justification of such data is required. The only input that requires assessment is expected to be the textual input. Most of the unprocessed information is going
to be represented at least partially in the textual form, including news articles and educational materials. The interaction with the users is considered a very important part
of the project; interaction between the users and between users and the platform will
also feature textual format. Assessing the textual data is of a great importance in order
to identify potentially malicious, difficult, provocative, significant input.
Moreover, some of the work in this deliverable, such as topic discovery may be reused in
the project in frames of Work Package 4 “Market Sentiment-based Financial Forecasting”. Topic discovery is capable of identifying new emerging topical trends and evolution
of the existing ones, hence may be used for identification of the subjects of sentiment
analysis.
1
2
http://ec.europa.eu/eurostat
http://www.opec.org/opec_web/en/
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 9 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
1.3 Goals
The goal of this deliverable is to prepare the methodology framework and the implement
tools for assessing (textual) input. Those tools would facilitate the identification of
input requiring additional attention and the processing of such input by the moderators.
Moreover, the tools should be useful for getting some insights for further analysis of
the input from users and from the news articles. Hence, the assessment includes the
methods for assessing the quality of the text, the relation to the field of interests, the
distribution of the topics (wrt prior topics extracted from the already processed data),
and the identification of potentially new trends and dependencies.
1.4 Notes on Concept Extraction
For all the assessment methods except purely textual analysis the extraction of concepts
was used. To this end, the extracted concepts are used as a better alternative for
identifying keywords in the documents. As the concepts are chosen and curated by
experts in advance they represent an essential background knowledge about the field
of interest. Namely the concepts identify the main elements of the text that an expert
would focus the attention on. Moreover, since the concepts are taken from a thesaurus
the semantic relations between concepts are known and may be used for even deeper
analysis.
For performing the work described in this deliverable the STW Economics thesaurus
was used [Borst and Neubert, 2009]. The extraction was performed using PoolParty3 .
For more information see Deliverable 2.1 “PROFIT core knowledge model” delivered in
Month 6.
1.5 Note on Implementation
The described methods were implemented in Python using the scikit-learn library
[Pedregosa et al., 2011].
3
poolparty.biz
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 10 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
2 Preprocessing
2.1 Sentence Boundary Disambiguation
2.1.1 Introduction
Sentence boundary disambiguation is the task of identifying the individual sentences
within a text. Because the sentence is the basic textual unit immediately above the
word and phrase, Sentence Boundary Disambiguation (SBD) is one of the most essential
problems for many applications of Natural Language Processing – Parsing, Information
Extraction, Machine Translation, and Document Summarizations.
The SBD problem is not always simple. Usually a sentence ends with a terminal punctuation, such as a ., ?, or !. However, a period can be associated with an abbreviation,
such as Mr. or represent a decimal point in a number like $12.58. In these cases, it is a
part of an abbreviation or a number; we cannot delimit a sentence because the period
has a different meaning here. On the other hand, the trailing period of an abbreviation
can also represent the end of a sentence at the same time. In most such cases, the
word following this period is a capitalized common word (e.g., “The President lives in
Washington D.C. He likes that place.”).
The original SBD systems were built from manually generated rules in the form of regular
expressions for grammar, which is augmented by a list of abbreviations, common words,
proper names, etc. For example, the Almebic system [Aberdeen et al., 1995] deploys
over 100 regular-expression rules written in Flex. Such a system may work well on the
language or corpus for which the system was initially designed. Nevertheless, developing
and maintaining an accurate rule-based system requires substantial hand coding effort
and domain knowledge. Another drawback of this kind of systems is that it is difficult
to port an existing system to other languages. On the advantages of such systems one
may note that such systems are easy to develop, and they do not require any annotated
data for training.
The current research activity in SBD focuses on employing machine learning techniques,
which treat the SBD task as a standard classification problem. The general principles
of these systems are: training the system on a training set (usually annotated) to make
the system “remember” the features of the local context around the sentence-breaking
punctuation or global information on the list of abbreviations and proper names, and
then recognize the real text sentences using this trained system. Those systems have the
following drawbacks:
• they require more effort for development than rule-based systems,
• they demand annotated data for training,
• they lack transparency (compared to rule-based systems).
[Palmer and Hearst, 1997] developed a system, called SATZ, to classify the potential sentence boundary by using the local syntactic context. To obtain the syntactic information
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 11 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
for local context, SATZ needs the words in the context to be tagged with part-of-speech
(POS). An additional drawback of using POS tagging is that the POS data is language
specific, hence no straight-forward extension to new languages is possible. The authors
reported a performance of around 1.0% in terms of error rate on Wall Street Journal
(WSJ) data.
In order to solve the problems encountered by the SATZ system, [Mikheev, 2000] proposed a method that segments a sentence into smaller sections. The authors claimed
only a 0.25% error rate on the Brown corpus and a 0.39% error rate on the WSJ corpus.
There
are
known
approaches
to
SBD
without
using
POS
tags.[Reynar and Ratnaparkhi, 1997] presented a solution based on a Maximum
Entropy (ME) model for the SBD problem. The model can attain an accuracy of 98.8%
on the WSJ data, a quite good performance given that the model is simple and the
data feature selection is quite flexible.
Although the results reported by the authors of the systems achieve a great performance
of around 1% in terms of error rate, one can expect a slightly worse performance on a
general corpus, not the one used for tuning the system. The state of the art systems
may achieve an accuracy of 95% and above.
2.1.2 Implemented Method
The implemented method is based on the manual rules for finding the sentence boundaries. The following advantages of the rule-based method are considered:
• No annotated data is needed;
• No user interaction is required;
• Easy to implement and use.
The results of SBD is going to be used in assessing the readability of a text. For this
purpose the error rates up to 5-10% are acceptable. Therefore, in the implementation
special attention was paid to ensure that the method is robust, easy, and transparent.
In this direction, the number of rules is kept small even at the cost of some accuracy
decrease.
The following regular expression [Aho and Ullman, 1992, Chapter 10] was used to find
candidates for sentence boundaries:
(?<=[.?!\]\n])
([\"\’]?[\s|\r\n]+[\"\’]?)
(?=(\([a-z])|(\(?[A-Z]))
(1)
The first part starting with (?<= is a so-called positive look-behind and is responsible for
matching the characters that precede a sentence boundary. The expression specifies that
one of the .?!] symbols or a new line should be present before the sentence boundary.
The second part specifies the boundary itself. In this part we define that the sentence
boundary is either a space or a new line and may be surround by single or double quotes
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 12 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
on both sides.
The third part is responsible for the succeeding part and is called the positive look-ahead.
In this part it is specified that the sentence boundary is followed by either an opening
bracket and a lower case character or an optional opening bracket and an upper case
character.
Although the specified regular expression is able to find most of the sentence boundaries
it also matches a lot of cases which are not sentence boundaries. In particular, all the
abbreviations, names, special words like “etc.” are going to trigger the matching. In
order to avoid this a list of exceptions is adopted. A short list containing 4 items was
used:
1. “Mr”,
2. “Mrs”,
3. “etc”,
4. all the single upper case characters.
Although user interaction is not required, the applied method is quite flexible to be
extended. For instance, it is possible to add new rules for detecting boundaries and
non-boundaries. The first and the third parts of the regular expression (1) are actually
implemented as list and can be extended by further entries. Moreover, in order to
capture a possible dependency between first and third parts one can easily add more
regular expressions like (1). The list of exceptions can be extended at any time as well.
2.1.3 Evaluation Results
The implemented procedure was tested on the Project Gutenberg4 corpora as provided
by NLTK5 . The corpora consists of 18 items:
1. “Emma” by Jane Austen,
2. “Persuasion” by Jane Austen,
3. “Sense and Sensibility” by Jane Austen,
4. Bible,
5. Poems by William Blake,
6. Stories by Sara Cone Bryant,
7. “The Adventures of Buster Bear” by Thornton Waldo Burgess,
8. “Alice in Wonderland” by Lewis Carroll,
9. “The Ball and the Cross” by Gilbert Keith Chesterton,
4
5
http://www.gutenberg.org/
http://www.nltk.org/nltk_data/
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 13 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 1: Precision, recall, and f1-measure for all items
Precision Recall f1-measure
1
0.92
0.85
0.88
2
0.96
0.91
0.93
3
0.94
0.89
0.91
4
0.10
0.01
0.02
5
0.17
0.40
0.24
6
0.83
0.82
0.82
7
0.91
0.87
0.89
8
0.74
0.65
0.69
9
0.91
0.87
0.89
10
0.94
0.90
0.92
11
0.90
0.86
0.88
12
0.86
0.77
0.81
13
0.89
0.80
0.84
14
0.85
0.78
0.81
15
0.97
0.94
0.95
16
0.98
0.97
0.98
17
0.97
0.92
0.94
18
0.62
0.46
0.53
10. “The Innocence of Father Brown” by Gilbert Keith Chesterton,
11. “The Man Who Was Thursday” by Gilbert Keith Chesterton,
12. “The Parent’s Assistant” by Maria Edgeworth,
13. “Moby-Dick; or, The Whale” by Herman Melville,
14. “Paradise Lost” by John Milton,
15. “The Tragedy of Julius Caesar” by William Shakespeare,
16. “The Tragedy of Hamlet, Prince of Denmark” by William Shakespeare,
17. “The Tragedy of Macbeth” by William Shakespeare,
18. “Leaves of Grass” by Walt Whitman.
The correct partitioning of the texts into sentences is known for the corpora. The results
for each individual corpus is presented in terms of precision, recall, and f1-measure
[Forman, 2003] in Table 1. The average and standard deviations can be found in Table
2.
4 items from the list are significant outliers in our analysis:
• 4 Bible written in a special style which we do not expect as input,
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 14 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 2: Average and standard deviation of precision, recall, and f1-measure for all items
Precision Recall F1 measure
Average
0.80
0.76
0.78
Standard deviation
0.25
0.24
0.25
Table 3: Average and standard deviation of precision, recall, and f1-measure for 14 items
Precision Recall F1 measure
Average
0.92
0.89
0.89
Standard deviation
0.05
0.04
0.05
• 5 and 18 are poems, hence much more difficult for processing,
• 8 written in a fancy way also difficult for processing.
The scores for these items are significantly lower than for the other entries because of
the very special styles used in these corpora. After considering 14 items (excluding the
4 special cases listed above) we obtain much better scores, as depicted in Table 3.
Although we only achieve accuracy of about 90%, which is at least 5% less than the
state of the art system, we may be satisfied with the implemented procedure because
this accuracy is sufficient and the approach is kept simple.
2.2 Counting Syllables
The complexity of counting syllables in the words depends on the language used. For
some languages, like Finnish, each word can be divided into syllables using only general
rules. However, possibly due to the weak correspondence between sounds and letters
in the spelling of modern English, written syllabification in English is based mostly on
etymological or morphological principles instead of phonetic principles 6 .
In the implementation a simple syllabification algorithm was used. The algorithm counts
each occurrence of one or more vowels consequently as a syllable. Only one exception
is made: if a word ends in vowels, consonant, and “e” then this “e” is not counted as a
separate syllable.
6
https://en.wikipedia.org/wiki/Syllabification
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 15 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
2.3 Vectorization
2.3.1 Introduction
In order to use numerical and discrete algorithm on textual data it is necessary to have
a nice representation of the textual data. A common approach used in many applications is to represent the text as a vector in some feature space [Agirre and Soroa, 2009,
Mihalcea, 2010, Yarowsky, 1995]. In the project settings we have a valuable background
data which we are free to use, namely, the thesauri. The concepts in the thesauri are
chosen by experts and represent valuable instances in the field of study. Those concepts
are used as features in the text representations.
As a result, each text is represented as a vector with numerical entries. Each entry is a
non-negative number indicating the degree to which the document can be described by
the respective concept.
2.3.2 Term Frequency – Inverse Document Frequency
The degree to which each document can be described by a concept can be computed by multiple ways. A very commonly used technique for weighting different
term in the vectorization process is term frequency – inverse document frequency
[Salton and McGill, 1986]. The first part – term frequency TF – is the number of occurrences of the term in the document. The more often the term occurs the better it
represents the document [Luhn, 1957]. The second part – inverse document frequency
idf – is the inverse of the number of documents that contain the term. The specificity
of a term can be quantified as an inverse function of the number of documents in which
it occurs [Sparck Jones, 1972].
There exist multiple variations of the TF-IDF weighting. In the assessment tools the
straight forward version was used: tf ∗ idf.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 16 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
3 Annotated Articles
3.1 Introduction
As a part of Deliverable 2.3 “Data crawlers, adapters and extractors” due in Month 12 a
crawler for collecting the data from the website investing.com was developed and used.
This source was provided by the partners from DUTH as a source of quality financial
news articles. The news articles were collected for three categories:
Euro / USD exchange rate: 19119 articles,
Eurostoxx50: 5834 articles,
Crude oil: 14209 articles.
We use this set of data as a basis for the evaluation of the designed assessment tools.
Since the articles are approved by the experts we do not doubt their quality. Therefore,
the articles represent a good testbed for calibrating and benchmarking the procedures.
In the following subsections we analyze the articles in order to get the insights about
the obtained data.
3.2 Text Statistics
We start the analysis of the articles from the purely textual analysis, namely, counting
the words, unique words, polysyllables (words with 3 or more syllables), etc. The results
of this analysis are represented in Table 4. Each entry in the table is formed as “average
± standard deviation”.
From the table we observe that though the deviation of the length of articles counted in
words is large, the articles about crude oil are about 1.5 times longer than the articles
about euro / usd exhange rate. Moreover, the articles about eurostoxx are on average
100 words shorter than the articles about crude oil. Both use about the same number
of unique words on average, meaning that the ratio of unique words is higher in the
eurostoxx news. In terms of word length we observed that the value remains more or
less the same across categories and is about 1.5 syllables per word. Articles about euro
/ usd exchange rate tend to contain less long words (polysyllables), though they are also
shorter on average. On average each sentence contains about 23 words in all categories.
3.3 Feature Statistics
All the fetched articles were send to the PoolParty extractor (see Deliverable 2.1
“PROFIT core knowledge model” delivered in Month 6 and https://www.poolparty.
biz/poolparty-extractor/) in order to run the extraction procedure, i.e., to find all
the relevant concepts in the articles. In the course of Task 2.2 “Semantic Data Modeling, Linking and Enrichment” the STW Economics thesaurus slightly modified by the
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 17 of 49
Deliverable D2.2
Dissemination level (PU)
Words
Unique words
Polysyllables
Syllables per word
Words per sentence
Table 4: Textual statistics
euro / usd
eustoxx
crude oil
310 ± 167
380 ± 109
436 ± 184
185 ± 84
252 ± 67
260 ± 102
47 ± 28
66 ± 20
66 ± 37
1.47 ± 0.15 1.53 ± 0.09 1.48 ± 0.19
23 ± 8
24 ± 3
24 ± 6
Contract N. 687895
all
367 ± 176
223 ± 96
57 ± 32
1.48 ± 0.16
23 ± 7
experts, about 300 concepts were added. This modified thesaurus was used in the extraction process. The results of the extraction and vectorization using the original STW
Economics thesaurus should resemble the results reported in this deliverable.
We use the advantage of having the thesaurus and create a second vectorization of the
articles. In the second vectorization we enrich the data using the hierarchical relations,
i.e. we add all the broader concepts for the extracted concepts. In this vectorization
many top level concepts occur very often; for example, such general concepts as “Economics” will occur in almost every article because “Economics” would be a broader
concept for at least one concept extracted from the article. Therefore, pre-filtering of
the features is desirable; pre-filtering based on the frequency of the concepts eliminates
a-priori irrelevant concepts and speeds up the process of learning. However, even after
pre-filtering the enriched vectorization contains a lot of irrelevant features; moreover,
obviously, many feature are highly dependent on each other since the tight semantic
relation of broader and narrow hold between features by design. This fact is taken into
account when performing the relevant tasks.
Table 5 presents the number of concepts and the frequency of their occurrences as well.
The density of the data is the ratio of the non-zero values in the matrix to the total
number of entries in the matrix. The frequency rows in the table contains the information
about the number of concepts occurring at least or at most the specified frequency. For
example, there are 10 concepts occurring in more than 50% of documents for “plain”
vectorization and 100 such concepts for the enriched vectorization.
As expected the density of the enriched representation is much higher. This is attached to
the fact that more general concepts are taken into account in the enriched representation,
hence they appear in the extraction each time any of their narrower concepts is found
in the article.
The total number of articles is 39150. Hence 0.01% of the articles is about 4 articles.
The rare features occurring in 0.01% of the documents may be a source for overfitting
and are often discarded during the pre-filtering. After removing the rare concepts from
the enriched data the broader concepts are preserved. Hence, we may consider the
elimination of the rare concepts as a generalization.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 18 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 5: Statistics on the number of extracted concepts
Plain Enriched
Number of concepts 2498
4185
Density
0.018
0.055
Frequency > 50%
10
100
Frequency > 33%
30
210
Frequency < 0.1%
1512
2960
Frequency < 0.01%
722
2078
After the extraction of concepts the TF-IDF processing is used7 , see Section 2.3.2. The
vectorized data is used in the subsequent sections of this document for testing and
calibration purposes.
7
No normalization is done after TF-IDF, i.e. the sum of the scores for each article can take any value.
Some classifiers, for example Support Vector Machines, are known to work much better on normalized
data [Zhang and Oles, 2001]. However, for the logistic regression the normalization is not essential.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 19 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 6: Interpretations of the readability scores
Score Grade Level
1
Kindergarten
2
First Grade
...
...
9
Eighth grade
10
High school freshman
11
High school sophomore
12
High school junior
13
High school senior
14
College freshman
15
College sophomore
16
College junior
17
College senior
18
College graduate
4 Readability Indices
4.1 Introduction
Readability tests, readability formulae, or readability metrics are formulae for evaluating
the readability of text, usually by counting syllables, words, and sentences. Readability
tests are often used as an alternative to conducting an actual statistical survey of human
readers of the subject text (a readability survey). Word processing applications often
have readability tests built-in, which can be deployed on documents in-editing.
The application of a useful readability test protocol will will offer a rigorous indication
of a text’s readability. The accuracy can be further improved when finding the average
readability of a large number of works. The tests generate a score based on characteristics
such as statistical average word length (which is used as an unreliable proxy for semantic
difficulty) and sentence length (as an unreliable proxy for syntactic complexity) of the
work. Although different formulae for assessing the readability are used in this work
they all follow the same interpretation pattern: the obtained score predicts the number
of years of study required to easily comprehend the text. Table 6 provides helpful details
in this direction.
4.2 Flesch-Kincaid Grade Level
The Flesch–Kincaid readability test is a readability test designed to indicate how difficult
a reading passage in English is to understand. For this purpose word length and sentence
length are used [Kincaid et al., 1975].
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 20 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 7: FKGL scores per category
Category Average Standard Deviation
eustoxx
11.8
1.6
crude oil
11.2
3.2
eustoxx
10.6
4.1
all
11.0
3.5
The readability test is used extensively in the field of education. The “Flesch–Kincaid
Grade Level Formula” (FKGL) presents a score as a U.S. grade level, making it easier for
teachers, parents, librarians, and others to judge the readability level of various books
and texts. It can also mean the number of years of education generally required to
understand this text, relevant when the formula results in a number greater than 10.
The grade level is calculated with the following formula:
0.39
total words
total sentences
+ 11.8
total syllables
total words
− 15.59
The result is a number that corresponds to a U.S. grade level. For instance, the sentence
“The Australian platypus is seemingly a hybrid of a mammal and reptilian creature” gets
a score of 13.1 as it has 26 syllables and 13 words. The grade level formula emphasizes
the sentence length over the word length. By creating one-word strings with hundreds
of random characters, grade levels may be attained that are hundreds of times larger
than high school completion in the United States. Due to the formula’s construction the
score does not have an upper bound.
The lowest grade level score in theory is 3.40, but there are few real passages in which
every sentence consists of a single one-syllable word. Green Eggs and Ham by Dr. Seuss
comes close, averaging 5.7 words per sentence and 1.02 syllables per word, with a grade
level of 1.3.8
4.2.1 Evaluation Results
The averages and standard deviations of the FKGL score on the annotated corpora,
described in Section 3, are presented in Table 7.
4.3 Automated Readability Index
The Automated Readability Index (ARI) is a readability test for English texts, designed
to measure the understandability of a text. Like the FKGL, SMOG index, and CLI,
8
https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 21 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 8: ARI scores per category
Category Average Standard Deviation
eustoxx
14.0
1.6
crude oil
13.4
3.4
eustoxx
12.6
4.5
all
13.1
3.8
it produces an approximate representation of the US grade level needed to comprehend
the text 9 .
The formula for calculating the automated readability index is given below:
total characters
total words
+ 4.71
− 21.43
0.5
total sentences
total words
Unlike other indices, the ARI along with the CLI, rely on a factor of characters per
word, instead of the usual syllables per word. Although opinion varies on its accuracy
as compared to the syllables/word and complex words indices, characters/word is often
faster to calculate, as the number of characters is more readily and accurately counted
by computer programs than syllables. In fact, this index was designed for real-time
monitoring of readability on electric typewriters [Smith and Senter, 1967].
4.3.1 Evaluation Results
The averages and standard deviations of the ARI score on the annotated corpora described in Section 3 are presented in Table 8.
4.4 Coleman-Liau Index
The ColemanLiau index (CLI) is a readability test designed by Meri Coleman and T.
L. Liau to gauge the understandability of a text [Coleman and Liau, 1975]. Like the
FKGL, Gunning fog index, SMOG index, and ARI, its output approximates the US
grade level thought necessary to comprehend the text 10 .
Like the ARI but unlike most of the other indices, CLI relies on characters instead of
syllables per word. Although opinion varies on its accuracy as compared to the syllable
word and complex word indices, characters are more readily and accurately counted by
computer programs than are syllables.
The CLI was designed to be easily calculated from samples of hard-copy text mechanically. Unlike syllable-based readability indices, it does not entail the knowledge of char9
10
https://en.wikipedia.org/wiki/Automated_readability_index
https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 22 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 9: CLI scores per category
Category Average Standard Deviation
eustoxx
12.3
1.0
crude oil
11.4
2.2
eustoxx
11.0
1.7
all
11.3
1.9
acter content, originated from words. On the contrary the character length is enough.
Therefore, it could be used in conjunction with theoretically simple mechanical scanners
that would only need to recognize character, word, and sentence boundaries. Hence, the
full optical character recognition or manual keypunching is not required.
The CLI is calculated with the following formula:
total words
total characters
− 29.6
− 15.8
5.88
total words
total sentences
4.4.1 Evaluation Results
The averages and standard deviations of the CLI score on the annotated corpora described in Section 3 are presented in Table 9.
4.5 Simple Measure of Gobbledygook
The SMOG grade is a measure of readability that estimates the years of education needed
to understand a piece of writing. SMOG is the acronym derived from Simple Measure of
Gobbledygook. It is widely used, particularly for checking health messages. The SMOG
grade yields a 0.985 correlation with a standard error of 1.5159 grades with the grades
of readers who had 100% comprehension of test materials. [Mc Laughlin, 1969] 11
The formula for calculating the SMOG grade was developed by G. Harry McLaughlin
as a more accurate and more easily calculated substitute for the Gunning fog index and
published in 1969.
A 2010 study [Fitzsimmons et al., 2010] published in the Journal of the Royal College of
Physicians of Edinburgh stated that “SMOG should be the preferred measure of readability when evaluating consumer-oriented healthcare material.” The study found that
“The Flesch–Kincaid formula significantly underestimated reading difficulty compared
with the gold standard SMOG formula.”
11
https://en.wikipedia.org/wiki/SMOG
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 23 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 10: SMOG scores per category
Category Average Standard Deviation
eustoxx
14.7
1.0
crude oil
13.8
1.8
eustoxx
13.5
2.2
all
13.8
2.0
To calculate SMOG
r
1.043 number of polysyllables ×
30
+ 3.1291
number of sentences
4.5.1 Evaluation Results
The averages and standard deviations of the CLI score on the annotated corpora described in Section 3 are presented in Table 9.
4.6 Usage
The readability tests are used to assess the readability of the textual data in the project.
Depending on the purpose of the text different scores can be expected / desirable. For
the introductory educational resource one may require a readability score under 12 to
make them accessible for a wider audience. However, for the comments to some specialized discussion one may require a higher readability score to identify potentially
unprofessional comments and maintain the discussion on a high level.
The most stable results, i.e., the least standard deviation, were shown by the SMOG
index and CLI score. However, the average values of these two scores are 2.5 different
from each other. As this difference persists over different categories, it can be considered
as a calibration coefficient for the financial domain. As an outcome of this analysis the
average of two scores will be used to assess the readability of the text: CLI score +
2.5 and SMOG index. The expected value for a specialized article is around 14 with a
deviation of 2.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 24 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
5 Text Categorization
5.1 Introduction
Text categorization (a.k.a. text classification) is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections
and has important applications in the real world. For example, news stories are typically
organized by subject categories or geographical codes. Academic papers are often classified by technical domains and sub-domains; patient reports in health-care organizations
are often indexed from multiple aspects, using taxonomies of disease categories, types
of surgical procedures, insurance reimbursement codes and so on. Another widespread
application of text categorization is spam filtering, where email messages are classified
into the two categories of spam and non-spam [Yang and Joachims, 2008].
5.2 Classifiers
The classifiers are used to solve the categorization task. In the mathematical abstraction
a classifier is a function that takes a text and a set of categories as an input and outputs
one category. The text is predicted to belong to the output category. Hence, the
categories should be provided in advance. Moreover, before a classifier is able to do
its job it has to be “learned” on an annotated data, i.e. a set of texts with known
categories.
Different types of classifiers can be used.
Empirical evaluations have shown
the performance of those non-linear classifiers comparable to stronger linear classifiers [Fan and Yang, 2003, Zhang and Oles, 2001].
Taking this information into
account we have chosen a so-called logistic regression for performing the task
[Hosmer Jr and Lemeshow, 2004]. Besides the ease of use and availability of libraries
for different programming languages, the logistic regression offers a straight-forward interpretations of the feature weights as the importance of features. In other words for
the different features (concepts) the classifier contains the information about the “importance” for the categories. Although this importance is not used for the assessment
of the text directly it may become useful for some other application in the project.
The logistic regression classifier accepts several meta-parameters that influence the quality of classification. For estimating the values of the parameters the grid search is used
[Lerman, 1980]. In the course of the grid search multiple logistic regression classifiers
with different meta-parameters are learned and their performance is then evaluated. The
meta-parameters yielding the best performance are chosen as the parameters for the final
classifier.
In the data there exist 3 classes. In different applications of the categorization task in
the project platform we may expect to have more classes. Therefore, it is necessary
to design a system without restrictions on the number of classes. Taking this into
account we need a multi-class classification. However, originally logistic regression is
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 25 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
designed for a binary classification problem, i.e., distinguishing between two classes.
Several approaches to extend logistic regression to multi-class classification are known
[Tsoumakas and Katakis, 2006]. The most popular two being one-vs-one and one-vsrest. Both approaches rely on the idea of learning an ensemble of classifiers. We have
chosen the one-vs-rest approach because of its simplicity. Moreover, it is reported that
under common conditions it is not any worse than other more sophisticated approaches
[Rifkin and Klautau, 2004]. The approach consists in the following: for each class the
training data is re-annotated; the class is separated from the rest of the classes in that
all the other classes are merged into one class “rest”. Therefore for each class we learn
a classifier to distinguish this class and the rest. In the testing and working phase all
the classifiers are used; the predicted class is the class of the classifier with the highest
score.
In order to evaluate the performance of the classifier a well-known technique of cross
validation is used [Kohavi et al., 1995]. The technique consists in the following: first the
annotated data is divided into several chunks. Next one chunk is left for testing and does
not participate in the learning phase. The classifier is learned on the training chunks (all
the chunks except the testing chunk). Then the next chunk is taken for testing and so on.
We measure the accuracy of the classifiers, i.e. the number of instances with correctly
predicted classes divided by the total number of instances. In this way we aggregate the
score for all classes in multi-class classification.
When learning logistic regression classifier one often considers regularizations to avoid
over-fitting, especially when there is a only small number of training examples, or when
there are a large number of parameters to be learned [Goodman et al., 2004, Ng, 2004].
The two most widely used regularization methods are L1 and L2 regularizations. The
L1 regularized logistic regression is often used for feature selection, and has been shown
to have good generalization performance in the presence of many irrelevant features.
Namely, L1 adds a penalty for each new feature during the learning process, therefore
forcing the classifier to use as little features as possible. We test the performance of the
classifiers with both regularizations.
5.3 Evaluation Result
For the testing the articles collected from investing.com were used. The category of
each article is known, therefore the data is annotated.
Though the categories are not balanced in the number of samples, the dataset is large.
For the testing the stratified K-fold cross validation is used. The difference from the
described above cross validation is that in every chunk the percentage of the samples of
each class is kept the same as in the complete set. The number of chunks is chosen to
be 10, i.e., 90% of the data is used for training and 10% is used for testing.
The accuracy of the 10 fold cross validation of L1 and L2 normalizations as well as the
number of features with non-zero entries for L1 regularization are presented in Table 11.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 26 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 11: Accuracy scores
L1
L1 number of features
L2
Plain
93.3% ± 0.5 %
1492
93.2% ± 0.3 %
Enriched
93.4% ± 0.5 %
1428
93.2% ± 0.3 %
Table 12: Confusion matrix
Actual
Crude oil
Euro / USD
Eustoxx50
Crude oil
1294
34
7
Predicted
Euro / USD
85
1870
19
Eustoxx50
41
7
557
From the results we may deduce that the enrichment of the data does not improve the
quality of the classification, however, the number of important features may be decreased.
This may happen due to generalization of some features and may lead to a more robust
performance. However, further tests show that this behavior is not monotonous and
further investigation is needed to find more insights into these phenomena. However,
such an investigation lies outside of the scopes of the project.
The scores of the classification of some similar dataset of news articles containing 20
classes are reported to be between 80 and 90 % [Nigam et al., 2000, Li and Vogel, 2010,
Dai and Liu, 2014]. Although the number of classes in our case is much lower, the classes
themselves are more similar and belong to the same domain. Therefore, the scores of
above 90% are satisfactory for the considered use-case. We believe that these score are
due to the selection of features, namely the usage of concepts from the thesaurus as
features. Since the thesaurus was created in the interaction with experts from the field,
the features are guaranteed to be meaningful.
Another useful method of surveying the results and accuracy of the classification is
the confusion matrix. The confusion matrix gives an overview of the correctly and
incorrectly classified instances. For creating the matrix, 10% of the data was held out
preserving the percentage of the instances for different classes. The matrix can be found
in Table 12. As can be seen from the matrix a significant amount of mistakes is caused
by misclassification of the instances of the “Crude oil” class. Moreover, some instances
of the “Eustoxx50” class were classified as belonging to “Euro / USD” class.
We also give an overview of the most important features per class in Tables 13 and 14.
The table entries are the used features originated by the concepts. They are represented
by their preferred label taken from the STW Economics thesaurus. It is worth noting
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 27 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 13: Top 15 most important concepts per category for plain data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Crude oil
Natural gas
composite output index
Membership
Foreign exchange
Stock exchange
oil
brent oil for delivery
Conservative party
MSCI
Oil price
Reuters
barrel
crude
sweet crude future
Petroleum
Euro / USD
New York Stock Exchange
AUD/USD
Monetary union
USD/CHF
Commodity Futures
Lobbying
Replacement investment
Euro
Economic forecast
Offshore financial centre
Exchange
Futures exchange
EUR/JPY
greenback versus a basket
EUR/USD
Eustoxx50
Forecast
South Korea
Information
Market
EUR/GBP
FTSE 100 (U.K.)
Interest rate policy
U.K
Euro Stoxx (Euro zone)
hang seng index
million barrel
pound to dollar
NYSE
CAC 40 (France)
DAX
that for plain data in the “Crude oil” class, the 4th and the 5th most important features
contain the token “exchange”. Due to internal annotation algorithm of PoolParty used
for annotation the overlapping preferred labels may be annotated by several concepts,
i.e., in many occurrences of “Foreign exchange” the concept “Exchange” is also present
in annotations. This may be the sources of misclassification of the instances of “Crude
oil” class. Moreover, it is interesting that among the 4 currency pairs important for
“Euro /USD” class the pair Euro to USD is actually the least important. Also it is
worth mentioning that pound to dollar pair is important for Eustoxx50.
For the enriched data we may observe that more general features gain importance. This
fact may correspond to the general intuition and may ease the analysis. However, the
quality of the classification is not improved.
5.4 Usage
The categories offer the functionality of grouping the articles. As opposed to some
automatic grouping, the categorization described in this section allows for the manual
control over the final outcome. Such a grouping may be useful for end users to pre-filter
the data as well as for the platform designers to organize the data in the categories for
further analysis.
The users’ input may be categorized automatically for better structuring of the data
in the storage. The input obtained from other websites may be also categorized if the
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 28 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Table 14: Top 15 most important concepts per category for enriched data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Crude oil
Gases
Treasuries
Industries
S&P 500
Reuters
Commodity price
Bloomberg
crude stockpiles
MSCI
oil contract
Stock exchange
Oil price
Commodity Futures
brent oil for delivery
crude oil inventories
Euro / USD
Futures market
Foreign Exchange Market
V.16.04 Probability Theory
Time series analysis
Industrial production
Bond
One-person household
European Central Bank
New York Stock Exchange
greenback versus a basket
WTI
US Dollar
USD/JPY
EUR/GBP
EUR/JPY
Eustoxx50
P.21 Packages
Chinese
United Kingdom
Bank
P.10 Electrical Engineering
Asian
LONDON
International financial market
Forecast
U.K.
Euro Stoxx (Euro zone)
hang seng index
CAC 40 (France)
NYSE
DAX
category is not known in advance. In case the input is a comment from a user, the
categorization may check if the comment belongs to the same category as the material
it refers to.
As can be seen from the discussion of the results some important insights in the data
may be obtained through the investigation of the trained classifiers.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 29 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
6 Topic Discovery
6.1 Introduction
In machine learning and natural language processing, a topic model is a type of statistical
model for discovering the abstract “topics” that occur in a collection of documents. Topic
modeling is a frequently used text-mining tool for discovery of hidden semantic structures
in a text body. The “topics” produced by topic modeling techniques are clusters of
similar words. A topic model captures this intuition in a mathematical framework,
which allows examining a set of documents and discovering the topics and distribution
of those topics in each document.
In the age of information, the amount of the written material we encounter each day
is simply beyond our processing capacity. Topic models can help to organize and offer
insights for us to understand large collections of unstructured text bodies. Topic modeling algorithms analyze the words of the original texts to discover the themes that run
through them, how those themes are connected to each other and how they change over
time. Topic modeling algorithms do not require any prior annotations or labeling of the
documents – the topics emerge from the analysis of the original texts. Topic modeling enables us to organize and summarize electronic archives at a scale that would be
impossible by human annotation.
Automatic extraction of meaningful themes from large amounts of documents can help
detecting events taking place in real time and facilitate the exploration of unlabeled
user-generated content archives.
6.2 Factorization and Topics
In mathematics, factorization is the decomposition of an object into a product of other
objects, or factors, which when multiplied together give the original object. The aim of
factoring is usually to reduce something to “basic building blocks”. Factorization technique were proved to be useful in many different application fields such as image factorization [Tomasi and Kanade, 1992], biomedicine [Gao and Church, 2005], and robotics
[Meng and Zhuang, 2007].
As applied to text mining the most common usage of factorization is the so-called topic
modeling. The rational behind topic modeling is to model the set of topics that best
represent the texts. The topics are actually the textual factors. There exists two main
approaches to topic modeling:
Probabilistic approach A statistical generative model would attempt to explain a set of
observations (documents) by latent variables (topics) that would be topics in this
case [Blei, 2012]. The similarities in the documents may be explained by the fact
that they are the result of the activity of the same latent variables. Each latent
variable has a probability of generating a certain word. One of the first techniques
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 30 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
of probabilistic topic modeling is probabilistic latent semantic analysis (PLSA)
[Blei et al., 2003]. The most popular probabilistic topic modeling approach is the
latent Dirichlet allocation (LDA) [Pritchard et al., 2000]; the advantage over PLSA
is that each document may be represented as a mixture of topics. Many other
extensions exist allowing the consideration of multiple other conditions.
In order to achieve better results some assumptions have to be made. For example,
in LDA an assumptions about prior Dirichlet distribution is made. Such assumptions may be arguable and seem artificial. The mathematical representation of the
model is rather complicated.
Algebraic approach Non-negative matrix factorization (NMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear
algebra where a matrix V is factorized into (usually) two matrices W and H, with
the property that all three matrices have no negative elements. This non-negativity
makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to
the data being considered. Since the problem is not exactly solvable in general, it
is commonly approximated numerically.
The model is rather simple and robust, it allows for straight-forward interpretation
of foundations and results. Moreover, thanks to non-uniqueness of the decomposition the model is easily customizable to take additional conditions into account
(such as, for example, sparsity). Theoretically some variations of the model are
equivalent to PLSA.
In the course of this work we focus on the algebraic approach and NMF method in
particular. The learning algorithm makes of the coordinate descent method from the
LIBLINEAR library [Fan et al., 2008].
The number of topics is a parameter of the method and has to be given in advance.
However, it is rare that the number of topics is known in advance. Therefore it is
necessary to somehow determine the optimal number. There exist different techniques
for determining this number. We use the “elbow rule” for this purpose [Thorndike, 1953,
Ketchen and Shook, 1996]. The idea of the elbow rule is that we increase the number of
topics until we find out the each next topic improves the approximation of the data less
than the previous ones. Figure 2 shows that method. The difference of the performance
between 14 and 16 topics is larger than the difference in the performance between 74
and 76 topics. The optimal number of topics is the number at which the behavior of the
curve changes. Increasing the number of topics increases the computation time as well,
while the generalization capabilities of the model decrease.
6.2.1 Evaluation Results
The performance of topic modeling is measured depending on the number of topics using
all the documents from the corpora described in Section 3. The number of topics varies
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 31 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 1: Reconstruction error and density of topics with plain annotations
from 10 to 80. The performance is measured in terms of:
Reconstruction Error The norm of the difference |V − W ∗ H| is measured. W ∗ H is
the reconstruction of the original data through the found decomposition into the
topics, hence the name.
Density The density of the topics H is measured, i.e. the ratio of the non-zero entries to
the total number of entries. The more dense is H the more often words are used in
different topics. It is desirable to have sparse matrix so that topics are described
with a small number of different words.
The result for the article annotations without broaders is presented in Figure 1, the
result with broaders is presented in Figure 2. For both cases the optimal number of
topics determined with the elbow rule is about 30 topics.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 32 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 2: Reconstruction error and density of topics with enriched annotations
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 33 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
6.3 Topical Density and Relatedness
The topics represent the main directions of possible contribution of a certain piece of
text. The more topics are represented in the text the less focused the contribution is.
Moreover, since the topics are weighted we can assess the numerical density of the topics.
As the measure of the topical density we have chosen the relation D = max( Pwwi i ), i.e.
i
i
the maximal topic weight divided by the sum of the weights of the represented topics.
Hence the topical density D denotes the ratio between the main contribution and the
overall sum of the contributions.
P
wi . The
The topical relatedness is defined as the sum of all contributions, i.e., R =
i
topical relatedness R identifies to which extent a text is related to the field of interest,
i.e., to the main topics of the texts from the training set. The two values D and R are
complementary to each other in a certain sense. If a text contains only a few concepts, its
topical density is expected to be high since only a few (or one) topics will be represented
in the text. However, in this case topical relatedness will be low. On the other hand,
if there are a lot of concepts in a text then its topical relatedness is likely to be high;
however, if the concepts are taken randomly and place in the text then the topical density
of the text will be low. Therefore, only texts with focused contribution in several topics
and a high number of concepts are expected to have high scores in both values.
6.3.1 Evaluation Results
The tests are performed on the annotated articles using plain and enrich annotations.
The dataset is split into 80% for learning topics and 20% for testing. This way we
check that the trained topics perform well on the unseen data. The dependencies of
topic relatedness and topic density on the number of words in the article and number
of concepts extracted from the article are built. The results are presented in Figures 3,
4, 5, and 6. On each figure a least square approximation is built and the correlation
coefficient R2 is presented. The best results are obtained for enriched data and the
dependency of topic relatedness on the number of words. For the topic density the
correlation coefficient was discovered to be quite small. Therefore, the dependency of
the topic density on the number of extracted concepts can be neglected.
One may also note that the characteristics of the dependencies change from plain data
to enriched data. Namely, if the data is enriched with broader concepts then the number
of concepts becomes a more reliable predictor for the topic relatedness.
Overall one may assume that for shorter snippets of text, topic density appears to be an
important value and may reach the values up to 1, whereas topic relatedness may still
remain low, close to 0. For longer snippets of text topic density may be lower, where as
the topic relatedness may be expected to be equal to the number of words multiplied by
the slope computed from the training data.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 34 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 3: Topic density and relatedness on plain training data
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 35 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 4: Topic density and relatedness on enriched training data
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 36 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 5: Topic density and relatedness on plain test data
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 37 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 6: Topic density and relatedness on enriched test data
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 38 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
6.4 Topic Transition
News articles (especially in financial field) attempt to respond to the latest change in all
possible fields of life that may have an impact on economics and finance. Hence, in the
analysis of topics we cannot rely on the news articles that are years or even months old.
However, some news topics tend to persist over long periods of time while others are
emerging and fading out rapidly. Detection of such emerging and fading out topics may
be important for economic forecasting and may be valuable for the users of the platform.
Therefore, the task of detecting new topics and investigation of the transitions of the
old topics is a relevant assessment task.
Over the last decade, many strategies for topic detection based on probabilistic models have appeared [AlSumait et al., 2008, Cao et al., 2007, Vaca et al., 2014]. However,
these models have a main drawback in that high computational times make them unable
to deal with large amounts of documents arriving in real time. Hence, these approaches
cannot be applied in many real-world scenarios where data must be processed online and
efficiently. Prominent examples are online news outlets and social media where users
are continuously producing large amount of data whose topics rapidly grow and fade in
intensity across time. Moreover, these approaches make use of complex models that are
difficult to implement and investigate. Some approaches [Vaca et al., 2014] introduce
new meta-parameters, therefore making the models even more difficult to use.
Other authors make use of a simpler approach relying on the existing methods
[Panisson et al., 2014]. The idea is to sample the data into chunks based on the date of
appearance, compute the topic distribution for those chunks and then compare the obtained distributions. The comparison may yield the insights about the transitions of the
topics and the emergence of new topics. Though the results of the investigation of the
topic transitions cannot be directly used for text assessment since a set of text (corpus)
is needed to identify the transitions, it makes sense to investigate this functionality in
frames of this deliverable since topic modeling is under investigation.
6.4.1 Evaluation Results
We compute the distribution of topics Hn,n+k for the time period of k weeks starting
from week n. Next we take Hn+k,n+2k and compare the two topic distributions. We are
aiming to find the matrix M such that Hn+k,n+2k = M̂ ∗ Hn,n+k , where ∗ is the dot
product. In general such a matrix M̂ does not exist so we are looking for such a matrix
that would “explain” most of the values in Hn+k,n+2k .
The approach described in [Panisson et al., 2014] uses the cosine score to find the transitions between topics. In case topics correlate in the original distribution the cosine approach may overcount the evidence and “explain” the same new topic twice. Therefore
we have decided to use a different approach. The initial idea is to use the inverse matri−1
−1
ces. If the inverse matrix Hn,n+k
exists then we could express M = Hn+k,n+2k ∗ Hn,n+k
.
However, the inverse matrix does not exist in the general case. Therefore, we could take
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 39 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 7: Topic transition matrix using pseudo-inverse matrix
+
[Ben-Israel and Greville, 2003].
the Moore-Penrose pseudo-inverse matrix instead Hn,n+k
+
.
Hence, we obtain an approximation of the transition matrix M = Hn+k,n+2k ∗ Hn,n+k
The disadvantage of the inverse matrices is that the resulting transition matrix M may
have negative entries. The negative entries cannot be interpreted meaningfully.
As an alternative approach we implement a method to solve a proper optimization
task minimizing the difference between Hn+k,n+2k and M ∗ Hn,n+k with an additional
constraint on the non-negativity of M . We present the results of both methods as
colormaps. In Figures 7 and 8 we present the two matrices obtained for the topics from
the year 2015 with n = 0, k = 2. For better representation we limit the number of topics
to 10.
The reconstruction errors are:
Pseudo-inverse 37.8
Optimization 38.6
As there is no big difference in reconstruction error we choose the optimization problem
as the preferred option for building the topic transitions. As the output of the transition
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 40 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Figure 8: Topic transition matrix using optimization algorithms
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 41 of 49
Deliverable D2.2
Topic 0
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Topic 7
Topic 8
Topic 9
Dissemination level (PU)
Contract N. 687895
Table 15: 5 Topics of weeks 0 and 1 of year 2015
NASDAQ, New York Stock Exchange, NYSE, International financial
market, Financial market, Corporation, B.08.03 Taxes and Choice of
Organizational Form or Location, Organizational form, B.01.04
Organizational Forms, N.05.02.02 Economic Private Law
Western Europe, G.01.05 Western Europe, EU countries, France,
PARIS, Europe, LONDON, United Kingdom, Private bank, Bank
percent, G.02 Asia, Asia, East Asia, G.02.02 East Asia, S&P 500,
Yield, Return to capital, Tokyo, Japan
Central America, G.04.01 Central America, Mexico, Refinery,
Chemical Industry, Petrochemical industry, Basic chemical industry,
Mexicans, Mexican, Latin Americans
XETRA, DAX, Stock price, AG NA O.N, Partnership, KGAA,
MDAX, DAX (Germany), W.14.03.02 Financial Economics, B.08.03
Taxes and Choice of Organizational Form or Location
V.07.05 Foreign Trade, W.10.04 Export Sector, Export, External
sector, Foreign Trade, Process, Enterprise, Law, W.21 Business
Services, Energy
National Accounts, Consumer good, Goods, Gross Domestic Product,
National product, V.03.05 Consumption and Savings, Balance of
Payments, V.07.08 Balance of Payments, Durable good, durable
goods
Airline, Transport sector, W.12.01.04 Air Transport, Hedging, Fuel,
W.12.01 Transport Mode, W.12 Transport and Tourism, Costs,
B.03.02 Cost Accounting, Strategy
manufacture, National Accounts, National product, European
Central Bank, V.02.02 Theory of the Firm, Factor price, Costs,
Chairman of the ECB, Mario Draghi, N.04.04.04 European
Integration and EU Policy
Arab countries, G.02.01 Middle East, Middle East, Syria, Asia, G.02
Asia, Iraq, Social conflict, Labour dispute, Strike
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 42 of 49
Deliverable D2.2
Topic 0
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
Topic 6
Topic 7
Topic 8
Topic 9
Dissemination level (PU)
Contract N. 687895
Table 16: 5 Topics of weeks 2 and 3 of year 2015
Western Europe, France, PARIS, G.01.05 Western Europe, EU
countries, Europe, CAC 40 (France), Private bank, France, LONDON
NASDAQ, New York Stock Exchange, NYSE, International financial
market, Financial market, Corporation, Organizational form, B.08.03
Taxes and Choice of Organizational Form or Location, B.01.04
Organizational Forms, N.05.02.02 Economic Private Law
B.03.02 Cost Accounting, Factor price, V.02.02 Theory of the Firm,
Costs, B.04.02 Wage Payment Systems and Fringe Benefits, Income,
Cost of capital, V.05.04 Interest Rate, Return to capital, V.03.03
Capital
Mineral resources, Natural gas resources, Natural gas, Gases, cubic
foot, cubic foot, Gases, billion cubic foot, natural gas future, Climate
XETRA, DAX, Stock price, DAX (Germany), MDAX, Partnership,
KGAA, W.14.03.02 Financial Economics, AG NA O.N, Germany
Oil Prices Future contracts, V.05.06.03 Forward Market, Commodity
Futures, Futures contract, million barrel, Brent Crude Oil Futures,
crude oil inventories, W.04.01.04 Energy Policy, V.12.02.01 Energy
Policy, B.08.02 Statement of Assets
International Monetary System, Economic Integration, International
economic relations, International relations, Economic union,
European Economic and Monetary Union, Euro Area, N.04.04.04
European Integration and EU Policy, European Integration,
Monetary union
percent, B.02.01.02 Debt Financing, V.05.06.02.02 Bond Market,
Fixed-income securities, Yield, Bond, V.03.01 Aggregate Investment,
Public Debt, Debt, V.09.06 State-Owned Assets and Public Debt
V.09.07 Economics of Taxation, V.09.07.03 Tax Types, Revenue,
Public Revenues, V.09.03 Public Budget, Public Finance, W.22.01.02
Financial Administration and Public Finance, Tax, V.11 Regional
Economics and Infrastructure, V.09.07.01 Theory of Taxation
N.10.01.02 Climate, Weather, and Air, W.12 Transport and Tourism,
bpd, Means of transport, P.09 Vehicles, Climate change, G.02.01
Middle East, G.02 Asia, Middle East, Climate
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 43 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
analysis we obtain the information about the disappearing topics and the emerging
topics:
Disappearing 5, 6, 7
Emerging 3, 5, 8
6.5 Usage
The topic density and topic relatedness may be used to assess the overall impact of the
textual contribution on the platform coming from different sources including users’ contribution. The overall methodology was developed in course of this deliverable, further
tests may be needed to evaluate the efficiency on the real user inputs.
The topic transitions can be used to analyze text coming from news and from user inputs
as well. In contrast to other methods developed in the course of this deliverable, the
topic transition is not applied to individual texts but to a set of texts. The usage of
these methods has been discussed with the partners in the context of WP4 “Market
Sentiment-based Financial Forecasting”. Moreover, this methodology allows for the
straight-forward extension for the observation of the evolution of the topics. As an
additional tool methodologies for the identification of potential trends with the reasons
for those trends is being planned based on positive and negative associative rules. These
methodologies may be developed in the further months of the Task 2.1 “Data Harvesting,
Extraction and Assessment” (M1-M24).
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 44 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
Conclusion
The deliverable 2.2 “Data and information streams – assessment tools” represents the
results of preparing the assessment tools for the PROFIT project. In the discussion with
the partners it was found out that the only input the requires assessment is the textual
input. Therefore, the assessment methods were developed for the analysis of purely
textual information based on the data acquired earlier and the background thesaurus.
In the context of this activity the following was done:
1. Methods for preprocessing of the textual input are developed. Those methods
include sentence boundary disambiguation, syllabification, and standard text vectorization methods.
2. Readability indices are investigated and implemented. Evaluation tests were carried out to identify the most reliable indices.
3. Text categorization methods are implemented. The results of the categorization
using different regularization techniques and enrichment of the texts using thesauri
were investigated. The relevance of features is examined.
4. Topic modeling and topic discovery approaches are investigated and implemented.
Two metrics based on the topic decomposition are introduced: topic density and
topic relatedness. The behavior of these metrics is investigated subject to the possible usage patterns. Topic transition methodologies are investigated, implemented,
and tested.
5. The data extracted in frames of D2.3 “Data crawlers, adaptors and extractors”
(M12) is used for tests and calibration. This input data set is suggested and
approved by the partners who are experts in the field. The conclusions are based
on the evaluation results obtained from this data set assessment.
6. Usage patterns of the metrics are investigated and suggested. For each metrics an
application scenario and the condition of the usage are identified.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 45 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
References
[Aberdeen et al., 1995] Aberdeen, J., Burger, J., Day, D., Hirschman, L., Robinson,
P., and Vilain, M. (1995). Mitre: description of the alembic system used for muc6. In Proceedings of the 6th conference on Message understanding, pages 141–155.
Association for Computational Linguistics.
[Agirre and Soroa, 2009] Agirre, E. and Soroa, A. (2009). Personalizing pagerank for
word sense disambiguation. In Proceedings of the 12th Conference of the European
Chapter of the Association for Computational Linguistics, EACL ’09, pages 33–41,
Stroudsburg, PA, USA. Association for Computational Linguistics.
[Aho and Ullman, 1992] Aho, A. V. and Ullman, J. D. (1992). Foundations of computer
science. Computer Science Press, Inc.
[AlSumait et al., 2008] AlSumait, L., Barbará, D., and Domeniconi, C. (2008). Online lda: Adaptive topic models for mining text streams with applications to topic
detection and tracking. In 2008 eighth IEEE international conference on data mining,
pages 3–12. IEEE.
[Ben-Israel and Greville, 2003] Ben-Israel, A. and Greville, T. N. (2003). Generalized
inverses: theory and applications, volume 15. Springer Science & Business Media.
[Blei, 2012] Blei, D. M. (2012). Probabilistic topic models. Communications of the
ACM, 55(4):77–84.
[Blei et al., 2003] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet
allocation. Journal of machine Learning research, 3(Jan):993–1022.
[Borst and Neubert, 2009] Borst, T. and Neubert, J. (2009). Case study: Publishing
stw thesaurus for economics as linked open data. W3C Semantic Web Use Cases and
Case Studies.
[Cao et al., 2007] Cao, B., Shen, D., Sun, J.-T., Wang, X., Yang, Q., and Chen, Z.
(2007). Detect and track latent factors with online nonnegative matrix factorization.
In IJCAI, volume 7, pages 2689–2694.
[Coleman and Liau, 1975] Coleman, M. and Liau, T. L. (1975). A computer readability
formula designed for machine scoring. Journal of Applied Psychology, 60(2):283.
[Dai and Liu, 2014] Dai, J. and Liu, X. (2014). Approach for text classification based
on the similarity measurement between normal cloud models. The Scientific World
Journal, 2014.
[Fan and Yang, 2003] Fan, L. and Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proc. ICML, pages 472–479.
[Fan et al., 2008] Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J.
(2008). Liblinear: A library for large linear classification. Journal of machine learning
research, 9(Aug):1871–1874.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 46 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
[Fitzsimmons et al., 2010] Fitzsimmons, P., Michael, B., Hulley, J., and Scott, G.
(2010). A readability assessment of online parkinson’s disease information. The journal of the Royal College of Physicians of Edinburgh, 40(4):292–296.
[Forman, 2003] Forman, G. (2003). An extensive empirical study of feature selection
metrics for text classification. Journal of machine learning research, 3(Mar):1289–
1305.
[Gao and Church, 2005] Gao, Y. and Church, G. (2005). Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics,
21(21):3970–3975.
[Goodman et al., 2004] Goodman, J. et al. (2004). Exponential priors for maximum
entropy models. In HLT-NAACL, pages 305–312. Citeseer.
[Hosmer Jr and Lemeshow, 2004] Hosmer Jr, D. W. and Lemeshow, S. (2004). Applied
logistic regression. John Wiley & Sons.
[Ketchen and Shook, 1996] Ketchen, D. J. and Shook, C. L. (1996). The application of
cluster analysis in strategic management research: an analysis and critique. Strategic
management journal, 17(6):441–458.
[Kincaid et al., 1975] Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom,
B. S. (1975). Derivation of new readability formulas (automated readability index, fog
count and flesch reading ease formula) for navy enlisted personnel. Technical report,
DTIC Document.
[Kohavi et al., 1995] Kohavi, R. et al. (1995). A study of cross-validation and bootstrap
for accuracy estimation and model selection. In IJCAI, volume 14, pages 1137–1145.
[Lerman, 1980] Lerman, P. (1980). Fitting segmented regression models by grid search.
Applied Statistics, pages 77–84.
[Li and Vogel, 2010] Li, B. and Vogel, C. (2010). Improving multiclass text classification
with error-correcting output coding and sub-class partitions. In Canadian Conference
on Artificial Intelligence, pages 4–15. Springer.
[Luhn, 1957] Luhn, H. P. (1957). A statistical approach to mechanized encoding
and searching of literary information. IBM Journal of Research and Development,
1(4):309–317.
[Mc Laughlin, 1969] Mc Laughlin, G. H. (1969). Smog grading-a new readability formula. Journal of reading, 12(8):639–646.
[Meng and Zhuang, 2007] Meng, Y. and Zhuang, H. (2007). Autonomous robot calibration using vision technology. Robotics and Computer-Integrated Manufacturing,
23(4):436–446.
[Mihalcea, 2010] Mihalcea, R. (2010). Word sense disambiguation. In Sammut, C. and
Webb, G., editors, Encyclopedia of Machine Learning, pages 1027–1030. Springer US.
[Mikheev, 2000] Mikheev, A. (2000). Tagging sentence boundaries. In Proceedings of the
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 47 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
1st North American chapter of the Association for Computational Linguistics conference, pages 264–271. Association for Computational Linguistics.
[Ng, 2004] Ng, A. Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational
invariance. In Proceedings of the twenty-first international conference on Machine
learning, page 78. ACM.
[Nigam et al., 2000] Nigam, K., Mccallum, A. K., Thrun, S., and Mitchell, T. (2000).
Text classification from labeled and unlabeled documents using em. Machine Learning,
39(2):103–134.
[Palmer and Hearst, 1997] Palmer, D. D. and Hearst, M. A. (1997). Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241–267.
[Panisson et al., 2014] Panisson, A., Gauvin, L., Quaggiotto, M., and Cattuto, C.
(2014). Mining concurrent topical activity in microblog streams. arXiv preprint
arXiv:1403.1403.
[Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay,
E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12:2825–2830.
[Pritchard et al., 2000] Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2):945–
959.
[Reynar and Ratnaparkhi, 1997] Reynar, J. C. and Ratnaparkhi, A. (1997). A maximum entropy approach to identifying sentence boundaries. In Proceedings of the
fifth conference on Applied natural language processing, pages 16–19. Association for
Computational Linguistics.
[Rifkin and Klautau, 2004] Rifkin, R. and Klautau, A. (2004). In defense of one-vs-all
classification. Journal of machine learning research, 5(Jan):101–141.
[Salton and McGill, 1986] Salton, G. and McGill, M. J. (1986). Introduction to Modern
Information Retrieval. McGraw-Hill, Inc., New York, NY, USA.
[Smith and Senter, 1967] Smith, E. A. and Senter, R. (1967). Automated readability
index. AMRL-TR. Aerospace Medical Research Laboratories (6570th).
[Sparck Jones, 1972] Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21.
[Thorndike, 1953] Thorndike, R. L. (1953). Who belongs in the family? Psychometrika,
18(4):267–276.
[Tomasi and Kanade, 1992] Tomasi, C. and Kanade, T. (1992). Shape and motion from
image streams under orthography: a factorization method. International Journal of
Computer Vision, 9(2):137–154.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 48 of 49
Deliverable D2.2
Dissemination level (PU)
Contract N. 687895
[Tsoumakas and Katakis, 2006] Tsoumakas, G. and Katakis, I. (2006). Multi-label classification: An overview. Dept. of Informatics, Aristotle University of Thessaloniki,
Greece.
[Vaca et al., 2014] Vaca, C. K., Mantrach, A., Jaimes, A., and Saerens, M. (2014). A
time-based collective factorization for topic discovery and monitoring in news. In
Proceedings of the 23rd international conference on World wide web, pages 527–538.
ACM.
[Yang and Joachims, 2008] Yang, Y. and Joachims, T. (2008). Text categorization.
3(5):4242. revision #91858.
[Yarowsky, 1995] Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling
supervised methods. In Proceedings of the 33rd annual meeting on Association for
Computational Linguistics, pages 189–196. Association for Computational Linguistics.
[Zhang and Oles, 2001] Zhang, T. and Oles, F. J. (2001). Text categorization based on
regularized linear classification methods. Information Retrieval, 4(1):5–31.
September 30, 2016
H2020-687895 ©The PROFIT Consortium
Page 49 of 49