FI - FIWARE Forge

1. Enabler Description, Requirements and Roadmap
1.1
Definition of the master use scenarios
Reader’s Background
Even if this document will be kept as simple as possible, the reader has to know some generic
information about semantic technologies, format and datasets in order to fully understand the
content. Namely:
- Basics of Semantic Technlogies (meaning the ones exploiting specifically RDF, the concept
of unique entities)
- Basics of “entity named recognition” and “automatic classification”
- RDF Format, Triple Stores.
Technological Background
The evolution of the traditional Web into a Semantic Web and the continuous increase in the
amount of data published as Linked Data open up new opportunities for annotation and
categorization systems to reuse these data as semantic knowledge bases. Accordingly, Linked Data
has been used by information extraction systems to exploit the semantic knowledge bases, which
can be interconnected and structured in order to increase the precision and recall of annotation and
categorization mechanisms.
The goal of the Semantic Web (or Web of Data) is to describe the meaning of the information
published on the Web to allow retrieval based on an accurate understanding of its semantics. The
Semantic Web adds structure to the resources accessible online in ways that are not only usable by
humans, but also by software agents that can rapidly process them. Linked Data (LD) refers to a
way for publishing and interlinking structured data on the Web. LD is part of the design of the
Semantic Web and represents the foundation “bricks” needed to build it. Although the concept of
“linked data” was already present in the theory of the Semantic Web, it came into vogue later in
computer science due to the too low computing capacity at that time.
However, in the recent years due to the growing number of datasets based on LD, it has been
possible to exploit their implicit knowledge through text classification and annotation processes in
order to build semantic applications.
Text classification is the assignment of a text into one or more pre-existing classes (also known as
features). This process determines the class membership of a text document given a set of distinct
classes with a profile and a number of features. The criterion for the selection of relevant features
for classification is essential and is determined a priori by the classifier (human or software).
The semantic classification takes place when the elements of interest in the classification refer to
the “meaning" of the document. Text annotation refers to the common practice of adding
information to the text itself through underlining, notes, comments, tags or links. The annotation of
text can be also semantic when the text of a document is added with information about its meaning
or the meaning of individual elements that compose it. This is done primarily using links that
connect a word, an expression or a phrase to an information resource on the Web or to an
unambiguous entity present in a knowledge base.
Social Semantic Enricher
The Social Semantic Enricher will work on text. Given a text in input the enabler provided will
perform a Semantic classification process on the given text in order to identify a custom number of
concepts, ordered by relevance, which can both describing or extending the text given in input.
Each concept is represented by a single word or a set of words related to a specific object,
referenced by a Linked Open Data entry into the dataset. Concepts will be returned ordered by
relevance, meaning that the last concept could have low relevancy with respect to the text and
opening the way to some optimization problems.
The returned concepts are then used as inputs for the enrichment part. The system will enrich the
text by searching for media and text related to the concepts found over different data sources.
The system will be composed of a “core” part, containing the data in a format efficient to be queried
which will be in communication with an “Enhancement” Module. The interoperability among the
different modules will output the concepts enriched for the text given in input
This “Core interface” will actually be the one provided to process the text in order to obtain the list
of concepts related to that text. For each one of the concepts a basic set of information will be
returned, with the common feature to be found into the core dataset itself.
Connected to this core interface, there is the “enhanment client” the enhancer client wraps the core
interface and expose as well its own APIs (in REST FORMAT) in order to have the data coming
out on the core enhanced with externale media/text
A call to the REST client interface will specify how to enhance the data coming out of the SSE
Core. Different policies will be defined, such as “enhance only with images” or “enhance with
videos and images” etc.
The external sources available will include for sure images and video with a possible extension to
external text, news and places.
Besides this, in order to have a “freestanding” enabler, a Graphical User Interface module will be
developed in order to make available a friendly environment to use the enabler and make users just
play with it or understand if such a kind of service is what they really need for their own purposes.
In order to be as generic as possible, the core will contain the Lucene Indexes of the Dbpedia
Datasets, due also to the fact that it is possible to map Dbpedia on Wikipedia in order to make the
enhancement part easier, specially for the image part.
In general the module structure should allow, with a reasonable effort and thanks to a clear
definition of interfaces to change the core with another one.
At least for dBpedia data update a tool to generate Lucene Indexes shall be provided.
1.2
Stakeholders/User Stories
An object like the Social Semantic Enricher could be useful in different environments for different
Stakeholders
News Operators: Enriching the text of specific news could be the way to provide insights on
specific topics to the final users, allowing them to go in depth into “related” topic. This usage of
the Social Semantic Enricher would be a borderline usage among classification and
recommendation
Broadcasters/Broadcasting Service providers: The EPG description of the program could be
passed as text to the enricher in order to obtain some related topics to be provided on a second
screen or on an evolved EPG.
Mobile App Developers:
The API paradigm makes the service suitable for mobile apps, to be used in different contexts.
Book/Education operators: The service could be suitable for applications related to books.
Passing parts of books, such as chapters for example to the service should give an hint of what they
are about. Such an approach could work also for papers into the scientific community.
End users could also play with the service by using the GUI in the way they found it useful to.
Telecom Italia is contributing to FI-CORE by means of the Innovation department. The unit is
chartered to identify new exploitation opportunities compounding technological and strategy
aspects.
By trying to define clearly TI’s plans in terms of exploitation of such an enabler we identified
several areas of potential interest inside Telecom Italia’s business for such a service:
● Social Reading
● Smart Spaces
● Social TV/Video Services
Based on the findings the Social TV/ Social Video Services area seems to be more suitable,
foreseeing a possible integration with Kurento and Social Data Aggregators enablers.
1.3
Requirements
Once described the vision of the Semantic Social Enricher and its possible usage inside and outside
of the project, we try to define some requirements in a more formal way, so to be ready to always
provide a clear view of the enabler status. The requirements can have three different levels of
importance:
● Must: Shows a feature considered essential for the success of the application. Without this
feature the application success could be in risk.
● Relevant: Shows a feature considered important for the success of the application.
● Desirable: Shows a feature that might be interesting for some scenarios, but whose real
interest would need further analysis and experimentation.
R1: MUST
System must offer a classification service, which given a text in input should be able to perform a
semantic topic detection and output the analysis result to the caller
R2: MUST
System must provide an enhancement service, which given a text and the topics detected from Rx
SHOULD be able to enhance each topic with multimedia elements related to the topic.
R2.1: MUST
An Image Enhancement Service must exist, trying to get a relevant image for a given (detected)
topic
R2.2: RELEVANT
A Video Enhancement service must be provided, trying to get a relevant image for a given
(detected) topic
R2.3: DESIRABLE
A Place Enhancement service should be provided, trying to get a relevant place (if any) to enhance
given (detected) topic
R2.4: DESIRABLE
A Text Enhancement service should be provided, trying to get a relevant text (if any) for to enhance
a given (detected) topic
R2.5: DESIRABLE
A News Enhancement service should be provided, trying to get a relevant text (if any) to enhance a
given (detected) topic
R3: MUST
System must have a modular structure, with a “core part” containing Lucene Indexes of dbpedia
Datasets to be queried in order to offer the classification service.
R3.1:MUST
There should be a tool to generate the lucene indexes of dbpedia (also for update purposes)
R3.2: DESIRABLE
In final architecture It should be possible to substitute Dataset/Core module with another one with
reasonoable effort
R4: MUST
System core must offer a core module and enhancement module communicating among them to
obtain the final result.
R5: MUST
Responses of the core module should just be related to data inside the core.
R6: MUST
An external System Interface, wrapping and extending the core functionalities must be provided
R6.1: MUST
The External System Interface should offer a set of API wrapping the classification service of the
core interface and extending it by offering via API the Enhancement Services described in
Requirements from R2.1 to R2.5
R7: MUST
A Graphical User Interface offering both the classification and enhancement services must be
provided.
R7.1: DESIRABLE
The GUI must call and present the requirements of a subset of API described in R6.
1.4
Roadmap
Checking the requirements we decided in the first period to give priority to R3, R3.1, R3.2 and
R5,R6. In the next 7 months (release of March 2015 and June 2015) we will provide
-
A tool for generating the lucene indexes for the core (March 2015)
The SSE Core module with a test interface to check the results.
The choice on these requirements and activity was related to confine and minimize in time the
development of software which can be hardly demonstrated.
The software parts matching the requirements that will be developed later will be easier to
demonstrate in terms of progress and we decided to leave them for the advanced phase of the
project in which it is more critical to monitor progresses.
2. Component details
2.1 Core Module dataset and Core updater Module details and interfaces
Introduction
The Classification process of the SSE uses machine-learning techniques with a memory-based
approach. The target text or document is assigned to a class by comparing it with a set of preclassified document, called training set.
It works like this: for each document classified a vector distance among the training set instances
and document itself is computed. The SSE Training set is based on the Wikipedia contents, stored
into a specific document database/collection, named Lucene (Corpus) Index, for whom the
corresponding dBpedia entry is used as index. Each entry into this dataset is formally called a
Lucene Document.
For each Lucene Document, indexed with his DBPedia uri ( such as
http://dbpedia.org/resource/NameOfTheResource), a specific set of information is stored, including
one called “context”, which is “core” in order to compute similarity between documents.
The need of an automatic or semi-automatic process to update the indexes into the dataset comes
from the fact that since the entries in Wikipedia are user-generated, the dataset tend to be outdated
quite soon, due to a number of new entries which are necessarily unprocessed. So we will consider
a specific version of the Lucene Corpus Index then, necessarily tied to a specific version of
DBPedia and WikiPedia.
The Core updater which will be developed in FIWARE Data Sprints 4.2.1-4.2.2-4.2.3 and released
with release 4.2, will generate the datasets to populate the core basing on DBPEDIA version 3.9.
The choice to use the DBpedia uri as index is due to the fact that the main elements we want to
extract are entities, which are uniquely pointed by URIs.
Lucene Index Generation Process Details
As it was said previously, each entry into the Corpus Index document collection has a set of
information called context. The complete info which il be included into the document context into
the corpus is:
-
URI: Dbpedia entity URI
URI COUNT: number of times in which the entity appears as wikilink into the whole
Wikipedia
TITLE: Wikipedia Title, identified by URI
TYPE: Value of property rdf:type related to entity
IMAGE: image url as indicated in dataset RDF property foaf:depiction
CONTEXT: Wikipedia Paragraph containing specific entity under the wiki-link format.
As it was explained previously the context is the key to compute the similiraty between documents.
The similarity metric is computed by taking into account all of the paragraphs in which a wikilink
occurs into the whole Wikipedia. If a paragraph contains a wikilink, this paragraph will be saved as
“Context” Field into a Lucene document corresponding to a specific dbpedia resource. So, given
that each Wikipedia entry has a corresponding dbpedia entry, we will find:
- 1 document for each Dbpedia/Wikipedia entry
- N CONTEXT fields, each one related to a wikilink contained into the document and
pointing to its specific resource
For the similarity computation the Default Similarity metric of Lucene is used, leveraging on a
combination of Vector Space Model and Boolean model of information retrieval.
The development of this part will leverage, from a technological point of view on Dbpedia Spotlight
and Apache Lucene.
The very first step performed in generating indexes was starting to prepare small portion of the
datasets, checking the results by means of a Java Tool Called Luke.
We Started with some Dataset in English, aiming at extending the portion of dataset obtained in this
field. The next figure shows the luke output for a single resource
Key parameters are URI COUNT and CONTEXT(s). The software will actually consider similarity
metrics among given text in input and all the contexts in the index.
Next Step will be the index generation and generalization of the code for the upcoming datasets.