1. Enabler Description, Requirements and Roadmap 1.1 Definition of the master use scenarios Reader’s Background Even if this document will be kept as simple as possible, the reader has to know some generic information about semantic technologies, format and datasets in order to fully understand the content. Namely: - Basics of Semantic Technlogies (meaning the ones exploiting specifically RDF, the concept of unique entities) - Basics of “entity named recognition” and “automatic classification” - RDF Format, Triple Stores. Technological Background The evolution of the traditional Web into a Semantic Web and the continuous increase in the amount of data published as Linked Data open up new opportunities for annotation and categorization systems to reuse these data as semantic knowledge bases. Accordingly, Linked Data has been used by information extraction systems to exploit the semantic knowledge bases, which can be interconnected and structured in order to increase the precision and recall of annotation and categorization mechanisms. The goal of the Semantic Web (or Web of Data) is to describe the meaning of the information published on the Web to allow retrieval based on an accurate understanding of its semantics. The Semantic Web adds structure to the resources accessible online in ways that are not only usable by humans, but also by software agents that can rapidly process them. Linked Data (LD) refers to a way for publishing and interlinking structured data on the Web. LD is part of the design of the Semantic Web and represents the foundation “bricks” needed to build it. Although the concept of “linked data” was already present in the theory of the Semantic Web, it came into vogue later in computer science due to the too low computing capacity at that time. However, in the recent years due to the growing number of datasets based on LD, it has been possible to exploit their implicit knowledge through text classification and annotation processes in order to build semantic applications. Text classification is the assignment of a text into one or more pre-existing classes (also known as features). This process determines the class membership of a text document given a set of distinct classes with a profile and a number of features. The criterion for the selection of relevant features for classification is essential and is determined a priori by the classifier (human or software). The semantic classification takes place when the elements of interest in the classification refer to the “meaning" of the document. Text annotation refers to the common practice of adding information to the text itself through underlining, notes, comments, tags or links. The annotation of text can be also semantic when the text of a document is added with information about its meaning or the meaning of individual elements that compose it. This is done primarily using links that connect a word, an expression or a phrase to an information resource on the Web or to an unambiguous entity present in a knowledge base. Social Semantic Enricher The Social Semantic Enricher will work on text. Given a text in input the enabler provided will perform a Semantic classification process on the given text in order to identify a custom number of concepts, ordered by relevance, which can both describing or extending the text given in input. Each concept is represented by a single word or a set of words related to a specific object, referenced by a Linked Open Data entry into the dataset. Concepts will be returned ordered by relevance, meaning that the last concept could have low relevancy with respect to the text and opening the way to some optimization problems. The returned concepts are then used as inputs for the enrichment part. The system will enrich the text by searching for media and text related to the concepts found over different data sources. The system will be composed of a “core” part, containing the data in a format efficient to be queried which will be in communication with an “Enhancement” Module. The interoperability among the different modules will output the concepts enriched for the text given in input This “Core interface” will actually be the one provided to process the text in order to obtain the list of concepts related to that text. For each one of the concepts a basic set of information will be returned, with the common feature to be found into the core dataset itself. Connected to this core interface, there is the “enhanment client” the enhancer client wraps the core interface and expose as well its own APIs (in REST FORMAT) in order to have the data coming out on the core enhanced with externale media/text A call to the REST client interface will specify how to enhance the data coming out of the SSE Core. Different policies will be defined, such as “enhance only with images” or “enhance with videos and images” etc. The external sources available will include for sure images and video with a possible extension to external text, news and places. Besides this, in order to have a “freestanding” enabler, a Graphical User Interface module will be developed in order to make available a friendly environment to use the enabler and make users just play with it or understand if such a kind of service is what they really need for their own purposes. In order to be as generic as possible, the core will contain the Lucene Indexes of the Dbpedia Datasets, due also to the fact that it is possible to map Dbpedia on Wikipedia in order to make the enhancement part easier, specially for the image part. In general the module structure should allow, with a reasonable effort and thanks to a clear definition of interfaces to change the core with another one. At least for dBpedia data update a tool to generate Lucene Indexes shall be provided. 1.2 Stakeholders/User Stories An object like the Social Semantic Enricher could be useful in different environments for different Stakeholders News Operators: Enriching the text of specific news could be the way to provide insights on specific topics to the final users, allowing them to go in depth into “related” topic. This usage of the Social Semantic Enricher would be a borderline usage among classification and recommendation Broadcasters/Broadcasting Service providers: The EPG description of the program could be passed as text to the enricher in order to obtain some related topics to be provided on a second screen or on an evolved EPG. Mobile App Developers: The API paradigm makes the service suitable for mobile apps, to be used in different contexts. Book/Education operators: The service could be suitable for applications related to books. Passing parts of books, such as chapters for example to the service should give an hint of what they are about. Such an approach could work also for papers into the scientific community. End users could also play with the service by using the GUI in the way they found it useful to. Telecom Italia is contributing to FI-CORE by means of the Innovation department. The unit is chartered to identify new exploitation opportunities compounding technological and strategy aspects. By trying to define clearly TI’s plans in terms of exploitation of such an enabler we identified several areas of potential interest inside Telecom Italia’s business for such a service: ● Social Reading ● Smart Spaces ● Social TV/Video Services Based on the findings the Social TV/ Social Video Services area seems to be more suitable, foreseeing a possible integration with Kurento and Social Data Aggregators enablers. 1.3 Requirements Once described the vision of the Semantic Social Enricher and its possible usage inside and outside of the project, we try to define some requirements in a more formal way, so to be ready to always provide a clear view of the enabler status. The requirements can have three different levels of importance: ● Must: Shows a feature considered essential for the success of the application. Without this feature the application success could be in risk. ● Relevant: Shows a feature considered important for the success of the application. ● Desirable: Shows a feature that might be interesting for some scenarios, but whose real interest would need further analysis and experimentation. R1: MUST System must offer a classification service, which given a text in input should be able to perform a semantic topic detection and output the analysis result to the caller R2: MUST System must provide an enhancement service, which given a text and the topics detected from Rx SHOULD be able to enhance each topic with multimedia elements related to the topic. R2.1: MUST An Image Enhancement Service must exist, trying to get a relevant image for a given (detected) topic R2.2: RELEVANT A Video Enhancement service must be provided, trying to get a relevant image for a given (detected) topic R2.3: DESIRABLE A Place Enhancement service should be provided, trying to get a relevant place (if any) to enhance given (detected) topic R2.4: DESIRABLE A Text Enhancement service should be provided, trying to get a relevant text (if any) for to enhance a given (detected) topic R2.5: DESIRABLE A News Enhancement service should be provided, trying to get a relevant text (if any) to enhance a given (detected) topic R3: MUST System must have a modular structure, with a “core part” containing Lucene Indexes of dbpedia Datasets to be queried in order to offer the classification service. R3.1:MUST There should be a tool to generate the lucene indexes of dbpedia (also for update purposes) R3.2: DESIRABLE In final architecture It should be possible to substitute Dataset/Core module with another one with reasonoable effort R4: MUST System core must offer a core module and enhancement module communicating among them to obtain the final result. R5: MUST Responses of the core module should just be related to data inside the core. R6: MUST An external System Interface, wrapping and extending the core functionalities must be provided R6.1: MUST The External System Interface should offer a set of API wrapping the classification service of the core interface and extending it by offering via API the Enhancement Services described in Requirements from R2.1 to R2.5 R7: MUST A Graphical User Interface offering both the classification and enhancement services must be provided. R7.1: DESIRABLE The GUI must call and present the requirements of a subset of API described in R6. 1.4 Roadmap Checking the requirements we decided in the first period to give priority to R3, R3.1, R3.2 and R5,R6. In the next 7 months (release of March 2015 and June 2015) we will provide - A tool for generating the lucene indexes for the core (March 2015) The SSE Core module with a test interface to check the results. The choice on these requirements and activity was related to confine and minimize in time the development of software which can be hardly demonstrated. The software parts matching the requirements that will be developed later will be easier to demonstrate in terms of progress and we decided to leave them for the advanced phase of the project in which it is more critical to monitor progresses. 2. Component details 2.1 Core Module dataset and Core updater Module details and interfaces Introduction The Classification process of the SSE uses machine-learning techniques with a memory-based approach. The target text or document is assigned to a class by comparing it with a set of preclassified document, called training set. It works like this: for each document classified a vector distance among the training set instances and document itself is computed. The SSE Training set is based on the Wikipedia contents, stored into a specific document database/collection, named Lucene (Corpus) Index, for whom the corresponding dBpedia entry is used as index. Each entry into this dataset is formally called a Lucene Document. For each Lucene Document, indexed with his DBPedia uri ( such as http://dbpedia.org/resource/NameOfTheResource), a specific set of information is stored, including one called “context”, which is “core” in order to compute similarity between documents. The need of an automatic or semi-automatic process to update the indexes into the dataset comes from the fact that since the entries in Wikipedia are user-generated, the dataset tend to be outdated quite soon, due to a number of new entries which are necessarily unprocessed. So we will consider a specific version of the Lucene Corpus Index then, necessarily tied to a specific version of DBPedia and WikiPedia. The Core updater which will be developed in FIWARE Data Sprints 4.2.1-4.2.2-4.2.3 and released with release 4.2, will generate the datasets to populate the core basing on DBPEDIA version 3.9. The choice to use the DBpedia uri as index is due to the fact that the main elements we want to extract are entities, which are uniquely pointed by URIs. Lucene Index Generation Process Details As it was said previously, each entry into the Corpus Index document collection has a set of information called context. The complete info which il be included into the document context into the corpus is: - URI: Dbpedia entity URI URI COUNT: number of times in which the entity appears as wikilink into the whole Wikipedia TITLE: Wikipedia Title, identified by URI TYPE: Value of property rdf:type related to entity IMAGE: image url as indicated in dataset RDF property foaf:depiction CONTEXT: Wikipedia Paragraph containing specific entity under the wiki-link format. As it was explained previously the context is the key to compute the similiraty between documents. The similarity metric is computed by taking into account all of the paragraphs in which a wikilink occurs into the whole Wikipedia. If a paragraph contains a wikilink, this paragraph will be saved as “Context” Field into a Lucene document corresponding to a specific dbpedia resource. So, given that each Wikipedia entry has a corresponding dbpedia entry, we will find: - 1 document for each Dbpedia/Wikipedia entry - N CONTEXT fields, each one related to a wikilink contained into the document and pointing to its specific resource For the similarity computation the Default Similarity metric of Lucene is used, leveraging on a combination of Vector Space Model and Boolean model of information retrieval. The development of this part will leverage, from a technological point of view on Dbpedia Spotlight and Apache Lucene. The very first step performed in generating indexes was starting to prepare small portion of the datasets, checking the results by means of a Java Tool Called Luke. We Started with some Dataset in English, aiming at extending the portion of dataset obtained in this field. The next figure shows the luke output for a single resource Key parameters are URI COUNT and CONTEXT(s). The software will actually consider similarity metrics among given text in input and all the contexts in the index. Next Step will be the index generation and generalization of the code for the upcoming datasets.
© Copyright 2026 Paperzz