Extracting Multilingual Natural- Language Patterns for RDF

Extracting Multilingual NaturalLanguage Patterns for RDF
Predicates
Daniel Gerber and Axel-Cyrille Ngonga
Ngomo
Universität Leipzig, Institut für Informatik, AKSW
Presented by Tyler Klement
Introduction – Challenge
+ Challenge:
-
Extract new knowledge from raw, natural
language text on the web in multiple languages
+ Solution:
-
Utilize already-existing methods of using
knowledge bases to enable identification of new
facts in text
-
Improve on these methods with more features
and a neural network
Introduction - Solution
-
Do so in a language-independent manner, so as
to be usable in any language
-
Generate facts into a machine readable format
(RDF), to feed back into knowledge base
-
The system which was developed to do all of this
was called BOA
Introduction - BOA
+ BOA: Bootstrapping Linked Data
-
The core product of this paper - BOA was a
system published in 2011 by the same authors,
and was used and expanded upon in this paper
which was published in the following year
-
Bootstrapping: “a technique used to iteratively
improve a classifier's performance . . . recursive
self improvement”. (Wikipedia)
Introduction - BOA
-
“The goal of the BOA framework is to allow
extracting structured data as RDF from
unstructured data.” (Gerber & Ngomo 2012)
-
BOA is a framework that generates RDFs from
natural language text by using the Data Web as
background knowledge – a system that can
continually learn new facts from raw textual data,
so that they can be added to the Data Web
Introduction - The Data Web
-
Data Web = Linked Data Web = Semantic Web
-
“Data Web” essentially refers to a standardized
methodology for structuring data (RDF), and to all
openly-available data which is structured with this
methodology
Approach
 + How did they do it?
Approach - Input
+ Input
-
Knowledge base(s)
-
Text corpus (mostly extracted from the web)
-
(optional) Wikipedia dump
-
(if wiki dump) Generate surface forms for all
entities in the source knowledge base:
-
For each predicate in knowledge sources,
get sentence-level statistics of cooccurrence for label pairs linked via the
predicate
Approach – Pattern Extraction
+ Step 1. Pattern Extraction
a. (if wiki dump) Generate surface forms for all
entities in the source knowledge base
-
Given predicate in knowledge base, identify
the subject and object of the predicate, then
look up all variations of the names of these
entities (using Wikipedia redirect and
disambiguation pages)
Approach – Pattern Extraction
-
e.g. “UK” can be identified with United_Kingdom
-
Use of surface forms gives a significantly
higher recall
Approach – Pattern Extraction
- Recall with surface forms (red, green) and without (yellow, blue)
Approach – Pattern Extraction
b.
Get patterns (i.e. natural language representations (NLRs) of
DBpedia predicates) from the corpus
-
a pattern is technically a predicate paired with a NLR
-
So, for each (subject, predicate, object) triplet instance of a
predicate from Dbpedia:
-
Find all substrings in the corpus of form: “subject . . .
object” or “object . . . subject”
-
Extract the substring between the subject and object,
replacing subject and object with placeholders. These are
our NLRs. Associating them with the predicate gives us
patterns.
-
All patterns for a given predicate are called a pattern
mapping
Approach – Pattern Extraction
-
Example:
- Given DBpedia predicate “architect” and the
following information from Dbpedia:
Approach – Pattern Extraction
-
Search corpus for sentence with “Empire State
Building” followed eventually by “Shreve, Lamb and
Harmon” for an (s, p, o) triplet, or reversed for an (o,
p, s) triplet. Thus finding:
-
Patterns extracted are:
-
(:architect, “?D? was designed by ?R?”)
(:architect, “?R? also designed the ?D?”)
Approach – Pattern Extraction
- BOA filters patterns according to constraints:
-
-
Patterns must have at least one non stop-word
-
Token count between two specific thresholds
-
Do not begin with “and” or “, and”
-
Must be common, popular, statistically
significant (above a threshold)
Different predicates can have the same patterns
(e.g. “architect” and “builder”)
Approach – Feature Extraction
+ Step 2. Feature Extraction
-
When a pattern has survived filtering, it is
extracted of features
-
Features are as follows:
Approach – Feature Extraction
-
Support (2 features): capture how often a pattern
occurs between the elements of all the (s, o) pairs
for that predicate
-
s1 (first feature): the number of distinct
pairs a pattern has been learned from
-
s2 (second feature): maximum number of
occurrences of the same pattern between the
elements of a single element of the set of
{(s, o) : (s p o) є KB}
Approach – Feature Extraction
-
Specificity (1 feature): IDF of a pattern, i.e. how
exclusively a pattern expresses the predicate
Approach – Feature Extraction
-
Typicity (3): higher typicity means a pattern
connects entities which lie more closely within the
domain and range restrictions of the predicate, i.e.
the subject and object of the predicate conform
most closely to the classes of the predicate's
known subject/object classes in the ontology
-
t1: target's subject closeness to known
subject classes
t2: object's closeness to object classes
t3: prevents only promoting patterns which have a
low recall
Approach – Feature Extraction
-
IICM (1): Intrinsic Information Content Metric - gets
all synsets for the predicate from WordNet (e.g.
“architect” and “designer”) and get similarities for
these terms to each term in the pattern, returning
the highest similarity from all of these pairs
-
ReVerb (1): take the average of the confidence score
given by ReVerb's classifier (therefore taking
advantage of POS-based regular expressions) for the
pattern given all of its occurrences/contexts
Approach – Feature Extraction
Approach – Scoring
+ Step 3. Scoring – Neural Network
-
Due to having so many features available, a feed forward neural
network was used to find the optimum combination of feature
values
-
Network structure:
-
-
Input
An input neuron for each feature
-
One hidden layer (various sizes of the layer were tested)
-
Output
One neuron, whose activation was used as a score
Trained on 200 manually annotated patterns
Approach – RDF Generation
+ Step 4. RDF Generation
-
For the occurrences of each well-scored pattern in
the corpus, use a Named Entity Recognizer to
get the subject and object of the predicate in the
occurrence
-
Use these entities as the labels for the RDF
domain and range
-
If a URI exists, use it, else create a new one
Evaluation - Setup
+ Experimental Setup
-
Sample 100 triples extracted by BOA (for each
corpus) and manually annotate their precision
-
Multilinguality evaluated on 2 English corpora
and 2 German corpora
-
Note: English corpora were twice the size of the
German corpora
Evaluation – Score Function
-
200 patterns per corpus were hand-annotated
- 1 = high-quality pattern
- 0 = not high-quality
-
1 neural network per dataset was then trained
Evaluation - Results
+ Results
-
Only integrate RDFs generated by at least 2
patterns – this boosted precision resulting in. . .
-
92/91% precision for English and 70/74% for
German
Evaluation – Results
Conclusion
-
German possibly representative of system's behavior
on other languages, but more highly morphological
languages always risky
-
More language-independent features should be added
to approach the high performance provided for English
-
Predicate learning would complete the picture
Thank you.

Download Report

Extracting Multilingual Natural- Language Patterns for RDF

Paperzz.com

Your Paperzz