s - VideoLectures.NET

Finding frequent and interesting
triples in text
Janez Brank, Dunja Mladenić, Marko Grobelnik
Jožef Stefan Institute, Ljubljana, Slovenia
Motivation
• Help with populating a knowledge base / ontology (e.g.
something like Cyc) with common-sense “facts” that
would help with reasoning or querying
– We’ll be interested in concept1, relation, concept2 triples
– E.g. person, inhabit, country tells us that a country is
something that can be inhabited by a person, which is
potentially useful
• We’d like to automatically extract such triples from a
corpus of text
– They are likely to contain slightly abstract concepts and
aren’t mentioned directly in the text, but their
specializations are
– We will use WordNet to generalize concepts
Overview of the approach
Corpus of
text
Parser +
some
heuristics
List of
frequent,
interesting
triples
List of
subject,
predicate,
object
triples
Measures
of interest
WordNet
List of
frequent
triples
List of
concept
triples
Generalization,
minimum
support
threshold
Associating input triples
with WordNet concepts
• Our input was a list of subject, predicate, object triples
– Each component is a phrase in natural language
European Union finance ministers, approved, convergence plans
– But we’d like each component to be a WordNet concept so that we’ll
be able to use WordNet for generalization
• We use a simple heuristical approach:
– Look for the longest subsequence of words that also happens to be
the name of a WordNet concept
• Thus “finance minister”, not “minister”
– Break ties by selecting the rightmost such sequence
• Thus “finance minister”, not “European Union”
– Be prepared to normalize words when matching
• “ministers”  “minister”
– Use only the nouns in WordNet when processing the subject and
object, and only the verbs when processing the predicate
Identifying frequent triples
• Now we have a list of concept triples, each of
which corresponds roughly to one clause in the
input textual corpus
• Let u  v denote that v is a hypernym (direct or
indirect) of u in WordNet (including u = v)
• support(s, v, o) := the number of concept triples
(s', v', o') such that s'  s, v'  v, o'  o
– Thus, a triple that supports finance minister, approve,
plan also supports executive, approve, idea
• We want to identify all s, v, o whose support
exceeds a certain threshold
Identifying frequent triples
• We use an algorithm inspired by Apriori
• However, we have to adapt it to prevet the generation of an
untractably large amount of candidate triples (most of which
would turn out to be infrequent)
• We use the depth of concepts in
the WordNet hierarchy to order
the search space
• Process triples in increasing
order of the sum of the depths
of their concepts
– Each depth-sum requires one pass
through the data
Identifying interesting triples
• Not all frequent triples are interesting
– Generalizing one or more components of the triple leads to a higher (or at
least equal) support
– Thus the most general triples are also the most frequent, but they aren’t
interesting
• E.g. entity, act, entity
• We are investigating heuristics to identify which triples are likely to be
interesting
– Let s be a concept and s' its hypernym.
– Every input triple that supports s in its subject also supports s', but the other
way around is usually not true.
– We can think of the ratio support(s) / support(s') as a “conditional probability”
P(s|s').
– So we might naively expect that P(s|s') support(s', v, o) input triples will
support the triple s, v, o.
– But the actual support(s, v, o) can be quite different. If it is significantly higher,
we conclude that s fits well together with v and o.
– Thus, interestingnessS(s, v, o) = support(s, v, o) / (P(s|s') support(s', v, o)).
– Can be defined for v and o as well.
Identifying interesting triples
• But this measure of interestingness turns out to be too
sensitive to outliers and quirks in the WordNet
hierarchy
• Define the sv-neighborhood of a triple s, v, o as the
set of all (frequent) triples with the same s and v.
– The so- and vo-neighborhoods can be defined analogously.
• Possible criteria to select interesting triples now
include:
– A triple is interesting if it is the most interesting in two (or
even all three) of its neighbourhoods (sv-, so- and vo-).
– We might also require that the neighbourhoods be large
enough.
Experiments: Frequent triples
• Input: 15.9 million subject, predicate, object triples
extracted from the Reuters (RCV1) corpus
• For 11.8 of them, we were able to associate them with
WordNet concepts. These are the basis of further processing.
• Frequent triple discovery:
– Found 40 M frequent triples
(at various levels of
generalization) in about
60 hours of CPU time
– Required 35 passes through
the data
(one for each depth-sum)
– At no pass was the number of
candidates generated greater
than the number of actually
frequent triples by
more than 60%
Experiments: Interesting triples
• We manually evaluated the interestingness of all the
frequent triples that are specializations of person,
inhabit, location (there were 1321 of them)
– On a scale of 1..5, we consider 4 and 5 as being interesting
– If, instead of looking at all these triples, we select a smaller
group of them on the basis of our interestingness
measures,
does the
percentage
of tripes
scored 4 or 5
increase?
Conclusions and future work
• Frequent triples
– Our frequent triple algorithm successfully handles large amounts of data
– Its memory footprint only minimally exceeds the amount needed to store the
actual frequent triples themselves
• Interesting triples
– Our measure of interestingness has some potential but it remains to be seen
what’s the right way to use it
– Evaluation involving a larger set of triples is planned
• Ideas for future work: covering approaches
– Suppose we fix s and v, and look where the corresponding o’s (i.e. those for
which s, v, o is frequent) fall in the WordNet hypernym tree
– We want to identify nodes whose subtrees cover a lot of these concepts but
not too many other concepts (combined with an MDL criterion)
– Alternative: think of the input concept triples as positive examples, and
generate random triples of concepts as negative examples. Use this as the
basis for a coverage problem similar to those used in training association rules.