ppt - osm.cs.byu.edu

by Warren Shen, Pedro DeRose, Robert
McCann, AnHai Doan, and Raghu
Ramakrishnan, SIGMOD'08, June 2009,
Vancouver, British Columbia, Canada,
2007, 1031-1042
Presented by Andrew Zitzelberger


Many solutions exist for extracting structured
data from raw data pages.
But …

Virtually all of these solutions focus on precise
Information Extraction programs that output exact
results



Generally cannot execute a partially specified
version of the program.
Generally takes a long time (days or weeks)
before obtaining the first meaningful results
(partially due to the first limitation). Not
acceptable for time sensitive applications
Writing precise IE programs can be a waste of
time in some instances.

Given 500 pages find all houses which cost
more than $500,000 and whose high school is
Lincoln.


Case 1: Define price as a numeric value and run the
approximate program. 9 pages are returned
containing a number greater than 500,000 and the
word Lincoln. Search these 9 pages manually.
Case 2: Instead 120 pages are returned. The program
is underspecified so the Next-Effort assistant is
consulted which asks if price tags are always bolded.
After discovering they are, this is added to the
specification, and this time 35 pages are returned.



Allows the developer to quickly develop an
approximate extraction program (Alog)
The approximate program can then be run to
quickly retrieve approximate results (Compact
Tables)
To improve results the developer can enlist the
aid of the Next-Effort assistant

Xlog is a variant of datalog



Consists of a number of rules in the form of p:q1,…,qn where p and qi are predicates and p is the
head and the qi’s form the body.
Xlog does not allow rules with negated
predicates or recursion.
Xlog can accommodate procedural steps of real
world IE using p-predicates and p-functions.

p-predicate



A p-predicate takes the form q(a1, . . . , an, b1, . . . ,
bm), where ai and bi are variables and q is associated
with some procedural code module.
The associated procedural code module takes a in an
input tuple (u1, . . . , un), where ui is bound to ai, i ∈
[1, n], and produces as out-put a set of tuples (u1, . . .
, un, v1, . . . , vm).
p-function

p-function f(a1, . . . , an) takes as input a tuple (u1, . . .
, un) and returns a scalar value.

Extract houses with a price above $500,000, more
than 4500 square feet, and with a top high school.

p-predicates
 extractHouses(x, p, a, h)
 extractSchools(y, s)

p-function
 approxMatch(h, s)

Query
 R1: houses(x,p,a,h) :- housePages(x), extractHouses(x,p,a,h)
 R2: schools(s) :- schoolPages(y), extractSchools(y,s)
 R3: Q(x,p,a,h) :- houses(x,p,a,h), schools(s), p>500000,
a>4500, approxMatch(h,s)




To write an Xlog program the developer must
first decompose the program into smaller tasks.
Then p-predicates and p-functions are
designed to reflect the decomposition.
Next procedural modules to perform the
functionality of the p-predicates and pfunctions must be designed and written (takes
a lot of time and must be fairly complete before
testing can begin).
Finally, the modules must be linked in.


IE predicates – a p-predicate that extracts one
or more output spans from a single input
document or span.
The procedure writing stage of Xlog is replaced
by the ability to write description rules to do
“good enough.”


The developer can also attach procedural modules if
desired.
The developer also specifies the type of
approximation to use with annotations.


Written in the same form as traditional Xlog
rules except that the head of the rule must be
an IE predicate.
Can be used to define domain constraints in the
form of f(a) = v (example: numeric(a) = yes)



Values can be yes, distinct-yes, no, distinct-no, and
unknown
Can also describe text features such as boldfont, followed-by, underlined, hyperlinked, etc.
iFlex provides a rich set of built in features and
provides an interface for the user to add more.



Verify(s, f, v) checks whether f(s) = v.
Refine(s, f, v) returns all subspans t from s such
that f(t) = v
This implementation is done once and stored
so that all future Alog programs can make use
of it.


Description rules must be safe – meaning that
they don’t produce an infinite relation.
extractHouses(x, p, a, h) :- numeric(p),
numeric(a) is not safe because it does not
specify where p, a, and h are extracted from.
iFlex provides built-in rule from(x, y) that
conceptually extracts all sub-spans y from document
x.
 This predicate can be used to easily make rules safe.

 extractHouses(x, p, a, h) :- from(x, p), from(x, a),
from(x, h), numeric(p)=yes, numeric(a)=yes

Existence Annotation



Indicates that a tuple in the relation may or may not
exist.
schools(s)? :- schoolPages(y), extractSchools(y,s)
Attribute Annotation


Indicates that an attribute takes a value from a given
set, but we do not know which value.
houses(x,<p>,<a>,<h>) :- housePages(x),
extractHouses(x,p,a,h)




Suppose we determine that school names are in
bold font.
It is not likely that every bold word in the
document is a school name.
Thus we can use the existence annotation to
specify that each tuple found may or may not
be in the actual relation.
Every tuple found is added to a relation and
the power set is returned to specify the set of
relations that are possibly correct.



Suppose that each document x in housePages
describes exactly one house (the x is a key in
the relation)
Then we can specify that price, area, and high
school come from some matching values we
found on the page.
All possible relations are constructed for
houses where one value is selected for each
attribute.





Need a way to store the set of relations an Alog
program produces.
An a-table is a multiset of a-tuples.
An a-tuple is a tuple (V1,…,Vn), where each Vi is a
multiset of possible values.
An a-tuple may be annotated with a ‘?’, in which
case it is also called a maybe a-tuple.
An a-table represents the set of all possible
relations that can be constructed by:
(a) selecting a subset of the maybe a-tuples and all nonmaybe a-tuples, then
 (b) selection one possible value for each attribute in each
a-tuple in (a).




A-tables are not typically succinct enough due to
the fact that an Alog rule may produce a huge
number of extracted values.
iFlex employs compact tables which exploit the
sequential nature of text to “pack” the set of values
into each cell into a much smaller set of so-called
assignments.
A compact table is a multiset of compact tuples. A
compact tuple is a tuple of cells (c1,…,cn) where
each cell ci is a multiset of assignments or an
expansion cell. A compact tuple may optionally be
designated as a maybe compact tuple, denoted
with a ‘?’.

exact


exact(s) – encodes a value that is exactly span s
contain

contains(s) – encodes all values that are sub-spans of
s on the page (example: contain(“Cherry”) includes
{“C”, “Ch”, …,“Cherry”}


Suppose a tuple t with cells (c1, …,ci, …, cn)
where ci = expand(v1, …, vk).
T can be expanded into a set of compact tuples
obtained by replacing cell ci with an
assignment exact (vj): (c1,…,exact(vj),…cn),
where 1≤j ≤ k.


Not a complete model for approximate data
(cannot do mutual exclusion)
Not closed under traditional relational
operators




Ensure superset semantics – result is always a
super set of actual results
Projection – ignore duplicate detection
Selection – if any of the possible tuples in a
compact tuple meet the selection condition the
tuple is retained. If only some of the tuples
meet the condition, it becomes a maybe
compact tuple.
θ-join – evaluate θ condition on all compact
tuples in the Cartesian product using the
selection criteria.


Unfold all rules (unifying variables if
necessary) until only IE predicates remain that
are associated with procedures in the program.
Construct a logical plan fragment

Suggests ways to refine the current information
extraction program by asking the developer
questions?



Example: “is price in bold font?”
iFlex adds new constraints to the program
based on the developers responses.
If the number of tuples does not change for k
iterations, the assistant can notify the developer
that the results of have converged.

Sequential



Rank attributes in decreasing importance (using
various heuristics)
Always ask questions about the most important
attribute
Simulation

Ask questions whose answers will eliminate the
most possible answers
 The results of each stage of the execution plan are
stored, so that only the changes have to be rerun.


Domains: Movies, DLBP, Books
Comparisons in performance are based on the
time it takes to write the program for extraction
(or do the extraction in the manual case).


Times are averaged over 1-3 volunteers for each task.
Time stops when correct result is obtained or the
program converges.
iFlex reduced time by 25-98% in all 27 scenarios
• iFlex converged correctly in 23 out of 27 of the scenarios (not shown due to
space limitations)
• The four remaining cases were 170%, 161%, 114%, and 102%.
• Two of those cases had a small number of tuples
Tasks took 104, 351, and 107 seconds to run; iFlex running time
is comparable to Perl extraction programs.




iFlex is a best-effort information extraction
program that can be use to quickly obtain
approximate results.
iFlex significantly reduces the developer time
in creating information extraction programs.
iFlex is efficient enough to run with
comparable speed to Perl
Simulated question patterns from the NextEffort assistant outperforms the sequential
pattern.

Download Report

ppt - osm.cs.byu.edu

Paperzz.com

Your Paperzz