PODS`06

Finding and Approximating Top-k
Answers in Keyword Proximity Search
Benny Kimelfeld and Yehoshua Sagiv
The Selim and Rachel Benin School of Engineering and Computer Science
‫האוניברסיטה העברית בירושלים‬
The Hebrew University of Jerusalem
1
CIKM 2005
Keyword Proximity Search (KPS)
A paradigm for data extraction
Data have varying degrees of structure
– Relational databases, XML, Web sites
Queries are sets of keywords
− No structural constraints
The Goal:
Extract meaningful parts of data w.r.t. the keywords
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
2
Querying Structure & Content by Keywords
Keywords appear in different parts of the data
Answers show occurrences of keywords, as well
the associations among these occurrences
Vardi Databases
search
article
title
author
Databases Vardi
journal
name
article
Databases author
…
Vardi
Proximity of the keywords in the answer indicates
a close (strong) semantic association among them
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
3
Past Work on KPS (Keyword Proximity Search)
• DataSpot (Sigmod 1998)
• Information Units (WWW 2001)
• BANKS (ICDE 2002, VLDB 2005)
• DISCOVER (VLDB 2002)
• DBXplorer (ICDE 2002)
• XKeyword (ICDE 2003)
• …
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
4
The Goal of this Paper
Devise efficient algorithms for finding highquality answers in keyword proximity search
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
5
Contents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
6
Contents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
7
Data Graphs
 Structural and keyword nodes
 Edges may have weights
– Weak relationships are penalized by high weights
company
hq
president
Paris
supplies
company
hq
supplies
Cohen
supply
supplier
product
A4
PODS'06
supply
supplier
product
papers
coffee
department
manager
Summers
Finding and Approximating Top-k Answers in Keyword Proximity Search
8
Queries
Queries are sets of keywords
from the data graph
Q={ Summers , Cohen , coffee }
company
hq
president
Paris
supplies
company
hq
supplies
Cohen
supply
supplier
product
A4
PODS'06
supply
supplier
product
papers
coffee
department
manager
Summers
Finding and Approximating Top-k Answers in Keyword Proximity Search
9
Query Answers
company
hq
president
supplies
Paris
company
hq
supplies
Cohen
supply
product
A4
PODS'06
supply
customer customer
product
papers
coffee
department
manager
Summers
Finding and Approximating Top-k Answers in Keyword Proximity Search
10
Query Answers
An answer is a directed subtree of the data graph
 Contains all keywords of the query
 Has no redundant edges (and nodes)
company
hq
president
supplies
Paris
company
hq
supplies
Cohen
supply
customer customer
The root
has two
product
or more children
A4
PODS'06
papers
The keywords
of
the
query
are
supply
department
the leaves
product
coffee
manager
Summers
Finding and Approximating Top-k Answers in Keyword Proximity Search
11
Ranking: Inversely Proportional to Weight
rank(A)=(weight(A))-1
1
article
1
1
references
title
1
1 represent
Smaller subtrees
closer associations
title
cite
databases
dblp
1.5
5
1
article
5
Vardi
article
1
1
title
article
databases title
1
1
1
title
Vardi
Vardi
1
2
PODS'06
1
databases
3
Finding and Approximating Top-k Answers in Keyword Proximity Search
12
Enumerating in Exact (Ranked) Order
AB C A B C
If
A B
C
A
B
C
A
B
Then
C
A B
A
C
B
C
A
B
C
≤
A
B
C
Top-k Answers
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
13
C mayin
beaaC-Approximate
function of G andOrder
Q
Enumerating
A B C
AB C
A
If
B
C
A
B
C
A B
C
A
Then
A
B
A B
B
C
C
A
B
C
≤C
C
C-Approximation of the Top-k Answers
(Fagin et. al, PODS’01)
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
14
Polynomial Delay
Yardstick of efficiency:
Polynomial delay
AB C A B C
A B
C
A
B
C
A
B
C
A
B
C
A
B
C
Polynomial time between
generating successive answers
Exponentially many answers even for 2 keywords
(it is inefficient to generate all answers and then sort)
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
15
Contents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
16
Top Answers are Steiner Trees
• Finding the top answer in KPS (a.k.a. the Steinertree problem) is intractable
– Therefore, one cannot enumerate all answers
in ranked order with polynomial delay
• However, the top answer can be found efficiently
under data complexity
– That is, the number of keywords is fixed
• Approximations can be found efficiently under
query-and-data complexity
– There is a lot of work on Steiner-tree approximations
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
17
So What Can Be Done?
Can answers of KPS be enumerated
in the exact order with polynomial
delay, under data complexity?
Can approximations of Steiner trees
be used for efficiently enumerating in
an approximate order (while preserving
the approximation ratio)?
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
18
Our Results
Theorem 1:
Under data complexity, answers of KPS can
be enumerated in the exact order with
polynomial delay
AB C A B C
PODS'06
A B
C
A
B
C
A
B
C
A
B
C
A
Finding and Approximating Top-k Answers in Keyword Proximity Search
B
C
19
Our Results (cont’d)
Theorem 2:
Under query-and-data complexity, given an
efficient C-approximation for finding Steiner
trees, one can enumerate with polynomial
delay in a (C+1)-approximate order
AB C
A B
C
A B C
A
PODS'06
B
C
A
B
C
A
B
C
A
Finding and Approximating Top-k Answers in Keyword Proximity Search
B
C
20
The Meaning of the Results
KPS is tractable under
data complexity
All results on Steiner trees
can be applied to KPS
Under query-and-data complexity, an efficient
enumeration in an approximate order can be done
with almost the same ratios as Steiner trees
From a theoretical point of view,
using heuristics is not the only option
Existing approaches to KPS are heuristics
–Exponential delay in the worst case
–No provable nontrivial approximation ratios
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
21
Contents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
22
Lawler’s Method
• We use the technique of Lawler (1972),
which is an iterative method for finding the
top-k answers
• Each iteration generates the next answer
by finding the top answer under constraints
• Lawler’s method is designed for general
(discrete) optimization problems
• When applying it to a specific problem, one
needs to deal with the following two issues
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
23
Two Problems to Solve
1. What exactly are the constraints?
(That is, how can we apply Lawler’s
method so that the constraints make it
possible to find top answers efficiently?)
2. How can we find efficiently the
top answer under constraints?
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
24
Solving the First Problem
Constraints are subtrees of the graph
• Pairwise node disjoint
• Their leaves are exactly the keywords of the query
An answer satisfies the constraints if it
contains all the subtrees (i.e., a supertree)
A B
C
A B
E
G
PODS'06
C
E
F
F
G
Finding and Approximating Top-k Answers in Keyword Proximity Search
25
Two Problems to Solve (One Left)
1. What exactly are the constraints?
(That is, how can we apply Lawler in a
way that the constraints enable finding
the top answer efficiently?)
2. How can we find efficiently the
top answer under constraints?
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
26
Formulation of the Second Problem
Input: constraints
(node-disjoint subtrees, keywords as leaves)
Objective:
A minimal answer satisfying the constraints
(i.e., containing all the subtress)
Next, an algorithm that solves “almost” this problem, namely:
(Almost the same) Objective:
A minimal supertree satisfying the constraints
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
27
Finding a Minimal Supertree
Input: G, T (constraints, i.e., subtrees)
1. Collapse each of the subtrees of T into a node
2. Find a Steiner tree T of the collapsed subtrees
3. Restore the collapsed subtrees in T
(more details in the proceedings…)
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
28
This is not Enough!
Input: constraints
(node-disjoint subtrees, keywords as leaves)
Objective:
A minimal answer satisfying the constraints
(i.e., containing all the subtress)
Not the same!
(Almost the same) Objective:
A minimal supertree satisfying the constraints
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
29
Query Answers Revisited
An answer is a directed subtree of the data graph
 Contains all keywords of the query
 Has no redundant edges (and nodes)
company
hq
president
supplies
Paris
company
hq
Keywords are
the leaves
supplies
Cohen
supply
customer customer
The root
has two
product
or more children
A4
PODS'06
supply
papers
product
coffee
department
manager
Summers
Finding and Approximating Top-k Answers in Keyword Proximity Search
30
An Example
A
B
C
PODS'06
D
Finding and Approximating Top-k Answers in Keyword Proximity Search
31
An Example
This edge is redundant!
But, it cannot be removed since it is a constraint!
A
B
A
C
D
B
C
D
The minimal supertree
The minimal answer
satisfying
theminimal
constraints
satisfying
the constraints
The
answer can
be completely
different from the minimal supertree
Furthermore, there can be no answer even if there is a supertree
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
32
What if We Remove Edges of Constraints?
• What if we first generate a minimal supertree and if
the root has only one child, then we just remove it
(until an answer is obtained)?
• The constraints are violated, leading to a
failure of Lawler’s method!
• That is,
– Some answers will be duplicated
– While other answers will not be generated at all
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
33
Our Approach
F
F
G
E
G
C
C
D Transform
F
D
Min.
Supertree E
G
C
D
E
H A
B
Constraints
H A B
New constraints
H A B
Answer
The root of this subtree has more than one
child and it must be the root of the answer
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
34
This Process is Repeated
F
F
C
F
C
E H A
D
H A
D
C D best
The
answer
Ffinal
G
B
E
Constraints
Up to 2#keywords times
(fixed & usually fewer)
H A
PODS'06
H A
C
D
G FE
B
B
B
H A
B
G
B
Min.
Supertree
isGthe
D
E
H A
H A
C
B
Min.
Supertree
E
F
E
G
G
G
E
F
H
C
D
F
Min.
Supertree
Min.
Supertree
C D
H A
G
B C
D
E
H A
Finding and Approximating Top-k Answers in Keyword Proximity Search
B
C
D
G FE
35
About the Transformation
• The details of the exact transformation and
the proof of correctness are intricate
• All can be found in the proceedings…
This concludes the algorithm for
enumerating in the exact order
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
36
A Different View: Chain of Reductions
Enumerating answers in ranked order
Adapting Lawler’s method
Finding the top answer under constraints
Transformation of constraints
Finding minimal supertrees
Collapse and restore
Finding Steiner trees
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
37
Contents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
38
Modifying the Chain of Reductions
Enumeration in an approximate order
Similar
Finding approximate answers under constraints
Completely different!
Finding approximations of minimal supertrees
Similar
Finding approximations of Steiner trees
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
39
Exact Order Revisited
F
F
G
C
F
E
to 2D#keywords
GUp C
H A
B
B
Min.
Supertree
E
E
B
H A
H A
H A
D
B
B
Min.
E
F
cannotGallow
it underH
Supertree
G
C D
query-and-data complexity!
F
G
G
E
Constraints
H A
C
D
G FE
H A
PODS'06
E H A
C We
D
F
C
D
F
G
B
B
Min.
Supertree
Min.
Supertree
C D
H A
B C
D
E
H A
Finding and Approximating Top-k Answers in Keyword Proximity Search
B
C
D
G FE
40
The Algorithm
F
E
C
C
C
D
E
F
H A
B
≤ C times the optimum
A C-approximation of
the minimal supertree
(collapse and restore)
PODS'06
D
H A B
Constraints
D
E
≤ 1 times the optimum
A minimal answer for 3
or fewer constraints (the
algorithm for the exact order)
Finding and Approximating Top-k Answers in Keyword Proximity Search
41
The combined
subgraph contains an answer
Combine
the Subtrees
≤ (C+1) times the optimum
C
C
D
H A
B
F
≤ C times the optimum
A C-approximation of
the minimal supertree
(collapse and restore)
PODS'06
D
E
E
E
F
C
D
H A
B
≤ 1 times the optimum
A minimal answer for 3
or fewer constraints (the
algorithm for the exact order)
Finding and Approximating Top-k Answers in Keyword Proximity Search
42
Contents
Introduction
Formal Setting
The Main Results
Enumerating in the Exact Order
Enumerating in an Approximate Order
Conclusion and Future Work
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
43
Keyword Proximity Search
• A common paradigm for keyword search over
structured databases
• In the formal model:
– Data are directed and weighted graphs
– Queries are sets of keywords (i.e., nodes) from
the data graph
– Query answers are non-redundant subtrees
containing the keywords of the query
• The goal is to find the top-k answers, where the
rank is inversely proportional to the weight
• A stronger goal: enumeration with poly. delay
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
44
Our Results
• Under data complexity, answers can be
enumerated in the exact ranked order with
polynomial delay
• Under query-and-data complexity, every efficient
C-approximation to the Steiner-tree problem yields
an algorithm for enumerating answers with
polynomial delay in a (C+1)-approximate order
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
45
Our Chain of Reductions
Enumerating answers in sorted order
Lawler’s approach
Finding the top answer under constraints
The intricate part …
Finding minimal supertrees
Subtree Collapse/Restore
Finding Steiner trees
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
46
Other Variant of KPS
Our algorithms can be adapted
to other popular variants of KPS
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
47
Undirected Variant
Answers are undirected trees
company
hq
president
Paris
supplies
company
hq
supplies
Cohen
supply
supplier
product
A4
PODS'06
supply
supplier
product
papers
coffee
department
manager
Summers
Finding and Approximating Top-k Answers in Keyword Proximity Search
48
Strong Variant
Answers are undirected trees
and keywords are leaves
company
hq
president
Paris
supplies
company
hq
supplies
Cohen
supply
supplier
product
A4
PODS'06
supply
supplier
product
papers
coffee
department
manager
Summers
Finding and Approximating Top-k Answers in Keyword Proximity Search
49
Open Problems
• Can we improve the space efficiency of our
algorithms?
• Some ranking functions (e.g., height) are
easier than weight when looking for the top
answer (no constraints), but
– The chain of reductions doesn’t work
– The complexity of finding the top answer under
constraints is unknown
• Can our results hold for richer queries that
also have structural constraints?
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
50
Implementation Considerations
• Bottlenecks: Steiner-tree algorithms and
approximations
• Thin graphs allow in-memory execution of
our algorithms, even for large XML
documents (e.g., DBLP)
• New and intuitive ranking functions that are
easier to implement efficiently
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
51
Related Work: Order vs. Efficiency
Exact Order
This work
More
Desirable
Approximate Order
Heuristic Order
(Queries
have a
fixed size)
More
Efficient
(no approx. guaranteed)
Past work
No Order
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
52
Thank you.
Questions?
53
CIKM 2005
Illustration of Lawler’s Method
54
CIKM 2005
Lawler’s Method (1972)
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
55
1. Find the Top Answer
In principle, at this point we should
find the second-best answer
But Instead…
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
56
2. Partition the Remaining Answers
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
57
2. Partition the Remaining Answers
Each partition is defined by
a distinct set of constraints
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
58
3. Find the Top of each Set
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
59
4. Find the Second Answer
The second answer is the best among
all the top answers in the partitions
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
60
5. Further Divide the Chosen Partition
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
61
And so on…
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
62
Adapting Lawler’s Method
63
CIKM 2005
Our Constraints
Inclusion constraints
• Node-disjoint subtrees of the data graph
C D
• All the leaves are keywords
• An answer must contain all the subtrees
A
Exclusion constraints
• Edges of the data graph
• An answer must not contain any of the
B
C
edges
B
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
64
edges(A)
\ I = {e(cont)
Partitioning
a Partition
1,…,ek}
A
I
E
A
A0
I
E ⋃{e1}
A0
A1
I ⋃{e1}
E ⋃{e2}
A1
A2
I ⋃{e1,e2}
E ⋃{e3}
A2
A3
I ⋃{e1,e2,e3}
E ⋃{e4}
A3
Ak-1
I ⋃{e1,…,ek1}
E ⋃{ek}
Ak-1
PODS'06
…
Finding and Approximating Top-k Answers in Keyword Proximity Search
65
Constraints (subtrees/edges) are obtained from existing
Generating
Constraints
(intuition)
constraints of the current partition and the top answer
A
B C
D E
A
B C
D E
A
B C
D E
A
B C
D E
A
B C
D E
A
B C
D E
A
B C
D E
A
B C
D E
A
B C
D E
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
66
Collapsing Subtrees
67
CIKM 2005
Collapsing a Subtree
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
68
1. Remove All Edges and Internal Nodes
Only the root is left
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
69
2. Remove Incoming Edges of Internal Nodes
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
70
3. Add Outgoing Edges to the Root
An edge that emanates from an internal node
becomes an outgoing edge of the root
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
71
More Details
• When adding an outgoing edge (r,u) to the root,
the weight of (r,u) is the minimal weight among all
the edges from the collapsed subtree to u
• When restoring a subtree, each outgoing edge
(r,u) of the root is replaced with an (arbitrary)
original edge from the restored subtree to u, with
the same weight
• Incoming edges of internal nodes of the subtree
are never restored
– Such edges cannot participate in G-supertrees
PODS'06
Finding and Approximating Top-k Answers in Keyword Proximity Search
72