Integration of Database and Information Retrieval technologies for

XSEarch
XML Search Engine
Jonathan MAMOU
October 2002
Motivation
XML


Getting popular
Allows meta-data to be embedded into
documents



Data-centric view : exchange format for
structured data – meta data
Document-centric view : Content – text,
meta data
Querying data and meta-data
Buy our
Classic
Children’s
books.
One Fish Two Fish by
John Meyer & Peter Smith
Costs Only: $7.95
Goodnight Moon by Margaret Brown
Costs Only: $10.55
Brown Bear by Bill Martin Jr.
Costs Only: $6.00
amazing.com
<bookinfo>
<book><title>One Fish Two Fish</title>
<author>John Meyer</author>
< author >Peter Smith</author>
<price>7.95</price></book>
<book><title>Goodnight Moon</title>
< author >Margaret Brown</author>
<price>10.55</price></book> ....
</bookinfo>
A query

Find titles and prices of books by
‘Meyer’ or ‘Smith’
IR Approach

How to deal with tags?

Discard all tags




Simplicity
Loss of information (structure)  lower retrieval
performance
Keep tags as keyword
How to write the query?

“Title price book author Meyer Smith”
IR Approach (cont’d)



Can’t specify that Meyer and Smith are
the authors
Can’t specify that title, price and author
belongs to same book
Can’t specify desired output (i.e., titles,
price)
Database approach
FOR $b IN document(“bib.xml”)//book
WHERE $b/author contains ‘Meyer’ OR $b/author
contains ‘Smith’
RETURN
<result>
<title> $b/title </title>
<price> $b/price </price>
</result>
•Difficult for naive user
•Requires knowledge of document structure
•Dependent on document structure
Our Goal






Combine IR and database techniques :
tags + text
Simple language
Logical Structure, not physical
Require knowledge of tag names, not
structure
Queries should work even if structure
changes
Rank results
Framework
Tree Representation
bookinfo
book
price
title
author
Just
Lost
author
$5.75
Mercy
Meyer
Gina
Meyer
book
title
price
Brown $13.95
Bear
We need to find tuples of related title and
price nodes.
Another Tree Representation
bookinfo
author
book
name book
author
book
price
title
$12.50 One Fish
Two Fish
Dr.
title price
Meyer
Cat in $14.95
the Hat
name
title
M. Brown
Goodnight
Moon
Similar document, but with different
hierarchical structure from the previous.
We need to find tuples of related title, author
and price nodes.
Interconnection
The lowest
common
ancestor of
the circled
nodes
bookinfo
book
title
name
book
price
title
price
name
Just
Lost
$5.75
Mercy
Meyer Gina
Meyer
Brown $13.95
Bear
Consider a title and price node
Intuition: The nodes belong to different book
entities
Interconnection (cont’d)
The lowest
common
ancestor of
the circled
nodes
bookinfo
book
title
name
book
price
title
price
name
Just
Lost
$5.75
Mercy
Meyer Gina
Meyer
Brown $13.95
Bear
Intuition: The nodes belong to same book entity
Interconnection (cont’d)
bookinfo
book
title
name
book
price
title
price
name
Just
Lost
$5.75
Mercy
Meyer Gina
Meyer
Brown $13.95
Bear
Intuition: The nodes belong to same book entity
Relationship tree




Nodes n1,n2
n their lowest common ancestor
Tn the subtree rooted at n
The relationship tree of n1,n2 is the tree
obtained by pruning from Tn all nodes other
than n1,n2 that are not ancestors of n1,n2
Interconnection

We say that n1,n2 are interconnected
if
the relationship tree does not contain 2
distinct nodes with the same label
Or
 the relationship tree contains exactly one
pair of distinct nodes with the same label
and this pair is comprised of n1,n2

All-Pairs Interconnection

A set of nodes is all-pairs
interconnected if every pair of nodes
are interconnected
Star interconnection
bookinfo
book
price
title
author
author
Just
$5.75
Lost name
name
Mercy
Meyer
book
title
price
Brown $13.95
Bear
Gina
Meyer
The 2 names are not interconnected
Star Interconnection (cont’d)

A set of nodes is star interconnected if
all the nodes in the set are
interconnected to the same node
Search terms, Search query

Search Term (l,k)



Search Query AND:L1 OR:L2



l label (context)
k keyword
L1, L2 list of search terms
AND:(title,)(price,)
OR:(author,Meyer)(author:Smith)
Answer

AND:N1 OR:N2





N1, N2 are list of nodes
Matching between N1,N2 and L1,L2
N1 and N2 are interconnected
All all-pair answers are star answers
Maximal answer
Example
null
title
author
price
bookinfo
book
price
title
author
author
Just
$5.75
Mercy
Lost
Meyer Gina
Meyer
book
title
price
Brown $13.95
Bear
(title,) (price,) (author,Meyer)
Find matchings of title, author and
price to the nodes in the tree
Computing answers

All-pairs



Determining whether the set of answers is
empty is NP-complete
If L1 is empty, computing the set of
answers is polynomial in the size of input
and output
Star

computing the set of answers is polynomial
in the size of input and output
Ranking results

Unstructured




Keyword weight (tfilf)
Tags weight
Result size
Structured


Nodes distance
Ancestor-descendant
Keyword Weight


Compute the weight of a keyword k
within a given node n
Variation of the tfidf, one of the metric
of Vector Space Model (classical model
in IR)
Keyword Weight (cont’d)
Term Frequency (tf): number of
appearances of k within n
tf(k,n) = occ(k,n) / (max occ(k’,n))
 Inverse Leaf Frequency (ilf): inverse
frequency of k among all the leafs in the
corpus
idf(k) = log(1+N/Nk)
 W(k,n) = tf(k,n) * idf(k)
 Normalized per leave

Tag Weight

Give weight to tags according to their
importance

E.g. give more weight to <title> than to
<abstract>
Result Size

Number of search terms appearing in
the result (OR part)
Ranking-Structured

Nodes distance


size of the relationship tree
Ancestor-descendant relationship

“more” interconnected
System overview
XSEarch overview
Online
XML corpus with logical hierarchy
query
Indexer
Offline
Search
Results
Document Location array



Generate a unique id, did
Associate each did with the physical
location of the corresponding document
Logical structure of the corpus
Node Encoding Array


Generate for each interior node a id, nid
Node encoding

Defined recursively




Node encoding of its parent
Index of the node among its siblings
Eg: 13.8.1.9
Associate each nid with its node
encoding
Node Label Array

Associate each nid with its label
Inverted Tag Index

For each tag, keep
 posting list: list of nodes labeled with this
tag
 weight
tag
Nid1
Nid2
Nid3
Inverted Keyword Index

For each kw, keep


posting list: list of leafs containing this
keyword
weight of the kw within the leaf (tfilf)
kw
Nid1,w1 Nid2,w2 Nid3,w3
Node Interconnection Matrix

element ij contains:




1, if ni and nj are interconnected
0, else
n*n symmetric sparse matrix
Dynamic programming
Alternative

Hash set : keep only interconnected
nodes

Key: pair (ni, nj)
Interconnection




Let n be the number of nodes
It is possible to determine whether n1
and n2 are interconnected in O(n) time
It is possible to determine
interconnection of all pairs in O(n2)
Offline/Online computation
Interconnection

for (i=size-1; i>=0; i--)
 for (j=i+1; j<=size; j++)
 if i ancestor of j
 connected(iChild,j) AND connected(i,jFather) AND
 labelIChild != labelJ AND labelI != labelJFather
 for (j=i+1; j<size; j++)
 if i not ancestor of j
 connected(i,jFather) AND connected(iFather,j) AND
 labelI != labelJFather AND labelIFather != labelJ
Demo