XSEarch XML Search Engine Jonathan MAMOU October 2002 Motivation XML Getting popular Allows meta-data to be embedded into documents Data-centric view : exchange format for structured data – meta data Document-centric view : Content – text, meta data Querying data and meta-data Buy our Classic Children’s books. One Fish Two Fish by John Meyer & Peter Smith Costs Only: $7.95 Goodnight Moon by Margaret Brown Costs Only: $10.55 Brown Bear by Bill Martin Jr. Costs Only: $6.00 amazing.com <bookinfo> <book><title>One Fish Two Fish</title> <author>John Meyer</author> < author >Peter Smith</author> <price>7.95</price></book> <book><title>Goodnight Moon</title> < author >Margaret Brown</author> <price>10.55</price></book> .... </bookinfo> A query Find titles and prices of books by ‘Meyer’ or ‘Smith’ IR Approach How to deal with tags? Discard all tags Simplicity Loss of information (structure) lower retrieval performance Keep tags as keyword How to write the query? “Title price book author Meyer Smith” IR Approach (cont’d) Can’t specify that Meyer and Smith are the authors Can’t specify that title, price and author belongs to same book Can’t specify desired output (i.e., titles, price) Database approach FOR $b IN document(“bib.xml”)//book WHERE $b/author contains ‘Meyer’ OR $b/author contains ‘Smith’ RETURN <result> <title> $b/title </title> <price> $b/price </price> </result> •Difficult for naive user •Requires knowledge of document structure •Dependent on document structure Our Goal Combine IR and database techniques : tags + text Simple language Logical Structure, not physical Require knowledge of tag names, not structure Queries should work even if structure changes Rank results Framework Tree Representation bookinfo book price title author Just Lost author $5.75 Mercy Meyer Gina Meyer book title price Brown $13.95 Bear We need to find tuples of related title and price nodes. Another Tree Representation bookinfo author book name book author book price title $12.50 One Fish Two Fish Dr. title price Meyer Cat in $14.95 the Hat name title M. Brown Goodnight Moon Similar document, but with different hierarchical structure from the previous. We need to find tuples of related title, author and price nodes. Interconnection The lowest common ancestor of the circled nodes bookinfo book title name book price title price name Just Lost $5.75 Mercy Meyer Gina Meyer Brown $13.95 Bear Consider a title and price node Intuition: The nodes belong to different book entities Interconnection (cont’d) The lowest common ancestor of the circled nodes bookinfo book title name book price title price name Just Lost $5.75 Mercy Meyer Gina Meyer Brown $13.95 Bear Intuition: The nodes belong to same book entity Interconnection (cont’d) bookinfo book title name book price title price name Just Lost $5.75 Mercy Meyer Gina Meyer Brown $13.95 Bear Intuition: The nodes belong to same book entity Relationship tree Nodes n1,n2 n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n1,n2 is the tree obtained by pruning from Tn all nodes other than n1,n2 that are not ancestors of n1,n2 Interconnection We say that n1,n2 are interconnected if the relationship tree does not contain 2 distinct nodes with the same label Or the relationship tree contains exactly one pair of distinct nodes with the same label and this pair is comprised of n1,n2 All-Pairs Interconnection A set of nodes is all-pairs interconnected if every pair of nodes are interconnected Star interconnection bookinfo book price title author author Just $5.75 Lost name name Mercy Meyer book title price Brown $13.95 Bear Gina Meyer The 2 names are not interconnected Star Interconnection (cont’d) A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node Search terms, Search query Search Term (l,k) Search Query AND:L1 OR:L2 l label (context) k keyword L1, L2 list of search terms AND:(title,)(price,) OR:(author,Meyer)(author:Smith) Answer AND:N1 OR:N2 N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected All all-pair answers are star answers Maximal answer Example null title author price bookinfo book price title author author Just $5.75 Mercy Lost Meyer Gina Meyer book title price Brown $13.95 Bear (title,) (price,) (author,Meyer) Find matchings of title, author and price to the nodes in the tree Computing answers All-pairs Determining whether the set of answers is empty is NP-complete If L1 is empty, computing the set of answers is polynomial in the size of input and output Star computing the set of answers is polynomial in the size of input and output Ranking results Unstructured Keyword weight (tfilf) Tags weight Result size Structured Nodes distance Ancestor-descendant Keyword Weight Compute the weight of a keyword k within a given node n Variation of the tfidf, one of the metric of Vector Space Model (classical model in IR) Keyword Weight (cont’d) Term Frequency (tf): number of appearances of k within n tf(k,n) = occ(k,n) / (max occ(k’,n)) Inverse Leaf Frequency (ilf): inverse frequency of k among all the leafs in the corpus idf(k) = log(1+N/Nk) W(k,n) = tf(k,n) * idf(k) Normalized per leave Tag Weight Give weight to tags according to their importance E.g. give more weight to <title> than to <abstract> Result Size Number of search terms appearing in the result (OR part) Ranking-Structured Nodes distance size of the relationship tree Ancestor-descendant relationship “more” interconnected System overview XSEarch overview Online XML corpus with logical hierarchy query Indexer Offline Search Results Document Location array Generate a unique id, did Associate each did with the physical location of the corresponding document Logical structure of the corpus Node Encoding Array Generate for each interior node a id, nid Node encoding Defined recursively Node encoding of its parent Index of the node among its siblings Eg: 13.8.1.9 Associate each nid with its node encoding Node Label Array Associate each nid with its label Inverted Tag Index For each tag, keep posting list: list of nodes labeled with this tag weight tag Nid1 Nid2 Nid3 Inverted Keyword Index For each kw, keep posting list: list of leafs containing this keyword weight of the kw within the leaf (tfilf) kw Nid1,w1 Nid2,w2 Nid3,w3 Node Interconnection Matrix element ij contains: 1, if ni and nj are interconnected 0, else n*n symmetric sparse matrix Dynamic programming Alternative Hash set : keep only interconnected nodes Key: pair (ni, nj) Interconnection Let n be the number of nodes It is possible to determine whether n1 and n2 are interconnected in O(n) time It is possible to determine interconnection of all pairs in O(n2) Offline/Online computation Interconnection for (i=size-1; i>=0; i--) for (j=i+1; j<=size; j++) if i ancestor of j connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather for (j=i+1; j<size; j++) if i not ancestor of j connected(i,jFather) AND connected(iFather,j) AND labelI != labelJFather AND labelIFather != labelJ Demo
© Copyright 2026 Paperzz