Keyword Search

Keyword Searching
Weighted Federated Search with Key Word in Context
Date: 10/2/2008
M
D
Metadata Solutions
Dan McCreary
President
Dan McCreary & Associates
[email protected]
(952) 931-9198
Acknowledgements
• Joe Wicentowski wrote the original
keyword search examples
• Joe’s work was based on the KWIC code
done by Wolfgang Meier
M
D
Copyright 2008 Dan McCreary & Associates
2
Note About Example and Functions
• In an actual production system the code
would be modularized into a series of
functions
• This example has the functions intentionally
removed to make the process easier to view
• A functionalized version will also be
available for students to use in their
production applications
M
D
Copyright 2008 Dan McCreary & Associates
3
Motivation
• You have a large complex web site with
many heterogeneous data collections
– people, blogs, news stories, event calendar etc
• Want a single search function that will find
any item in any of these collections
• Each item has different:
– Collection
– Title
– Item Viewer Function
M
D
Copyright 2008 Dan McCreary & Associates
4
Heterogeneous Items in a Collection
sequence of hit items
person
hit item
blog
country
person
blog
blog
t
t
t
t
t
title element
•
•
•
Search results come back as heterogeneous items in a sequence
Each hit item has a different structure
Each hit item has a document type and the title is consistently at the same XPath
expression for each item type
M
D
Copyright 2008 Dan McCreary & Associates
5
Detailed Steps
•
•
•
•
•
•
Gather search keywords
Construct scope (collections)
Execute query (generate hits)
Score and sort
Prepare summary results for top hits
Display top results
M
D
Copyright 2008 Dan McCreary & Associates
6
Basic Search Algorithm
pseudo-code
let $q := get-parameter(“q”, “”)
for $hit in $collection-list/type
[$hit contains($hit, $q)]
return $hit
1. Get the search query
2. Find the documents that match [ ] is like the SQL
where statement
3. Return a short summary of the matching
documents
M
D
Copyright 2008 Dan McCreary & Associates
7
Collection Paths and Predicate
for $hit in
(collection('/db/test/articles')/article/body,
collection('/db/test/people')/person/biography)
[. &= $q]
In a production system the list of collections would be
stored in an XML file and a function would return a
sequence of the the collections
M
D
Copyright 2008 Dan McCreary & Associates
8
Sample HTML Search Form
<html>
<head><title>Keyword Search</title></head>
<body>
<h1>Keyword Search</h1>
<form method="GET“ action=“search.xq”>
<p>
<strong>Keyword Search:</strong>
<input name="q" type="text"/>
</p>
<p>
<input type="submit" value="Search"/>
</p>
</form>
</body>
</html>
The path to XQuery
REST service
that your form uses
M
D
Copyright 2008 Dan McCreary & Associates
9
Protection against injection attacks
let $q := xs:string(request:get-parameter("q", ""))
let $filtered-q :=
replace($q,
"[&amp;&quot;-*;-`~!@#$%^*()_+=\[\]\{\}\|';:/.,?(:]",
"")
This will remove any characters from the input query
that might contain characters any special characters
that could be used as SQL injection attacks.
M
D
Copyright 2008 Dan McCreary & Associates
10
Create a Scope Sequence
let $scope := (
collection('/db/test/articles')/article/body,
collection('/db/test/people')/people/person/biography
)
A scope is the list of all the items that you will query against.
Note that we will usually replace this “inline” scope variable with
a function xrx:get-searchable-collections() to search for all
collections in the future
M
D
Copyright 2008 Dan McCreary & Associates
11
Scoring Each Hit
let $keyword-matches := text:match-count($hit)
let $hit-node-length := string-length($hit)
let $score := $keyword-matches div $hit-node-length
text:match-count() is the number of times a hit matches a keyword hit. If a
document has five occurrences of the keywords the match count would return 5.
Once you have the sequence of hits, you can now score each of the hits and return a
new sequence of the top scoring hits.
In the example above the score is the number of matches within the document divided
by the total length of the document (in this case the total number of characters in the
file).
M
D
Copyright 2008 Dan McCreary & Associates
12
Score and Sort
let $sorted-hits :=
for $hit in $hits
let $keyword-matches := text:match-count($hit)
let $hit-node-length := string-length($hit)
let $score := $keyword-matches div $hit-nodelength
order by $score descending
return $hit
Once you have the sequence of hits, you can now score
each of the hits and return a list of the top scoring hits
M
D
Copyright 2008 Dan McCreary & Associates
13
Result Pagination
let $perpage := xs:integer(request:get-parameter("perpage", "10"))
let $start := xs:integer(request:get-parameter("start", "0"))
let $end := $start + $perpage
let $results := for $hit in $sorted-hits[$start to $end]
The remainder of our example deals with iterating through the results N records at
time where N is the number of results per page ($perpage).
In this case $perpage and $start are both optional parameters to our search query.
$end is the sum of the start and the number per page.
Adding the [$start to $end] to a new query is the same as performing a subsequence()
operation on the sorted hist to get the final $result sequence to display on the page.
M
D
Copyright 2008 Dan McCreary & Associates
14
Showing Results
• With Highlighted Keyword in Context
• We want to show each result as an HTML
div element containing 3 components:
– The document title
– a summary with an excerpt of the hit showing
the keywords highlighted in context
– and a link to display the full document
M
D
Copyright 2008 Dan McCreary & Associates
15
Extracting the Collection and Document
let $collection := util:collection-name($hit)
let $document := util:document-name($hit)
We did not need to keep track of the original collection and
document that the hit came from because we can always find the
collection and document using the these two functions.
M
D
Copyright 2008 Dan McCreary & Associates
16
KWIC Functions
• let $summary := kwic:summarize($hit,
$config)
M
D
Copyright 2008 Dan McCreary & Associates
17
Displaying the Keyword in Context
The word or words you used in your search should be
highlighted in the context of the search results. You
can customize how much of the surrounding text you
want to display.
M
D
Copyright 2008 Dan McCreary & Associates
18
Calculating number of pages
let $perpage := xs:integer(request:get-parameter("perpage", "10"))
let $start := xs:integer(request:get-parameter("start", "0"))
let $total-result-count := count($hits)
let $end :=
if ($total-result-count lt $perpage) then
$total-result-count
else
$start + $perpage
let $number-of-pages :=
xs:integer(ceiling($total-result-count div $perpage))
let $current-page := xs:integer(($start + $perpage) div $perpage)
M
D
Copyright 2008 Dan McCreary & Associates
19
Managing Federated Search
• Each application you use needs to
communicate the following items to the
federated search tool:
–
–
–
–
–
–
M
D
Collection name
Collection data path
Collection document path
Collection title path
Collection id path
Collection viewer path
Copyright 2008 Dan McCreary & Associates
20
Sample App Config File
<app-info>
<app-name>Articles</app-name>
<app-path>/db/test/articles</app-path>
<doc-path>article/body</doc-path>
<doc-title-path>article/title/text()</doc-title-path>
<doc-id>article/id/text()</doc-id>
<doc-viewer>/db/test/articles/views/view-article.xq?id=</doc-viewer>
</app-info>
If you create a file called app-info.xml in each collection that you
want to search on you can create dynamically create a list of
applications that you want to search. If you do this you can
automate the installation of interoperable applications.
M
D
Copyright 2008 Dan McCreary & Associates
21
Thank You!
Please contact me for more information:
•
•
•
•
•
•
Native XML Databases
Metadata Management
Metadata Registries
Service Oriented Architectures
Business Intelligence and Data Warehouse
Semantic Web
Dan McCreary, President
Dan McCreary & Associates
Metadata Strategy Development
[email protected]
(952) 931-9198
M
D
Copyright 2008 Dan McCreary & Associates
22