Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 M D Metadata Solutions Dan McCreary President Dan McCreary & Associates [email protected] (952) 931-9198 Acknowledgements • Joe Wicentowski wrote the original keyword search examples • Joe’s work was based on the KWIC code done by Wolfgang Meier M D Copyright 2008 Dan McCreary & Associates 2 Note About Example and Functions • In an actual production system the code would be modularized into a series of functions • This example has the functions intentionally removed to make the process easier to view • A functionalized version will also be available for students to use in their production applications M D Copyright 2008 Dan McCreary & Associates 3 Motivation • You have a large complex web site with many heterogeneous data collections – people, blogs, news stories, event calendar etc • Want a single search function that will find any item in any of these collections • Each item has different: – Collection – Title – Item Viewer Function M D Copyright 2008 Dan McCreary & Associates 4 Heterogeneous Items in a Collection sequence of hit items person hit item blog country person blog blog t t t t t title element • • • Search results come back as heterogeneous items in a sequence Each hit item has a different structure Each hit item has a document type and the title is consistently at the same XPath expression for each item type M D Copyright 2008 Dan McCreary & Associates 5 Detailed Steps • • • • • • Gather search keywords Construct scope (collections) Execute query (generate hits) Score and sort Prepare summary results for top hits Display top results M D Copyright 2008 Dan McCreary & Associates 6 Basic Search Algorithm pseudo-code let $q := get-parameter(“q”, “”) for $hit in $collection-list/type [$hit contains($hit, $q)] return $hit 1. Get the search query 2. Find the documents that match [ ] is like the SQL where statement 3. Return a short summary of the matching documents M D Copyright 2008 Dan McCreary & Associates 7 Collection Paths and Predicate for $hit in (collection('/db/test/articles')/article/body, collection('/db/test/people')/person/biography) [. &= $q] In a production system the list of collections would be stored in an XML file and a function would return a sequence of the the collections M D Copyright 2008 Dan McCreary & Associates 8 Sample HTML Search Form <html> <head><title>Keyword Search</title></head> <body> <h1>Keyword Search</h1> <form method="GET“ action=“search.xq”> <p> <strong>Keyword Search:</strong> <input name="q" type="text"/> </p> <p> <input type="submit" value="Search"/> </p> </form> </body> </html> The path to XQuery REST service that your form uses M D Copyright 2008 Dan McCreary & Associates 9 Protection against injection attacks let $q := xs:string(request:get-parameter("q", "")) let $filtered-q := replace($q, "[&"-*;-`~!@#$%^*()_+=\[\]\{\}\|';:/.,?(:]", "") This will remove any characters from the input query that might contain characters any special characters that could be used as SQL injection attacks. M D Copyright 2008 Dan McCreary & Associates 10 Create a Scope Sequence let $scope := ( collection('/db/test/articles')/article/body, collection('/db/test/people')/people/person/biography ) A scope is the list of all the items that you will query against. Note that we will usually replace this “inline” scope variable with a function xrx:get-searchable-collections() to search for all collections in the future M D Copyright 2008 Dan McCreary & Associates 11 Scoring Each Hit let $keyword-matches := text:match-count($hit) let $hit-node-length := string-length($hit) let $score := $keyword-matches div $hit-node-length text:match-count() is the number of times a hit matches a keyword hit. If a document has five occurrences of the keywords the match count would return 5. Once you have the sequence of hits, you can now score each of the hits and return a new sequence of the top scoring hits. In the example above the score is the number of matches within the document divided by the total length of the document (in this case the total number of characters in the file). M D Copyright 2008 Dan McCreary & Associates 12 Score and Sort let $sorted-hits := for $hit in $hits let $keyword-matches := text:match-count($hit) let $hit-node-length := string-length($hit) let $score := $keyword-matches div $hit-nodelength order by $score descending return $hit Once you have the sequence of hits, you can now score each of the hits and return a list of the top scoring hits M D Copyright 2008 Dan McCreary & Associates 13 Result Pagination let $perpage := xs:integer(request:get-parameter("perpage", "10")) let $start := xs:integer(request:get-parameter("start", "0")) let $end := $start + $perpage let $results := for $hit in $sorted-hits[$start to $end] The remainder of our example deals with iterating through the results N records at time where N is the number of results per page ($perpage). In this case $perpage and $start are both optional parameters to our search query. $end is the sum of the start and the number per page. Adding the [$start to $end] to a new query is the same as performing a subsequence() operation on the sorted hist to get the final $result sequence to display on the page. M D Copyright 2008 Dan McCreary & Associates 14 Showing Results • With Highlighted Keyword in Context • We want to show each result as an HTML div element containing 3 components: – The document title – a summary with an excerpt of the hit showing the keywords highlighted in context – and a link to display the full document M D Copyright 2008 Dan McCreary & Associates 15 Extracting the Collection and Document let $collection := util:collection-name($hit) let $document := util:document-name($hit) We did not need to keep track of the original collection and document that the hit came from because we can always find the collection and document using the these two functions. M D Copyright 2008 Dan McCreary & Associates 16 KWIC Functions • let $summary := kwic:summarize($hit, $config) M D Copyright 2008 Dan McCreary & Associates 17 Displaying the Keyword in Context The word or words you used in your search should be highlighted in the context of the search results. You can customize how much of the surrounding text you want to display. M D Copyright 2008 Dan McCreary & Associates 18 Calculating number of pages let $perpage := xs:integer(request:get-parameter("perpage", "10")) let $start := xs:integer(request:get-parameter("start", "0")) let $total-result-count := count($hits) let $end := if ($total-result-count lt $perpage) then $total-result-count else $start + $perpage let $number-of-pages := xs:integer(ceiling($total-result-count div $perpage)) let $current-page := xs:integer(($start + $perpage) div $perpage) M D Copyright 2008 Dan McCreary & Associates 19 Managing Federated Search • Each application you use needs to communicate the following items to the federated search tool: – – – – – – M D Collection name Collection data path Collection document path Collection title path Collection id path Collection viewer path Copyright 2008 Dan McCreary & Associates 20 Sample App Config File <app-info> <app-name>Articles</app-name> <app-path>/db/test/articles</app-path> <doc-path>article/body</doc-path> <doc-title-path>article/title/text()</doc-title-path> <doc-id>article/id/text()</doc-id> <doc-viewer>/db/test/articles/views/view-article.xq?id=</doc-viewer> </app-info> If you create a file called app-info.xml in each collection that you want to search on you can create dynamically create a list of applications that you want to search. If you do this you can automate the installation of interoperable applications. M D Copyright 2008 Dan McCreary & Associates 21 Thank You! Please contact me for more information: • • • • • • Native XML Databases Metadata Management Metadata Registries Service Oriented Architectures Business Intelligence and Data Warehouse Semantic Web Dan McCreary, President Dan McCreary & Associates Metadata Strategy Development [email protected] (952) 931-9198 M D Copyright 2008 Dan McCreary & Associates 22
© Copyright 2025 Paperzz