Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005 CiteSeer CiteSeer Search Issues Keyword-based full-text search Boolean search syntax How to… search by author name? search author affiliation? search by publication date? CiteSeer Example: Suggested author search approach: For authors, list all variants that appear in citations, separated by “OR“ Examples: m jordan or michael jordan or m i jordan or michael i jordan howard w/2 white or h w/2 white CiteSeer – phrase search CiteSeer – term search Goal Search selected metadata fields Author name Author affiliation Publication Date (month, day, year) Title Others… Increase precision Methodology - Nutch An open-source web search engine Includes crawling, indexing, searching Technologies: Java, JSP, Tomcat Extensible new fields new parsing/indexing facilities adapt UI for searching Methodology - Metadata Methodology 1) Split XML file into HTML documents Each HTML doc contains metadata Allows existing crawler to be used/extended 2) Crawl and index HTML documents on local filesystem 3) Search generated index using JSP page Methodology Implemented as part of project XML File (100 records) Split Program 100 HTML Documents Nutch Crawler Nutch Search (JSP) Parse Filter Query Filter Index Filter XML to HTML Split Methodology - Split Methodology – Crawl/Index Requires 2 filters to process metadata CSParseFilter Parses HTML for metadata values Implements Nutch HtmlParseFilter interface CSIndexingFilter Uses metadata generated by ParseFilter Adds metadata to index Implements Nutch IndexingFilter interface Parse Filter – extract metadata Index Filter Methodology – Query Modification of Nutch search page Change URL from filesystem metadata HTML to CiteSeer Change to 20 hits, to match CiteSeer Query filter Handles custom fields from index filter Prefixed with cs_ Implements Nutch QueryFilter interface Query Filter Evaluation Testing for precision/recall 100 documents Stress test 10,000 documents Approx 10 mins to crawl/index 575,000 documents in CiteSeer metadata download (716,797 documents in CiteSeer) 3.5 hours to split XML into HTML 12 hours to crawl/index ~551,000 indexed during crawling Evaluation Precision & recall Use first 100 docs (easy to measure recall) Issue queries Author last name Author first & last name Author affiliation Precision Use max docs in each system Issue author search queries to both systems Measure precision on each page of 20 hits Evaluation – P & R Look for all papers where Peter Lee is an author (1 document) cs_authorlast:lee Returns 3 documents, all with last name of Lee P=.33, R=1 cs_authorlast:lee cs_authorfirst:peter Returns single document P=1, R=1 Evaluation - Precision Author search: Q1: Peter Lee Q2: Jeffrey Ullman Project: cs_authorfirst:peter cs_authorlast:lee CiteSeer: peter w/2 lee Project: cs_authorfirst:jeffrey cs_authorlast:ullman CiteSeer: jeffrey w/2 ullman Q3: John Smith Project: cs_authorfirst:john cs_authorlast:smith CiteSeer: john w/2 smith Evaluation - Precision Search Demo Available fields: cs_authorfirst cs_authorlast cs_authoraffiliation cs_pubyear cs_pubmonth
© Copyright 2026 Paperzz