Survey of Semantic Annotation Platforms

Searching CiteSeer
Metadata Using Nutch
Larry Reeve
INFO624 – Information Retrieval
Dr. Lin – Winter 2005
CiteSeer
CiteSeer

Search Issues
Keyword-based full-text search
 Boolean search syntax


How to…
search by author name?
 search author affiliation?
 search by publication date?

CiteSeer

Example:

Suggested author search approach:

For authors, list all variants that appear in citations, separated by
“OR“

Examples:

m jordan or michael jordan or m i jordan or
michael i jordan

howard w/2 white or h w/2 white
CiteSeer – phrase search
CiteSeer – term search
Goal

Search selected metadata fields
Author name
 Author affiliation
 Publication Date (month, day, year)
 Title
 Others…


Increase precision
Methodology - Nutch

An open-source web search engine

Includes crawling, indexing, searching

Technologies: Java, JSP, Tomcat

Extensible
new fields
 new parsing/indexing facilities
 adapt UI for searching

Methodology - Metadata
Methodology
1) Split XML file into HTML documents
Each HTML doc contains metadata
 Allows existing crawler to be used/extended

2) Crawl and index HTML documents on local
filesystem
3) Search generated index using JSP page
Methodology
Implemented as
part of project
XML File
(100 records)
Split Program
100 HTML
Documents
Nutch Crawler
Nutch Search
(JSP)
Parse
Filter
Query
Filter
Index
Filter
XML to HTML Split
Methodology - Split
Methodology – Crawl/Index

Requires 2 filters to process metadata

CSParseFilter
Parses HTML for metadata values
 Implements Nutch HtmlParseFilter interface


CSIndexingFilter
Uses metadata generated by ParseFilter
 Adds metadata to index
 Implements Nutch IndexingFilter interface

Parse Filter – extract metadata
Index Filter
Methodology – Query

Modification of Nutch search page
Change URL from filesystem metadata HTML to
CiteSeer
 Change to 20 hits, to match CiteSeer


Query filter

Handles custom fields from index filter


Prefixed with cs_
Implements Nutch QueryFilter interface
Query Filter
Evaluation

Testing for precision/recall


100 documents
Stress test

10,000 documents


Approx 10 mins to crawl/index
575,000 documents in CiteSeer metadata download

(716,797 documents in CiteSeer)

3.5 hours to split XML into HTML
12 hours to crawl/index
~551,000 indexed during crawling


Evaluation

Precision & recall
Use first 100 docs (easy to measure recall)
 Issue queries





Author last name
Author first & last name
Author affiliation
Precision
Use max docs in each system
 Issue author search queries to both systems
 Measure precision on each page of 20 hits

Evaluation – P & R

Look for all papers where Peter Lee is an
author (1 document)

cs_authorlast:lee



Returns 3 documents, all with last name of Lee
P=.33, R=1
cs_authorlast:lee cs_authorfirst:peter


Returns single document
P=1, R=1
Evaluation - Precision

Author search:

Q1: Peter Lee



Q2: Jeffrey Ullman



Project: cs_authorfirst:peter cs_authorlast:lee
CiteSeer: peter w/2 lee
Project: cs_authorfirst:jeffrey cs_authorlast:ullman
CiteSeer: jeffrey w/2 ullman
Q3: John Smith


Project: cs_authorfirst:john cs_authorlast:smith
CiteSeer: john w/2 smith
Evaluation - Precision
Search Demo

Available fields:
cs_authorfirst
 cs_authorlast
 cs_authoraffiliation
 cs_pubyear
 cs_pubmonth
