a Personalized Metasearch Engine

Captain Nemo:
a Metasearch Engine with
Personalized Hierarchical Search
Space
(http://www.dblab.ntua.gr/~stef/nemo)
Stefanos Souldatos, Theodore Dalamagas, Timos Sellis
(National Technical University of Athens, Greece)
INTRODUCTION
Metasearching
Metasearch engines can reach a large part of
the web.
Search
Engine 1
Search
Engine 2
Search
Engine 3
Metasearch
Engine
Personalization
Personalization is the new need on the Web.
Personalization in
Metasearching
Personalization can be applied in all 3 stages
of metasearching:
Result
Retrieval
Result
Presentation
Result
Administration
Personalization in
Metasearching
Personalization can be applied in all 3 stages
of metasearching:
Result
Retrieval
Result
Presentation
Result
Administration
Personal Retrieval Model
search engines, #pages, timeout
Personalization in
Metasearching
Personalization can be applied in all 3 stages
of metasearching:
Result
Retrieval
Result
Presentation
Result
Administration
Personal Presentation Style
grouping, ranking, appearance
Personalization in
Metasearching
Personalization can be applied in all 3 stages
of metasearching:
Result
Retrieval
Result
Presentation
Result
Administration
Thematic Classification of Results
k-Nearest Neighbor, Support Vector Machines,
Naive Bayes, Neural Networks, Decision
Trees, Regression Models
Hierarchical Classification
Flat Model
ART
fine
arts
ROOT
SPORTS
athlete score
referee
CINEMA
movie film
actor
Hierarchical
Model
ART
PAINTING
painter camvas
gallery
FOOTBALL
ground ball
match
ROOT
fine arts
CINEMA
movie film actor
BASKETBALL
basket nba
game
PAINTING
painter camvas gallery
SPORTS
athlete score referee
BASKETBALL
basket nba game
FOOTBALL
ground ball match
RELATED WORK
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Result
Administration
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Search
Ixquick
Infogrid
Mamma
Profusion
WebCrawler
Query Server
Result
Administration
User defines the:
search engines to
be used
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Infogrid
Mamma
Profusion
Query Server
Result
Administration
User defines the:
timeout option
(i.e. max time to wait
for search results)
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Result
Administration
User defines the:
Profusion
Query Server
number of pages to
be retrieved by each
search engine
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Result
Administration
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Dogpile
WebCrawler
MetaCrawler
Result
Administration
Result can be grouped
by search engine that
retrieved them
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Result
Administration
Personalization in Metasearch
Engines
Result
Retrieval
Result
Presentation
Result
Administration
Northern Light
Organizes search results into
dynamic custom folders
Inquirus2
Recognises thematic
categories and improves
queries towards a category
Buntine et
al. (2004)
Topic-based open source
search engine
CAPTAIN NEMO
Personal Retrieval Model
Result
Retrieval
Result
Presentation
Result
Administration
Personal Retrieval Model
Result
Retrieval
Result
Presentation
Result
Administration
Search
Search
Search
Engine 1 Engine 2 Engine 3

Search Engines




Number of Results
20
30
10

Search Engine Timeout
6
8
4

Search Engine Weight
7
10
5
Personal Presentation Style
Result
Retrieval
Result
Presentation
Result
Administration
Personal Presentation Style
Result
Retrieval

Result
Administration
Result Grouping




Result
Presentation
Merged in a single list
Grouped by search engine
Grouped by relevant topic of interest
Result Content



Title
Title, URL
Title, URL, Description
Personal Presentation Style
Result
Retrieval

Result
Presentation
Look ‘n’ Feel

Color Themes
(XSL Stylesheets)

Page Layout

Font Size
Result
Administration
Topics of Personal Interest
Result
Retrieval
Result
Presentation
Result
Administration
Topics of Personal Interest
Result
Retrieval

Result
Presentation
Result
Administration
Administration of topics of personal interest



The user defines a hierarchy of topics of
personal interest (i.e. thematic categories).
Each thematic category has a name and a
description of 10-20 words.
The system offers an environment for the
administration of the thematic categories and
their content.
Topics of Personal Interest
Result
Retrieval

Result
Presentation
Result
Administration
Hierarchical classification of results


The system proposes the most appropriate
thematic category for each result (Nearest
Neighbor).
The user can save the results in the proposed or
other category.
Classification Example

Query: “Michael Jordan”

Results in user’s topics of interest:
ROOT
3
ART
fine arts
3
CINEMA
movie film actor
2
PAINTING
painter camvas gallery
SPORTS
athlete score referee
8
BASKETBALL
basket nba game
FOOTBALL
ground ball match
METASEARCH RANKING
Two Ranking Approaches
Using Initial
Scores of
Search Engines
Not Using
Initial Scores of
Search Engines
Using Initial Scores

Rasolofo et al. (2001) believe that the initial scores
of the search engines can be exploited.

Normalization is required in order to achieve a
common measure of comparison.

A weight factor incorporates the reliability of each
search engine. Search engines that return more
Web pages should receive higher weight. This is
due to the perception that the number of relevant
Web pages retrieved is proportional to the total
number of Web pages retrieved as relevant.
Not Using Initial Scores

The scores of various search engines are not
compatible and comparable even when normalized.

Towell et al. (1995) note that the same document
receives different scores in various search engines.

Gravano and Papakonstantinou (1998) point out
that the comparison is not feasible not even among
engines using the same ranking algorithm.

Dumais (1994) concludes that scores depend on
the document collection used by a search engine.
Aslam and Montague (2001)

Bayes-fuse uses probabilistic theory to
calculate the probability of a result to be
relevant to a query.

Borda-fuse is based on democratic voting. It
considers that each search engine gives
votes in the results it returns (N votes in the
first result, N-1 in the second, etc). The
metasearch engine gathers the votes and the
ranking is determined democratically by
summing up the votes.
Aslam and Montague (2001)

Weighted borda-fuse: weighted alternative
of borda-fuse, in which search engines are
not treated equally, but their votes are
considered with weights depending on the
reliability of each search engine.
Weighted Borda-Fuse

V (ri,j) = wj * (maxk(rk) - i + 1)




V(ri,j): Votes of i result of j search engine
wj: weight of j search engine (set by user)
maxk(rk) : maximum number of results
Example:
SE1:
SE2:
SE3:
5 4 3 2
5 4 3
5 4 3 2 1
W1=7
W2=10
W3=5
35 28 21 14
50 40 30
25 20 15 10
5
Captain Nemo
http://www.dblab.ntua.gr/~stef/nemo
Links
Introduction
Related work
Captain Nemo
Metasearch Ranking