Chap13-WebIR.pdf

Advanced information
retrieval
Computer engineering department
Chapter 13 – Web IR
Spring 2005
Searching the web
Introduction (History, Definitions, Aims, Statistics)
■ Tasks of search engines
- Gathering
- Indexing
- “Searching” (Querying and ranking algorithms)
- Document and query management
■ Metasearch
■ Browsing and web directory
■ Users and web search
■ Research issues
■
Web Searching and
Classical IR
TREC
Classic IR research
1970s
1980s
1990s
then came the web
web searching
2000s
Terminology and Definitions
■
■
A web page corresponds to a document in traditional IR
Web pages are different in their size and in the types of
files that constitute them
◆
■
IR on the web considers as collection of documents the
part of the web that is publicly indexable
◆
■
■
text, graphics, sound, video, GIF, JPEG, ASCII,PDF, etc
exclude pages that cannot be indexed (authorization,
dynamic pages)
Location of web pages by navigation
Location of web pages by searching (IR on the web)
Challenges for web search
■
Problem with the data
Distributed data
◆ High percentage of volatile data
◆ Large volume
◆ Unstructured data
◆ Redundant data
◆ Quality of data
◆ Heterogeneous data
◆
■
Problem faced by users
How to specify a query?
◆ How to interpret answers?
◆
■
Portals
Search engines
identification of real names
◆ links to books from Amazon.com
◆ send electronic postcards
◆ translation to other languages
◆ search of other media (metadata)
◆ language-specific searches
◆ weather, stock price, traffic
◆
■
Business model
targeted advertising with revenue from clickthroughs
◆ word of mouth (since no industry standard evaluation)
◆ fast response (<1s) 24 hours / 7 days a week
◆ filter out spam
◆
Examples
Search engines
URL
AltaVista
www.altavista.com
Excite
www.excite.com
Google
www.google.com
Infoseek
www.infoseek.com
Lycos
www.lycos.com
NorthernLight
www.nlsearch.com
■
25-55% of web covered only!
■
Some search engines powered by same IR engine
■
Most based in US, English
Other search engines
■
Specialised in different countries/languages
(e.g., http://www.iol.it/)
■
Specific
Rank according to popularity (e.g., DirectHit)
◆ Topic oriented (e.g., SearchBroker)
◆ Personnel or institutional pages, electronic mail
addresses, images, software applets
◆
Tasks of a web search engine
■
Document gathering
◆
■
Document indexing
◆
◆
■
represent the content of the selected documents
often 2 indices maintained (full + small for frequent queries)
Searching
◆
◆
■
select the documents to be indexed
represent the user information need into a query
retrieval process (search algorithms, ranking of web pages)
Document and query management
◆
◆
display the results
virtual collection (documents discarded after indexing) vs. physical
collection (documents maintained after indexing)
Document gathering
■
Document gathering = crawling the web
■
Crawler
Robot, spider, wanderer, walker, knowbot, web
search agent
◆ Program that traverses the web to send new or
updated pages to be indexed
◆ Run on local server and send requests to remote
servers
◆
Crawling the web (1)
■
Crawling process
◆
Start with set of URLs
✦
✦
◆
◆
Submitted by users or companies
Popular URLs
Breath-first or depth-first
Extract further URLs
■
Up to 10 millions pages per day
■
Several crawlers
◆
◆
Problem of redundancy
Web partition ⇒ robot per partition
Crawling the web (2)
■
Up-to-date?
Non-submitted pages need up to 2 months to be
indexed
◆ Search engine learns change frequency of pages
◆ Popular pages (having many links to them) crawled
more frequently
◆
■
Guideline for robot behaviours
File placed at root of web server
◆ Indicate web pages that should not be indexed
◆ Avoid overloading servers/sites
◆
Document indexing
■
■
Document indexing = building the indices
Indices are variant of inverted files
◆
◆
◆
◆
meta tag analysis
stop words removal + stemming
position data (for phrase searches)
weights
✦
✦
✦
◆
■
tf x idf;
downweight long URLs (not important page)
upweight terms appearing at the top of the documents, or emphasised
terms
use de-spamming techniques
hyperlink information
✦
✦
✦
count link popularity
anchor text from source links
hub and authority value of a page
Searching
■
Querying
◆
◆
◆
◆
◆
■
1 word or all words must be in the retrieved pages
normalisation (stop words removal, stemming, etc)
complex queries (date, structure, region, etc)
Boolean expressions (advanced search)
metadata
Ranking algorithms
◆
◆
use of web links
web page authority analysis
✦
✦
HITS (Hyperlink Induced Topic Search)
PageRank (Google)
Use of web links
■
■
Web link: represent a relationship between the connected
pages
The main difference between standard IR algorithms and web
IR algorithms is the massive presence of web links
◆
web links are source of evidence but also source of noise
◆
classical IR: citation-based IR
◆
web track in TREC, 2000, TREC-9: Small Web task (2GB of
web data); Large Web task (100GB of web data, 18.5 million
documents)
Use of anchor text
■
represent referenced document
◆
why?
✦
✦
■
provides more accurate and concise description than the page
itself
(probably) contains more significant terms than the page itself
◆
used by ‘WWW Worm’ - one of the first search engines 1994
◆
representation of images, programs, …
generate page descriptions from anchor text
Algorithms
■
Query independent page quality
◆
global analysis
✦
■
PageRank (Google): simulates a random walk across the web and
computes the “score” of a page as the probability of reaching the
page
Query dependent page quality
◆
local analysis
✦
HITS (Hyperlink Induced Topic Search): focusses on broad topic
queries that are likely to be answered with too many pages
• the more a page is pointed to by other pages, the more popular is the
page
• popular pages are more likely to include relevant information than nonpopular pages
PageRank (1)
■
■
■
Designed by Brin and Page at Stanford University
Used to implement Google
Main idea:
◆
a page has a high rank if the sum of the ranks of its in-links is
high
✦
✦
◆
■
in-link of page p: a link from a page to page p
out-link of a page p: a link from page p to a page
a high PageRank page has many in-links or few highly ranked
in-links
Retrieval: use cosine product (content, feature, term weight)
combined with PageRank value
PageRank (2)
■
Random Surfer Model : user randomly navigates
◆ Initially the surfer is at a random page
◆ At each step the surfer proceeds
✦
✦
■
to a randomly chosen Web page with probability d called the “damping factor”
(e.g. probability of random jump = 0.2)
to a randomly chosen page linked to the current page with probability 1-d (e.g.
probability of following a random outlink = 0.8)
Process modelled by Markov Chain
◆ PageRank PR of a page a = probability that the surfer is at page a on a
given time
PR(a) = Kd + K(1-d) ∑i=1,n PR(ai)/C(ai)
d set by system
K normalisation factor
a = page pointed by ai for i=1,n
C(ai) = number of outlinks of ai
HITS: Hypertext Induced Topic Search
■
Originated from Kleinberg, 1997
■
Also referred to as the “The Connectivity Analysis Approach”
■
■
Broad topic queries produce large sets of retrieved results
◆ abundance problem ⇒ too many relevant documents
◆ new type of quality measure needed ⇒ distinguish the most
“authoritative” pages ⇒ high-quality response to a broad query
HITS: for a certain topic, it identifies
◆ good authorities
✦
◆
pages that contain relevant information (good sources of content)
good hubs
✦
page that point to useful pages (good sources of links)
HITS (2)
■
Intuition
◆
◆
◆
◆
■
authority comes from inlinks
being a good hub comes from outlinks
better authority comes from inlinks from good hubs
being a better hub comes from outlinks to good authorities
Mutual reinforcement between hubs and authorities
◆
◆
a good authority page is pointed to by many hub pages
a good hub page point to many authority pages
HITS (3)
■
Set of pages S that are retrieved
◆
■
Set of pages T that point to or are pointed to by retrieved set of
pages S
◆
■
sometimes k (e.g. k = 200) top-ranked pages
rank pages according to in_degree (number of in-links) - not
effective
Set of pages T :
◆
authoritative pages relevant to query should have a large in_degree
◆
considerable overlap in the sets of pages that point to them ⇒ hubs
Algorithm for HITS (General Principle)
■
■
■
Computation of hub and authority value of a page through the
iterative propagation of “authority weight” and “hub weight”
Initially all values equal to 1
Authority weight of page x(p)
◆
■
x(p) = Σqi→p y(qi)
Hub weight of page y(p)
◆
■
if p is pointed to by many pages with large y-values,then it should
receive a large x-value
if p points to many pages with large x-values, then it should
receive a large y-value
y(p) = Σp→qi x(qi)
After each computation (iteration), weights are normalised
Topic distillation
■
■
Process of finding ‘quality’ (authority - hub) Web documents related to a
query topic, given an initial user information need.
Extensions of HITS
◆ ARC (Automatic Resource Compilation)
✦
✦
◆
■
distance-2 neighbourhood graph
anchor (& surrounding) text used in computation of hub and authority
values
SALSA, etc.
Problems with HITS
◆ mutual reinforcing relationship between hosts
◆ automatically generated links
◆ non relevant highly connected pages
◆ topic drift: generalisation of the query topic
Difference between PageRank and HITS
■
■
■
■
The PageRank is computed for all web pages stored in the
database and then prior to the query; HITS is performed on the
set of retrieved web pages, and for each query.
HITS computes authorities and hubs; PageRank computes
authorities only.
PageRank: non-trivial to compute, HITS: easy to compute, but
real-time execution is hard
Implementation details of PageRank have been reported
Document and query management
■
Results
◆
◆
◆
◆
◆
◆
■
Usually screens of 10 pages
Clustering
URL, size, date, abstract, etc
Various sorting
Most similar documents options
Query refinement
Virtual collection vs. physical collection
◆
◆
◆
document can change over time
document may be different to the one that has been indexed
broken link
Metasearch (1)
■
Problems of Web search engines:
◆
◆
◆
limited coverage of the publicly indexable Web
index different overlapping sections of the Web
based on different IR models ➙different results to the same query
⇒ users do not have the time, knowledge to select the most appropriate
search engines with regard to their information need
■
Possible solution: metasearch engines
◆
Web server that sends query to
✦
◆
◆
■
Several search engines, Web directories, Databases
Collect results
Unify (merge) them - Data fusion
Aim: better coverage, increase effectiveness
Metasearch (2)
■
Divided into phases
◆
search engine selection
✦
◆
document selection
✦
◆
topic-dependent, past queries, network traffic, …
how many documents from each search engine?
merging algorithm
✦
utilise rank positions, document retrieval scores, …
Metasearcher
MetaCrawler
Dogpile
SavvySearch
URL
www.metacrawler.com
www.dogpile.com
www.search.com
Sources used
13
25
> 1000
MetaCrawler
Browsing
■
Web directory
◆ Catalog, yellow pages, subject categories
◆
■
Many standard search engines provide subject categories now
Hierarchical taxonomy that classifies human knowledge
Arts & Humanities
Investing
Automotive
Kids & Family
Business & Economy
Life & Style
Computers & Internet
News
Education
People
Employment
Philosophy & Religion
Entertainment & Leisure
Politics
Games
Science & Technologies
Government
Social Science …
Yahoo!
■
■
■
Around one million classified pages
More than 14 country specific directories (Chinese,
Danish, French, …)
Manual classification
(has its problem)
■
Acyclic graphs
■
Search within the classified pages/taxonomy
■
Problem of coverage
Users and the web (1)
■
Queries on the web: average values
Measure
Average value
Number of words
2.35
Number of operators
0.41
Repetitions of queries
3.97
Queries per user session
2.02
Screens per query
1.39
Range
0 - 393
0 - 958
1 - 1.5 million
1 - 173325
1 - 78496
Users and the web (2)
■
Some statistics
◆
Main purpose: research, leisure, business, education,
✦
✦
✦
✦
✦
◆
◆
◆
◆
◆
products and services (e-commerce)
people and company names and home pages
factoids (from any one of a number of documents)
entire, broad documents
mp3, image, video, audio
80% do not modify query
85% look first screen only
64% queries are unique
25% users use single keywords (problem for polysemic words
and synonyms)
10% queries are empty
Users and the web (3)
■
Results of search engines are identical, independent of
◆
◆
■
user
context in which the user made the request
adding context information for improving search results
⇒ focus on the user need and answer it directly
◆
explicit context
✦
◆
implicit context
✦
◆
query + category
based on documents edited or viewed by user
personalised search
✦
previous requests and interests, user profile
Research issues
■
■
■
■
■
■
■
■
■
■
Modelling
Querying
Distributed architecture
Ranking
Indexing
Dynamic pages
Browsing
User interface
Duplicated data
Multimedia