Sharons - Technion - Electrical Engineering

How Search Engines
Work?
Ziv Bar-Yossef
Department of Electrical Engineering
Technion
1
What is the Internet?


A global network of
computers connected
to each other
Computers “talk” to
each other using
standard protocols
 TCP/IP
2
What is the World-Wide Web
(WWW)?

Collection of pages
available via the Internet
 Internet
users can view
pages with web browsers
 WWW is only one
application of the Internet
 Other applications: email,
messengers, VOIP,
newsgroups, ftp
3
Web Pages

Various formats
 pdf,

word, excel, images, mp3, video, text
Most popular
format: HTML
 HTML
pages point
to each other using
hyperlinks
 Users “surf the
web” by clicking
hyperlinks
4
What are Search Engines?

Users have “information needs”
 Where
can I find solutions to my math homework
problem?
 Where can I find mp3s of Miri Messika’s latest album?
 What is the weather in Eilat in Channuka?
 What other Sharons are famous except for our prime
minister?

Search engines enable us to find web pages
that match our information needs
5
Search Engines
“Information Need”
What other Sharons
are famous, except for
our prime minister?
Web
Search
Engine
Web pages
query
sharon
-ariel
User
1. Sharon Creech
2. Sharon Stone
3. Sharon, Massachusetts
Ranked list of
matching pages
6
How Search Engines (don’t)
Work?

Common misconception: when user submits a
query, the search engine scans all web pages to
find the relevant matches
Web
Search
Engine
Web pages
query
sharon
-ariel
User
1. Sharon Creech
2. Sharon Stone
3. Sharon, Massachusetts
Ranked list of
matching pages
7
How Search Engines Work?

What do you do when you look for a term in an
encyclopedia?
 Use
the index!
Search Engine
Web
index
Web pages
query
sharon
-ariel
User
1. Sharon Creech
2. Sharon Stone
3. Sharon, Massachusetts
Ranked list of
matching pages
8
Search Engine Architecture
Search Engine
Crawler
Index
Query
Processor
Ranking
Algorithm
9
Web Crawler
(a.k.a. Spider)
Fetches web pages and stores them in a
local repository
 Tries to get as many web pages as
possible
 Follows hyperlinks to learn about new
pages
 Refetches pages that change frequently

10
The Index
www.cnn.com
Ariel1 Sharon2, the3
prime4 minister5 of6 Israel7
founded8 a9 new10
political11 party12.
www.hollywood.com
Sharon1 Stone2 dressed3
a4 new5 Jean6 Paul7
Gaultier8 gown9 at10 the11
Oscars12 after13 party14.
Index
ariel:
(cnn.com,1)
dress: (hollywood.com,3)
found: (cnn.com,8)
gaultier: (hollywood.com,8)
gown: (hollywood.com,9)
israel: (cnn.com,7)
jean:
(hollywood.com,6)
minister: (cnn.com,5)
new:
(cnn.com,7), (hollywood.com, 5)
oscar: (hollywood.com,12)
party:
(cnn.com,12), (hollywood.com,14)
paul:
(hollywood.com,7)
political: (cnn.com,11)
prime: (cnn.com,4)
sharon: (cnn.com,2), (hollywood.com,1)
stone: (hollywood.com,2)
11
Index by “Anchor Text”

Anchor text: what’s written inside a link
 Example: Ariel
Sharon, the prime minister…
Usually succinctly describes what’s written in
the linked page
 By which terms a page is listed in the index?

 Terms
that appear in the page
 Terms that appear in anchor text of links to the
page
12
Query Processor




Gets a user query
Fetches relevant posting lists from index
Extracts relevant matches from lists
Example: Query = “sharon –ariel”
 L1

 L2

 posting list of sharon
sharon: (cnn.com,2), (hollywood.com,1)
 posting list of ariel
ariel: (cnn.com,1)
 Return

all pages in L1 that do not occur in L2
cnn.com
13
Ranking Algorithm

Many queries have many matching pages
 472

million matches for “London” in Google
Cannot return all of them to the user
 User

Need to order results by relevance
 Most

needs the most relevant results anyway
relevant results are at the top
Ranking algorithm: a method of ordering matches
“heart” of a search engine
 The reason why Google is the most preferred search
engine today
 The
14
Google’s PageRank

Ranking  Elections
 Candidates:
all web pages
 Voters: all web pages
 p votes to q, if p has a hyperlink to q.


Favorites(p) = all the pages p votes for.
Fans(p) = all the pages that vote for p.

1
if p has no fans
15
Google’s PageRank
1
1.5
1
4
1
2.5
1

Underlying principles:
is “important” if it has important fans
 A page splits its “importance” evenly among its
favorite pages.
 A page
16
Google’s PageRank

Ranking algorithm:
 Find
pages that match the given query
 Order them by their PageRank
 Return top 10 matches
17
But…PageRank Not Always Works
SPAM
18
Conclusions
Search engines use index to answer user
queries
 Ranking is the most important component
 Spam is a problem

19
Thank You
20