How Search Engines Work? Ziv Bar-Yossef Department of Electrical Engineering Technion 1 What is the Internet? A global network of computers connected to each other Computers “talk” to each other using standard protocols TCP/IP 2 What is the World-Wide Web (WWW)? Collection of pages available via the Internet Internet users can view pages with web browsers WWW is only one application of the Internet Other applications: email, messengers, VOIP, newsgroups, ftp 3 Web Pages Various formats pdf, word, excel, images, mp3, video, text Most popular format: HTML HTML pages point to each other using hyperlinks Users “surf the web” by clicking hyperlinks 4 What are Search Engines? Users have “information needs” Where can I find solutions to my math homework problem? Where can I find mp3s of Miri Messika’s latest album? What is the weather in Eilat in Channuka? What other Sharons are famous except for our prime minister? Search engines enable us to find web pages that match our information needs 5 Search Engines “Information Need” What other Sharons are famous, except for our prime minister? Web Search Engine Web pages query sharon -ariel User 1. Sharon Creech 2. Sharon Stone 3. Sharon, Massachusetts Ranked list of matching pages 6 How Search Engines (don’t) Work? Common misconception: when user submits a query, the search engine scans all web pages to find the relevant matches Web Search Engine Web pages query sharon -ariel User 1. Sharon Creech 2. Sharon Stone 3. Sharon, Massachusetts Ranked list of matching pages 7 How Search Engines Work? What do you do when you look for a term in an encyclopedia? Use the index! Search Engine Web index Web pages query sharon -ariel User 1. Sharon Creech 2. Sharon Stone 3. Sharon, Massachusetts Ranked list of matching pages 8 Search Engine Architecture Search Engine Crawler Index Query Processor Ranking Algorithm 9 Web Crawler (a.k.a. Spider) Fetches web pages and stores them in a local repository Tries to get as many web pages as possible Follows hyperlinks to learn about new pages Refetches pages that change frequently 10 The Index www.cnn.com Ariel1 Sharon2, the3 prime4 minister5 of6 Israel7 founded8 a9 new10 political11 party12. www.hollywood.com Sharon1 Stone2 dressed3 a4 new5 Jean6 Paul7 Gaultier8 gown9 at10 the11 Oscars12 after13 party14. Index ariel: (cnn.com,1) dress: (hollywood.com,3) found: (cnn.com,8) gaultier: (hollywood.com,8) gown: (hollywood.com,9) israel: (cnn.com,7) jean: (hollywood.com,6) minister: (cnn.com,5) new: (cnn.com,7), (hollywood.com, 5) oscar: (hollywood.com,12) party: (cnn.com,12), (hollywood.com,14) paul: (hollywood.com,7) political: (cnn.com,11) prime: (cnn.com,4) sharon: (cnn.com,2), (hollywood.com,1) stone: (hollywood.com,2) 11 Index by “Anchor Text” Anchor text: what’s written inside a link Example: Ariel Sharon, the prime minister… Usually succinctly describes what’s written in the linked page By which terms a page is listed in the index? Terms that appear in the page Terms that appear in anchor text of links to the page 12 Query Processor Gets a user query Fetches relevant posting lists from index Extracts relevant matches from lists Example: Query = “sharon –ariel” L1 L2 posting list of sharon sharon: (cnn.com,2), (hollywood.com,1) posting list of ariel ariel: (cnn.com,1) Return all pages in L1 that do not occur in L2 cnn.com 13 Ranking Algorithm Many queries have many matching pages 472 million matches for “London” in Google Cannot return all of them to the user User Need to order results by relevance Most needs the most relevant results anyway relevant results are at the top Ranking algorithm: a method of ordering matches “heart” of a search engine The reason why Google is the most preferred search engine today The 14 Google’s PageRank Ranking Elections Candidates: all web pages Voters: all web pages p votes to q, if p has a hyperlink to q. Favorites(p) = all the pages p votes for. Fans(p) = all the pages that vote for p. 1 if p has no fans 15 Google’s PageRank 1 1.5 1 4 1 2.5 1 Underlying principles: is “important” if it has important fans A page splits its “importance” evenly among its favorite pages. A page 16 Google’s PageRank Ranking algorithm: Find pages that match the given query Order them by their PageRank Return top 10 matches 17 But…PageRank Not Always Works SPAM 18 Conclusions Search engines use index to answer user queries Ranking is the most important component Spam is a problem 19 Thank You 20
© Copyright 2026 Paperzz