Teaching Mathematics and Its Applications (2011) 30, 107^119 doi:10.1093/teamat/hrr007 Advance Access publication 12 May 2011 How does Google come to a ranked list?çmaking visible the mathematics of modern society HANS HUMENBERGER* University of Vienna, Nordbergstraße 15 (UZA 4), 1090 Vienna, Austria *Email: [email protected] [Submitted October 2010; accepted February 2011] When one uses Google (and many people do this!), the result of the query is a list of sites that have something to do with the item one is looking for. The specific sites are always more or less on the top, so it is not necessary to have a look on hundreds of sites to read something relevant and informative. How can Google manage this? How does Google come to the suggested list? This article is primarily written for teachers and lecturers who want to share the idea of PageRank with students without having complications arising from ‘concepts of higher mathematics’ like eigenvectors or eigenvalues. The basis is a special limit theorem (concerning Markov chains) which can be used unproved in school in order to come to interesting and elementary applications of mathematics. This example also provides a very good chance for cross-linking several mathematical fields: stochastics (probabilities, etc.), linear algebra (vectors, matrices, etc.) and analysis (limits, etc.). Another focus of this contribution is to make more visible the use of mathematics in modern society. This seems to be necessary because mathematics disappears more and more from societal perception in spite of the fact that its role rises in importance (but in most cases hidden) in our lives, it is surely a so-called key-technology. 1. Introduction According to a talk1 world wide 26% of the people were online (used the internet) in September 2009, this is called the internet penetration rate. In Europe, it was 52% and for North America 74%. Search Engines are the second largest internet application (after email) and Google has become the most used internet search engine all over the world.2 When using it, the following question arises quite naturally: How can Google manage the ranking (specific sites first)? The answer has to do with the famous ‘PageRank’. First a simple problem for the introduction: The telephone market of a country is dominated by three companies (A-tel, B-tel and C-tel). The companies have annual contracts with their customers,3 and for 1 Prof. Monika Henzinger (a former computer scientist of Google) in December 2009. Market shares (according television broadcast in June 2009): Google 62%, Yahoo 21%. 3 Assumption: these contracts are always made for one year, at the end/beginning of a year the customers may possibly change the telephone provider. 2 ß The Author 2011. Published by Oxford University Press on behalf of The Institute of Mathematics and its Applications. All rights reserved. For permissions, please email: [email protected] 108 HOW DOES GOOGLE COME TO A RANKED LIST? FIG. 1. Transition graph. simplicity reasons let us make the assumption that at the end of every year the customers stay with their former company to a special percentage and change to other companies, respectively. This situation can be easily described with a so-called directed graph (also transition graph ! Fig. 1): This means, for example, for the company C-tel that 70% of their customers stay at C-tel after 1 year, 20% change to A-tel and 10% to B-tel. The other transition rates can be interpreted in a similar way. Let’s suppose—also for simplicity reasons—that these transition rates do not change during the next 5 (10; 20) years; what would be the distribution (percentages) of the customers to the companies at that time if at the beginning it were ðA0 ‚ B0 ‚ C0 Þ ¼ 13 ‚ 13 ‚ 13 or (A0,B0,C0) = (30%, 50%, 20%)? Even if students have not heard anything about Markov chains or transition matrices, they can handle this problem easily by using a spreadsheet programme (e.g. EXCEL). One can establish the associated recursions 0:8An þ 0:3Bn þ 0:2Cn ¼ Anþ1 0:1An þ 0:6Bn þ 0:1Cn ¼ Bnþ1 0:1An þ 0:1Bn þ 0:7Cn ¼ Cnþ1 by looking at the transition graph and enter them as a formula. Especially for such iterative situations (problems), spreadsheet programmes are a very useful tool! Using the well-known dragging-down method, one can easily and quickly see the values after 5, 10, 20 years (using only a calculator would be much more cumbersome here). One will realize that the values quickly tend to (An, Bn, Cn) = (55%, 20%, 25%) completely independent of the initial distribution (A0,B0,C0). Spreadsheets are a wonderful tool to determine experimentally such limit distributions in the case of only a few possible ‘stations’ (above only 3). One does not need matrices or theories behind it, one only needs very elementary knowledge of using spreadsheets. The process in spreadsheet programmes is an iterative one like the real determination of the PageRank in the practice of Google. Therefore, using spreadsheets here is on the one hand a simple introduction and on the other hand it is not so far away from the procedure in reality (iterative methods are used there too). For dealing with such ‘limit distributions’ in a more detailed way (especially a few theoretical aspects) spreadsheets are not sufficient, we need ‘transition matrices’ (see Chapter 3). H. HUMENBERGER 109 2. Google and its founders This section does not have mathematical contents, but some general information about Google and its founders. When teaching ‘PageRank’ primarily as mathematical phenomenon I think it is also important to have items that lie outside mathematics and that are motivating to students. There are many internet search engines that can ‘comb through’ the www in small splits of a second. Google is a very famous and common one. The name ‘Google’ was selected to denote something very huge and big—according to the tremendous plenty and richness of the www. ‘Google’ is a modification of ‘Googol’, a word that was established by the American mathematician E. Kasner in 1938. It was meant to be for the giant number 10100. ‘1 Googol’ is much bigger than the number of atoms in our Universe (1080), on the other hand 10100 & 70! and therefore it is approximately the number of possibilities to arrange 70 different things in a row. Here we again see the power of some mathematical notations and how fast factorials grow: 70! is 1020-times as big as the number of atoms in our Universe! Who would intuitively not say that there were much more atoms in our Universe than possibilities to arrange 70 different things? Google is the leading search engine world wide (compared e.g. to Yahoo, MSN, etc.). What is the reason for that, why could Google prevail? One of the reasons lies in the ‘PageRank algorithm’, which at that time made Google better than the others: results should come very fast, the information should be relevant to the users so that they do not need to click many different links. And especially when companies want to get many new customers (in a young market), it is very important to be better than the other business rivals. Search engines should try hard to show good sites on the first page of the list because4 85% of the users click on sites only at the first page of the list, 77% of the users make only one query (they do not change the words they are looking for). The criteria Google takes into account in order to rank the sites as good as possible are very manifold nowadays (200), the first and probably most important one was the PageRank. Lawrence (Larry) Page (born in 1973) is an US-American computer scientist and co-founder of the internet search engine Google. At Stanford University, he received his Master in computer science. Together with his fellow student Sergej Michailowitsch Brin (born in Moscow in 1973), he created a prototype of an internet search engine for the www in 1996. None of the big companies (today rivals in business, e.g. Yahoo) were interested in the search engine programmed by them. Therefore in 1998, they founded Google Inc. together—with an initial aid of 100,000 US$ by Sun Microsystems. They started to work on a PhD thesis but they did not continue it after the foundation of Google. This, of course, is not surprising, they surely had and have other important things to do. Moreover due to going public in 2004 (! stock exchange) they became very rich: why should they work on a PhD thesis? Although the actual algorithm used by Google is more complicated than it will be presented here (there are several other aspects and constraints), the main idea behind the mathematics of the PageRank algorithm is a very elementary one. We will focus only on the mathematical ideas of PageRank. On the one hand, it is amazing that one can make so much money and establish a world-shaking thing like it was done with Google by L. Page and S. Brin with so elementary ideas. On the other hand, herein lies a more pleasant confirmation that basic mathematical ideas are very important (in this case for millions of users and—economically seen—for the founders of Google, its employees and shareholders). 4 According to a talk of Monika Henzinger, a former Google computer scientist, in December 2009. 110 HOW DOES GOOGLE COME TO A RANKED LIST? When we state that the principle idea is an elementary one we do not want to detract the performance of the founders, quite on the contrary! The transformation of a mathematically elementary idea to a programme that can handle thousands of queries in an acceptable short time is a really hard job and an excellent performance (not only containing mathematical ideas, but also many important computer science issues). In 2008, Google searched at over 1 trillion (=1,000,000,000,000) URLs, a very very huge number! Every second Google answers very many queries5 in >100 ‘domains’ and languages and every user wants the result immediately without waiting. Google wanted and wants an answer time of half a second at the most, in most cases the time is much shorter. This very quick supply with results was one of the reasons for the success and popularity of Google in the 1990s. The business rivals needed a bit more time for answers and therefore had disadvantages. Nowadays Google employs many software engineers, but the first steps were probably done by the founders themselves, a great job! Reportedly L. Page and S. Brin want to stay in the Google company at least until 2024. We just say ‘ad multos annos’! For doing internet searches with Google a new verb ‘to google’6 has been established—also in German: ‘googeln’. If somebody asks another person about a special word (term) and this person does not know very much about it, one can often hear the hint: ‘Have you googled it already?’ At Wikipedia, one can read: ‘The verb to google (also spelled to Google) refers to using the Google search engine to obtain information on the Web. A neologism arising from the popularity and dominance of the eponymous search engine, the American Dialect Society chose it as the ‘‘most useful word of 2002’’. It was officially added to the Oxford English Dictionary on 15 June 2006, and to the 11th edition of the Merriam-Webster Collegiate Dictionary in July 2006. The first recorded usage of google used as a verb was on 8 July 1998, by Larry Page himself, who wrote on a mailing list: ‘‘Have fun and keep googling!’’ ’ 3. The www as a directed graph and the description by transition matrices Search engines start their procedure by ‘combing through’ the www with a so-called spider or webcrawler (special computer programme): which documents of the www include the word we are interested in and looking for? One aim of this very large search process is to get a description of the link structure between the sites of the www containing the word (item) looked up.7 Let us start with a very simple example: A, B, C, D are four different sites that are linked to each other as shown in Fig. 2. For example, there is a link from the site A to B and C, from the site B there are links to C and D, etc. Modelling assumption 1: For reasons of simplicity, we assume that every link on a site will be used with the same probability.8 That means if there are two leaving arrows from a site each of them ought 5 Per day, there are on average 60,000,000 queries from adults only in the USA, i.e. per second 700 (Wills, 2006, p. 6). Chartier (2006, p. 17) writes that Google has more than 3000 queries per second; I suppose that this is also meant only for the USA. 6 Officially it is not allowed to use ‘to google’ in the meaning of ‘using any internet search engine’, only when using Google one should say ‘to google’. 7 Which site (containing the word looked up) has links to which other one? 8 Of course in reality, this is not exactly the case; a conspicuous link at the top of the page is probably used more often than a ‘small link’ at the bottom. But these kinds of simplifications and idealisations are very typical for mathematical modelling: we have to make such simplifying assumptions in order to be able to use mathematics successfully. H. HUMENBERGER 111 FIG. 2. Four sites. to have the number ½, when there are 3 (k) leaving arrows, each arrow ought to have the number 1/3 (1/k). Thus, for clarity, we will omit the probability labels in the directed graphs. Now we can imagine—just like in the case with the telephone companies—that many users are in the system of the sites A, B, C, D, at the beginning with the relative frequencies A0,B0,C0,D0 (fractions, percentages; A0 + B0 + C0 + D0 = 1). The change of a telephone company corresponds here to the change of an internet site. We again think of discrete steps in time: the users change the sites in these time steps (by following the links), so that after n time steps, the distribution of the users is An,Bn,Cn,Dn. We can again read off the recursions Cn ¼ Anþ1 0:5An 0:5An þ 0:5Bn þ 0:5Dn ¼ Bnþ1 þ 0:5Dn ¼ Cnþ1 0:5Bn ¼ Dnþ1 easily from the transition graph (Fig. 2). In order to check whether there is a limit distribution B‚ DÞ C‚ (i.e. a stable distribution in the long run) and if it is the case how does it look like, we ðA‚ could again use spreadsheets. But linear equation systems can also be described very comfortably by using matrices and vectors: 0 1 0 1 0 1 0 0 1 0 An Anþ1 B 0:5 0 0 0:5 C B Bn C B Bnþ1 C B C B C B C @ 0:5 0:5 0 0:5 A @ Cn A ¼ @ Cnþ1 A 0 0:5 0 0 Dn Dnþ1 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflffl{zfflfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl} ¼:T ¼:~ n ~nþ1 The vectors ~n denote the distribution after n time steps. All transitions ~n ! ~nþ1 are given by the same matrix T (‘transition matrix’). In column i, there are the probabilities that a user on site i changes to site j in the next step by using a link i!j (i, j = 1, . . . ,4). Somebody who is at the moment on site C has to go to site A (probability 1) in the next step; this can be seen in the transition graph and in the transition matrix (the probability 1 in column 3 and row 1). ~ 1 zfflfflffl}|fflfflffl{ For the transitions we get: T ~0 ¼ ~1 ‚ T ðT ~0 Þ ¼ ~2 ‚ . . . ‚ T n ~0 ¼ ~n : |fflfflfflfflfflffl{zfflfflfflfflfflffl} T 2 ~ 0 112 HOW DOES GOOGLE COME TO A RANKED LIST? When using matrices, we have a possibility to get a direct formula for ~n (not only an iterative description like with spreadsheets). A vector that has probabilities (relative frequencies, percentages) with sum 1 as entries is called a stochastic vector. A square matrix is called stochastic if its column vectors are stochastic. Transition matrices are of course stochastic because they are square matrices and in the first column there are the probabilities for users at A landing at A, B, C, D in the next step. Of course these numbers come from the interval [0; 1] and have sum 1 (analogous with the other columns). 4. How can one measure the relevance of a site? On relevant sites one expects to read something informative, specific and worth knowing. Of course, a site s is the more relevant the more sites have a link on s, especially when these links come from relevant sites. But this does not say how to measure relevance. Which is the most relevant page in the above graph, which is the next relevant, etc.? How can one determine the relevance of a site within a directed graph? This is a question that can be answered in different ways, Google has found its own answer; its own measurement of the relevance of a page. One can think of the following situation: Many users are in the network (directed graph), say 1 million users who are randomly surfing in the web for an unlimited time. What fraction (percentage) of them is at A, B, C, D in the long run? If it turns out that a special site attracts 90% of the users, then it is clear that this site is most relevant and must be placed at the top of the list. These long term fractions are a possibility to measure the relevance of a site and for these fractions we need ‘limit distributions’. Let us assume that the users start to surf in the small web containing the four pages by chance and that the fractions of the users at the beginning are ¼ for A, B, C, D: ~0 ¼ ð0:25‚ 0:25‚ 0:25‚ 0:25Þt ; if they continue surfing and using the links by chance the distribution in the next step will be ~1 ¼ T ~0 ¼ ð0:25‚ 0:25‚ 0:375‚ 0:125Þt , taking another step yields ~2 ¼ T ~1 ¼ T 2 ~0 ¼ ð0:375‚ 0:1875‚ 0:3125‚ 0:125Þt ; the sites A and C seem to have an advantage here. This is also plausible: all sites have a link to C and from there one must go to page A . . . . By multiplying with T from the left, one gets the distributions ~n ¼ T n ~0 ; they converge to a ‘limit distribution’ ~n ! ~, which is given by ~ ¼ ð3=9‚ 2=9‚ 3=9‚ 1=9Þt . According to this, both sites A and C should be listed equally on the first rank, followed by B and D. Such limit distributions can be determined at school in several ways: (1) Repeating the iteration with a spreadsheet programme so long until the values do not change anymore. (2) Determining a high power Tn of the matrix with a computer algebra system (CAS), so that ~n ¼ T n ~0 should be near the limit distribution. (3) We are looking for a vector ~ with component sum 1 that does not change under multiplication with T : T ~ ¼ ~. One has to solve a linear equation system, of course, by CAS.9 9 With means of higher mathematics one can speak of an eigenvector of T to the eigenvalue 1. But in most cases at school students will not know these terms and the corresponding ideas. Therefore we do not go into details in this respect. H. HUMENBERGER 113 Problems that could occur: Is it possible that there are more such limit distributions ~? If there are more than one, do the vectors (distributions) ~i converge sometimes to the one and at other times to the other (depending on the start distribution ~0 )? This would not fit to our purpose because we want to use this limit distribution as a neutral and stable basis for establishing a relevance ranking. A limit distribution that is not unique but depending on the start distribution would not be a good basis. It would be best if the limit distribution were unique and independent of the start distribution. All three possibilities mentioned above to determine the limit distribution ~ are practicable for relatively low dimensions, as above a 4 4 matrix, eventually also 20 20, but in the case of a 1,000,000 1,000,000 matrix (or more, as occur in real-time Google searches) other methods are used: iterative algorithms that come to an approximate solution. They have to be very fast because there are very many queries given to Google every second. And nobody wants to wait a long time for the result. Regardless of whether or not Markov chains are dealt with, one special limit theorem is very important because it provides a simple condition on the transition matrix T that guarantees the existence of the limit distribution ~, the uniqueness of it and the independence of the start distribution ~0 (without proof): Limit Theorem: T is stochastic and Tn contains for some n 1 only positive entries ) the limit matrix L :¼ lim T n exists, is stochastic and has equal columns.10 n!1 It is clear that these columns then determine the unique limit distribution ~ independent of the start distribution ~0 : Because of A0 + B0 + C0 + D0 = 1 in case of a 4 4 matrix one gets for the limit distribution ~ with this limit matrix L (independent of the concrete values of A0,B0,C0,D0): 0 1 0 1 0 1 t1 t1 t1 t1 A0 t1 B t2 t2 t2 t2 C B B0 C B t2 C C B C B C ~ ¼ B @ t3 t3 t3 t3 A @ C0 A ¼ @ t3 A t4 t4 t4 t4 D0 t4 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflffl{zfflfflffl} L ~0 P Of course we have ti ¼ 1 because L is stochastic. In our example T itself does not have only positive entries but T5 has. So the convergence and the independence of the start distribution are guaranteed by the above limit theorem. In our case, we determine, for example, T 20 with a CAS (four decimal places). Here ~ ¼ ð3=9‚ 2=9‚ 3=9‚ 1=9Þt —the limit distribution mentioned above—can easily be seen: 0 1 0:3333 0:3333 0:3333 0:3333 B 0:2222 0:2222 0:2222 0:2222 C C T 20 ¼ B @ 0:3333 0:3333 0:3333 0:3333 A 0:1111 0:1111 0:1111 0:1111 10 That means the entries are constant in each row. This theorem needs not to be proved at school, one can simply use it for understanding the PageRank algorithm. Also the other specialties of the theory around it need not be dealt with at school. For a proof of a corresponding theorem, see for example, Kemeny & Snell (1976, 69ff). The conditions could be even weakened: it suffices when there exists one row with only positive entries in some power of T. 114 HOW DOES GOOGLE COME TO A RANKED LIST? FIG. 3. Small network. 5. A slightly more complicated exampleçthe general case The link structure of a still very small network consisting of six internet sites is shown in Fig. 3. The transition matrix can be read off easily again (! matrix T). 1 0 0 0 1=3 0 0 0 B 1=2 0 1=3 0 0 0 C C B B 1=2 0 0 0 0 0 C C T ¼B B 0 0 1=3 0 0 1=2 C C B @ 0 0 0 1=2 0 1=2 A 0 0 0 1=2 1 0 This is a new situation: From site B there is no arrow leaving, there are no links on this page. Within the process of surfing one could call this a dead end or sink. We can see this also in the second column of the matrix T, it contains only zeros. This is really bad for our purposes (stochastic matrix, column sum should be 1). What will one do in such a situation if this happens while surfing in the net? There are several possibilities: (a) Stop surfing and stay at site B; in the matrix this would mean replacing the second zero in the second column by 1, in the directed graph we would have to add an arrow from B to itself. We will not choose this possibility. (b) One could go back one step in the browser and then use another link instead of B (hopefully not again a dead end). One would have to distinguish from what page one came to B—this would make things rather complicated. We will not choose this possibility either. (c) We decide in favour of another alternative: one leaves this site—coming back to the list (which we think as not yet ranked)—and by chance clicks one of the many other sites. H. HUMENBERGER 115 This we want to formulate explicitly as the following assumption. Modelling assumption 2: when we come to a dead end during the surfing process we go back to the list and click one of the m possible sites by chance, each with the same probability of 1/m. Here we do not consider that one probably will not click at the same page again11 (if there are really many sites, it will not make a big difference to ‘take off’ the site or not): we replace the entries in the second column (zeroes) by 1/6 (in general: 1/m if theret are m web sites). Instead of the zero column, we write the m-dimensional column vector m1 ‚ . . . ‚ m1 and get the matrix T1: 1 0 0 1=6 1=3 0 0 0 B 1=2 1=6 1=3 0 0 0 C C B B 1=2 1=6 0 0 0 0 C C: B T1 ¼ B C B 0 1=6 1=3 0 0 1=2 C @ 0 1=6 0 1=2 0 1=2 A 0 1=6 0 1=2 1 0 So we can get a stochastic transition matrix T1 even though there are dead ends in the structure of the network. Question: what about the situation of a site to (instead of from) which no link exists? Would this also be so bad? Modelling assumption 3: from the experiences concerning the dead-end situation, we can say: although a page is not a dead-end, it is well possible that one does not follow the links on a page but comes back to the list and clicks another page (at random). Let us assume that one follows the links on a page with probability and comes back to the list to take a new chance with probability 1 (the pages are taken at random with probability 1/m). How can this scenario be described mathematically? What does the new transition matrix U look like in this situation? When following the links of a page the transition matrix is given by T1. What must the transition matrix look like in the case of coming back to the list and taking a new chance (problem for students)? Because the sites are chosen randomly with probability 1/m in this case the next distribution must be (1/m, . . . ,1/m)t, that means the transition matrix has to be 0 1 0 1 0 0 1 1 1=m 1=m 1=m 1=m 1=m 1 B . B . .. C B .. C B .. C .. C T2 ¼ @ .. . A @ . A ¼ @ . A: . A because :@ .. m 1=m 1=m 1=m 1=m 1=m |fflfflffl{zfflfflffl} P i ¼1 In sum or in combination we get for the new transition matrix U by weighting these two cases with the factors and 1 , respectively: U¼ ffl{zffl Tffl}1 |ffl With probability following the links 11 þ ð1 Þ T2 |fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} ð1Þ With probability ð1Þ ‘‘new start’’ By the way, it can really happen that a special site is clicked a second time although one has already seen it and one did not want to open it a second time. 116 HOW DOES GOOGLE COME TO A RANKED LIST? It is easy to see (problem for students): Because T1 and T2 are stochastic matrices U is also stochastic. The crucial attribute of U: the matrix U has only positive entries, no zeroes any more. According to the above limit theorem, this transition matrix provides the wanted and easy case of a unique limit distribution independent of the start distribution. This limit distribution can give us a ranking of the pages concerning their relevance (!‘PageRank’). Which value should we take for ? It is known that Google has used = 0.85 for a long time. Possibly nowadays Google uses another value for . For the example above, we get the solution of the linear equation system U ~ ¼ ~ ( = 0.85; 1 þ . . . þ 6 ¼ 1; CAS; 4 decimal places): |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} i 0 t ð1 ‚ 2 ‚ 3 ‚ 4 ‚ 5 ‚ 6 Þ ¼ ð0:0517‚ 0:0737‚ 0:0574‚ 0:1999‚ 0:2686‚ 0:3487Þt : We get the same result if we determine a high power of U and then take one column of it. Also by using spreadsheets we would get the same result. According to it the ranked sites (concerning their relevance) would be: site 6 ! site 5 ! site 4 ! site 2 ! site 3 ! site 1: This describes in general how the matrix U is created within the PageRank algorithm12 in an elementaryP way. We cannot deal with the question how the algorithm comes to a solution of U ~ ¼ ~ i ¼ 1‚ i 0 in the huge practice of Google, the solution is derived in an iterative way and it is an approximate solution. 6. The connection to ‘recursive definitions’ In 2009, I published an article in German on this topic (Humenberger, 2009); then I received a mail from a student in Germany writing some sort of paper for a final exam: ‘I am struggling with the question how one can derive the formula (2) which is well known from the www, could you please help me?’ This is the reason why I wrote this paragraph. Indeed in several articles one can find a definition of the PageRank algorithm as an example of a so-called ‘recursive definition’. For example in the German Wikipedia, one can read (translated): The principle of the PageRank algorithm is that each site has its weight (PageRank) which is the higher the more sites with a high weight have a link to this site. The weight PRi of a site i is determined by the weights of the sites j that have a link to site i. When there are Cj different links on site j then the weight PRj is devided into Cj equal parts. The following recursive formula can be seen as a definition of the PageRank algorithm: X PRj 1d þd PRi ¼ : ð2Þ N Cj j:j!i Here, N is the total number of sites and d is a damping factor between 0 and 1. With this damping factor a small part (1 d ) of the weight of every page is taken off uniformly from all the sites that are covered by the algorithm. This is necessary because the weight should not flow off to sites that have no further links. 12 For deeper mathematics to this topic see: Chartier (2006), Langville & Meyer (2006), Wills (2006). H. HUMENBERGER 117 In which respect does this definition fit to our explanations above? If we insert (1) into U ~ ¼ ~ we get (see above): 0 1 1=m B . C ~ ¼ ð1 Þ @ .. A þ T1 ~ 1=m Now we consider P component i in this equation (the components i are in the new notation PRi): i ¼ 1 þ j Pð j ! iÞ j ‚ where the P(j!i) are the entries in row i of the transition matrix T1, m i.e. the probabilities with which a user comes from site j to site i when following the links. If there is no link from site j to site i we have P( j ! i) = 0, otherwise P( j ! i ) is exactly C1j where Cj is the number of the outgoing links on site j. Taking this into account and considering d = and N = m, we have illustrated the analogy between this recursive definition and our explanations above. 7. An explicit solution (formula) When we insert (1) in U ~ ¼ ~ and use matrix notation, we can manipulate the equation. So we get an explicit formula for the limit distribution ~ (I hereby denotes the m-dimensional identity matrix): 0 1 1=m B . C C T1 ~ þ ð1 Þ T2 ~ ¼ I ~ ) ð T1 IÞ ~ ¼ ð 1Þ B @ .. A |ffl{zffl} t ¼ð1=m‚ ...‚ 1=mÞ 1=m 0 1 1=m B C 1 B .. C ) ~ ¼ ð 1Þ ð T1 IÞ @ . A: 1=m One can show that the matrix T1 I is not singular so that we always have an unique solution.13 However, Google cannot use this explicit solution formula in practice because there we have very big linear equation systems (e.g. m = 1,000,000 or more) and in such high dimensions to determine the inverse matrix is a very hard and time consuming job. In practice, iterative and approximate algorithms are used to come to a solution. 8. Summary and reflections Three elementary modelling assumptions (see above) had enormous effects. These ideas are simultaneously elementary and ingenious, they guarantee that the algorithm ‘always works’.14 Of course this modelling process is not meant to be done by the students themselves (autonomous work), but with the teacher’s help they get to know a piece of a very up to date application of mathematics [see also Voskoglu (1995); for a German paper dealing with Markov chains at school—stressing that this topic can be a bridge between analysis, linear algebra and stochastics—see e.g. Wirths (1997)]. 13 This follows also from the limit theorem above. One has to admit: in practice, the algorithm is more complicated but its ‘mathematical heart’ is a very elementary one—see above—and can be understood by students. 14 118 HOW DOES GOOGLE COME TO A RANKED LIST? In the discussion document to ICMI 20, it is written ‘we need to make the use of mathematics in modern society more visible’ (Damlamian & Sträßer, 2009, p. 527). In this example, this is realized in a very good way and it is also an example of a good teaching practice: The internet search engines like Google are surely very important parts of modern society, they are used by very many people every day. All over the world we can detect the so-called relevance paradox of mathematics: mathematics is used more and more extensively in modern society (mobile phones, internet, electronic cash, cars, computers, insurances, CD players, networks, etc., one could give hundreds of examples!) and therefore becomes more and more relevant for us as a society, in other words, mathematics surely is a so-called key technology for our future. But in many cases, the mathematics behind these things is very complex and can not be understood by non specialists. And for just using these things—we have to admit—understanding is not necessary. This means in some sense, mathematics gets less and less relevant for the individual. This is one reason why many people less and less do see the relevance of mathematics. Therefore, in mathematics educating we should come up with examples that show the relevance of mathematics in a striking and elementary way. And I think this article shows such an example. Consequences for teaching: we should foster examples that show the use of mathematics: for our society and for the individuals. Mathematical modelling (with or without autonomous work of students) can be a way of doing that successfully. For teaching in this way it is important that students have basic knowledge of several mathematical fields (above: matrices, vectors, using computers, etc.) and that it is allowed to mention and use single unproved mathematical theorems (above: limit theorem), especially when one can come to interesting mathematical phenomena. Of course, this does not mean that reasoning in mathematics education would not be important! In almost all the preambles to syllabuses, it is stressed that ‘cross-linking fields’ is something desirable. Also most researchers and teachers in the field of mathematics education say that the teaching and learning process should more often give the opportunity for cross-linking mathematical topics. In this example, we have a very good chance for cross-linking stochastics (probabilities, etc.), linear algebra (vectors, matrices, etc.) and analysis (limits etc.). Besides dealing with this topic provides a possibility for a reasonable use of computers in mathematics education (spreadsheet programmes, CASs). This example may give reason for motivation and surprise: with how elementary ideas one can establish something world shaking and earn very much money.15 Therefore it may serve as a sort of advertisement for mathematics: a great career is possible by cleverly using both elementary and ingenious ideas. We also have an affirmation of the fact that basic ideas are still important! REFERENCES CHARTIER, T. P. (2006) Googling Markov. UMAP J., 27, 17–30. DAMLAMIAN, A. & STRÄßER, R. (2009) ICMI Study 20: educational interfaces between mathematics and industry. Discussion Document. Zentralblatt für Didaktik der Mathematik, 41, 525–533. HUMENBERGER, H. (2009) Das Google-Page-Rank-System – Mit Markoff-Ketten und linearen Gleichungssystemen Ranglisten erstellen. Mathematiklehren, 154, 58–63. 15 Again: establishing a company like Google and the real algorithms are not elementary things, but the basic mathematical idea of the PageRank is elementary. H. HUMENBERGER 119 KEMENY, J. G. & SNELL, J. L. (1976) Finite Markov chains. Undergraduate Texts in Mathematics. New York: Springer. LANGVILLE, A. N. & MEYER, C. D. (2006) Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton: Princeton University Press. VOSKOGLU, M. G. (1995) Use of absorbing Markov chains to describe the process of mathematical modelling: a classroom experiment. Int. J. Math. Educ. Sci. Technol., 26, 759–763. WILLS, R. S. (2006) Google’s PageRank: The Math Behind the Search Engine. Math. Intelligencer, 28, 6–11. WIRTHS, H. (1997) Markow-Ketten – Brücke zwischen Analysis, linearer Algebra und Stochastik. Mathematik in der Schule, 35, 601–613.
© Copyright 2025 Paperzz