How does Google come to a ranked list?зmaking

Teaching Mathematics and Its Applications (2011) 30, 107^119
doi:10.1093/teamat/hrr007
Advance Access publication 12 May 2011
How does Google come to a ranked
list?çmaking visible the mathematics
of modern society
HANS HUMENBERGER*
University of Vienna, Nordbergstraße 15 (UZA 4), 1090 Vienna, Austria
*Email: [email protected]
[Submitted October 2010; accepted February 2011]
When one uses Google (and many people do this!), the result of the query is a list of sites
that have something to do with the item one is looking for. The specific sites are always
more or less on the top, so it is not necessary to have a look on hundreds of sites to read
something relevant and informative. How can Google manage this? How does Google
come to the suggested list? This article is primarily written for teachers and lecturers who
want to share the idea of PageRank with students without having complications arising
from ‘concepts of higher mathematics’ like eigenvectors or eigenvalues. The basis is a
special limit theorem (concerning Markov chains) which can be used unproved in school
in order to come to interesting and elementary applications of mathematics. This example
also provides a very good chance for cross-linking several mathematical fields: stochastics
(probabilities, etc.), linear algebra (vectors, matrices, etc.) and analysis (limits, etc.).
Another focus of this contribution is to make more visible the use of mathematics
in modern society. This seems to be necessary because mathematics disappears more
and more from societal perception in spite of the fact that its role rises in importance
(but in most cases hidden) in our lives, it is surely a so-called key-technology.
1. Introduction
According to a talk1 world wide 26% of the people were online (used the internet) in September 2009,
this is called the internet penetration rate. In Europe, it was 52% and for North America 74%. Search
Engines are the second largest internet application (after email) and Google has become the most used
internet search engine all over the world.2 When using it, the following question arises quite naturally:
How can Google manage the ranking (specific sites first)? The answer has to do with the famous
‘PageRank’.
First a simple problem for the introduction: The telephone market of a country is dominated by three
companies (A-tel, B-tel and C-tel). The companies have annual contracts with their customers,3 and for
1
Prof. Monika Henzinger (a former computer scientist of Google) in December 2009.
Market shares (according television broadcast in June 2009): Google 62%, Yahoo 21%.
3
Assumption: these contracts are always made for one year, at the end/beginning of a year the customers may
possibly change the telephone provider.
2
ß The Author 2011. Published by Oxford University Press on behalf of The Institute of Mathematics and its Applications.
All rights reserved. For permissions, please email: [email protected]
108
HOW DOES GOOGLE COME TO A RANKED LIST?
FIG. 1. Transition graph.
simplicity reasons let us make the assumption that at the end of every year the customers stay with
their former company to a special percentage and change to other companies, respectively. This
situation can be easily described with a so-called directed graph (also transition graph ! Fig. 1):
This means, for example, for the company C-tel that 70% of their customers stay at C-tel after 1 year,
20% change to A-tel and 10% to B-tel. The other transition rates can be interpreted in a similar way.
Let’s suppose—also for simplicity reasons—that these transition rates do not change during the next 5
(10; 20) years; what would be the distribution (percentages)
of the customers to the companies at that
time if at the beginning it were ðA0 ‚ B0 ‚ C0 Þ ¼ 13 ‚ 13 ‚ 13 or (A0,B0,C0) = (30%, 50%, 20%)? Even if
students have not heard anything about Markov chains or transition matrices, they can handle this
problem easily by using a spreadsheet programme (e.g. EXCEL). One can establish the associated
recursions
0:8An þ 0:3Bn þ 0:2Cn ¼ Anþ1
0:1An þ 0:6Bn þ 0:1Cn ¼ Bnþ1
0:1An þ 0:1Bn þ 0:7Cn ¼ Cnþ1
by looking at the transition graph and enter them as a formula. Especially for such iterative situations
(problems), spreadsheet programmes are a very useful tool! Using the well-known dragging-down
method, one can easily and quickly see the values after 5, 10, 20 years (using only a calculator would
be much more cumbersome here). One will realize that the values quickly tend to (An, Bn, Cn) =
(55%, 20%, 25%) completely independent of the initial distribution (A0,B0,C0).
Spreadsheets are a wonderful tool to determine experimentally such limit distributions in the case of
only a few possible ‘stations’ (above only 3). One does not need matrices or theories behind it, one
only needs very elementary knowledge of using spreadsheets. The process in spreadsheet programmes
is an iterative one like the real determination of the PageRank in the practice of Google. Therefore,
using spreadsheets here is on the one hand a simple introduction and on the other hand it is not so far
away from the procedure in reality (iterative methods are used there too).
For dealing with such ‘limit distributions’ in a more detailed way (especially a few theoretical
aspects) spreadsheets are not sufficient, we need ‘transition matrices’ (see Chapter 3).
H. HUMENBERGER
109
2. Google and its founders
This section does not have mathematical contents, but some general information about Google and its
founders. When teaching ‘PageRank’ primarily as mathematical phenomenon I think it is also important to have items that lie outside mathematics and that are motivating to students.
There are many internet search engines that can ‘comb through’ the www in small splits of a second.
Google is a very famous and common one. The name ‘Google’ was selected to denote something very
huge and big—according to the tremendous plenty and richness of the www. ‘Google’ is a modification of ‘Googol’, a word that was established by the American mathematician E. Kasner in 1938.
It was meant to be for the giant number 10100. ‘1 Googol’ is much bigger than the number of atoms in
our Universe (1080), on the other hand 10100 & 70! and therefore it is approximately the number of
possibilities to arrange 70 different things in a row. Here we again see the power of some mathematical
notations and how fast factorials grow: 70! is 1020-times as big as the number of atoms in our
Universe! Who would intuitively not say that there were much more atoms in our Universe than
possibilities to arrange 70 different things?
Google is the leading search engine world wide (compared e.g. to Yahoo, MSN, etc.). What is the
reason for that, why could Google prevail? One of the reasons lies in the ‘PageRank algorithm’, which
at that time made Google better than the others: results should come very fast, the information should
be relevant to the users so that they do not need to click many different links. And especially when
companies want to get many new customers (in a young market), it is very important to be better than
the other business rivals. Search engines should try hard to show good sites on the first page of the list
because4 85% of the users click on sites only at the first page of the list, 77% of the users make only
one query (they do not change the words they are looking for). The criteria Google takes into account
in order to rank the sites as good as possible are very manifold nowadays (200), the first and
probably most important one was the PageRank.
Lawrence (Larry) Page (born in 1973) is an US-American computer scientist and co-founder of the
internet search engine Google. At Stanford University, he received his Master in computer science.
Together with his fellow student Sergej Michailowitsch Brin (born in Moscow in 1973), he created a
prototype of an internet search engine for the www in 1996. None of the big companies (today rivals in
business, e.g. Yahoo) were interested in the search engine programmed by them. Therefore in 1998,
they founded Google Inc. together—with an initial aid of 100,000 US$ by Sun Microsystems.
They started to work on a PhD thesis but they did not continue it after the foundation of Google.
This, of course, is not surprising, they surely had and have other important things to do. Moreover due
to going public in 2004 (! stock exchange) they became very rich: why should they work on a PhD
thesis?
Although the actual algorithm used by Google is more complicated than it will be presented
here (there are several other aspects and constraints), the main idea behind the mathematics of the
PageRank algorithm is a very elementary one. We will focus only on the mathematical ideas of
PageRank.
On the one hand, it is amazing that one can make so much money and establish a world-shaking
thing like it was done with Google by L. Page and S. Brin with so elementary ideas. On the other hand,
herein lies a more pleasant confirmation that basic mathematical ideas are very important (in this case
for millions of users and—economically seen—for the founders of Google, its employees and
shareholders).
4
According to a talk of Monika Henzinger, a former Google computer scientist, in December 2009.
110
HOW DOES GOOGLE COME TO A RANKED LIST?
When we state that the principle idea is an elementary one we do not want to detract the performance of the founders, quite on the contrary! The transformation of a mathematically elementary idea to
a programme that can handle thousands of queries in an acceptable short time is a really hard job and
an excellent performance (not only containing mathematical ideas, but also many important computer
science issues).
In 2008, Google searched at over 1 trillion (=1,000,000,000,000) URLs, a very very huge number!
Every second Google answers very many queries5 in >100 ‘domains’ and languages and every user
wants the result immediately without waiting. Google wanted and wants an answer time of half
a second at the most, in most cases the time is much shorter. This very quick supply with results
was one of the reasons for the success and popularity of Google in the 1990s. The business rivals
needed a bit more time for answers and therefore had disadvantages.
Nowadays Google employs many software engineers, but the first steps were probably done by the
founders themselves, a great job! Reportedly L. Page and S. Brin want to stay in the Google company
at least until 2024. We just say ‘ad multos annos’!
For doing internet searches with Google a new verb ‘to google’6 has been established—also
in German: ‘googeln’. If somebody asks another person about a special word (term) and this person
does not know very much about it, one can often hear the hint: ‘Have you googled it already?’
At Wikipedia, one can read: ‘The verb to google (also spelled to Google) refers to using the
Google search engine to obtain information on the Web. A neologism arising from the popularity
and dominance of the eponymous search engine, the American Dialect Society chose it as the ‘‘most
useful word of 2002’’. It was officially added to the Oxford English Dictionary on 15 June 2006, and
to the 11th edition of the Merriam-Webster Collegiate Dictionary in July 2006. The first recorded
usage of google used as a verb was on 8 July 1998, by Larry Page himself, who wrote on a mailing list:
‘‘Have fun and keep googling!’’ ’
3. The www as a directed graph and the description by transition matrices
Search engines start their procedure by ‘combing through’ the www with a so-called spider or
webcrawler (special computer programme): which documents of the www include the word we are
interested in and looking for? One aim of this very large search process is to get a description of
the link structure between the sites of the www containing the word (item) looked up.7 Let us start
with a very simple example: A, B, C, D are four different sites that are linked to each other as shown in
Fig. 2. For example, there is a link from the site A to B and C, from the site B there are links to C and
D, etc.
Modelling assumption 1: For reasons of simplicity, we assume that every link on a site will be used
with the same probability.8 That means if there are two leaving arrows from a site each of them ought
5
Per day, there are on average 60,000,000 queries from adults only in the USA, i.e. per second 700 (Wills, 2006,
p. 6). Chartier (2006, p. 17) writes that Google has more than 3000 queries per second; I suppose that this is also
meant only for the USA.
6
Officially it is not allowed to use ‘to google’ in the meaning of ‘using any internet search engine’, only when
using Google one should say ‘to google’.
7
Which site (containing the word looked up) has links to which other one?
8
Of course in reality, this is not exactly the case; a conspicuous link at the top of the page is probably used more
often than a ‘small link’ at the bottom. But these kinds of simplifications and idealisations are very typical for
mathematical modelling: we have to make such simplifying assumptions in order to be able to use mathematics
successfully.
H. HUMENBERGER
111
FIG. 2. Four sites.
to have the number ½, when there are 3 (k) leaving arrows, each arrow ought to have the number 1/3
(1/k). Thus, for clarity, we will omit the probability labels in the directed graphs.
Now we can imagine—just like in the case with the telephone companies—that many users are in
the system of the sites A, B, C, D, at the beginning with the relative frequencies A0,B0,C0,D0 (fractions,
percentages; A0 + B0 + C0 + D0 = 1).
The change of a telephone company corresponds here to the change of an internet site. We again
think of discrete steps in time: the users change the sites in these time steps (by following the links),
so that after n time steps, the distribution of the users is An,Bn,Cn,Dn. We can again read off the
recursions
Cn
¼ Anþ1
0:5An
0:5An þ 0:5Bn
þ 0:5Dn ¼ Bnþ1
þ 0:5Dn ¼ Cnþ1
0:5Bn
¼ Dnþ1
easily from the transition graph (Fig. 2). In order to check whether there is a limit distribution
B‚
DÞ
C‚
(i.e. a stable distribution in the long run) and if it is the case how does it look like, we
ðA‚
could again use spreadsheets.
But linear equation systems can also be described very comfortably by using matrices and vectors:
0
1 0 1 0
1
0
0 1 0
An
Anþ1
B 0:5 0 0 0:5 C B Bn C B Bnþ1 C
B
C B C B
C
@ 0:5 0:5 0 0:5 A @ Cn A ¼ @ Cnþ1 A
0 0:5 0 0
Dn
Dnþ1
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflffl{zfflfflffl} |fflfflfflfflfflffl{zfflfflfflfflfflffl}
¼:T
¼:~
n
~nþ1
The vectors ~n denote the distribution after n time steps. All transitions ~n ! ~nþ1 are given by the
same matrix T (‘transition matrix’). In column i, there are the probabilities that a user on site i changes
to site j in the next step by using a link i!j (i, j = 1, . . . ,4).
Somebody who is at the moment on site C has to go to site A (probability 1) in the next step; this can
be seen in the transition graph and in the transition matrix (the probability 1 in column 3 and row 1).
~
1
zfflfflffl}|fflfflffl{
For the transitions we get: T ~0 ¼ ~1 ‚ T ðT ~0 Þ ¼ ~2 ‚ . . . ‚ T n ~0 ¼ ~n :
|fflfflfflfflfflffl{zfflfflfflfflfflffl}
T 2 ~
0
112
HOW DOES GOOGLE COME TO A RANKED LIST?
When using matrices, we have a possibility to get a direct formula for ~n (not only an iterative
description like with spreadsheets).
A vector that has probabilities (relative frequencies, percentages) with sum 1 as entries is called a
stochastic vector. A square matrix is called stochastic if its column vectors are stochastic. Transition
matrices are of course stochastic because they are square matrices and in the first column there are the
probabilities for users at A landing at A, B, C, D in the next step. Of course these numbers come from
the interval [0; 1] and have sum 1 (analogous with the other columns).
4. How can one measure the relevance of a site?
On relevant sites one expects to read something informative, specific and worth knowing. Of course, a
site s is the more relevant the more sites have a link on s, especially when these links come from
relevant sites. But this does not say how to measure relevance. Which is the most relevant page in the
above graph, which is the next relevant, etc.? How can one determine the relevance of a site within
a directed graph?
This is a question that can be answered in different ways, Google has found its own answer; its own
measurement of the relevance of a page.
One can think of the following situation:
Many users are in the network (directed graph), say 1 million users who are randomly surfing in the
web for an unlimited time. What fraction (percentage) of them is at A, B, C, D in the long run?
If it turns out that a special site attracts 90% of the users, then it is clear that this site is most relevant
and must be placed at the top of the list. These long term fractions are a possibility to measure the
relevance of a site and for these fractions we need ‘limit distributions’.
Let us assume that the users start to surf in the small web containing the four pages by chance
and that the fractions of the users at the beginning are ¼ for A, B, C, D: ~0 ¼ ð0:25‚ 0:25‚ 0:25‚ 0:25Þt ;
if they continue surfing and using the links by chance the distribution in the next step will
be ~1 ¼ T ~0 ¼ ð0:25‚ 0:25‚ 0:375‚ 0:125Þt , taking another step yields ~2 ¼ T ~1 ¼ T 2 ~0 ¼
ð0:375‚ 0:1875‚ 0:3125‚ 0:125Þt ; the sites A and C seem to have an advantage here. This is also plausible: all sites have a link to C and from there one must go to page A . . . . By multiplying with T from
the left, one gets the distributions ~n ¼ T n ~0 ; they converge to a ‘limit distribution’ ~n ! ~, which is
given by ~ ¼ ð3=9‚ 2=9‚ 3=9‚ 1=9Þt . According to this, both sites A and C should be listed equally on
the first rank, followed by B and D.
Such limit distributions can be determined at school in several ways:
(1) Repeating the iteration with a spreadsheet programme so long until the values do not change
anymore.
(2) Determining a high power Tn of the matrix with a computer algebra system (CAS), so that
~n ¼ T n ~0 should be near the limit distribution.
(3) We are looking for a vector ~ with component sum 1 that does not change under
multiplication with T : T ~ ¼ ~. One has to solve a linear equation system, of course,
by CAS.9
9
With means of higher mathematics one can speak of an eigenvector of T to the eigenvalue 1. But in most cases at
school students will not know these terms and the corresponding ideas. Therefore we do not go into details in this
respect.
H. HUMENBERGER
113
Problems that could occur:
Is it possible that there are more such limit distributions ~? If there are more than one, do the vectors
(distributions) ~i converge sometimes to the one and at other times to the other (depending on the
start distribution ~0 )? This would not fit to our purpose because we want to use this limit distribution
as a neutral and stable basis for establishing a relevance ranking. A limit distribution that is not
unique but depending on the start distribution would not be a good basis. It would be best if the limit
distribution were unique and independent of the start distribution.
All three possibilities mentioned above to determine the limit distribution ~ are practicable for
relatively low dimensions, as above a 4 4 matrix, eventually also 20 20, but in the case of a
1,000,000 1,000,000 matrix (or more, as occur in real-time Google searches) other methods are
used: iterative algorithms that come to an approximate solution. They have to be very fast because
there are very many queries given to Google every second. And nobody wants to wait a long time
for the result.
Regardless of whether or not Markov chains are dealt with, one special limit theorem is very
important because it provides a simple condition on the transition matrix T that guarantees the existence of the limit distribution ~, the uniqueness of it and the independence of the start distribution ~0
(without proof):
Limit Theorem: T is stochastic and Tn contains for some n 1 only positive entries ) the limit matrix
L :¼ lim T n exists, is stochastic and has equal columns.10
n!1
It is clear that these columns then determine the unique limit distribution ~ independent of the start
distribution ~0 : Because of A0 + B0 + C0 + D0 = 1 in case of a 4 4 matrix one gets for the limit
distribution ~ with this limit matrix L (independent of the concrete values of A0,B0,C0,D0):
0
1 0 1 0 1
t1 t1 t1 t1
A0
t1
B t2 t2 t2 t2 C B B0 C B t2 C
C B C B C
~ ¼ B
@ t3 t3 t3 t3 A @ C0 A ¼ @ t3 A
t4 t4 t4 t4
D0
t4
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflffl{zfflfflffl}
L
~0
P
Of course we have
ti ¼ 1 because L is stochastic.
In our example T itself does not have only positive entries but T5 has. So the convergence and the
independence of the start distribution are guaranteed by the above limit theorem. In our case, we
determine, for example, T 20 with a CAS (four decimal places). Here ~ ¼ ð3=9‚ 2=9‚ 3=9‚ 1=9Þt —the
limit distribution mentioned above—can easily be seen:
0
1
0:3333 0:3333 0:3333 0:3333
B 0:2222 0:2222 0:2222 0:2222 C
C
T 20 ¼ B
@ 0:3333 0:3333 0:3333 0:3333 A
0:1111 0:1111 0:1111 0:1111
10
That means the entries are constant in each row. This theorem needs not to be proved at school, one can simply
use it for understanding the PageRank algorithm. Also the other specialties of the theory around it need not be
dealt with at school. For a proof of a corresponding theorem, see for example, Kemeny & Snell (1976, 69ff). The
conditions could be even weakened: it suffices when there exists one row with only positive entries in some power
of T.
114
HOW DOES GOOGLE COME TO A RANKED LIST?
FIG. 3. Small network.
5. A slightly more complicated exampleçthe general case
The link structure of a still very small network consisting of six internet sites is shown in Fig. 3.
The transition matrix can be read off easily again (! matrix T).
1
0
0 0 1=3 0 0 0
B 1=2 0 1=3 0 0 0 C
C
B
B 1=2 0 0
0 0 0 C
C
T ¼B
B 0 0 1=3 0 0 1=2 C
C
B
@ 0 0 0 1=2 0 1=2 A
0 0 0 1=2 1 0
This is a new situation: From site B there is no arrow leaving, there are no links on this page. Within
the process of surfing one could call this a dead end or sink. We can see this also in the second column
of the matrix T, it contains only zeros. This is really bad for our purposes (stochastic matrix, column
sum should be 1). What will one do in such a situation if this happens while surfing in the net?
There are several possibilities:
(a) Stop surfing and stay at site B; in the matrix this would mean replacing the second zero in the
second column by 1, in the directed graph we would have to add an arrow from B to itself. We
will not choose this possibility.
(b) One could go back one step in the browser and then use another link instead of B (hopefully not
again a dead end). One would have to distinguish from what page one came to B—this would
make things rather complicated. We will not choose this possibility either.
(c) We decide in favour of another alternative: one leaves this site—coming back to the list (which we
think as not yet ranked)—and by chance clicks one of the many other sites.
H. HUMENBERGER
115
This we want to formulate explicitly as the following assumption.
Modelling assumption 2: when we come to a dead end during the surfing process we go back to the list
and click one of the m possible sites by chance, each with the same probability of 1/m.
Here we do not consider that one probably will not click at the same page again11 (if there are really
many sites, it will not make a big difference to ‘take off’ the site or not): we replace the entries in the
second column (zeroes) by 1/6 (in general: 1/m
if theret are m web sites). Instead of the zero column,
we write the m-dimensional column vector m1 ‚ . . . ‚ m1 and get the matrix T1:
1
0
0 1=6 1=3 0 0 0
B 1=2 1=6 1=3 0 0 0 C
C
B
B 1=2 1=6 0
0 0 0 C
C:
B
T1 ¼ B
C
B 0 1=6 1=3 0 0 1=2 C
@ 0 1=6 0 1=2 0 1=2 A
0 1=6 0 1=2 1 0
So we can get a stochastic transition matrix T1 even though there are dead ends in the structure of the
network.
Question: what about the situation of a site to (instead of from) which no link exists? Would this also
be so bad?
Modelling assumption 3: from the experiences concerning the dead-end situation, we can say: although
a page is not a dead-end, it is well possible that one does not follow the links on a page but comes back
to the list and clicks another page (at random). Let us assume that one follows the links on a page with
probability and comes back to the list to take a new chance with probability 1 (the pages are
taken at random with probability 1/m). How can this scenario be described mathematically? What does
the new transition matrix U look like in this situation?
When following the links of a page the transition matrix is given by T1.
What must the transition matrix look like in the case of coming back to the list and taking a new
chance (problem for students)? Because the sites are chosen randomly with probability 1/m in this case
the next distribution must be (1/m, . . . ,1/m)t, that means the transition matrix has to be
0
1 0 1 0
0
1
1
1=m 1=m
1=m 1=m
1=m
1
B .
B .
.. C B .. C B .. C
.. C
T2 ¼ @ ..
. A @ . A ¼ @ . A:
. A because :@ ..
m
1=m 1=m
1=m 1=m
1=m
|fflfflffl{zfflfflffl}
P
i ¼1
In sum or in combination we get for the new transition matrix U by weighting these two cases with
the factors and 1 , respectively:
U¼
ffl{zffl
Tffl}1
|ffl
With probability following the links
11
þ
ð1 Þ T2
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}
ð1Þ
With probability ð1Þ
‘‘new start’’
By the way, it can really happen that a special site is clicked a second time although one has already seen it and
one did not want to open it a second time.
116
HOW DOES GOOGLE COME TO A RANKED LIST?
It is easy to see (problem for students): Because T1 and T2 are stochastic matrices U is also
stochastic.
The crucial attribute of U: the matrix U has only positive entries, no zeroes any more. According to the
above limit theorem, this transition matrix provides the wanted and easy case of a unique limit
distribution independent of the start distribution. This limit distribution can give us a ranking of the
pages concerning their relevance (!‘PageRank’).
Which value should we take for ? It is known that Google has used = 0.85 for a long time.
Possibly nowadays Google uses another value for . For the example above, we get the solution of the
linear equation system U ~ ¼ ~ ( = 0.85; 1 þ . . . þ 6 ¼ 1; CAS; 4 decimal places):
|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
i 0
t
ð1 ‚ 2 ‚ 3 ‚ 4 ‚ 5 ‚ 6 Þ ¼ ð0:0517‚ 0:0737‚ 0:0574‚ 0:1999‚ 0:2686‚ 0:3487Þt :
We get the same result if we determine a high power of U and then take one column of it. Also by
using spreadsheets we would get the same result. According to it the ranked sites (concerning their
relevance) would be:
site 6 ! site 5 ! site 4 ! site 2 ! site 3 ! site 1:
This describes in general how the matrix U is created within the PageRank algorithm12 in an
elementaryP
way. We cannot
deal with the question how the algorithm comes to a solution of
U ~ ¼ ~
i ¼ 1‚ i 0 in the huge practice of Google, the solution is derived in an iterative
way and it is an approximate solution.
6. The connection to ‘recursive definitions’
In 2009, I published an article in German on this topic (Humenberger, 2009); then I received a mail
from a student in Germany writing some sort of paper for a final exam: ‘I am struggling with the
question how one can derive the formula (2) which is well known from the www, could you please
help me?’ This is the reason why I wrote this paragraph.
Indeed in several articles one can find a definition of the PageRank algorithm as an example of a
so-called ‘recursive definition’. For example in the German Wikipedia, one can read (translated):
The principle of the PageRank algorithm is that each site has its weight (PageRank) which is the
higher the more sites with a high weight have a link to this site. The weight PRi of a site i is
determined by the weights of the sites j that have a link to site i. When there are Cj different links on
site j then the weight PRj is devided into Cj equal parts. The following recursive formula can be seen
as a definition of the PageRank algorithm:
X PRj
1d
þd
PRi ¼
:
ð2Þ
N
Cj
j:j!i
Here, N is the total number of sites and d is a damping factor between 0 and 1. With this damping
factor a small part (1 d ) of the weight of every page is taken off uniformly from all the sites that
are covered by the algorithm. This is necessary because the weight should not flow off to sites that
have no further links.
12
For deeper mathematics to this topic see: Chartier (2006), Langville & Meyer (2006), Wills (2006).
H. HUMENBERGER
117
In which respect does this definition fit to our explanations above? If we insert (1) into U ~ ¼ ~ we
get (see above):
0
1
1=m
B . C
~ ¼ ð1 Þ @ .. A þ T1 ~
1=m
Now we consider
P component i in this equation (the components i are in the new notation PRi):
i ¼ 1
þ
j Pð j ! iÞ j ‚ where the P(j!i) are the entries in row i of the transition matrix T1,
m
i.e. the probabilities with which a user comes from site j to site i when following the links. If there is no
link from site j to site i we have P( j ! i) = 0, otherwise P( j ! i ) is exactly C1j where Cj is the number of
the outgoing links on site j. Taking this into account and considering d = and N = m, we have
illustrated the analogy between this recursive definition and our explanations above.
7. An explicit solution (formula)
When we insert (1) in U ~ ¼ ~ and use matrix notation, we can manipulate the equation. So we get an
explicit formula for the limit distribution ~ (I hereby denotes the m-dimensional identity matrix):
0
1
1=m
B . C
C
T1 ~ þ ð1 Þ T2 ~
¼ I ~ ) ð T1 IÞ ~ ¼ ð 1Þ B
@ .. A
|ffl{zffl}
t
¼ð1=m‚ ...‚ 1=mÞ
1=m
0
1
1=m
B
C
1 B .. C
) ~ ¼ ð 1Þ ð T1 IÞ @ . A:
1=m
One can show that the matrix T1 I is not singular so that we always have an unique solution.13
However, Google cannot use this explicit solution formula in practice because there we have very
big linear equation systems (e.g. m = 1,000,000 or more) and in such high dimensions to determine the
inverse matrix is a very hard and time consuming job. In practice, iterative and approximate algorithms
are used to come to a solution.
8. Summary and reflections
Three elementary modelling assumptions (see above) had enormous effects. These ideas are simultaneously elementary and ingenious, they guarantee that the algorithm ‘always works’.14 Of course this
modelling process is not meant to be done by the students themselves (autonomous work), but with the
teacher’s help they get to know a piece of a very up to date application of mathematics [see also
Voskoglu (1995); for a German paper dealing with Markov chains at school—stressing that this topic
can be a bridge between analysis, linear algebra and stochastics—see e.g. Wirths (1997)].
13
This follows also from the limit theorem above.
One has to admit: in practice, the algorithm is more complicated but its ‘mathematical heart’ is a very elementary one—see above—and can be understood by students.
14
118
HOW DOES GOOGLE COME TO A RANKED LIST?
In the discussion document to ICMI 20, it is written ‘we need to make the use of mathematics in
modern society more visible’ (Damlamian & Sträßer, 2009, p. 527). In this example, this is realized
in a very good way and it is also an example of a good teaching practice: The internet search engines
like Google are surely very important parts of modern society, they are used by very many people
every day.
All over the world we can detect the so-called relevance paradox of mathematics: mathematics is
used more and more extensively in modern society (mobile phones, internet, electronic cash, cars,
computers, insurances, CD players, networks, etc., one could give hundreds of examples!) and therefore becomes more and more relevant for us as a society, in other words, mathematics surely is a
so-called key technology for our future. But in many cases, the mathematics behind these things is very
complex and can not be understood by non specialists. And for just using these things—we have to
admit—understanding is not necessary. This means in some sense, mathematics gets less and less
relevant for the individual. This is one reason why many people less and less do see the relevance
of mathematics. Therefore, in mathematics educating we should come up with examples that show
the relevance of mathematics in a striking and elementary way. And I think this article shows such
an example.
Consequences for teaching: we should foster examples that show the use of mathematics: for our
society and for the individuals. Mathematical modelling (with or without autonomous work of students) can be a way of doing that successfully. For teaching in this way it is important that students
have basic knowledge of several mathematical fields (above: matrices, vectors, using computers, etc.)
and that it is allowed to mention and use single unproved mathematical theorems (above: limit theorem), especially when one can come to interesting mathematical phenomena. Of course, this does not
mean that reasoning in mathematics education would not be important!
In almost all the preambles to syllabuses, it is stressed that ‘cross-linking fields’ is something
desirable. Also most researchers and teachers in the field of mathematics education say that the
teaching and learning process should more often give the opportunity for cross-linking mathematical
topics. In this example, we have a very good chance for cross-linking stochastics (probabilities, etc.),
linear algebra (vectors, matrices, etc.) and analysis (limits etc.). Besides dealing with this topic
provides a possibility for a reasonable use of computers in mathematics education (spreadsheet
programmes, CASs).
This example may give reason for motivation and surprise: with how elementary ideas one can
establish something world shaking and earn very much money.15 Therefore it may serve as a sort of
advertisement for mathematics: a great career is possible by cleverly using both elementary and
ingenious ideas. We also have an affirmation of the fact that basic ideas are still important!
REFERENCES
CHARTIER, T. P. (2006) Googling Markov. UMAP J., 27, 17–30.
DAMLAMIAN, A. & STRÄßER, R. (2009) ICMI Study 20: educational interfaces between mathematics and industry.
Discussion Document. Zentralblatt für Didaktik der Mathematik, 41, 525–533.
HUMENBERGER, H. (2009) Das Google-Page-Rank-System – Mit Markoff-Ketten und linearen Gleichungssystemen Ranglisten erstellen. Mathematiklehren, 154, 58–63.
15
Again: establishing a company like Google and the real algorithms are not elementary things, but the basic
mathematical idea of the PageRank is elementary.
H. HUMENBERGER
119
KEMENY, J. G. & SNELL, J. L. (1976) Finite Markov chains. Undergraduate Texts in Mathematics. New York:
Springer.
LANGVILLE, A. N. & MEYER, C. D. (2006) Google’s PageRank and Beyond: The Science of Search Engine
Rankings. Princeton: Princeton University Press.
VOSKOGLU, M. G. (1995) Use of absorbing Markov chains to describe the process of mathematical modelling: a
classroom experiment. Int. J. Math. Educ. Sci. Technol., 26, 759–763.
WILLS, R. S. (2006) Google’s PageRank: The Math Behind the Search Engine. Math. Intelligencer, 28, 6–11.
WIRTHS, H. (1997) Markow-Ketten – Brücke zwischen Analysis, linearer Algebra und Stochastik. Mathematik in
der Schule, 35, 601–613.