A classification algorithm for finding the optimal

A classification algorithm for finding the optimal
rank aggregation method
Sibel Adalı
Malik Magdon-Ismail
Brandeis Marshall]
Rensselaer Polytechnic Institute
Troy, New York 12180
Email: [email protected]
Rensselaer Polytechnic Institute
Troy, New York 12180
Email: [email protected]
Purdue University
West Lafayette, Indiana 47907
Email: [email protected]
Abstract—In this paper, we develop a classification algorithm
for finding the optimal rank aggregation algorithm. The input features for the classification are measures of noise and
misinformation in the rankers. The optimal ranking algorithm
varies greatly with respect to these two factors. We develop two
measures to compute noise and misinformation: cluster quality
and rank variance. Further, we develop a cost based decision
method to find the least risky aggregator for a new set of ranked
lists and show that this decision method outperforms any static
rank aggregation method by through rigorous experimentation.
I. I NTRODUCTION
Rank aggregation refers to the problem of finding a combined ordering for objects given a set of rankings obtained
from different rankers. Rank aggregation is a frequently used
method in meta-search applications as well as in many other
domains where objects are ordered with respect to different
criteria. Based on the assumption that rankers are imperfect,
many aggregation methods [3], [4], [6], [10] have been introduced in the literature to best reflect the correct information
available in the rankers and disregard the irrelevant information or noise. In previous work [1], [2], we provided a
principled analysis of aggregation methods for two specific
criteria, noise that includes spam and misinformation. Noise
refers to the case where the mistakes made by rankers cause
some local perturbations of the rankings. Misinformation on
the other hand refers to a more fundamental mistake made
by the rankers such as using the wrong ranking function.
Noise is very common as rankers may be imperfect or are not
This work was partially supported by the National Science Foundation
under grants IIS-0324947, CNS-0323324, EIA-0091505 and IIS-9876932.
Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of
the National Science Foundation.
] This work has been done when the author was a PhD Student at Rensselaer
Polytechnic Institute.
1-4244-1364-8/07/$25.00 ©2007 IEEE
personalized. Misinformation can correspond to a difference
of opinion or a malicious agent that is trying to mislead
the rank aggregation algorithm. In general, one can cancel
out noise by using more and more rankers. However, the
best way to deal with misinformation is to disregard the
rankers with differences of opinion or treat each opinion
separately. In Section IV we describe how we model noise and
misinformation in our experimental setup. In [1], we show that
the choice of the best aggregation method may differ greatly
based on the amount of noise and misinformation available
in the rankers. As a result, if different queries have different
levels of noise and misinformation, then the best algorithm
needs to be chosen specific to that query.
In this paper, we expand on this finding and develop a
classification method to find the optimal rank aggregation
algorithm dynamically. In contrast with previous work in
learning how to rank [8], [9], we concentrate on learning
the correct algorithm not the correct parameters for a given
algorithm. In this sense, our work is complementary to this
type of work as it reduces the number of parameters considered
in training a specific algorithm. Most of the previous work uses
a specific algorithm in training. Furthermore, the algorithms
considered in our work are general purpose, avoiding the
problems caused by overfitting the data to the training set.
However, we have already shown that such an approach can
be suboptimal [1]. Furthermore, our work requires similar or
simpler training data making it cheaper to use. Our main
contributions in this paper are: (1) We develop two measures
called cluster quality and rank variance to measure the degree
of noise and misinformation available in the rankers. (2) We
develop a cost based classification method that takes into
consideration the cost of making a mistake as well as the
probability of making the right choice. (3) Within a controlled
simulation settting, we show through rigorous experimentation
that our method is able to provide improvement in performance
over a system that uses a static aggregation method.
II. R ELATED W ORK
In this paper, we deal with the rank aggregation problem
where only the ranks are available or are used. We review
the related work for this problem only. A great deal of the
previous work in rank aggregation concentrates on finding
an optimal algorithm for a specific scenario. In addition
to simple aggregators like average and median, algorithms
that optimize for the Kendall-tau performance measure have
become popular in recent years. These algorithms try to find
a ranking with the minimal Kendall-tau distance (number of
pairwise disagreements in the ordering of pairs of objects)
to the input ranked lists. The earliest such algorithm is
given by Dwork et. al. [6] where multiple Markov chain
approximations to this problem are given as well as simple
localized search methods to improve the results. It has also
been shown that finding the ranked list with the minimal
Kendall-tau distance can be reduced to the minimal feedback
arc set problem [4]. Ailon et. al. [3] develop the FAS-Pivot algorithm and its variations to approximately solve the minimum
feedback arc problem. The FAS-pivot algorithm is identical to
the Condorcet-fuse algorithm [10] developed independently.
Recently, we introduced a global search algorithm called
IBF [2] to find the Kendall-tau optimal ranking. Other well
known rank algorithms consider different signals instead of
pairwise disagreements. The CombMNZ algorithm [7] looks
at number of times an object appears in all ranked lists and
multiplies this by the sum of the rankings. The PrOpt [2]
algorithm uses only the number of appearances in all lists
but breaks ties with respect to the average ranker. We have
recently shown [2] that all these algorithms perform well under
different noise and misinformation conditions, hence the best
algorithm very much depends on the specific application.
Another approach to rank aggregation is to learn the best
aggregation by learning the importance of different factors
based on user feedback or training data. In this work, a
specific aggregation is already chosen. Joachims [8] uses click
through data to learn the importance of different factors in
ranking. These factors are typically text based factors that
may not always to be available for applications where only
the ranks are available and no other property of the objects
are known. However, click data provides a very useful way
of obtaining user preference information that can be used in
training. Recently, Liu et. al. [9] offer a supervised training
method for the Markov chains introduced in [6]. We note
that different algorithms mentioned above consider different
signals in rank aggregation such as pairwise orderings of
objects, the rank values, the statistical properties of ranks of
the same objects across rankers, the number of rankers that
rank a specific object, etc. Hence, an ideal learning algorithm
needs to consider the effectiveness of all these different signals
for an effective aggregation. To our knowledge, there is no
work that considers the wide range of algorithms and their
corresponding signals in rank aggregation that we consider in
this paper.
III. M ETHODOLOGY
Suppose there are n objects denoted by O = {o1 , . . . , on }
that exist in a database and can be queried. A ranker r is a
ranked list of objects from O. Let rank(r, o) denote the rank
of object o in ranker r with 1 being the highest rank. If object
o is not indexed by r, then rank(r, o) is undefined for this
object. DB(r) is the set of objects indexed by r. Given as
input, a set of rankers {r1 , . . . , rs }, the aggregate is ranker rA
of objects in DB(r1 ) ∪ . . . ∪ DB(rs ) based on their ranks in
{r1 , . . . , rs }. The rank aggregation problem is that of finding
an aggregate ranker that optimizes a performance measure. We
assume both the input and aggregate rankers are of length K in
this paper and use the two well-known performance measures:
precision and Kendall-tau. Precision returns the number of
common objects between two lists. Kendall-tau returns the
total number of pairwise disagreements between two lists. In
case where an object o in ranker r1 does not appear in ranker
r2 , then we assign it a default rank of K + 1 in r2 .
Given a set of rankers, we first try to estimate the amount
of noise and misinformation they contain. As estimating misinformation is impossible using only the ranking information,
we instead find whether there is an asymmetry between the
rankers using clustering. Below we describe our methodology
for estimating noise and misinformation.
A. Clustering of Rankers
Given a set of rankers, r1 , . . . , rs , we use a greedy approach
to cluster the rankers into C disjoint clusters. The objective is
to cluster rankers with the highest similarity together. Assume
that for every pair of rankers (ri , rj ), sim(i, j) refers to the
given measure of similarity between the rankers. We assume
sim(i, j) is a normalized measure. In the following, we use
average precision as our similarity measure. However, it is also
possible to use (1 − τ 0 ) as the measure of similarity as well
where τ 0 is a normalized version of Kendall-tau. To cluster, we
use single link clustering where we initially place each ranker
in a different cluster. We then merge clusters with the highest
similarity until there are exactly C clusters. The similarity
between two clusters C1 and C2 is given by the average of all
pairwise similarity computations.
X
1
simAV G (C1 , C2 ) =
sim(i, j) (1)
|C1 | ∗ |C2 |
ri ∈C1 ,rj ∈C2
1) Identification of Misinformation and Noise: Once the
rankers have been grouped into C clusters, we now must
evaluate the quality of these rankers as represented by the
clusters. We use cluster quality to determine whether there is
an asymmetry between the rankers as a way of determining
misinformation and variance of ranks to determine noise.
As our clustering methods places similar rankers in the same
cluster, we measure how close the clustered objects are. To
accomplish this, we compute two values: inter-cluster width
and intra-cluster distance. The inter-cluster width computes
how far away the rankers in a single cluster are. The smaller
the width, the better quality that specific cluster is. The intracluster distance measures how far away a pair of clusters
are. The larger the distance, the better the cluster quality. We
combine these two measures to compute overall quality of
clusters as discussed below.
a) Distance within clusters (width w): Let ri , rj be two
rankers. Let C denote the set of rankers grouped together in a
cluster and |C| represent the size of this cluster. We compute
for all distinct pairs of rankers (ri , rj ∈ C), the width of the
cluster as follows:
P|C|
i,j=1,i6=j (1 − sim(ri , rj ))
widthC =
(2)
|C|∗|C−1|
Both the width and distance are normalized on a scale [0,1],
where values closer to 0 denote higher similarity. For example,
when all rankers are identical and there are two clusters, the
width of each cluster is 0 and the distance between each of
these clusters is 0. In order to handle the zero case (when
distance = 0 or width = 0), we assign the insignificant =
0.0001. The epsilon removes the divide by zero case. When
both width and distance are zero, the cluster quality becomes
1, which is the worst case scenario. The cluster quality is
considered high if the Q value is low. In addition, epsilon is
used for any width zero which may happen for clusters of
size greater than 1. Note that Q would be high if some of the
rankers have significantly different rankings.
d) Variance of Ranks.: While the cluster quality is used
to measure the misinformation, we introduce the variance of
ranks measure to compute the amount of noise in the rankers.
Let n be the distinct number of objects in the input rankers. Let
rank(rj , oi ) be the rank of oi in ranker rj . For each cluster C,
we compute the mean of the variance in the ranks of an object
over all the rankers in cluster C. We then average this value
over all the objects. This constitutes the average variance for
this cluster. We then average these values over all the clusters.
Note that if an object is not ranked by a ranker then we assign
it the default rank of K+1 where K is the current retrieval
size. If two rankers are missing the same objects, the objects
would reduce the variance measure, indicating an agreement
between the two rankers in line with the intended meaning of
this measure.
B. Decision Algorithm
2
b) Distance across clusters (distance dist): Let C1 and
C2 be two clusters. Let rL ∈ C1 be the set of rankers that
are grouped in cluster C1 and rR ∈ C2 be the set of rankers
that are grouped in cluster C2 . We compute for all pairs of
rankers from the two clusters (rL , rR ), the average distance
across two clusters as follows:
P|C1 | P|C2 |
(1 − sim(rL , rR ))
(3)
distLR = L=1 R=1
|C1 | ∗ |C2 |
c) Cluster quality.: Suppose there are C clusters. Let
wi , wj be the width of two clusters Ci , Cj , respectively, and
distij be the distance between these clusters. We can now
compute the similarity between every distinct pair of clusters,
denoted as cluster quality Q, as follows:
Q(C) =
|C| |C|
X
X wi + wj
2 ∗ distij
i=1 j=i+1
We now describe how we use the above measures in our
decision algorithm. Suppose we are given a set of queries
D (dataset) containing a set of rankers and a set of rank
aggregation algorithms A. We assume the dataset has labels
that represent the correct ordering of objects for each query.
Using this information, we determine the performance of each
algorithm for each query using either average precision or
Kendall-tau. Furthermore, each query has a label that represents a specific noise and misinformation class. Note that we
do not need to know how high/low noise and misinformation
is in each class, we just want to be able to differentiate
between different classes. The following algorithm shows how
we compute the average performance of each aggregator and
construct a set of values for cluster quality and rank variance
to be used in constructing classifiers.
(4)
1:
function constructClassifiers(D, A)
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
X = total number of noise/misinformation (n/m) class
labels
Y = |A|
perf = array[X, Y ], Q = array[X], V ar = array[X]
for each query {(r1 , . . . , rs ) : l} in D {l is the identifier
for the n/m label of this query} do
Find the cluster quality x1 and variance of ranks x2 for
r1 , . . . , r s
Q[l] = Q[l] ∪ {x1 }, V ar[l] = V ar[l] ∪ {x2 }
for each aggregator a ∈ A do
let p be the performance of a with respect to
r1 , . . . , rs using average precision or Kendall-tau
perf [l, a] = perf [l, a] + p
for each l, a do
perf [l, a] = perf [l, a]/|Q[l]| {Find average performance of each aggregator for each class}
for each l do
Find the classifier for l using Q[l] and V ar[l] {See
discussion below}
return classifiers for each l and perf
e) Finding classifiers: Note that the aim of a classifer
is to return the appropriate noise and misinformation n/m
label for a given cluster quality x1 and variance of ranks x2
values. Hence, given a new set of rankers r1 , . . . , rs with an
unknown n/m class, we will determine the correct class using
these two values and then we will use the best aggregator
for this class given the above matrix perf we just computed.
First, we construct a cost matrix Cost(li , lj ) that determines
the cost misclassifying label li as label lj . Let ai be the top
aggregator for label li and aj be the top aggregator for label
lj . Then, Cost(li , lj ) = perf [li , ai ] − perf [li , aj ]. In other
words, if we misclassify label li as lj , we will be using
the top aggregator aj instead of the correct aggregator ai .
We are given Cost(li , li ) = 0 by definition. The cost of
misclassifying label li is the difference in the performance
of the two aggregators for this class. We normalize the cost
P
matrix such that j Cost(li , lj ) = 1.
Given this cost matrix and the given set of x1 and x2 values
for a specific label l, we construct a two-dimensional Gaussian
distribution for each label based on the mean (µ) and covariance (Σ) matrix of values in Q[l] (for x1 ) and V ar[l] (for
x2 ).
→
We compute the probability of x = hx1 , x2 i belonging to
label l using the formula:
→
prob( x |l) =
→
→
1
− 12 ( x −µ)0 Σ−1 ( x −µ)
e
2π ∗ det Σ1/2
→
We are actually interested in prob(l| x ). By Bayes’ theorem,
we can write this as:
prob(l)
→
→
prob(l| x ) = prob( x |l) ∗
→
prob( x )
→
As prob( x ) is common to all the classes, it is simply a scaling
factor. Similarly, we assume that each class is equally likely
in our simulation. Hence, prob(l) = 1/X. As a result, we can
simply write:
→
→
prob(l| x ) ∝ prob( x |l)
As we want to find the most likely class, we are simply
→
using the prob( x |l) values to order the classes. As a result,
→
we do not need to compute the real value of prob(l| x ). Hence,
→
we use prob( x |l) in our class decision computations instead
→
→
of prob(l| x ). We can compute the best class for point x for
each class using the decision cost given below:
X →
DecisionCost(l) =
p( x |lj ) ∗ Cost(l, lj )
j
The class with the lowest decision cost is chosen and the top
performing aggregator for that class is executed. This allows
us to choose not the most likely aggregator, but the aggregator
with the lowest risk. The cost of computing the cluster quality
and rank variance measures is O(ns2 ) where n is the number
of objects returned by all the rankers and s is the number of
rankers. The cost finding the correct class is O(l2 ) where l is
the number of noise/misinformation classes.
IV. E XPERIMENTAL S ET UP
To be able to test the decision method for different noise
and misinformation classes and for large datasets, we created
a statistical framework which has perfect information about
the performance of each aggregator and the specific noise and
misinformation scenario. To apply our methods to real data
sets, we need to first cluster the datasets based on the cluster
quality and rank variance values. Each cluster will represent
a specific noise and misinformation scenario. We then need
to identify the performance of aggregators in each scenario.
To accomplish this, we can use the clickthrough data [8] that
estimates the correct or preferred ordering of objects based on
the order of clicks. Testing the effectiveness of our method
with real data and clustering is part of ongoing work.
In our statistical framework, there is a ground truth which
corresponds to the correct ranking of the objects in the
database. Each ranker is a perturbation of the ground truth.
The ground truth ranker computes a score for a number of
factors denoted by f1 , . . . , fm . The features are positive scores
that correspond to the relevance of the object for a specific
criteria. For example in a web searching application, objects
can be referred to as web pages and features can be keyword
occurrence, recency of web page update, or retrieval frequency.
In addition, the ground truth assigns positive weights to each
feature w = hw1 , . . . , wm i where each weight describes
the importance of the corresponding feature in determining
relevance. The score of an object is determined by a linear
combination method score(r, o) = w1 ∗o.f1 +. . .+wm ∗o.fm
and objects are ranked in decreasing order of their scores.
In this paper, we assume that the scores for each factor for
the ground truth are generated uniformly at random and are
independent of each other.
Each ranker tries to estimate the same factor but makes
randomly distributed mistakes. These mistakes correspond to
the noise and are added to the true scores for each factor.
In this paper, the errors made by each ranker for each factor
and for each object are independent of each other. We also
assume a specific distribution for the noise that models spam.
We define spam as the case of very low scored objects in
ground truth achieving high rank in the rankers. According to
this model, the mistakes made in the scores of good objects
are relatively small in magnitude. However, occasionally bad
objects w.r.t. ground truth are given very high score elevating
them to high ranks. The full details of the statistical framework
can be found in [1]. Note that for each data set that we
generate, we have a ground truth and a number of rankers.
When aggregating, we only use use the rankers as input
assuming the ground truth is not known. We then evaluate
the performance of the aggregation method against the ground
truth. While in information retrieval, the objects are generally
labelled as relevant or irrelevant, the ground truth gives us
more detailed information. It allows us to correctly identify
the top 10 objects for precision measure and it also provides
the correct ordering of objects for the Kendall-tau performance
measure. Furthermore, we are able to generate a statistically
significant number of datasets for each specific test scenario
using our model. In real data sets, it is generally very hard
to understand what specific noise and misinformation model
they correspond to. As a result, the statistical model provides
us with a good way to analyze different algorithms.
In our tests, we use 5 factors and 15 rankers. We vary
the noise variance between 0.01, 0.1, 0.5, 0.75. The ground
2
3
4
5
1
, 15
, 15
, 15
, 15
i. The
truth uses the weight function w = h 15
rankers use either the correct weight function above, or the
5
4
3
2
1
reverse weight function: wr = h 15
, 15
, 15
, 15
, 15
i. We model
misinformation by using the incorrect weight function. We
high
noise
low
noise
PrOpt
Pg
PrOpt, CombMNZ
Pg
MeIBF
RndIBF
Pg
CombMNZ
PgADJ
PgIBF
PgADJ, MeIBF
PgIBF
Pg
Av
CFuse
PgADJ
MeADJ
Me
Av, Pg
*ADJ, *IBF, PrOpt
CFuse
MeIBF
Me
MeADJ
no misinformation
high misinformation
Fig. 1. The best rankers in each noise/misinformation case for Kendall-tau.
test three misinformation cases nM I = h0, 2, 7i where for
nmi = 0 all rankers have the correct weights and there is no
misinformation. For nmi = 2 and 7, 2 and 7 of the rankers
have the incorrect weights respectively and the remaining have
the correct weights. When the weights used by the rankers
are incorrect, the information lost about each factor cannot be
regained by aggregating more rankers. Furthermore, in cases
nmi = 2 and 7, there is an asymmetry between the rankers.
In our tests, we compare the performance of a wide range of
rank aggregation algorithms. In addition to the average (Av),
median (Me), CombMNZ [7], PrOpt [2] (precision optimal)
described in Section II, we test a number of aggregators that
aim to reduce the number of pairwise disagreements (i.e. the
Kendall-tau measure) between the aggregate ranker and the
input rankers. ADJ [6] (adjacent pairs) is a localized search
method that checks adjacent pairs and flips them if there is an
improvement in the overall Kendall-tau error. IBF [2] (iterative
best flip) is a global search method that finds the best flip for
each object, forces a flip even if there is a temporary increase
in the error and finds the lowest error configuration among all
visited. Both ADJ and IBF can be applied to the output of
any of the other aggregation algorithms. Dwork et. al. [6] also
introduce a number of Markov-chain based aggregators. We
introduce a version of the Pagerank [5] (Pg) algorithm that
approximates the Markov chain M4 previously used in other
studies [11]. Details of our implementation can be found in [2].
Finally, Condorcet Fuse (CFuse) also aims to optimize the
Kendall-tau by simply approximating the ordering of objects
that agrees with the majority of the rankers for each pair.
In our previous work, we have shown that median is
best when the noise is low and majority of the rankers
have the correct weights while average is best when there
is no misinformation and noise. When, there is noise and
misinformation are both present, different algorithms do well
depending on each case and the performance measure used.
Figure 1 shows the top performing rankers for the Kendall-tau
performance measure for the scenario we study in this paper.
The top line in the figure is the top performing aggregator(s)
and the second line is the second best aggregator in each
scenario. We can see a large variety of algorithms performing
well under certain conditions and not so well in others. The
general trend shows that algorithms that disregard information
such as PrOpt, Pg and IBF are robust to noise. But they do
not do well in case of low noise. When there is asymmetry
between rankers, Me, MeADJ, MeIBF do well depending
on the noise. As misinformation increases further and the
majority of the rankers contain misinformation (not shown in
this paper), there is a greater need to incorporate information
from the rankers and Av starts do better than Me. As noise
and misinformation both increases, more robust algorithms are
needed. One interesting side note is that better Kendall-tau
optimizers such as IBF tend to be more robust to noise as
they contain less information about the actual rank values.
A. Results
To show that the classification method is effective, we
first train the classifiers by running 40,000 data sets in each
noise misinformation (n/m) setting. We also determine the best
aggregators in each n/m setting and use this information to
construct the cost matrix. We also determine the overall best
aggregators by averaging the rank of each aggregator in each
n/m setting. In a static scenario, the optimal choice would
be to use a static aggregator that performs well overall. We
run 40,000 data sets for each n/m setting, use the classifier
to find the correct class and use the top aggregator for the
estimated class. We compare the performance of the estimated
top aggregator with the overall best aggregators as a static
aggregator. We also compare the performance of our method to
the optimal classifier, one which always finds the correct class
for each dataset. Note that, when optimizing for precision,
precision is used for all computations. The same is true for
Kendall-tau performance measure. We have found that the
correct class is found in almost 75% of the cases in almost
all classes except for high misinformation/asymmetry cases. In
this case, there is a big performance difference between the top
aggregator and the rest. As a result, the risk of misidentifying
these cases is very high and the classifier tends to choose a
safer option. Figure 2 below shows the results for average
precision over all the classes and the top 3 overall aggregators
(high values are desirable) as well as the Kendall-tau results
(low values are desirable). As we can see, the cost-based
classification method that we introduce is able to perform
better than any of the static aggregators except for PrOpt. In
the test cases that we tried, PrOpt is the optimal ranker for
precision in a large number of cases. Hence, the best possible
improvement in this case is very little. However, for Kendalltau, we show significant improvement in performance over all
the classes.
optimal classification
cost-based classification
CombMNZ
PrOpt
PgADJ
8.5518
8.5364
8.3385
8.5360
8.4466
(a) Average Precision
Fig. 2.
optimal classification
cost-based classification
Pg
MeIBF
CombMNZ
18.869
19.091
21.313
19.384
22.115
(b) Average Kendall-tau results
The performance comparison of the cost-based classifier
V. C ONCLUSIONS
In this paper, we present a method to classify a given set
of rankers to a specific noise and misinformation scenario.
We have shown that by using this classifier, we are able to
estimate the noise and misinformation class correctly. By using
the top aggregator for each scenario, we are able to improve
the performance over any existing static aggregation method.
R EFERENCES
[1] S. Adalı, B. Hill, and M. Magdon-Ismail. Information vs. robustness
in rank aggregation: Models, algorithms and a statistical framework for
evaluation (to appear). In JDIM, Special Issue on Web Information
Retrieval.
[2] S. Adalı, B. Hill, and M. Magdon-Ismail. The impact of ranker quality of
rank aggregation algorithms: Information vs robustness. In Proceedings
of the International Workshop on the Challenges in Web Information
Retrieval and Integration, pages 10–19, 2006.
[3] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent
information: ranking and clustering. In Proceedings of the ACM
Symposium on Theory of Computing (STOC), pages 684–693, 2005.
[4] J. J. Bartholdi, C. A. Tovey, and M. A. Trick. Voting schemes for
which it can be difficult to tell who won the election. Social Choice
and Welfare, 6(2):157–165, 1989.
[5] S. Brin and L. Page. The anatomy of a large-scale hypertextual web
search engine. In Proceedings of ACM WWW, pages 107–117, 1998.
[6] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation
methods for the web. In Proceedings of ACM WWW, pages 613–622,
2001.
[7] J. Fox and E. Shaw. Combination of multiple sources: The trec-2
interactive track matrix experiment. In Proceedings of ACM SIGIR,
1994.
[8] T. Joachims. Optimizing search engines using clickthrough data. In
Proceedings of ACM SIGKDD, pages 133–142, 2002.
[9] Y.-T. Liu, T.-Y. Liu, T. Qin, Z.-M. Ma, and H. Li. Supervised rank
aggregation. In Proceedings of ACM WWW, pages 481–489, 2007.
[10] M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval.
In Proceedings of ACM CIKM, pages 538–548, 2002.
[11] M. E. Renda and U. Straccia. Web metasearch: Rank vs. score based
rank aggregation methods. In Proceedings of ACM SAC, pages 841–846,
2003.