A classification algorithm for finding the optimal rank aggregation method Sibel Adalı Malik Magdon-Ismail Brandeis Marshall] Rensselaer Polytechnic Institute Troy, New York 12180 Email: [email protected] Rensselaer Polytechnic Institute Troy, New York 12180 Email: [email protected] Purdue University West Lafayette, Indiana 47907 Email: [email protected] Abstract—In this paper, we develop a classification algorithm for finding the optimal rank aggregation algorithm. The input features for the classification are measures of noise and misinformation in the rankers. The optimal ranking algorithm varies greatly with respect to these two factors. We develop two measures to compute noise and misinformation: cluster quality and rank variance. Further, we develop a cost based decision method to find the least risky aggregator for a new set of ranked lists and show that this decision method outperforms any static rank aggregation method by through rigorous experimentation. I. I NTRODUCTION Rank aggregation refers to the problem of finding a combined ordering for objects given a set of rankings obtained from different rankers. Rank aggregation is a frequently used method in meta-search applications as well as in many other domains where objects are ordered with respect to different criteria. Based on the assumption that rankers are imperfect, many aggregation methods [3], [4], [6], [10] have been introduced in the literature to best reflect the correct information available in the rankers and disregard the irrelevant information or noise. In previous work [1], [2], we provided a principled analysis of aggregation methods for two specific criteria, noise that includes spam and misinformation. Noise refers to the case where the mistakes made by rankers cause some local perturbations of the rankings. Misinformation on the other hand refers to a more fundamental mistake made by the rankers such as using the wrong ranking function. Noise is very common as rankers may be imperfect or are not This work was partially supported by the National Science Foundation under grants IIS-0324947, CNS-0323324, EIA-0091505 and IIS-9876932. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. ] This work has been done when the author was a PhD Student at Rensselaer Polytechnic Institute. 1-4244-1364-8/07/$25.00 ©2007 IEEE personalized. Misinformation can correspond to a difference of opinion or a malicious agent that is trying to mislead the rank aggregation algorithm. In general, one can cancel out noise by using more and more rankers. However, the best way to deal with misinformation is to disregard the rankers with differences of opinion or treat each opinion separately. In Section IV we describe how we model noise and misinformation in our experimental setup. In [1], we show that the choice of the best aggregation method may differ greatly based on the amount of noise and misinformation available in the rankers. As a result, if different queries have different levels of noise and misinformation, then the best algorithm needs to be chosen specific to that query. In this paper, we expand on this finding and develop a classification method to find the optimal rank aggregation algorithm dynamically. In contrast with previous work in learning how to rank [8], [9], we concentrate on learning the correct algorithm not the correct parameters for a given algorithm. In this sense, our work is complementary to this type of work as it reduces the number of parameters considered in training a specific algorithm. Most of the previous work uses a specific algorithm in training. Furthermore, the algorithms considered in our work are general purpose, avoiding the problems caused by overfitting the data to the training set. However, we have already shown that such an approach can be suboptimal [1]. Furthermore, our work requires similar or simpler training data making it cheaper to use. Our main contributions in this paper are: (1) We develop two measures called cluster quality and rank variance to measure the degree of noise and misinformation available in the rankers. (2) We develop a cost based classification method that takes into consideration the cost of making a mistake as well as the probability of making the right choice. (3) Within a controlled simulation settting, we show through rigorous experimentation that our method is able to provide improvement in performance over a system that uses a static aggregation method. II. R ELATED W ORK In this paper, we deal with the rank aggregation problem where only the ranks are available or are used. We review the related work for this problem only. A great deal of the previous work in rank aggregation concentrates on finding an optimal algorithm for a specific scenario. In addition to simple aggregators like average and median, algorithms that optimize for the Kendall-tau performance measure have become popular in recent years. These algorithms try to find a ranking with the minimal Kendall-tau distance (number of pairwise disagreements in the ordering of pairs of objects) to the input ranked lists. The earliest such algorithm is given by Dwork et. al. [6] where multiple Markov chain approximations to this problem are given as well as simple localized search methods to improve the results. It has also been shown that finding the ranked list with the minimal Kendall-tau distance can be reduced to the minimal feedback arc set problem [4]. Ailon et. al. [3] develop the FAS-Pivot algorithm and its variations to approximately solve the minimum feedback arc problem. The FAS-pivot algorithm is identical to the Condorcet-fuse algorithm [10] developed independently. Recently, we introduced a global search algorithm called IBF [2] to find the Kendall-tau optimal ranking. Other well known rank algorithms consider different signals instead of pairwise disagreements. The CombMNZ algorithm [7] looks at number of times an object appears in all ranked lists and multiplies this by the sum of the rankings. The PrOpt [2] algorithm uses only the number of appearances in all lists but breaks ties with respect to the average ranker. We have recently shown [2] that all these algorithms perform well under different noise and misinformation conditions, hence the best algorithm very much depends on the specific application. Another approach to rank aggregation is to learn the best aggregation by learning the importance of different factors based on user feedback or training data. In this work, a specific aggregation is already chosen. Joachims [8] uses click through data to learn the importance of different factors in ranking. These factors are typically text based factors that may not always to be available for applications where only the ranks are available and no other property of the objects are known. However, click data provides a very useful way of obtaining user preference information that can be used in training. Recently, Liu et. al. [9] offer a supervised training method for the Markov chains introduced in [6]. We note that different algorithms mentioned above consider different signals in rank aggregation such as pairwise orderings of objects, the rank values, the statistical properties of ranks of the same objects across rankers, the number of rankers that rank a specific object, etc. Hence, an ideal learning algorithm needs to consider the effectiveness of all these different signals for an effective aggregation. To our knowledge, there is no work that considers the wide range of algorithms and their corresponding signals in rank aggregation that we consider in this paper. III. M ETHODOLOGY Suppose there are n objects denoted by O = {o1 , . . . , on } that exist in a database and can be queried. A ranker r is a ranked list of objects from O. Let rank(r, o) denote the rank of object o in ranker r with 1 being the highest rank. If object o is not indexed by r, then rank(r, o) is undefined for this object. DB(r) is the set of objects indexed by r. Given as input, a set of rankers {r1 , . . . , rs }, the aggregate is ranker rA of objects in DB(r1 ) ∪ . . . ∪ DB(rs ) based on their ranks in {r1 , . . . , rs }. The rank aggregation problem is that of finding an aggregate ranker that optimizes a performance measure. We assume both the input and aggregate rankers are of length K in this paper and use the two well-known performance measures: precision and Kendall-tau. Precision returns the number of common objects between two lists. Kendall-tau returns the total number of pairwise disagreements between two lists. In case where an object o in ranker r1 does not appear in ranker r2 , then we assign it a default rank of K + 1 in r2 . Given a set of rankers, we first try to estimate the amount of noise and misinformation they contain. As estimating misinformation is impossible using only the ranking information, we instead find whether there is an asymmetry between the rankers using clustering. Below we describe our methodology for estimating noise and misinformation. A. Clustering of Rankers Given a set of rankers, r1 , . . . , rs , we use a greedy approach to cluster the rankers into C disjoint clusters. The objective is to cluster rankers with the highest similarity together. Assume that for every pair of rankers (ri , rj ), sim(i, j) refers to the given measure of similarity between the rankers. We assume sim(i, j) is a normalized measure. In the following, we use average precision as our similarity measure. However, it is also possible to use (1 − τ 0 ) as the measure of similarity as well where τ 0 is a normalized version of Kendall-tau. To cluster, we use single link clustering where we initially place each ranker in a different cluster. We then merge clusters with the highest similarity until there are exactly C clusters. The similarity between two clusters C1 and C2 is given by the average of all pairwise similarity computations. X 1 simAV G (C1 , C2 ) = sim(i, j) (1) |C1 | ∗ |C2 | ri ∈C1 ,rj ∈C2 1) Identification of Misinformation and Noise: Once the rankers have been grouped into C clusters, we now must evaluate the quality of these rankers as represented by the clusters. We use cluster quality to determine whether there is an asymmetry between the rankers as a way of determining misinformation and variance of ranks to determine noise. As our clustering methods places similar rankers in the same cluster, we measure how close the clustered objects are. To accomplish this, we compute two values: inter-cluster width and intra-cluster distance. The inter-cluster width computes how far away the rankers in a single cluster are. The smaller the width, the better quality that specific cluster is. The intracluster distance measures how far away a pair of clusters are. The larger the distance, the better the cluster quality. We combine these two measures to compute overall quality of clusters as discussed below. a) Distance within clusters (width w): Let ri , rj be two rankers. Let C denote the set of rankers grouped together in a cluster and |C| represent the size of this cluster. We compute for all distinct pairs of rankers (ri , rj ∈ C), the width of the cluster as follows: P|C| i,j=1,i6=j (1 − sim(ri , rj )) widthC = (2) |C|∗|C−1| Both the width and distance are normalized on a scale [0,1], where values closer to 0 denote higher similarity. For example, when all rankers are identical and there are two clusters, the width of each cluster is 0 and the distance between each of these clusters is 0. In order to handle the zero case (when distance = 0 or width = 0), we assign the insignificant = 0.0001. The epsilon removes the divide by zero case. When both width and distance are zero, the cluster quality becomes 1, which is the worst case scenario. The cluster quality is considered high if the Q value is low. In addition, epsilon is used for any width zero which may happen for clusters of size greater than 1. Note that Q would be high if some of the rankers have significantly different rankings. d) Variance of Ranks.: While the cluster quality is used to measure the misinformation, we introduce the variance of ranks measure to compute the amount of noise in the rankers. Let n be the distinct number of objects in the input rankers. Let rank(rj , oi ) be the rank of oi in ranker rj . For each cluster C, we compute the mean of the variance in the ranks of an object over all the rankers in cluster C. We then average this value over all the objects. This constitutes the average variance for this cluster. We then average these values over all the clusters. Note that if an object is not ranked by a ranker then we assign it the default rank of K+1 where K is the current retrieval size. If two rankers are missing the same objects, the objects would reduce the variance measure, indicating an agreement between the two rankers in line with the intended meaning of this measure. B. Decision Algorithm 2 b) Distance across clusters (distance dist): Let C1 and C2 be two clusters. Let rL ∈ C1 be the set of rankers that are grouped in cluster C1 and rR ∈ C2 be the set of rankers that are grouped in cluster C2 . We compute for all pairs of rankers from the two clusters (rL , rR ), the average distance across two clusters as follows: P|C1 | P|C2 | (1 − sim(rL , rR )) (3) distLR = L=1 R=1 |C1 | ∗ |C2 | c) Cluster quality.: Suppose there are C clusters. Let wi , wj be the width of two clusters Ci , Cj , respectively, and distij be the distance between these clusters. We can now compute the similarity between every distinct pair of clusters, denoted as cluster quality Q, as follows: Q(C) = |C| |C| X X wi + wj 2 ∗ distij i=1 j=i+1 We now describe how we use the above measures in our decision algorithm. Suppose we are given a set of queries D (dataset) containing a set of rankers and a set of rank aggregation algorithms A. We assume the dataset has labels that represent the correct ordering of objects for each query. Using this information, we determine the performance of each algorithm for each query using either average precision or Kendall-tau. Furthermore, each query has a label that represents a specific noise and misinformation class. Note that we do not need to know how high/low noise and misinformation is in each class, we just want to be able to differentiate between different classes. The following algorithm shows how we compute the average performance of each aggregator and construct a set of values for cluster quality and rank variance to be used in constructing classifiers. (4) 1: function constructClassifiers(D, A) 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: X = total number of noise/misinformation (n/m) class labels Y = |A| perf = array[X, Y ], Q = array[X], V ar = array[X] for each query {(r1 , . . . , rs ) : l} in D {l is the identifier for the n/m label of this query} do Find the cluster quality x1 and variance of ranks x2 for r1 , . . . , r s Q[l] = Q[l] ∪ {x1 }, V ar[l] = V ar[l] ∪ {x2 } for each aggregator a ∈ A do let p be the performance of a with respect to r1 , . . . , rs using average precision or Kendall-tau perf [l, a] = perf [l, a] + p for each l, a do perf [l, a] = perf [l, a]/|Q[l]| {Find average performance of each aggregator for each class} for each l do Find the classifier for l using Q[l] and V ar[l] {See discussion below} return classifiers for each l and perf e) Finding classifiers: Note that the aim of a classifer is to return the appropriate noise and misinformation n/m label for a given cluster quality x1 and variance of ranks x2 values. Hence, given a new set of rankers r1 , . . . , rs with an unknown n/m class, we will determine the correct class using these two values and then we will use the best aggregator for this class given the above matrix perf we just computed. First, we construct a cost matrix Cost(li , lj ) that determines the cost misclassifying label li as label lj . Let ai be the top aggregator for label li and aj be the top aggregator for label lj . Then, Cost(li , lj ) = perf [li , ai ] − perf [li , aj ]. In other words, if we misclassify label li as lj , we will be using the top aggregator aj instead of the correct aggregator ai . We are given Cost(li , li ) = 0 by definition. The cost of misclassifying label li is the difference in the performance of the two aggregators for this class. We normalize the cost P matrix such that j Cost(li , lj ) = 1. Given this cost matrix and the given set of x1 and x2 values for a specific label l, we construct a two-dimensional Gaussian distribution for each label based on the mean (µ) and covariance (Σ) matrix of values in Q[l] (for x1 ) and V ar[l] (for x2 ). → We compute the probability of x = hx1 , x2 i belonging to label l using the formula: → prob( x |l) = → → 1 − 12 ( x −µ)0 Σ−1 ( x −µ) e 2π ∗ det Σ1/2 → We are actually interested in prob(l| x ). By Bayes’ theorem, we can write this as: prob(l) → → prob(l| x ) = prob( x |l) ∗ → prob( x ) → As prob( x ) is common to all the classes, it is simply a scaling factor. Similarly, we assume that each class is equally likely in our simulation. Hence, prob(l) = 1/X. As a result, we can simply write: → → prob(l| x ) ∝ prob( x |l) As we want to find the most likely class, we are simply → using the prob( x |l) values to order the classes. As a result, → we do not need to compute the real value of prob(l| x ). Hence, → we use prob( x |l) in our class decision computations instead → → of prob(l| x ). We can compute the best class for point x for each class using the decision cost given below: X → DecisionCost(l) = p( x |lj ) ∗ Cost(l, lj ) j The class with the lowest decision cost is chosen and the top performing aggregator for that class is executed. This allows us to choose not the most likely aggregator, but the aggregator with the lowest risk. The cost of computing the cluster quality and rank variance measures is O(ns2 ) where n is the number of objects returned by all the rankers and s is the number of rankers. The cost finding the correct class is O(l2 ) where l is the number of noise/misinformation classes. IV. E XPERIMENTAL S ET UP To be able to test the decision method for different noise and misinformation classes and for large datasets, we created a statistical framework which has perfect information about the performance of each aggregator and the specific noise and misinformation scenario. To apply our methods to real data sets, we need to first cluster the datasets based on the cluster quality and rank variance values. Each cluster will represent a specific noise and misinformation scenario. We then need to identify the performance of aggregators in each scenario. To accomplish this, we can use the clickthrough data [8] that estimates the correct or preferred ordering of objects based on the order of clicks. Testing the effectiveness of our method with real data and clustering is part of ongoing work. In our statistical framework, there is a ground truth which corresponds to the correct ranking of the objects in the database. Each ranker is a perturbation of the ground truth. The ground truth ranker computes a score for a number of factors denoted by f1 , . . . , fm . The features are positive scores that correspond to the relevance of the object for a specific criteria. For example in a web searching application, objects can be referred to as web pages and features can be keyword occurrence, recency of web page update, or retrieval frequency. In addition, the ground truth assigns positive weights to each feature w = hw1 , . . . , wm i where each weight describes the importance of the corresponding feature in determining relevance. The score of an object is determined by a linear combination method score(r, o) = w1 ∗o.f1 +. . .+wm ∗o.fm and objects are ranked in decreasing order of their scores. In this paper, we assume that the scores for each factor for the ground truth are generated uniformly at random and are independent of each other. Each ranker tries to estimate the same factor but makes randomly distributed mistakes. These mistakes correspond to the noise and are added to the true scores for each factor. In this paper, the errors made by each ranker for each factor and for each object are independent of each other. We also assume a specific distribution for the noise that models spam. We define spam as the case of very low scored objects in ground truth achieving high rank in the rankers. According to this model, the mistakes made in the scores of good objects are relatively small in magnitude. However, occasionally bad objects w.r.t. ground truth are given very high score elevating them to high ranks. The full details of the statistical framework can be found in [1]. Note that for each data set that we generate, we have a ground truth and a number of rankers. When aggregating, we only use use the rankers as input assuming the ground truth is not known. We then evaluate the performance of the aggregation method against the ground truth. While in information retrieval, the objects are generally labelled as relevant or irrelevant, the ground truth gives us more detailed information. It allows us to correctly identify the top 10 objects for precision measure and it also provides the correct ordering of objects for the Kendall-tau performance measure. Furthermore, we are able to generate a statistically significant number of datasets for each specific test scenario using our model. In real data sets, it is generally very hard to understand what specific noise and misinformation model they correspond to. As a result, the statistical model provides us with a good way to analyze different algorithms. In our tests, we use 5 factors and 15 rankers. We vary the noise variance between 0.01, 0.1, 0.5, 0.75. The ground 2 3 4 5 1 , 15 , 15 , 15 , 15 i. The truth uses the weight function w = h 15 rankers use either the correct weight function above, or the 5 4 3 2 1 reverse weight function: wr = h 15 , 15 , 15 , 15 , 15 i. We model misinformation by using the incorrect weight function. We high noise low noise PrOpt Pg PrOpt, CombMNZ Pg MeIBF RndIBF Pg CombMNZ PgADJ PgIBF PgADJ, MeIBF PgIBF Pg Av CFuse PgADJ MeADJ Me Av, Pg *ADJ, *IBF, PrOpt CFuse MeIBF Me MeADJ no misinformation high misinformation Fig. 1. The best rankers in each noise/misinformation case for Kendall-tau. test three misinformation cases nM I = h0, 2, 7i where for nmi = 0 all rankers have the correct weights and there is no misinformation. For nmi = 2 and 7, 2 and 7 of the rankers have the incorrect weights respectively and the remaining have the correct weights. When the weights used by the rankers are incorrect, the information lost about each factor cannot be regained by aggregating more rankers. Furthermore, in cases nmi = 2 and 7, there is an asymmetry between the rankers. In our tests, we compare the performance of a wide range of rank aggregation algorithms. In addition to the average (Av), median (Me), CombMNZ [7], PrOpt [2] (precision optimal) described in Section II, we test a number of aggregators that aim to reduce the number of pairwise disagreements (i.e. the Kendall-tau measure) between the aggregate ranker and the input rankers. ADJ [6] (adjacent pairs) is a localized search method that checks adjacent pairs and flips them if there is an improvement in the overall Kendall-tau error. IBF [2] (iterative best flip) is a global search method that finds the best flip for each object, forces a flip even if there is a temporary increase in the error and finds the lowest error configuration among all visited. Both ADJ and IBF can be applied to the output of any of the other aggregation algorithms. Dwork et. al. [6] also introduce a number of Markov-chain based aggregators. We introduce a version of the Pagerank [5] (Pg) algorithm that approximates the Markov chain M4 previously used in other studies [11]. Details of our implementation can be found in [2]. Finally, Condorcet Fuse (CFuse) also aims to optimize the Kendall-tau by simply approximating the ordering of objects that agrees with the majority of the rankers for each pair. In our previous work, we have shown that median is best when the noise is low and majority of the rankers have the correct weights while average is best when there is no misinformation and noise. When, there is noise and misinformation are both present, different algorithms do well depending on each case and the performance measure used. Figure 1 shows the top performing rankers for the Kendall-tau performance measure for the scenario we study in this paper. The top line in the figure is the top performing aggregator(s) and the second line is the second best aggregator in each scenario. We can see a large variety of algorithms performing well under certain conditions and not so well in others. The general trend shows that algorithms that disregard information such as PrOpt, Pg and IBF are robust to noise. But they do not do well in case of low noise. When there is asymmetry between rankers, Me, MeADJ, MeIBF do well depending on the noise. As misinformation increases further and the majority of the rankers contain misinformation (not shown in this paper), there is a greater need to incorporate information from the rankers and Av starts do better than Me. As noise and misinformation both increases, more robust algorithms are needed. One interesting side note is that better Kendall-tau optimizers such as IBF tend to be more robust to noise as they contain less information about the actual rank values. A. Results To show that the classification method is effective, we first train the classifiers by running 40,000 data sets in each noise misinformation (n/m) setting. We also determine the best aggregators in each n/m setting and use this information to construct the cost matrix. We also determine the overall best aggregators by averaging the rank of each aggregator in each n/m setting. In a static scenario, the optimal choice would be to use a static aggregator that performs well overall. We run 40,000 data sets for each n/m setting, use the classifier to find the correct class and use the top aggregator for the estimated class. We compare the performance of the estimated top aggregator with the overall best aggregators as a static aggregator. We also compare the performance of our method to the optimal classifier, one which always finds the correct class for each dataset. Note that, when optimizing for precision, precision is used for all computations. The same is true for Kendall-tau performance measure. We have found that the correct class is found in almost 75% of the cases in almost all classes except for high misinformation/asymmetry cases. In this case, there is a big performance difference between the top aggregator and the rest. As a result, the risk of misidentifying these cases is very high and the classifier tends to choose a safer option. Figure 2 below shows the results for average precision over all the classes and the top 3 overall aggregators (high values are desirable) as well as the Kendall-tau results (low values are desirable). As we can see, the cost-based classification method that we introduce is able to perform better than any of the static aggregators except for PrOpt. In the test cases that we tried, PrOpt is the optimal ranker for precision in a large number of cases. Hence, the best possible improvement in this case is very little. However, for Kendalltau, we show significant improvement in performance over all the classes. optimal classification cost-based classification CombMNZ PrOpt PgADJ 8.5518 8.5364 8.3385 8.5360 8.4466 (a) Average Precision Fig. 2. optimal classification cost-based classification Pg MeIBF CombMNZ 18.869 19.091 21.313 19.384 22.115 (b) Average Kendall-tau results The performance comparison of the cost-based classifier V. C ONCLUSIONS In this paper, we present a method to classify a given set of rankers to a specific noise and misinformation scenario. We have shown that by using this classifier, we are able to estimate the noise and misinformation class correctly. By using the top aggregator for each scenario, we are able to improve the performance over any existing static aggregation method. R EFERENCES [1] S. Adalı, B. Hill, and M. Magdon-Ismail. Information vs. robustness in rank aggregation: Models, algorithms and a statistical framework for evaluation (to appear). In JDIM, Special Issue on Web Information Retrieval. [2] S. Adalı, B. Hill, and M. Magdon-Ismail. The impact of ranker quality of rank aggregation algorithms: Information vs robustness. In Proceedings of the International Workshop on the Challenges in Web Information Retrieval and Integration, pages 10–19, 2006. [3] N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. In Proceedings of the ACM Symposium on Theory of Computing (STOC), pages 684–693, 2005. [4] J. J. Bartholdi, C. A. Tovey, and M. A. Trick. Voting schemes for which it can be difficult to tell who won the election. Social Choice and Welfare, 6(2):157–165, 1989. [5] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of ACM WWW, pages 107–117, 1998. [6] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proceedings of ACM WWW, pages 613–622, 2001. [7] J. Fox and E. Shaw. Combination of multiple sources: The trec-2 interactive track matrix experiment. In Proceedings of ACM SIGIR, 1994. [8] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of ACM SIGKDD, pages 133–142, 2002. [9] Y.-T. Liu, T.-Y. Liu, T. Qin, Z.-M. Ma, and H. Li. Supervised rank aggregation. In Proceedings of ACM WWW, pages 481–489, 2007. [10] M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In Proceedings of ACM CIKM, pages 538–548, 2002. [11] M. E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In Proceedings of ACM SAC, pages 841–846, 2003.
© Copyright 2026 Paperzz