Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01) All right reserved by Xuehua Shen [email protected] 1 Problem: Rank Aggregation Each object is scored using m different criteria, m sorted list for each criterion Combined score is calculated by an aggregation function Problem: find top-k objects with highest combined scores All right reserved by Xuehua Shen [email protected] 2 carID score d 0.81 c 0.76 e.g. weighted sum Example Top 2 Car Combined score = 0.2 *mileage score + 0.3*year score + 0.5 * price score Do we need access all entries of all sorted lists? Rank Aggregation carID Mileage Score c 1.0 a 0.8 e 0.6 b 0.5 d 0.5 carID Year Score a 0.9 b 0.7 c 0.7 d 0.7 e 0.5 All right reserved by Xuehua Shen [email protected] carID Price Score d 1.0 e 0.9 b 0.8 c 0.7 a 0.6 3 Applications Query Top k Color=‘red’and Shape=‘round’ Multimedia database system Rank Aggregation Engine Web search query Color = ‘red’ Sorted Sorted List List color Shape =‘round’ shape From Zhang2002 talk All right reserved by Xuehua Shen [email protected] 4 Outline Assumptions Fagin Algorithm Threshold Algorithm Summary & Comments All right reserved by Xuehua Shen [email protected] 5 Assumption 1: Modes of Access Sequential Access: obtain score of an object in one sorted list sequentially from current position Random Access: obtain score of an object in one sorted list using one random access carID Year score a 0.8 c 0.8 e 0.7 … Assumption: Both Access Modes are available All right reserved by Xuehua Shen [email protected] 6 Assumption 2: Aggregation Function Object gets different scores from different subsystems in the interval [0,1] Aggregation function to compute them into combined scores e.g. min, avg Monotone: f ( x1 , x2 ,..., xm ) f ( y1 , y2 ,..., ym ) if xi yi for every i All right reserved by Xuehua Shen [email protected] 7 Intuition of Algorithms Top objects in individual sorted lists also have chances to be correct answers Do some accesses, and think “Can we stop now?” All right reserved by Xuehua Shen [email protected] 8 Fagin Algorithm carID Price score a 0.9 c 0.8 e 0.7 … carID Mileage score b 1.0 e 0.8 f 0.7 … carID a c e … Year score 0.8 0.8 0.7 ’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why? Monotone function, object ‘e’ blocks all objects below Do random access for these 5 objects to get their scores and pick Top-1. We can’t say ‘e’ must be top-1,other objects can still have higher combined score All right reserved by Xuehua Shen [email protected] 9 Drawbacks of Fagin Algorithm Only use information provided by sorted list and monotone property Have to remember lots of objects: large buffer size All right reserved by Xuehua Shen [email protected] 10 Threshold Algorithm (TA) Intuition: Combined score calculated by aggregation function can provide some extra information. upper bound (or threshold) of combined score of unseen objects! When object R is seen under sequential access, immediately do random access to get all other scores of object R and compute combined score At the same time, Keep track of the upper bound of the unseen objects Halt when at least k objects have combined scores no less than upper bound All right reserved by Xuehua Shen [email protected] 11 TA: Example (K=1,AVG aggregation) carI Price D score a 0.9 c 0.8 e 0.7 carID Mileage score b 1.0 e 0.8 f 0.7 carID … … … a c e Year score 0.8 0.8 0.7 Upper Bound:0.9 Upper Bound:0.8 0.77 Const-size buffer 0.8 Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’ mileage score(0.6) and year score(0.8), avg is (0.77) Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’ price score(0.7) and year score(0.7), avg is (0.8) All right reserved by Xuehua Shen [email protected] 12 Evaluation of TA TA never stops later than FA TA requires only small constant-size (K) buffer However, TA may perform more random accesses All right reserved by Xuehua Shen [email protected] 13 Summary FA and TA with both sequential access and random access Extend TA to other situations Approximate algorithm No random access All right reserved by Xuehua Shen [email protected] 14 Comments Rely on universal identification of objects from different lists Assumptions can not always be valid e.g. not every sorted list exists beforehand Do sequential access wisely for speeding up TA for skewed data All right reserved by Xuehua Shen [email protected] 15 All right reserved by Xuehua Shen [email protected] 16 Backup Slides All right reserved by Xuehua Shen [email protected] 17 Middleware Middleware: functions as a translation layer, handles all incoming requests (such as Top-K query) and replies, interacting with the disparate back-office systems to gather the information it needs. Application developers don’t need know there are several heterogeneous systems behind the middleware. All right reserved by Xuehua Shen [email protected] 18 Boolean Query Vs. Fuzzy Query Semantics Get all the results that satisfy the conditions Vs. get the best possible answers to the query Size of result: constant Vs. variable Processing the query It’s possible to determine whether the tuple belongs to result only based on the tuple itself, but for fuzzy query it’s not. So for boolean query we can deal with each tuple individually, but for fuzzy query, we cannot determine whether it’s in the result just by itself All right reserved by Xuehua Shen [email protected] 19 Fuzzy Query Processor (from Zhang02) Query Set Query Top k Color=‘red’and Shape=‘round’ Title=‘database’ and Price <100 Query Processor (Boolean) Query Processor (Fuzzy) Color = ‘red’ Sorted Sorted List List color Traditional Database All right reserved by Xuehua Shen [email protected] Shape =‘round’ shape Database with fuzzy data 20 Cost Reduce the number of sequential access(Cs) Number of random accesses is bounded by sequential access by a factor of m-1 Overall cost is bounded by the Cs by constant factor Really optimal? All right reserved by Xuehua Shen [email protected] 21 Approximation Algorithm Approximately top k answers are acceptable or even desirable θ-approximation (θ>1) For any object y in the answer, z in database θt(y) >= t(z) Turning TA to approximate algorithm The top k objects seen so far satisfy the inequality All right reserved by Xuehua Shen [email protected] 22 Non Random Access (NRA) Similar as TA, except that No exact score No sorted order The lower bound and upper bound of such objects Do sequential access until there are k objects whose lower bound no less than the upper bound of all other objects All right reserved by Xuehua Shen [email protected] 23 NRA cont. Low Bound: use 0 Upper Bound: use last score seen carID Price score a 0.9 c 0.8 e 0.7 … carID Mileage score b 1.0 e 0.8 f 0.7 … All right reserved by Xuehua Shen [email protected] carID a c e … Year score 0.8 0.8 0.7 24 NRA example Advantage: R1(1,0), others(1/3,1/3) Top 1 Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2 Lots of Bookkeeping All right reserved by Xuehua Shen [email protected] 25 Optimality of FA Assumption Cost t is monotone Θ(N(m-1)/mk1/m) with arbitrarily high probability Optimality Each algorithm that correctly find the top k answers for strict monotone query Ft(A1, A2, …,Am) where A1, A2, …,Am are independent, and without wild guess has the cost Θ (N(m-1)/mk1/m) with arbitrarily high probability FA is optimal in all such algorithms in high probability sense All right reserved by Xuehua Shen [email protected] 26 Optimality of TA Assumption t is monotone Instance Optimality For any algorithm C that correctly find the top k answers for monotone query Ft(A1, A2, …,Am) without wild guess on any database D Cost(TA,D)=O(cost(C,D)) TA is instance optimal in all such algorithms All right reserved by Xuehua Shen [email protected] 27 Optimality of NRA Assumption t is monotone Instance Optimality For all algorithm that correctly find the top k objects for monotone query t for every database and don’t make random access All right reserved by Xuehua Shen [email protected] 28 Algorithm Comparision (from Zhang2002 talk) Algorithm Assumption Access Model Termination Worst Case Termination Expected Buffer Space FA Monotone Sorted Random n(m-1)/m + k/m Nm-1/mk1/m N TA Monotone Sorted Random Bounded by FA Depends on distribution k NRA Monotone Sorted N Depends on distribution N All right reserved by Xuehua Shen [email protected] 29 Worst Case O1 O2 ... On+1 On+2 On+3 ... O2n+1 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 All right reserved by Xuehua Shen [email protected] Aggregation Function: min n(m-1)/m + k/m 30 Naïve algorithm Algorithm: For each criterion, do sequential access to retrieve all objects and their score Calculate combined scores for all objects Pick up top K Comments: Access the entire database Cost is linear in the database size Does NOT use the fact that each list is sorted All right reserved by Xuehua Shen [email protected] 31 Fagin Algorithm Algorithm: Do sequential in parallel to all sorted list Li, until there is k “matches”. A “match” is an object that has been seen in all sorted lists Li. Then for each object that has been seen, do random access to get all its score. Compute the combined scores and pick the top k All right reserved by Xuehua Shen [email protected] 32
© Copyright 2026 Paperzz