Operators for Similarity Search Deepak Padmanabhan, PhD Centre for Data Sciences and Scalable Computing The Queen’s University of Belfast United Kingdom [email protected] 1 Similarity Search in Action Image Search Web Pages Similar Movies (Tastekid.com) 2 Similarity and Cognition 3 Similarity and Cognition This sense of sameness is the very keel and backbone of our thinking. Principles of Psychology William James, 1890 … the mind makes continual use of the notion of sameness, and if deprived of it, would have a different structure from what it has. 4 Similarity and Cognition Similarity, is fundamental for learning, knowledge and thought, for only our sense of similarity allows us to order things into kinds so that these can function as stimulus meanings. Reasonable expectation depends on the similarity of circumstances and on our tendency to expect that similar causes will have similar effects. Ontological Relativity and Other Essays Quine, 1969 5 Geometric Similarity Model O1 O1 O2 O2 S (O1, O 2) Sim (O1. f , O 2. f ) f 6 Diagnosticity Principle Similarity and Grouping are related Features that are used to cluster have disproportionate influence 7 Pair wise similarities: Object Representation and Similarity Measures 8 Example Representations 9 Estimating Similarity Between Objects Domain Ontology Text Similarity CAR1023 ---------Remarks: Good condition Model: Passat V6 Year: 2002 Battery Voltage: 12.9V … … … … min 0.60 0.60 0.80 0.75 0.90 Numeric max 0.90 avg 0.76 Domain Knowledge+ Numeric min2 0.75 CAR560 ---------Remarks: Nice condition Model: Passat Year: 2000 Battery Voltage: 12.6V … … … … noagg {0.6,0.8,0.75,0.9} 10 Outline for the Rest • Construction-based Classification • Property-based Classification • Some Directions 11 Problem Overview Q D S(Q,D) D(Q,D) q1 q2 . . . . . . qn d1 d2 . . . . . . dn s1 s2 . . . . . . sn Scoring I(Q,D’) I(Q,D) Aggregation Filter MemberShip/Score Query Parameters Scoring Operators: Assign a score vector to each Object, by comparing to the query object Aggregation operators: Aggregate the score vector into smaller number of values 12 Selection/Filter Operators: Select a subset of objects based on whether they satisfy a criterion, e.g., Skyline, rank based or threshold based Common Operations • Aggregation operations – – – – – Weighted Sum Max Min Distance N-Match • Filter operations – Skyline – Rank (Top-k) – Threshold (Bounding Box, Range query) Different combinations lead to different operators 13 Weighted Sum Top-k W(X) = 1 W(Y) = 2 (2,6) W ( X ) | D[ x] Q[ x] | W (Y ) | D[ y ] Q[ y ] | (5,6) (1,4) Q:(3,4) (1,3) (3,3) (6,3) (2,2) d((2,2)) = 1*1 + 2*2 = 5 d((1,3)) = 1*2 + 2*1 = 4 d((1,4)) = 1*2 + 2*0 = 2 d((5,1)) = 1*2 + 2*3 = 8 d((3,3)) = 1*0 + 2*1 = 2 d((6,3)) = 1*3 + 2*1 = 5 d((2,6)) = 1*1 + 2*2 = 5 d((5,6)) = 1*2 + 2*2 = 6 Top-k Filter: Sort and Choose k d((1,4)) = 1*2 + 2*0 = 2 d((3,3)) = 1*0 + 2*1 = 2 d((1,3)) = 1*2 + 2*1 = 4 d((2,2)) = 1*1 + 2*2 = 5 d((6,3)) = 1*3 + 2*1 = 5 d((2,6)) = 1*1 + 2*2 = 5 d((5,6)) = 1*2 + 2*2 = 6 d((5,1)) = 1*2 + 2*3 = 8 (5,1) W (i) | D[i] Q[i] | i Locus Useful when all attributes need to be considered w/ Eq. Weights 14 Max Top-k max{| D[ x] Q[ x] |, | D[ x] Q[ x] |} (2,6) (5,6) (1,4) Q:(3,4) (1,3) (3,3) (6,3) (2,2) (5,1) d((2,2)) = 2 d((1,3)) = 2 d((1,4)) = 2 d((5,1)) = 3 d((3,3)) = 1 d((6,3)) = 3 d((2,6)) = 2 d((5,6)) = 2 max{| D[i] Q[i] |} Useful when maximum dissimilarity needs to be bounded Locus 15 Min Top-k min{| D[ x] Q[ x] |, | D[ x] Q[ x] |} (2,6) (5,6) (1,4) Q:(3,4) (1,3) (3,3) (6,3) (2,2) (5,1) d((2,2)) = 1 d((1,3)) = 1 d((1,4)) = 0 d((5,1)) = 2 d((3,3)) = 0 d((6,3)) = 1 d((2,6)) = 1 d((5,6)) = 2 min{| D[i ] Q[i ] |} Useful when best matching attribute is sufficient Locus 16 Skyline Domination’ Region of (2,6) (2,6) Domination’ Region of (1,4) (1,4) (1,3) (5,6) Domination’ Region of (5,6) Q:(3,4) (3,3) (6,3) Domination’ Region of (2,2) (3,3) (5,1) An object is said to dominate another if the latter is farther away from the query than the former on “all” dimensions (can be equal on some, but not all) All objects that are not dominated by any other are output as results. Results: (5,6), (2,6), (3,3), (1,4) Useful when attribute scores cannot be aggregated 17 Range Query L2 aggregation + Threshold filter (2,6) r (1,4) Q:(3,4) (1,3) (3,3) (5,6) (6,3) (2,2) (5,1) Decision Criterion : 2 ( D [ x ] Q [ x ]) r x 18 Bounding Box ry (2,6) rx (1,4) Q:(3,4) (1,3) (3,3) (5,6) null aggregation + Threshold filter (6,3) (2,2) (5,1) Decision Criterion :x, | ( D[ x] Q[ x]) | rx 19 K-N-Match (2,6) (1,4) (4,5) Q:(3,4) (1,3) (6,3) (2,2) (3,2) (5,1) Data DisX DisY Data A1 A2 (2,6) 1 2 (2,6) X:1 Y:2 (4,5) 1 1 (4,5) X:1 Y:1 (1,4) 2 0 (1,4) Y:0 X:2 (1,3) 2 1 (1,3) Y:1 X:2 (6,3) 3 1 (6,3) Y:1 X:3 (2,2) 1 2 (2,2) X:1 Y:2 (3,2) 0 2 (3,2) X:0 Y:2 (5,1) 2 3 (5,1) X:2 Y:3 K-N-Match operator ranks objects based on the match on the nth best matching attribute. N=1 N=2 N-match aggregation + Rank filter Useful when at least N attributes should match 20 Summarizing the Construction-based Classification Operator Aggregation Filter Weighted sum Top-k Weighted Sum Top-k Max Top-k Max Top-k Min Top-k Min Top-k Skyline NULL Skyline Range L2 Threshold Bounding Box NULL Threshold on each attribute K-N-Match Nth best match Top-k 21 Property-based Classification • Ordered vs. Unordered Output – Whether there is an ordering in the output result set • Subset vs. All Attributes – Whether all attributes contribute to deciding the membership in the result set 22 Ordered vs. Unordered Output Applicable to Selection/Filter operators Skyline Top-k 3 R 1 2 Query R Query R 23 Subset vs. All Attributes Applicable to Aggregation operators Q D S(Q,D) D(Q,D) q1 q2 . . . . . . qn d1 d2 . . . . . . dn s1 s2 . . . . . . sn Scoring I(Q,D’) I(Q,D) Aggregation Filter MemberShip/Score Query Parameters We focus on the construction of I(Q,D) for this classification. 24 Some Example I(Q,D)s • Weighted Sum w i S ( Q , D )[ i ] i • Range Query • Bounding Box/Skyline 2 D ( Q , D )[ i ] i S (Q , D ) • Max max{ S ( Q , D )[ i ]} • Min min{ S ( Q , D )[ i ]} • K-N-Match All Attributes Needed arg max r “Some” Attributes Enough R { S ( Q , D )[ i ]}, | R | n r R 25 Classification Overview Aggregation Selection/ Filter Subset of Attributes Min Max N-Match All Attributes Ordered Unordered Top-k Range Skyline BoundingBox WeightedSum Lp Aggregation 26 “Add-on” Features for Similarity Operators • • • • • • Indirection (Reverse Operators) Multiple Queries Diversity Visibility Subspaces Typed Data (Chromaticity) 27 Reverse Operators • Range Query: Get me all the restaurants within 1km of my home – This is common in consumer usage scenario – E.g., user searching for restaurants to dine • Reverse Range Query: Get me all the users for whom my restaurant is within 1km – This is more of a service provider question – E.g., Finding potential consumers to whom targeted marketing may be done • This reversal could be done in various operators – E.g., Reverse Skyline, Reverse kNN, … 28 Multiple Queries Restaurants/Pubs Home Club Office I plan to leave from office, go to the club and then get home. I need to get some dinner somewhere during this travel. Give me restaurants or pubs that are within 1km of any of these three locations. This corresponds to a range query using multiple query points. The merging operator here is the OR operator, since we would be content with places that are close to any one of these queries. 29 Diversity Rating The logical 3 nearest neighbors aren’t very diverse and are very similar to each other. Cost Diversity constraint makes sure that the pairwise distance between any two results is lower bounded. Thus, it will return a more diverse set. 30 Visibility Constraints Return k Nearest neighbours that are visible from the query point (d6) (d4) (d1) Q (d2) (d5) (d8) K=3 KNN = {d4, d5, d6} (d3) (d7) VkNN = {d4, d1, d2} 31 Dimensions = {Expense, Rating} R = {d4, d5, d6} Dimensions = {Expense} R = {d4, d5, d6} Rating Subspaces: Subspace Range Search Dimensions = {Rating} R = {d1, d2, d4, d5, d6, d8} (d6) (d4) (d1) Q (d2) (d5) (d8) (d3) (d7) Expense Find objects within a threshold distance in a user specified subset of dimensions 32 Typed Data: Chromaticity Find objects (of class A) that have the query object (of class B) in its kNN result set Example: people and restaurants Find bi-chromatic rKNN set of a restaurant (p4) (p3) (p1) (r1) (p2) RNN(r1) = {p2, r3} (r2) (r3) (p5) Bi-RNN(r1) = {p2, p1} Bi-RNN(r3) = {p6} (p6) Two classes P and R. Query is from class R, results from class P BRkNN (q, k , P, R) { p P q kNN( p, k , R)} 33 Summary of operators Operator/Feat Results ure Attributes Feature Reverse Skyline Unordered All Reverse Multi-query kNN Ordered All Multi-query KNDN Ordered All Diversity Visible kNN Ordered All Visibility Subspace Range Query Ordered Subset Sub-space Bi-chromatic rKNN Unordered All Reverse, Typed data 34 The Road Ahead • Plethora of choices in each step leads to the large variety of similarity search operators – And keeps researchers busy • Choices in – – – – – Similarity measures Aggregation operators Selection/filter operators Additional features Algorithmic features • Are we done yet? 35 Let us invent some new Operators 36 N-Match-BB • Bounding Box query where at least N attribute bounds are satisfied – An adaptation of K-N-Match to Bounding Boxes Unordered Subset of Attrs Q For 1-Match-BB, data points on either of these rectangles are OK. 37 Multi-Query Bichromatic Reverse kNN • Combination of – – – – – Weighted Sum Top-k Filter Reverse (Indirection) Multi-Query Chromaticity • Example Use Case: Of the three chosen locations for Café X (all three are intended to be opened), find people who would find at least one of these locations among the k closest cafes 38 Miscellaneous • Revisiting algorithms on new platforms – Hadoop/MR • Interpretability in Results – Can results of similarity search be shown in a manner so that the intuitive similarity between the query and the result be highlighted? • Syntactic and Semantic Features – Understand the dichotomy between syntactic (e.g., shape similarity) and semantic (e.g., two images being similar due to both being maps) – Would modeling them differently and learning when to weigh each highly lead to more efficient similarity search • Contextual Similarity; conditioning on user history – On searching for “IBM Watson”, a travelling person should be shown IBM Watson Labs, whereas a technologist should be shown the IBM Watson system 39 “Similarity lies in the eyes of the beholder”* Thank You! Questions/Comments? [email protected] [email protected] * (Adapted from famous quote) from http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt 40
© Copyright 2026 Paperzz