Similarity Search: Navigating the choices for Similarity Operators

Operators for Similarity Search
Deepak Padmanabhan, PhD
Centre for Data Sciences and Scalable Computing
The Queen’s University of Belfast
United Kingdom
[email protected]
1
Similarity Search in Action
Image Search
Web Pages
Similar Movies (Tastekid.com)
2
Similarity and Cognition
3
Similarity and Cognition
This sense of sameness is the very
keel and backbone of our thinking.
Principles of Psychology
William James, 1890
… the mind makes continual use of the notion of sameness, and if
deprived of it, would have a different structure from what it has.
4
Similarity and Cognition
Similarity, is fundamental for learning, knowledge and
thought, for only our sense of similarity allows us to
order things into kinds so that these can function as
stimulus meanings. Reasonable expectation depends on
the similarity of circumstances and on our tendency to
expect that similar causes will have similar effects.
Ontological Relativity and Other Essays
Quine, 1969
5
Geometric Similarity Model
O1
O1
O2
O2
S (O1, O 2)   Sim (O1. f , O 2. f )
f
6
Diagnosticity Principle
Similarity and Grouping are related
Features that are used to cluster have
disproportionate influence
7
Pair wise similarities:
Object Representation and
Similarity Measures
8
Example Representations
9
Estimating Similarity Between Objects
Domain
Ontology
Text
Similarity
CAR1023
---------Remarks: Good condition
Model: Passat V6
Year: 2002
Battery Voltage: 12.9V
…
…
…
…
min
0.60
0.60
0.80
0.75
0.90
Numeric
max
0.90
avg
0.76
Domain
Knowledge+
Numeric
min2
0.75
CAR560
---------Remarks: Nice condition
Model: Passat
Year: 2000
Battery Voltage: 12.6V
…
…
…
…
noagg
{0.6,0.8,0.75,0.9}
10
Outline for the Rest
• Construction-based Classification
• Property-based Classification
• Some Directions
11
Problem Overview
Q
D
S(Q,D)
D(Q,D)
q1
q2
.
.
.
.
.
.
qn
d1
d2
.
.
.
.
.
.
dn
s1
s2
.
.
.
.
.
.
sn
Scoring
I(Q,D’)
I(Q,D)
Aggregation
Filter
MemberShip/Score
Query
Parameters
Scoring Operators: Assign a score vector to each Object, by comparing to the
query object
Aggregation operators: Aggregate the score vector into smaller number of
values
12
Selection/Filter Operators: Select a subset of objects based on whether they
satisfy a criterion, e.g., Skyline, rank based or threshold based
Common Operations
• Aggregation operations
–
–
–
–
–
Weighted Sum
Max
Min
Distance
N-Match
• Filter operations
– Skyline
– Rank (Top-k)
– Threshold (Bounding Box, Range query)
Different combinations lead to different operators
13
Weighted Sum Top-k
W(X) = 1
W(Y) = 2
(2,6)
W ( X ) | D[ x]  Q[ x] |  W (Y ) | D[ y ]  Q[ y ] |
(5,6)
(1,4)
Q:(3,4)
(1,3)
(3,3)
(6,3)
(2,2)
d((2,2)) = 1*1 + 2*2 = 5
d((1,3)) = 1*2 + 2*1 = 4
d((1,4)) = 1*2 + 2*0 = 2
d((5,1)) = 1*2 + 2*3 = 8
d((3,3)) = 1*0 + 2*1 = 2
d((6,3)) = 1*3 + 2*1 = 5
d((2,6)) = 1*1 + 2*2 = 5
d((5,6)) = 1*2 + 2*2 = 6
Top-k Filter:
Sort and Choose k
d((1,4)) = 1*2 + 2*0 = 2
d((3,3)) = 1*0 + 2*1 = 2
d((1,3)) = 1*2 + 2*1 = 4
d((2,2)) = 1*1 + 2*2 = 5
d((6,3)) = 1*3 + 2*1 = 5
d((2,6)) = 1*1 + 2*2 = 5
d((5,6)) = 1*2 + 2*2 = 6
d((5,1)) = 1*2 + 2*3 = 8
(5,1)
W (i)  | D[i]  Q[i] |
i
Locus
Useful when all attributes need to be considered
w/ Eq. Weights
14
Max Top-k
max{| D[ x]  Q[ x] |, | D[ x]  Q[ x] |}
(2,6)
(5,6)
(1,4)
Q:(3,4)
(1,3)
(3,3)
(6,3)
(2,2)
(5,1)
d((2,2)) = 2
d((1,3)) = 2
d((1,4)) = 2
d((5,1)) = 3
d((3,3)) = 1
d((6,3)) = 3
d((2,6)) = 2
d((5,6)) = 2
max{| D[i]  Q[i] |}
Useful when maximum dissimilarity needs to be bounded
Locus
15
Min Top-k
min{| D[ x]  Q[ x] |, | D[ x]  Q[ x] |}
(2,6)
(5,6)
(1,4)
Q:(3,4)
(1,3)
(3,3)
(6,3)
(2,2)
(5,1)
d((2,2)) = 1
d((1,3)) = 1
d((1,4)) = 0
d((5,1)) = 2
d((3,3)) = 0
d((6,3)) = 1
d((2,6)) = 1
d((5,6)) = 2
min{| D[i ]  Q[i ] |}
Useful when best matching attribute is sufficient
Locus
16
Skyline
Domination’
Region of
(2,6)
(2,6)
Domination’
Region of
(1,4)
(1,4)
(1,3)
(5,6)
Domination’
Region of
(5,6)
Q:(3,4)
(3,3)
(6,3)
Domination’
Region of
(2,2)
(3,3)
(5,1)
An object is said to dominate
another if the latter is farther
away from the query than the
former on “all” dimensions
(can be equal on some, but not all)
All objects that are not dominated by any other are output as results.
Results: (5,6), (2,6), (3,3), (1,4)
Useful when attribute scores cannot be aggregated
17
Range Query
L2 aggregation + Threshold
filter
(2,6)
r
(1,4)
Q:(3,4)
(1,3)
(3,3)
(5,6)
(6,3)
(2,2)
(5,1)
Decision Criterion :
2
(
D
[
x
]

Q
[
x
])
r

x
18
Bounding Box
ry
(2,6)
rx
(1,4)
Q:(3,4)
(1,3)
(3,3)
(5,6)
null aggregation + Threshold
filter
(6,3)
(2,2)
(5,1)
Decision Criterion :x, | ( D[ x]  Q[ x]) | rx
19
K-N-Match
(2,6)
(1,4)
(4,5)
Q:(3,4)
(1,3)
(6,3)
(2,2) (3,2)
(5,1)
Data
DisX
DisY
Data
A1
A2
(2,6)
1
2
(2,6)
X:1
Y:2
(4,5)
1
1
(4,5)
X:1
Y:1
(1,4)
2
0
(1,4)
Y:0
X:2
(1,3)
2
1
(1,3)
Y:1
X:2
(6,3)
3
1
(6,3)
Y:1
X:3
(2,2)
1
2
(2,2)
X:1
Y:2
(3,2)
0
2
(3,2)
X:0
Y:2
(5,1)
2
3
(5,1)
X:2
Y:3
K-N-Match operator ranks objects based on the
match on the nth best matching attribute.
N=1
N=2
N-match aggregation + Rank filter
Useful when at least N attributes should match
20
Summarizing the Construction-based
Classification
Operator
Aggregation
Filter
Weighted sum Top-k
Weighted Sum
Top-k
Max Top-k
Max
Top-k
Min Top-k
Min
Top-k
Skyline
NULL
Skyline
Range
L2
Threshold
Bounding Box
NULL
Threshold on each
attribute
K-N-Match
Nth best match
Top-k
21
Property-based Classification
• Ordered vs. Unordered Output
– Whether there is an ordering in the output
result set
• Subset vs. All Attributes
– Whether all attributes contribute to
deciding the membership in the result set
22
Ordered vs. Unordered Output
Applicable to Selection/Filter operators
Skyline
Top-k
3
R
1
2
Query
R
Query
R
23
Subset vs. All Attributes
Applicable to Aggregation operators
Q
D
S(Q,D)
D(Q,D)
q1
q2
.
.
.
.
.
.
qn
d1
d2
.
.
.
.
.
.
dn
s1
s2
.
.
.
.
.
.
sn
Scoring
I(Q,D’)
I(Q,D)
Aggregation
Filter
MemberShip/Score
Query
Parameters
We focus on the construction of I(Q,D) for this classification.
24
Some Example I(Q,D)s
• Weighted Sum
w
i
S ( Q , D )[ i ]
i
• Range Query
• Bounding Box/Skyline
2
D
(
Q
,
D
)[
i
]

i
S (Q , D )
• Max
max{ S ( Q , D )[ i ]}
• Min
min{ S ( Q , D )[ i ]}
• K-N-Match
All Attributes
Needed
arg max
r
“Some”
Attributes
Enough
R  { S ( Q , D )[ i ]}, | R |  n r  R
25
Classification Overview
Aggregation
Selection/
Filter
Subset of
Attributes
Min
Max
N-Match
All Attributes
Ordered
Unordered
Top-k
Range
Skyline
BoundingBox
WeightedSum
Lp Aggregation
26
“Add-on” Features for Similarity
Operators
•
•
•
•
•
•
Indirection (Reverse Operators)
Multiple Queries
Diversity
Visibility
Subspaces
Typed Data (Chromaticity)
27
Reverse Operators
• Range Query: Get me all the restaurants
within 1km of my home
– This is common in consumer usage scenario
– E.g., user searching for restaurants to dine
• Reverse Range Query: Get me all the users
for whom my restaurant is within 1km
– This is more of a service provider question
– E.g., Finding potential consumers to whom targeted
marketing may be done
• This reversal could be done in various
operators
– E.g., Reverse Skyline, Reverse kNN, …
28
Multiple Queries
Restaurants/Pubs
Home
Club
Office
I plan to leave from office, go to the club
and then get home. I need to get some
dinner somewhere during this travel. Give
me restaurants or pubs that are within
1km of any of these three locations.
This corresponds to a range query using multiple query points. The merging
operator here is the OR operator, since we would be content with places
that are close to any one of these queries.
29
Diversity
Rating
The logical 3 nearest neighbors aren’t
very diverse and are very similar to
each other.
Cost
Diversity constraint makes sure that
the pairwise distance between any
two results is lower bounded. Thus,
it will return a more diverse set.
30
Visibility Constraints
Return k Nearest neighbours that are
visible from the query point
(d6)
(d4)
(d1)
Q
(d2)
(d5)
(d8)
K=3
KNN = {d4, d5, d6}
(d3)
(d7)
VkNN = {d4, d1, d2}
31
Dimensions =
{Expense, Rating}
R = {d4, d5, d6}
Dimensions =
{Expense}
R = {d4, d5, d6}
Rating
Subspaces: Subspace Range Search
Dimensions =
{Rating}
R = {d1, d2, d4,
d5, d6, d8}
(d6)
(d4)
(d1)
Q
(d2)
(d5)
(d8)
(d3)
(d7)
Expense
Find objects within a threshold distance in a user specified
subset of dimensions
32
Typed Data: Chromaticity
Find objects (of class A) that have the query object (of
class B) in its kNN result set
Example: people and
restaurants
Find bi-chromatic
rKNN set of a
restaurant
(p4)
(p3)
(p1)
(r1)
(p2)
RNN(r1) = {p2, r3}
(r2)
(r3)
(p5)
Bi-RNN(r1) = {p2, p1}
Bi-RNN(r3) = {p6}
(p6)
Two classes P and R. Query is from class R, results from class P
BRkNN (q, k , P, R)  { p  P q  kNN( p, k , R)}
33
Summary of operators
Operator/Feat Results
ure
Attributes
Feature
Reverse
Skyline
Unordered
All
Reverse
Multi-query
kNN
Ordered
All
Multi-query
KNDN
Ordered
All
Diversity
Visible kNN
Ordered
All
Visibility
Subspace
Range Query
Ordered
Subset
Sub-space
Bi-chromatic
rKNN
Unordered
All
Reverse, Typed
data
34
The Road Ahead
• Plethora of choices in each step leads to the
large variety of similarity search operators
– And keeps researchers busy
• Choices in
–
–
–
–
–
Similarity measures
Aggregation operators
Selection/filter operators
Additional features
Algorithmic features
• Are we done yet?
35
Let us invent some new
Operators
36
N-Match-BB
• Bounding Box query where at least N
attribute bounds are satisfied
– An adaptation of K-N-Match to Bounding Boxes
Unordered
Subset of Attrs
Q
For 1-Match-BB, data points on either of these rectangles are OK.
37
Multi-Query Bichromatic
Reverse kNN
• Combination of
–
–
–
–
–
Weighted Sum
Top-k Filter
Reverse (Indirection)
Multi-Query
Chromaticity
• Example Use Case: Of the three chosen
locations for Café X (all three are intended to
be opened), find people who would find at
least one of these locations among the k
closest cafes
38
Miscellaneous
• Revisiting algorithms on new platforms
– Hadoop/MR
• Interpretability in Results
– Can results of similarity search be shown in a manner so that
the intuitive similarity between the query and the result be
highlighted?
• Syntactic and Semantic Features
– Understand the dichotomy between syntactic (e.g., shape
similarity) and semantic (e.g., two images being similar due to
both being maps)
– Would modeling them differently and learning when to weigh
each highly lead to more efficient similarity search
• Contextual Similarity; conditioning on user history
– On searching for “IBM Watson”, a travelling person should
be shown IBM Watson Labs, whereas a technologist should
be shown the IBM Watson system
39
“Similarity lies in the eyes of the beholder”*
Thank You!
Questions/Comments?
[email protected]
[email protected]
* (Adapted from famous quote) from http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt
40