Proximity Searching in High Dimensional Spaces with a Proximity

Proximity Searching in High
Dimensional Spaces with a Proximity
Preserving Order
Edgar Chávez
Karina Figueroa
Gonzalo Navarro
UNIVERSIDAD
MICHOACANA,
MEXICO
UNIVERSIDAD
DE CHILE,
CHILE
Content
1.
2.
3.
4.
5.
6.
About the problem
Basic concepts
Previous work
Our technique
Experiments
Conclusion and future wok
Proximity Searching
Huge Database
Expensive distance
•Exact searching is not possible
Applications
•
•
•
•
•
Retrieval Information
Classification
People finder through the web
Clustering
Currently used on
– Classification of Spider’s web
– Face recognition on Chilean’s Web
Problems (metric spaces)
Huge databases
Extraction of characteristics
High dimension
Complex objects
Memory
limited
Index
Terminology
Properties
• Queries
•Symmetry
– Range query
•Strict possitiveness
– K nearest neighbor
•Triangle inequality
Previous work
• Pivot based
• Partition based
Pivot
distance
q
Previous work
• Pivot based
• Partition based
q
centro
Our technique
Permutation
P1
p2
P4
P6
p5
p3
u
Permutant
Our technique
• Exact matching elements have the same
permutation
• Similar elements must have a similar
permutation (we guess)
• Spearman footrule metric
– Measures the similarity of the
permutations
– Promissority elements first
Spearman Footrule metric
Example
3-1, 6 - 2, 3-2, 4-1, 5-5, 6-4
Difference of positions
Searching process (1a. part)
Preprocessing time
p3,p1,p2
Permutant
p1
p3
p2,p1,p3
p2
p2,p3,p1
p3,p2,p1
Searching process (2a. part)
Query time
Permutant
Sorting elements
by
Spearman
Footrule metric
p3,p1,p2
p1
p2,p1,p3
p2,p3,p1
…..
…..
p3,p1,p2
p3
p2,p1,p3
q
p2
p2,p3,p1
p2,p1,p3
p3,p2,p1
%retrieved
Experiments
93% retrieved,
comparing 10% of database
Pivot based
algorithm
Retrieved 48%
90% retrieved,
comparing 60%
of database
%retrieved
Experiments
100% retrieved,
comparing 15% of database
100% retrieved,
comparing 90% of
database
How good is our prediction?
Dimension 256, using 256 pivots
retrieved
Metric algorithms
are using one of them
Percentage of the database compared
Similarities between
permutations
Almost the
same value
Conclusion
• A new probabilistic algorithm for proximity
searching in metric space.
• Our technique is based on permutations.
• Close elements will have similar
permutations.
• This technique is the fastest known
algorithm for high dimension.
• Permutations are good predictor
Future Work
• Can Non-metric spaces be tackled
with this technique?
• Approximated all K Nearest neighbor
algorithm.
• Improving other metric indexes.
Thank you
UNIVERSIDAD
MICHOACANA,
MEXICO
UNIVERSIDAD
DE CHILE,
CHILE
[email protected]