Efficient Approximate Search on String Collections
Part II
Marios Hadjieleftheriou
Chen Li
1
Overview
Sketch based algorithms
Compression
Selectivity estimation
Transformations/Synonyms
2
Selection Queries Using
Sketch Based Algorithms
What is a Sketch
An approximate representation of the string
With size much smaller than the string
That can be used to upper bound similarity
(or lower bound the distance)
String s has sketch sig(s)
If Simsig(sig(s), sig(t)) < θ, prune t
And Simsig is much more efficient to compute
than the actual similarity of s and t
4
Using Sketches for Selection
Queries
Naïve approach:
Scan all sketches and identify candidates
Verify candidates
Index sketches:
Inverted index
LSH (Gionis et al.)
The inverted sketch hash table (Chakrabarti et al.)
5
Known sketches
Prefix filter (CGK06)
Mismatch filter (XWL08)
Edit distance
Minhash (BCFM98)
Jaccard, Cosine, Edit distance
Jaccard
PartEnum (AGK06)
Hamming, Jaccard
6
Prefix Filter
Construction
Let string s be a set of q-grams:
-
Sort the set (e.g., lexicographically)
-
s = {q4, q1, q2, q7}
s = {q1, q2, q4, q7}
Prefix sketch sig(s) = {q1, q2}
Use sketch for filtering
If |s t| θ then sig(s) sig(t)
Example
Sets of size 8:
s = {q1, q2, q4, q6, q8, q9, q10, q12}
t = {q1, q2, q5, q6, q8, q10, q12, q14}
1
2
3
4
5
6
7
8
9
10 11
12 13 14
15
…
s
t
st = {q1, q2, q6, q8, q10, q12}, |st| = 6
For any subset t’ t of size 3: s t’
In worst case we choose q5, q14, and ??
If |st|θ then t’ t s.t. |t’||s|-θ+1, t’ s
8
Example continued
Instead of taking a subset of t, we sort and
take prefixes from both s and t:
pf(s) = {q1, q2, q4}
pf(t) = {q1, q2, q5}
If |s t| 6 then pf(s) pf(t)
Why is that true?
Best case we are left with at most 5 matching
elements beyond the elements in the sketch
Generalize to Weighted Sets
Example with weighted vectors
1’
2’
4’
w1 w2 … w14
s
w1 w2 0 0 w5
t
0 w 2 w3 0 w5
6’
8’
10’
12’
14’
w(st) θ ( w(st) = Σqstw(q) )
Sort by weights (not lexicographically anymore)
Keep prefix pf(s) s.t. w[pf(s)] w(s) - α
w(s)-α
pf(s)
α
sf(s)
Σqsf(s) w(q) = α
10
Continued
Best case: w[sf(s) sf(t)] = α
-
w(st) = w[pf(s) pf(t)] + w[sf(s) sf(t)]
In other words, the suffixes match perfectly
Consider the prefix and the suffix separately
w(st) θ
w[pf(s) pf(t)] + w[sf(s) sf(t)] θ
w[pf(s) pf(t)] θ - w[sf(s) sf(t)]
To avoid false negatives, minimize rhs
w[pf(s) pf(t)] > θ - α
Properties
w[pf(s) pf(t)] θ – α
Hence θ α
Hence α = θmin
Small θmin long prefix large sketch
w(s)-α
pf(s)
α
sf(s)
For short strings, keep the whole string
Prefix sketches easy to index
Use Inverted Index
How do I Choose α?
I need
|pf(s) pf(t)|
w[pf(s) pf(t)] 0
θ=α
13
Extend to Jaccard
Jaccard(s, t) = w(s t) / w(s t) θ
w(s t ) θ w(s t)
w(s t) = w(s) + w(t) – w(s t)
…
w[pf(s) pf(t)] β - w(sf(s) sf(t)]
β = θ / (1 + θ) [w(s) + w(t)]
To avoid false negatives:
w[pf(s) pf(t)] > β - α
Technicality
w[pf(s) pf(t)] > β – α
β = θ / (1 + θ) [w(s) + w(t)]
β depends on w(s), which is unknown at
prefix construction time
Use length filtering
θ w(t) ≤ w(s) ≤ w(t) / θ
15
Extend to Edit Distance
Let string s be a set of q-grams:
s = {q11, q3, q67, q4}
Now the absolute position of q-grams
matters:
Sort the set (e.g., lexicographically) but maintain
positional information:
-
s = {(q3, 2), (q4, 4), (q11, 1), (q67, 3)}
Prefix sketch sig(s) = {(q3, 2), (q4, 4)}
Edit Distance Continued
ed(s, t) θ:
Length filter: abs(|s| - |t|) θ
Position filter: Common q-grams must have
matching positions (within θ)
Count filter: s and t must have at least
β = [max(|s|, |t|) – Q + 1] – Q θ
Q-grams in common
s = “Hello” has 5-2+1 2-grams
One edit affects at most q q-grams
“Hello” 1 edit affects at most 2 2-grams
Edit Distance Candidates
Boils down to:
1.
2.
3.
Check the string lengths
Check the positions of matching q-grams
Check intersection size: |s t| ≥ β
Very similar to Jaccard
18
Constructing the Prefix
(|s|-q+1)-α
pf(s)
α
sf(s)
A total of (|s| - q + 1) q-grams
|s t| max(|s|, |t|) – q + 1 – qθ
…
|pf(s) pf(t)| > β – α
β = max(|s|, |t|) – q + 1 – qθ
Choosing α
|pf(s) pf(t)| > β – α
β = max(|s|, |t|) – q + 1 – qθ
Set β = α
|pf(s)| ≥ (|s|-q+1) - α
|pf(s)| = qθ+1 q-grams
If ed(s, t) ≤ θ then pf(s) pf(t)
20
Pros/Cons
Provides a loose bound
Too many candidates
Makes sense if strings are long
Easy to construct, easy to compare
21
Mismatch Filter
When dealing with edit distance:
Position of mismatching q-grams within pf(s),
pf(t) conveys a lot of information
Example:
Clustered edits:
-
s = “submit by Dec.”
t = “submit by Sep.”
4 mismatching 2-grams
2 edits can fix all of them
Non-clustered edits:
-
s = “sabmit be Set.”
t = “submit by Sep.”
6 mismatching 2-grams
Need 3 edits to fix them
Mismatch Filter Continued
What is the minimum edit operations that
cause the mismatching q-grams between s
and t?
This number is a lower-bound on ed(s, t)
It is equal to the minimum edit operations it takes
to destroy every mismatching q-gram
We can compute it using a greedy algorithm
We need to sort q-grams by position first (n logn)
Mismatch Condition
Fourth edit distance pruning condition:
4.
Mismatched q-grams in prefixes must be
destroyable with at most θ edits
24
Pros/Cons
Much tighter bound
Expensive (sorting), but prefixes relatively
short
Needs long prefixes to make a difference
25
Minhash
So far we sort q-grams
What if we hash instead?
Minhash construction:
Given a string s = {q1, …, qm}
Use k functions h1, …, hk from independent family
of hash functions, hi: q [0, 1]
Hash s, k times and keep the k q-grams q that
hash to the smallest value each time
sig(s) = {qmh1, qmh2, …, qmhk}
26
How to use minhash
Example:
s = {q4, q1, q2, q7}
h1(s) = { 0.01, 0.87, 0.003, 0.562}
h2(s) = { 0.23, 0.15, 0.93, 0.62}
sig(s) = {0.003, 0.15}
Given two sketches sig(s), sig(t):
Jaccard(s, t) is the percentage of hash-values in
sig(s) and sig(t) that match
Probabilistic: (ε, δ)-guarantees False negatives
27
Pros/Cons
Has false negatives
To drive errors down, sketch has to be pretty
large
long strings
Will give meaningful estimations only if actual
similarity between two strings is large
good only for large θ
28
PartEnum
Lower bounds Hamming distance:
Jaccard(s, t) θ H(s, t) 2|s| (1 – θ) / (1 + θ)
Partitioning strategy based on pigeonhole
principle:
-
Express strings as vectors
-
Partition vectors into θ + 1 partitions
If H(s, t) θ then at least one partition has
hamming distance zero.
To boost accuracy create all combinations of
possible partitions
-
29
Example
Partition
sig1
sig2
Enumerate
sig3
sig(s) = h(sig1) h(sig2) …
30
Pros/Cons
Gives guarantees
Fairly large sketch
Hard to tune three parameters
Actual data affects performance
31
Compression
(BJL+09)
32
A Global Approach
For disk resident lists:
Cost of disk I/O vs Decompression tradeoff
Integer compression
-
Golomb, Delta coding
Sorting based on non-integer weights??
For main memory resident lists:
Lossless compression not useful
Design lossy schemes
Simple strategies
Discard lists:
Random, Longest, Cost-based
Discarding lists tag-of-war:
-
-
Reduce candidates: ones that appear only in the
discarded lists disappear
Increase candidates: Looser threshold θ to account for
discarded lists
Combine lists:
Find similar lists and keep only their union
Combining Lists
Discovering candidates:
Lists with high Jaccard containment/similarity
Avoid multi-way Jaccard computation:
-
Use minhash to estimate Jaccard
Use LSH to discover clusters
Combining:
Use cost-based algorithm based on query
workload:
-
-
Size reduction
Query time reduction
When we meet both budgets we stop
General Observation
V-grams, sketches and compression use the
distribution of q-grams to optimize
Zipf distribution
A small number of lists are very long
Those lists are fairly unimportant in terms of string
similarity
-
A q-gram is meaningless if it is contained in almost all
strings
36
Selectivity Estimation
for Selection Queries
The Problem
Estimate the number of strings with:
Edit distance smaller than θ
Cosine similarity higher than θ
Jaccard, Hamming, etc…
Issues:
Estimation accuracy
Size of estimator
Cost of estimation
38
Flavors
Edit distance:
Based on clustering (JL05)
Based on min-hash (MBK+07)
Based on wild-card q-grams (LNS07)
Cosine similarity:
Based on sampling (HYK+08)
39
Edit Distance
Problem:
Given query string s
Estimate number of strings t D
Such that ed(s, t) θ
40
Clustering - Sepia
Partition strings using clustering:
Store per cluster histograms:
Number of strings within edit distance 0,1,…,θ from the cluster
center
Compute global dataset statistics:
Enables pruning of whole clusters
Use a training query set to compute frequency of data strings
within edit distance 0,1,…,θ from each query
Given query:
Use cluster centers, histograms and dataset statistics to estimate
selectivity
41
Minhash - VSol
We can use Minhash to:
Estimate Jaccard(s, t) = |st| / |st|
Estimate the size of a set |s|
Estimate the size of the union |st|
42
VSol Estimator
Construct one inverted list per q-gram in D
and compute the minhash sketch of each list:
Inverted list
q1
q2
1
5
…
3
5
…
…
q10
1
8
…
14
25
43
Minhash
43
Selectivity Estimation
Use edit distance count filter:
If ed(s, t) θ, then s and t share at least
β = max(|s|, |t|) - q + 1 – qθ
q-grams
Given query t = {q1, …, qm}:
We have m inverted lists
Any string contained in the intersection of at least
β of these lists passes the count filter
Answer is the size of the union of all non-empty βintersections (there are m choose β intersections)
44
Example
θ = 2, q = 3, |t|=14 β = 6
t=
Inverted list
q1
q2
1
5
…
3
5
…
…
q14
1
8
…
14
25
43
Look at all subsets of size 6
Α = |{ι1, ..., ι6}(10 choose 6) (ti1 ti2 … ti6)|
45
The m-β Similarity
We do not need to consider all subsets
individually
There is a closed form estimation
formula that uses minhash
Drawback:
Will overestimate results since many βintersections result in duplicates
46
OptEQ – wild-card q-grams
Use extended q-grams:
Introduce wild-card symbol ‘?’
E.g., “ab?” can be:
-
“aba”, “abb”, “abc”, …
Build an extended q-gram table:
Extract all 1-grams, 2-grams, …, q-grams
Generalize to extended 2-grams, …, q-grams
Maintain an extended q-grams/frequency
hashtable
47
Example
q-gram table
q-gram
Dataset
string
sigmod
vldb
icde
…
Frequency
ab
bc
de
ef
gh
hi
…
?b
a?
?c
…
10
15
4
1
21
2
…
13
17
23
…
abc
def
…
5
2
…
48
Assuming Replacements Only
Given query q=“abcd”
θ=2
There are 6 base strings:
-
“??cd”, “?b?d”, “?bc?”, “a??d”, “a?c?”, “ab??”
Query answer:
S1={sD: s ”??cd”}, S2={sD: s ”?b?d”},
S3={…}, …, S6={…}
-
A = |S1 S2 S3 S4 S5 S6| =
Σ1n6 (-1)n-1 |S1 … Sn|
49
Replacement Intersection
Lattice
A = Σ1n6 (-1)n-1 |S1 … Sn|
Need to evaluate size of all 2-intersections, 3intersections, …, 6-intersections
Use frequencies from q-gram table to
compute sum A
Exponential number of intersections
But ... there is well-defined structure
50
Replacement Lattice
Build replacement lattice:
??cd
?b?d
?bcd
?bc?
a??d
a?cd
ab?d
a?c?
ab??
abc?
abcd
2 ‘?’
1 ‘?’
0 ‘?’
Many intersections are empty
Others produce the same results
we need to count everything only once
51
General Formulas
Similar reasoning for:
Other combinations difficult:
r replacements
d deletions
Multiple insertions
Combinations of insertions/replacements
But … we can generate the corresponding
lattice algorithmically!
Expensive but possible
52
Hashed Sampling
Used to estimate selectivity of TF/IDF, BM25,
DICE
Main idea:
Take a sample of the inverted index
-
-
Simply answer the query on the sample and scale up
the result
Has high variance
We can do better than that
53
Visual Example
Inverted list
q1
q2
1
5
…
3
5
…
…
qn
1
8
…
14
25
43
Sampled
Inverted lists
Answer the query
using the sample
and scale up
Construction
Draw samples deterministically:
Use a hash function h: N [0, 100]
Keep ids that hash to values smaller than φ
This is called a bottom-K sketch
Invariant:
If a given id is sampled in one list, it will always be
sampled in all other lists that contain it
55
Example
Any similarity function can be computed
correctly using the sample
-
Not true for simple random sampling
q1
q2
q1
q2
1
5
…
3
5
…
1
5
…
3
5
…
14
Random
Hashed
25
1
7
23
14
1
5
25
5
9
21
5
12
56
Selectivity Estimation
Any union of sampled lists is a φ% random
sample
Given query t = {q1, …, qm}:
A = |As| |q1 … qm| / |qs1 … qsm|:
-
As is the query answer size from the sample
The fraction is the actual scale-up factor
But there are duplicates in these unions!
We need to know:
-
The distinct number of ids in q1 … qm
The distinct number of ids in qs1 … qsm
57
Count Distinct
Distinct |qs1 … qsm| is easy:
Scan the sampled lists
Distinct |q1 … qm| is hard:
Scanning the lists is the same as computing the
exact answer to the query … naively
We are lucky:
-
-
Each sampled list doubles up as a bottom-k sketch by
construction!
We can use the list samples to estimate the distinct
|q1 … qm|
58
The Bottom-k Sketch
It is used to estimated the distinct size of
arbitrary set unions (the same as FM sketch):
Take hash function h: N [0, 100]
Hash each element of the set
The r-th smallest hash value is an unbiased
estimator of count distinct:
r
0
hr
hr
r
100 ?
100
59
Transformations/Synonyms
(ACGK08)
60
Transformations
No similarity function can be cognizant of
domain-dependent variations
Transformation rules should be provided
using a declarative framework
We derive different rules for different domains
-
Addresses, names, affiliations, etc.
Rules have knowledge of internal structure
-
-
Address: Department, School, Road, City, State
Name: Prefix, First name, Middle name, Last name
61
Observations
Variations are orthogonal to each other
Variations have general structure
Dept. of Computer Science, Stanford University,
California
We can combine any variation of the three
components and get the same affiliation
We can use a simple generative rule to generate
variations of all addresses
Variations are specific to a particular entity
California, CA
Need to incorporate external knowledge
62
Augmented Generative
Grammar
G: A set of rules, predicates, actions
Rule
Predicate
<name> <prefix> <first> <middle> <last>
<name> <last>, <first> <middle>
<first> <letter>.
<first> F
F in Fnames
first; last
first; last
letter
F
Variable F ranges over a fixed set of values
Action
For example all names in the database
Given an input record r and G we derive
“clean” variations of r (might be many)
63
Efficiency
We do not need to generate and store all
record transformations
Generate a combined sketch for all variations (the
union of sketches)
Transformations have high overlap, hence the
sketch will be small
Generate a derived grammar on the fly
Replace all variables with constants from the
database that are related to r (e.g., they are
substrings of r or similar to r)
64
Conclusion
65
Conclusion
Approximate selection queries have very
important applications
Not supported very well in current systems
(think of Google Suggest)
Work on approximate selections has matured
greatly within the past 5 years
Expect wide adoption soon!
66
Thank you!
References
[AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav
Kaushik .VLDB 2006
[ACGK08] Incorporating string transformations in record matching. Arvind Arasu,
Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008
[BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire
Kenyon. SODA 2002
[BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String
Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009
[BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar,
Alan M. Frieze, Michael Mitzenmacher. STOC 1998
[CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris
Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis.
SIGMOD 2005
[CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit
Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06
[CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik
Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08
[HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios
Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008
68
References
[HYK+08] Hashed samples: selectivity estimators for set similarity selection queries.
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008.
[JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin,
Chen Li. VLDB 2005.
[JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen
Li, and Jianhua Feng. WWW 2009
[JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large
Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08
[KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita
Sarawagi, Divesh Srivastava. SIGMOD 2006.
[LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches.
Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.
[LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit
Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007
[LWY07] VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007
[MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika,
Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007
69
References
[SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal.
SIGMOD 2004
[XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance
constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008
[XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei
Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008
[YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li.
SIGMOD 2008
70
© Copyright 2026 Paperzz