Part II - Donald Bren School of Information and Computer Sciences

Efficient Approximate Search on String Collections
Part II

Marios Hadjieleftheriou
Chen Li
1
Overview




Sketch based algorithms
Compression
Selectivity estimation
Transformations/Synonyms
2
Selection Queries Using
Sketch Based Algorithms
What is a Sketch



An approximate representation of the string
With size much smaller than the string
That can be used to upper bound similarity
(or lower bound the distance)



String s has sketch sig(s)
If Simsig(sig(s), sig(t)) < θ, prune t
And Simsig is much more efficient to compute
than the actual similarity of s and t
4
Using Sketches for Selection
Queries

Naïve approach:



Scan all sketches and identify candidates
Verify candidates
Index sketches:



Inverted index
LSH (Gionis et al.)
The inverted sketch hash table (Chakrabarti et al.)
5
Known sketches

Prefix filter (CGK06)


Mismatch filter (XWL08)


Edit distance
Minhash (BCFM98)


Jaccard, Cosine, Edit distance
Jaccard
PartEnum (AGK06)

Hamming, Jaccard
6
Prefix Filter

Construction

Let string s be a set of q-grams:
-

Sort the set (e.g., lexicographically)
-


s = {q4, q1, q2, q7}
s = {q1, q2, q4, q7}
Prefix sketch sig(s) = {q1, q2}
Use sketch for filtering

If |s  t|  θ then sig(s)  sig(t)  
Example

Sets of size 8:
s = {q1, q2, q4, q6, q8, q9, q10, q12}
t = {q1, q2, q5, q6, q8, q10, q12, q14}
1
2
3
4
5
6
7
8
9
10 11
12 13 14
15
…
s
t


st = {q1, q2, q6, q8, q10, q12}, |st| = 6
For any subset t’ t of size 3: s  t’  


In worst case we choose q5, q14, and ??
If |st|θ then t’ t s.t. |t’||s|-θ+1, t’ s  
8
Example continued

Instead of taking a subset of t, we sort and
take prefixes from both s and t:




pf(s) = {q1, q2, q4}
pf(t) = {q1, q2, q5}
If |s  t|  6 then pf(s)  pf(t)  
Why is that true?

Best case we are left with at most 5 matching
elements beyond the elements in the sketch
Generalize to Weighted Sets

Example with weighted vectors
1’
2’
4’
w1  w2  …  w14
s
w1 w2 0 0 w5
t
0 w 2 w3 0 w5



6’
8’
10’
12’
14’
w(st)  θ ( w(st) = Σqstw(q) )
Sort by weights (not lexicographically anymore)
Keep prefix pf(s) s.t. w[pf(s)]  w(s) - α
w(s)-α
pf(s)
α
sf(s)
Σqsf(s) w(q) = α
10
Continued

Best case: w[sf(s)  sf(t)] = α
-

w(st) = w[pf(s)  pf(t)] + w[sf(s)  sf(t)]



In other words, the suffixes match perfectly
Consider the prefix and the suffix separately
w(st)  θ 
w[pf(s)  pf(t)] + w[sf(s)  sf(t)]  θ
w[pf(s)  pf(t)]  θ - w[sf(s)  sf(t)]
To avoid false negatives, minimize rhs
w[pf(s)  pf(t)] > θ - α
Properties




w[pf(s)  pf(t)]  θ – α
Hence θ  α
Hence α = θmin
Small θmin  long prefix  large sketch
w(s)-α
pf(s)


α
sf(s)
For short strings, keep the whole string
Prefix sketches easy to index

Use Inverted Index
How do I Choose α?

I need
|pf(s)  pf(t)|   
w[pf(s)  pf(t)]  0 
θ=α
13
Extend to Jaccard



Jaccard(s, t) = w(s  t) / w(s  t)  θ 
w(s  t )  θ w(s  t)
w(s  t) = w(s) + w(t) – w(s  t)
…
w[pf(s)  pf(t)]  β - w(sf(s)  sf(t)]
β = θ / (1 + θ) [w(s) + w(t)]
To avoid false negatives:
w[pf(s)  pf(t)] > β - α
Technicality
w[pf(s)  pf(t)] > β – α
β = θ / (1 + θ) [w(s) + w(t)]


β depends on w(s), which is unknown at
prefix construction time
Use length filtering

θ w(t) ≤ w(s) ≤ w(t) / θ
15
Extend to Edit Distance

Let string s be a set of q-grams:


s = {q11, q3, q67, q4}
Now the absolute position of q-grams
matters:

Sort the set (e.g., lexicographically) but maintain
positional information:
-

s = {(q3, 2), (q4, 4), (q11, 1), (q67, 3)}
Prefix sketch sig(s) = {(q3, 2), (q4, 4)}
Edit Distance Continued

ed(s, t)  θ:



Length filter: abs(|s| - |t|)  θ
Position filter: Common q-grams must have
matching positions (within θ)
Count filter: s and t must have at least
β = [max(|s|, |t|) – Q + 1] – Q θ
Q-grams in common

s = “Hello” has 5-2+1 2-grams
One edit affects at most q q-grams
“Hello” 1 edit affects at most 2 2-grams
Edit Distance Candidates

Boils down to:
1.
2.
3.

Check the string lengths
Check the positions of matching q-grams
Check intersection size: |s  t| ≥ β
Very similar to Jaccard
18
Constructing the Prefix
(|s|-q+1)-α
pf(s)
α
sf(s)
A total of (|s| - q + 1) q-grams

|s  t|  max(|s|, |t|) – q + 1 – qθ
…
|pf(s)  pf(t)| > β – α
β = max(|s|, |t|) – q + 1 – qθ
Choosing α
|pf(s)  pf(t)| > β – α
β = max(|s|, |t|) – q + 1 – qθ

Set β = α


|pf(s)| ≥ (|s|-q+1) - α 
|pf(s)| = qθ+1 q-grams
If ed(s, t) ≤ θ then pf(s)  pf(t)  
20
Pros/Cons

Provides a loose bound



Too many candidates
Makes sense if strings are long
Easy to construct, easy to compare
21
Mismatch Filter

When dealing with edit distance:


Position of mismatching q-grams within pf(s),
pf(t) conveys a lot of information
Example:

Clustered edits:
-

s = “submit by Dec.”
t = “submit by Sep.”
4 mismatching 2-grams
2 edits can fix all of them
Non-clustered edits:
-
s = “sabmit be Set.”
t = “submit by Sep.”
6 mismatching 2-grams
Need 3 edits to fix them
Mismatch Filter Continued

What is the minimum edit operations that
cause the mismatching q-grams between s
and t?




This number is a lower-bound on ed(s, t)
It is equal to the minimum edit operations it takes
to destroy every mismatching q-gram
We can compute it using a greedy algorithm
We need to sort q-grams by position first (n logn)
Mismatch Condition

Fourth edit distance pruning condition:
4.
Mismatched q-grams in prefixes must be
destroyable with at most θ edits
24
Pros/Cons



Much tighter bound
Expensive (sorting), but prefixes relatively
short
Needs long prefixes to make a difference
25
Minhash

So far we sort q-grams


What if we hash instead?
Minhash construction:




Given a string s = {q1, …, qm}
Use k functions h1, …, hk from independent family
of hash functions, hi: q  [0, 1]
Hash s, k times and keep the k q-grams q that
hash to the smallest value each time
sig(s) = {qmh1, qmh2, …, qmhk}
26
How to use minhash

Example:





s = {q4, q1, q2, q7}
h1(s) = { 0.01, 0.87, 0.003, 0.562}
h2(s) = { 0.23, 0.15, 0.93, 0.62}
sig(s) = {0.003, 0.15}
Given two sketches sig(s), sig(t):


Jaccard(s, t) is the percentage of hash-values in
sig(s) and sig(t) that match
Probabilistic: (ε, δ)-guarantees  False negatives
27
Pros/Cons


Has false negatives
To drive errors down, sketch has to be pretty
large


long strings
Will give meaningful estimations only if actual
similarity between two strings is large

good only for large θ
28
PartEnum

Lower bounds Hamming distance:


Jaccard(s, t)  θ  H(s, t)  2|s| (1 – θ) / (1 + θ)
Partitioning strategy based on pigeonhole
principle:
-
Express strings as vectors
-
Partition vectors into θ + 1 partitions
If H(s, t)  θ then at least one partition has
hamming distance zero.
To boost accuracy create all combinations of
possible partitions
-
29
Example
Partition
sig1
sig2
Enumerate
sig3
sig(s) = h(sig1)  h(sig2)  …
30
Pros/Cons



Gives guarantees
Fairly large sketch
Hard to tune three parameters

Actual data affects performance
31
Compression
(BJL+09)
32
A Global Approach

For disk resident lists:


Cost of disk I/O vs Decompression tradeoff
Integer compression
-


Golomb, Delta coding
Sorting based on non-integer weights??
For main memory resident lists:


Lossless compression not useful
Design lossy schemes
Simple strategies

Discard lists:


Random, Longest, Cost-based
Discarding lists tag-of-war:
-
-

Reduce candidates: ones that appear only in the
discarded lists disappear
Increase candidates: Looser threshold θ to account for
discarded lists
Combine lists:

Find similar lists and keep only their union
Combining Lists

Discovering candidates:


Lists with high Jaccard containment/similarity
Avoid multi-way Jaccard computation:
-

Use minhash to estimate Jaccard
Use LSH to discover clusters
Combining:

Use cost-based algorithm based on query
workload:
-
-
Size reduction
Query time reduction
When we meet both budgets we stop
General Observation

V-grams, sketches and compression use the
distribution of q-grams to optimize



Zipf distribution
A small number of lists are very long
Those lists are fairly unimportant in terms of string
similarity
-
A q-gram is meaningless if it is contained in almost all
strings
36
Selectivity Estimation
for Selection Queries
The Problem

Estimate the number of strings with:




Edit distance smaller than θ
Cosine similarity higher than θ
Jaccard, Hamming, etc…
Issues:



Estimation accuracy
Size of estimator
Cost of estimation
38
Flavors

Edit distance:




Based on clustering (JL05)
Based on min-hash (MBK+07)
Based on wild-card q-grams (LNS07)
Cosine similarity:

Based on sampling (HYK+08)
39
Edit Distance

Problem:



Given query string s
Estimate number of strings t  D
Such that ed(s, t)  θ
40
Clustering - Sepia

Partition strings using clustering:


Store per cluster histograms:


Number of strings within edit distance 0,1,…,θ from the cluster
center
Compute global dataset statistics:


Enables pruning of whole clusters
Use a training query set to compute frequency of data strings
within edit distance 0,1,…,θ from each query
Given query:

Use cluster centers, histograms and dataset statistics to estimate
selectivity
41
Minhash - VSol

We can use Minhash to:



Estimate Jaccard(s, t) = |st| / |st|
Estimate the size of a set |s|
Estimate the size of the union |st|
42
VSol Estimator

Construct one inverted list per q-gram in D
and compute the minhash sketch of each list:
Inverted list
q1
q2
1
5
…
3
5
…
…
q10
1
8
…
14
25
43
Minhash
43
Selectivity Estimation

Use edit distance count filter:


If ed(s, t)  θ, then s and t share at least
β = max(|s|, |t|) - q + 1 – qθ
q-grams
Given query t = {q1, …, qm}:



We have m inverted lists
Any string contained in the intersection of at least
β of these lists passes the count filter
Answer is the size of the union of all non-empty βintersections (there are m choose β intersections)
44
Example

θ = 2, q = 3, |t|=14  β = 6
t=
Inverted list
q1
q2
1
5
…
3
5
…
…
q14
1
8
…
14
25
43


Look at all subsets of size 6
Α = |{ι1, ..., ι6}(10 choose 6) (ti1  ti2  …  ti6)|
45
The m-β Similarity
We do not need to consider all subsets
individually
 There is a closed form estimation
formula that uses minhash


Drawback:

Will overestimate results since many βintersections result in duplicates
46
OptEQ – wild-card q-grams

Use extended q-grams:


Introduce wild-card symbol ‘?’
E.g., “ab?” can be:
-

“aba”, “abb”, “abc”, …
Build an extended q-gram table:



Extract all 1-grams, 2-grams, …, q-grams
Generalize to extended 2-grams, …, q-grams
Maintain an extended q-grams/frequency
hashtable
47
Example
q-gram table
q-gram
Dataset
string
sigmod
vldb
icde
…
Frequency
ab
bc
de
ef
gh
hi
…
?b
a?
?c
…
10
15
4
1
21
2
…
13
17
23
…
abc
def
…
5
2
…
48
Assuming Replacements Only



Given query q=“abcd”
θ=2
There are 6 base strings:
-

“??cd”, “?b?d”, “?bc?”, “a??d”, “a?c?”, “ab??”
Query answer:

S1={sD: s  ”??cd”}, S2={sD: s  ”?b?d”},
S3={…}, …, S6={…}
-
A = |S1  S2  S3  S4  S5  S6| =
Σ1n6 (-1)n-1 |S1  …  Sn|
49
Replacement Intersection
Lattice
A = Σ1n6 (-1)n-1 |S1  …  Sn|




Need to evaluate size of all 2-intersections, 3intersections, …, 6-intersections
Use frequencies from q-gram table to
compute sum A
Exponential number of intersections
But ... there is well-defined structure
50
Replacement Lattice

Build replacement lattice:
??cd
?b?d
?bcd
?bc?
a??d
a?cd
ab?d
a?c?
ab??
abc?
abcd


2 ‘?’
1 ‘?’
0 ‘?’
Many intersections are empty
Others produce the same results

we need to count everything only once
51
General Formulas

Similar reasoning for:



Other combinations difficult:



r replacements
d deletions
Multiple insertions
Combinations of insertions/replacements
But … we can generate the corresponding
lattice algorithmically!

Expensive but possible
52
Hashed Sampling


Used to estimate selectivity of TF/IDF, BM25,
DICE
Main idea:

Take a sample of the inverted index
-
-

Simply answer the query on the sample and scale up
the result
Has high variance
We can do better than that
53
Visual Example
Inverted list
q1
q2
1
5
…
3
5
…
…
qn
1
8
…
14
25
43
Sampled
Inverted lists
Answer the query
using the sample
and scale up
Construction

Draw samples deterministically:




Use a hash function h: N  [0, 100]
Keep ids that hash to values smaller than φ
This is called a bottom-K sketch
Invariant:

If a given id is sampled in one list, it will always be
sampled in all other lists that contain it
55
Example

Any similarity function can be computed
correctly using the sample
-
Not true for simple random sampling
q1
q2
q1
q2
1
5
…
3
5
…
1
5
…
3
5
…
14
Random
Hashed
25
1
7
23
14
1
5
25
5
9
21
5
12
56
Selectivity Estimation


Any union of sampled lists is a φ% random
sample
Given query t = {q1, …, qm}:

A = |As| |q1  …  qm| / |qs1  …  qsm|:
-

As is the query answer size from the sample
The fraction is the actual scale-up factor
But there are duplicates in these unions!
We need to know:
-
The distinct number of ids in q1  …  qm
The distinct number of ids in qs1  …  qsm
57
Count Distinct

Distinct |qs1  …  qsm| is easy:


Scan the sampled lists
Distinct |q1  …  qm| is hard:


Scanning the lists is the same as computing the
exact answer to the query … naively
We are lucky:
-
-
Each sampled list doubles up as a bottom-k sketch by
construction!
We can use the list samples to estimate the distinct
|q1  …  qm|
58
The Bottom-k Sketch

It is used to estimated the distinct size of
arbitrary set unions (the same as FM sketch):



Take hash function h: N  [0, 100]
Hash each element of the set
The r-th smallest hash value is an unbiased
estimator of count distinct:
r
0
hr
hr
r
100 ?
100
59
Transformations/Synonyms
(ACGK08)
60
Transformations


No similarity function can be cognizant of
domain-dependent variations
Transformation rules should be provided
using a declarative framework

We derive different rules for different domains
-

Addresses, names, affiliations, etc.
Rules have knowledge of internal structure
-
-
Address: Department, School, Road, City, State
Name: Prefix, First name, Middle name, Last name
61
Observations

Variations are orthogonal to each other



Variations have general structure


Dept. of Computer Science, Stanford University,
California
We can combine any variation of the three
components and get the same affiliation
We can use a simple generative rule to generate
variations of all addresses
Variations are specific to a particular entity


California, CA
Need to incorporate external knowledge
62
Augmented Generative
Grammar

G: A set of rules, predicates, actions
Rule
Predicate
<name>  <prefix> <first> <middle> <last>
<name>  <last>, <first> <middle>
<first>  <letter>.
<first>  F
F in Fnames

first; last
first; last
letter
F
Variable F ranges over a fixed set of values


Action
For example all names in the database
Given an input record r and G we derive
“clean” variations of r (might be many)
63
Efficiency

We do not need to generate and store all
record transformations



Generate a combined sketch for all variations (the
union of sketches)
Transformations have high overlap, hence the
sketch will be small
Generate a derived grammar on the fly

Replace all variables with constants from the
database that are related to r (e.g., they are
substrings of r or similar to r)
64
Conclusion
65
Conclusion




Approximate selection queries have very
important applications
Not supported very well in current systems
(think of Google Suggest)
Work on approximate selections has matured
greatly within the past 5 years
Expect wide adoption soon!
66
Thank you!
References









[AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav
Kaushik .VLDB 2006
[ACGK08] Incorporating string transformations in record matching. Arvind Arasu,
Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008
[BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire
Kenyon. SODA 2002
[BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String
Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009
[BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar,
Alan M. Frieze, Michael Mitzenmacher. STOC 1998
[CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris
Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis.
SIGMOD 2005
[CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit
Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06
[CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik
Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08
[HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios
Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008
68
References









[HYK+08] Hashed samples: selectivity estimators for set similarity selection queries.
Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008.
[JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin,
Chen Li. VLDB 2005.
[JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen
Li, and Jianhua Feng. WWW 2009
[JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large
Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08
[KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita
Sarawagi, Divesh Srivastava. SIGMOD 2006.
[LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches.
Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008.
[LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit
Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007
[LWY07] VGRAM: Improving Performance of Approximate Queries on String
Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang.
VLDB 2007
[MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika,
Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007
69
References




[SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal.
SIGMOD 2004
[XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance
constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008
[XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei
Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008
[YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to
Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li.
SIGMOD 2008
70