Indexing Text Data under Space
Constraints
Bijit Hore, Hakan Hacigumus, Bala Iyer, Sharad
Mehrotra
1
Introduction
We want to design an efficient indexing technique to support
pattern matching queries over string data
We focus on LIKE queries in SQL:
Select * from R where R.A Like “dat_”;
Select * from R where R.A Like “d%”;
Contribution:
A q-gram based index for efficient pattern matching
q-gram: any string of symbols of length q, from ∑
2
The basic approach
Given (initially):
Q: A typical workload of query patterns
R: The set of record strings to be indexed
Generic approach to evaluate LIKE queries:
1. Generate a set of candidate grams G
2. Select an “appropriate” set I ⊆ G
3. Create an index using I where :
where every g I has a pointer to r R iff r ∋ g
4. Given a query q, get all g q from I and return the
intersection of their pointer lists
5. Discard the false positives from the returned list
3
Research Issues
1. Generating an appropriate set of candidate grams
G relevant to workload Q
*2.
Choosing an optimal index set I ⊆ G minimizes
the # false positives over Q
3. Data structures / Query processing methodology
4
Outline
Introduction
Optimal Gram Selection
Complexity and optimizations
1. A parallel algorithm for gram selection
2. Workload reduction
Experiments & Conclusion
5
Visualizing the Q-G-R relations
R
G
r1
San Jose
Q
q1
_an%
r2
Los Angeles, International
r3
John F Kennedy, New York
g1
q2
%York%
q3
San Franc%
q4
%kane%
an
r4
g2
La Guardia, New York
or
r5
Newark
g3
r6
ne
San Francisco
r7
Oakland
Benefit(g) = (# queries ∋ g) X (# records ∌ g)
Gram “ne” covers the pairs {(q4,r1), (q4,r2), (q4,r6), (q4,r7)}
6
Optimal gram selection for index
Given:
3 sets Q, G, R (workload, candidates, records)
Weight function weight(q): Q ℜ
Cost function cost(g): G ℜ
Budget constraint M
Define a map: cover(g) : G Q X R
set of all (q,r) pairs s.t q ∋ g & r ∌ g
(weight(q,r) = weight(q))
Formal definition:
BestIndex(Q,R,G,M) = Imax ⊆ G, where
weight(Imax) is maximized over all I ⊆ G &
cost(Imax) ≤ M
7
Benefit of a gram
Benefit(g) = (# queries ∋ g) X (# records ∌ g)
A greedy heuristic for top-k grams that does not work:
Include the k grams with the largest individual
benefits in I
Cause:
A gram g* might have high individual benefit
BUT
In presence of other grams in I, g* might not prune any
new records for any of the queries ∋ g*
8
NP-hardness & an approximation algorithm
BestIndex problem is NP-Hard (reduction from set cover)
Define:
benefit(g,I) = total weight of new (q,r) pairs covered
by g (not already covered by some gram I)
benefit(g, I)
utility(g,I) =
cost(g)
Heuristic: In every iteration add the gram with the
highest utility, till cost budget is exhausted
1
The greedy heuristic gives a 0.5(1- )optimal approximation [2]
e
9
BestIndex algorithm: Example
Example: choose top 2 grams for the G-(Q,R) matrix shown below :
(q,r) q1,
G
r3
q1,
r4
q1,
r5
q2,
r1
q2,
r2
q2,
r5
q2,
r6
q2,
r7
q3,
r3
q3,
r4
q3,
r5
q4,
r1
q4,
r3
q4,
r4
q4,
r5
q4,
r6
q4,
r7
g1
1
1
1
0
0
0
0
0
1
1
1
0
1
1
1
0
0
g2
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
g3
1
1
1
0
0
0
0
0
0
0
0
1
0
1
1
1
1
• First iteration: compute utility of all candidates
• cost(g1) = 3; cost(g2) = 2; cost(g3) = 3
• utility(g1) = 9/3 = 3; utility(g2) = 5/2 = 2.5; utility(g3) = 8/3 = 2.66
• Top gram is g1 (utility(g1) > utility(g3) > utility(g2))
• Second iteration:
• utilities change: utility(g2) = 5/2 > utility(g3) = 3/3 choose g2
10
Compact representation of G-(Q,R) relations
(q,r) q1,
G
g1
q1,
r4
r3
1
q1,
r5
q2,
r1
q2,
r2
q2,
r5
q2,
r6
q2,
r7
q3,
r3
q3,
r4
q3,
r5
q4,
r1
q4,
r3
q4,
r4
q4,
r5
q4,
r6
1 0
g2
g4
S
r2
r3
r4
r5
r6
r7
1
0
0
0
0
1
1
g2
0
0
1
1
0
0
0
S[g][(q,r)] = (~M1[g][r])
g2
& M2[g][q]
g3
0
0
1
1
1
0
0
g3
G
g1
R r1
M1
G
g1
Q q1
q2
q3
q4
1
0
1
1
0
1
0
0
0
0
0
1
M2
11
q4,
r7
Complexity of an iteration
(q,r) q1,
G
r3
q1,
r4
q1,
r5
q2,
r1
q2,
r2
q2,
r5
q2,
r6
q2,
r7
q3,
r3
q3,
r4
q3,
r5
q4,
r1
q4,
r3
q4,
r4
q4,
r5
q4,
r6
q4,
r7
g1
1
1
1
0
0
0
0
0
1
1
1
0
1
1
1
0
0
g2
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
g3
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
Complexity of
single iteration
Ο(|Q|*|R|*|G|) !
12
Outline
Introduction
Optimal Gram Selection
Complexity and optimizations
1. A parallel algorithm for gram selection
2. Workload reduction
Experiments & Conclusion
13
Complexity of the naïve algorithm
Time complexity (worst case) = O(|I|*|R|*|Q|*|G|)
for choosing an set I of indexing grams
Space complexity = O( |G|*|Q| + |G|*|R| ) …
matrices M1 and M2
The naïve algorithm scales poorly with problem size
Explore the following optimization approaches:
Pre-processing: pruning, auxiliary data structures
Parallelization
Workload compression
14
Parallelizing the BestIndex algorithm
R
G
r1
Q
q1
r2
r3
g1
q2
r4
g2
q3
r5
g3
r6
q4
r7
15
Parallelizing the BestIndex algorithm
R
G
r3
Q
q2
r4
g2
16
Parallelizing the BestIndex algorithm
R
G
Q
r1
r2
g1
q3
r6
r7
17
Parallelizing the BestIndex algorithm
R
G
Q
r1
r2
r3
g1
Complexity of each gramselection iteration reduces from
O(|Q|*|R|*|G|) to O(|Q|*|R|)
r4
r5
g3
r6
q4
r7
18
Outline
Introduction
Optimal Gram Selection
Complexity and optimizations
1. A parallel algorithm for gram selection
2. Workload reduction
Experiments & Conclusion
19
Workload reduction
Parallel algorithm complexity: O(|I|*|Q|*|R|) (worst case)
Explore ways of reducing the workload Q while trying
to minimize loss of quality (similar to [4])
Our approach:
1. Define appropriate distance measures between
queries
2. Use k-median clustering to form k query clusters
3. Fold all queries in a cluster onto the median query
4. These k medians form the compressed workload Q’
20
Family of MaxDevDist measures
MaxDevDist(q1,q2) assumes q1
is folded onto q2
Folding affects benefits of
grams in:
(G1-G2) ∪ (G2–G1)
Gi is the set of grams in qi
Variants proposed,
proportional to:
1. | R’((G1-G2) ∪ (G2–G1)) |
2. 1 / | R’(G2 )|
3. 1 / | R’(G1 ∩ G2 )|
1)
| R' ( g ) |
| R' ( g ' ) |
g(G1 G2)(G2 G1)
g'(G1G2)
| R' ( g ) |
2)
g(G1 G2)(G2 G1)
| R' ( g ' ) |
g'G2
3)
| R' ( g ) |
g(G1 G2)(G2 G1)
R’(g) = set of
records not
containing g
21
Candidate set generation
Generate candidate set of grams G using Q:
Build a suffix tree by inserting suffixes of all q ∈ Q
Set of all path-labels G0
Retain shortest, mutually distinguishable prefixes of
the path-labels in G0 with selectivity < sthresh G
an
$
e$
e$
c$
Franc$
kane$
$
n
e$
ork$
r
York$
k$
anc$
Suffix tree on Q = { _an%, %York%, %Franc%, %kane }
G0 = { an, anc, ane, e, Franc, kane, n, ne, ork, r, rk, ranc, York }
22
Outline
Introduction
Optimal Gram Selection
Complexity and optimizations
1. A parallel algorithm for gram selection
2. Workload reduction
Experiments & Conclusion
23
Experimental results
Data Sets:
The “Digital Bibliography & Library Project “
2 string attributes: <author, publication>
|R| ≈ 305,000 records
|Q| = 1000, 2000, 3000 & 4000 (author last-names)
|G| ≈ 4K, 9K, 12K, 15K for the respective query sets
Performance metric:
Average Relative Error (ARE):
1
# false positives retrieved for q
| Q | q Q Total # records retrieved for q
24
Performance
4MB
Workload Vs Performance (2MB)
FREE
1
0.8
0.6
0.4
0.2
0
1000
2000
3000
4000
1
0.8
0.6
0.4
0.2
0
1000
2000
3000
4000
Workload size
Workload size
• Plots compare the performance of
8MB
Average Relative Error
Average Relative Error
Average Relative
Error
BestIndex
our index with that of FREE [1] (we
plot the average relative error)
0.8
• FREE does not consider any query
model unlike BestIndex
0.6
0.4
0.2
0
1000
2000
3000
Workload size
4000
• FREE generates all grams up to a
certain length and uses a cut-off
selectivity for discarding candidates
25
Resilience to deviation from workload
Average relative error
Deviation vs Performance (|Q| = 4000, Index size = 8MB)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Best Index
FREE
100
90
80
70
60
50
40
30
20
10
0
% retained in Test set from Original workload
• We test the resilience of our index by deviating the test query
set from the workload that is used to build it
26
Workload Reduction
Aggregate Proportion of
Error on Q
Performance of Distance functions (|Q| = 1000)
300
250
200
MaxDev_2
MaxDev_3
150
Edit distance
100
Random sample
50
0
200
400
600
Reduced Workload size (|Q'|)
| R' ( g ) |
• MaxDevDist_2 performs the best
• Random sample is worst !
g(G1 G2)(G2 G1)
| R' ( g ' ) |
g'G2
27
Conclusions
We show that “Optimal gram selection for indexing in
presence of workload & storage constraint”, is NP-hard
1
0.5(1- e
Adapt a
) approximation algorithm to select grams
optimally BestIndex
Speedup through Parallelization of code
Explore workload reduction techniques
Experimental result comparing with previous approaches
BestIndex is superior !
28
References
1. Cho, J., Rajagopalan, S., “A Fast Regular Indexing
Engine” ICDE 2002
2. Khuller, S., Moss, A., Naor, J., “The Budgeted
Maximum Coverage Problem”, IPL, V-70
3. Ukkonen, E., “Online construction of suffix trees”,
Algorithmica
4. Chaudhury, S., Gupta, A., K., Narasayya. V,
“Compressing SQL Workloads”, ACM SIGMOD ‘02
29
Thank You !
30
BestIndex algorithm
BestIndex-Naive(Q,R,G,M)
1.
2.
while some (q, r) uncovered AND space available
for every gram g ∈ G \ I set benefit[g] = 0
3.
4.
5.
6.
for every uncovered (qk,rj)
for every candidate gi not in I
if (gi covers (qk, rj)) then
benefit[gi] = benefit[gi] + 1
7.
8.
9.
10.
if (∃ a g with benefit[g] > 0) then
for every candidate g
utility[g] = benefit[g] /cost(g)
else EXIT
Ο(|Q|*|R|*|G|)
11.
I = I ∪ {gmax}, where gmax has maximum utility
12. end
31
Pre-processing optimizations
Pruning:
Discard frequent grams from G
(we pruned all grams with selectivity ≥ 0.1)
Auxiliary data structure:
Observation: To compute the benefit of a gram for a query qk :
1. Need only grams contained in qk G(qk)
2. Need only the set of records spanned by G(qk)
To allow such selective access for benefit computation, create 2
lists : Q-G-list and G-R-list
32
The auxiliary data structures
Q-G-list
G-R-list
q1
g11
g12
g13
g14
q2
g21
g22
g23
g24
q3
g31
g32
g33
g34
g15
G(qk)
g35
g36
g1
r11
r12
r13
g2
r 21
r 22
r 23
r 24
g3
r 31
r 32
r 33
r 34
R(gi)
r 25
Size is a small
q|Q|
g|Q|1
g|Q|2
G-R-list helps in
parallelizing the
problem
g|Q|3
Q-G-list reduces the complexity of
each gram-selection iteration from
g|G|
r|G|1
r|G|2
O(|Q|*|R|*|G|) to O(|Q|*|R|)
33
Parallel BestIndex algorithm
Parallelizable BestIndex(Q,R,G,M)
Partition the original problem : P1 … P|Q|
While (budget not filled & all (q,r) not covered)
For all g ∈ G \ I benefitglobal(g) 0
For each sub-problem Pi
Compute local benefits g ∈ Gi : benefiti(g)
benefitglobal(g) benefitglobal(g) + benefiti(g)
I I ∪ {g*} where g* has highest global utility
Return I
The time complexity O (|I|*(|R1|+…+|R|Q||)) = O (|I|*|Q|*Avg(|Ri|))
For our data Avg(|Ri|) ≈ 0.17|R| ≈ 6 times faster than naïve + basic
optimized code, even for sequential execution)
34
Distance measures
Maximum Deviation Distance measures (MaxDevDist)
Let Ibest BestIndex(Q) & I’ BestIndex(Q’)
MaxDevDist tries to capture the notion of
“how different is I’ from Ibest”
Intuition:
Difference in Ibest & I’ depends on benefit(g)
computed for candidate grams in each case
benefit(g) α |R’(g)| where R’(g) = set of records
not containing g
35
Distance measure (example)
Q = {q1 , q2} = {ab , bb}
G1 = {a, b, ab}; G2 = {b, bb}
G1 – G2 = {a, ab}; G2 – G1 = {bb};
G1 ∩ G2 = {b}
(G1 – G2) ∪ (G2 – G1) = {a, ab, bb}
Let |R| = 10,
|R(a)| = 7
|R(a)| = 5
|R(a)| = 2
|R(a)| = 1
|R’(a)| = 3
|R’(a)| = 5
|R’(a)| = 8
|R’(a)| = 9
MaxDevDist1(q1, q2) =
| R' ( g ) | | R' (a) | | R' (ab) | | R' (bb) | 3 8 9
4.0
=
|
R'
(b)
|
5
| R' ( g ' ) |
g(G1 G2)(G2 G1)
g'(G1G2)
36
A measure of quality of Index
Quality of index I w.r.t workload Q can be
measured by: Aggregate Proportion of Error (APE)
APE (Q,I) =
# false positives returned for q using
# true strings containing q
I
qQ
qQ
37
© Copyright 2026 Paperzz