26-arroyuelo

Reducing the Space Requirement
of LZ-index
Diego Arroyuelo1, Gonzalo Navarro1, and Kunihiko Sadakane2
1Dept.
of Computer Science, Univ. Of Chile
2Dept. of Computer Science and Comunnication Engineering, Kyushu Univ.
Barcelona – July 7, 2006
Outline







Introduction
The LZ-index (A Review)
LZ-index as a Navigation Scheme
Suffix-Links in the Reverse Trie
xbw LZ-index
Displaying Text Substrings
Conclusions
Problem definition


The full-text search problem: to find the occ occurrences of a
pattern P[1…m] in a text T[1…u]
To provide fast access to T requiring little space we use compressed
full-text self-indexes:

replace T and in addition give indexed access to it, and

take space proportional to the compressed text size
(O(uHk(T)) bits)
The k-th order empirical entropy of T
Hk(T) ≤ Hk-1(T) ≤ … ≤ H0(T) ≤ log s

Main motivation: to store the indexes of very large texts entirely
in main memory
Our results
LZ-index [Navarro, 2004]
Space: 4uHk(T)+o(ulogs) bits, k =
o(logsu)
Reporting: O(m3logs
Displaying:

+ (m+occ)logu)
O(llogs)
The main drawback of LZ-index is the
factor 4 in the space complexity

Our Results
(2+e)uHk(T)+o(ulogs) bits
for any constant 0 < e < 1
O(m2log m + (m+occ)logu)
O(l / logsu)
(optimal)
But also
(1+e)uHk(T)+o(ulogs) bits
LZ-index is faster to report and to display
(very important for a self-index!)
O(m2) (average case), for
m ≥ 2logsu
Our results in context

Our data structures:




Size O(uHk(T)) bits
O(logu) time per occurrence reported, if s = Q(polylog(u))
There are competing schemes requiring the same or
better complexity for reporting
The case s = Q(polylog(u)) represents moderate-size
alphabets and is very common in practice, but does not
fit in competing schemes
Outline







Introduction
The LZ-index (A Review)
LZ-index as a Navigation Scheme
Suffix-Links in the Reverse Trie
xbw LZ-index
Displaying Text Substrings
Conclusions
The LZ-index (a review)
Range
LZTrie
RevTrie
LZ78 parsing of T
Node
We don’t need to store the text!
Succinct representation of the
data structures
Assume n is the number of phrases in the LZ78 parsing of T


LZTrie:

par: the balanced parentheses representation of LZTrie (2n+o(n) bits)

lets: the symbols labelling the arcs of LZTrie (in preorder) (nlogs bits)

ids: the phrase identifiers in preorder (nlogn bits)
RevTrie:

rpar: the balanced parentheses representation of RevTrie (4n+o(n) bits)

rids: the phrase identifiers in preorder (nlogn bits)

Node: an array requiring nlog(2n) = nlogn + n bits

Range: implemented using [Chazelle, 1988], requiring nlogn(1+o(1)) bits
Succinct representation of the
data structures

We have four nlogn-bit terms

As nlogn = uHk(T)+o(ulogs), for k = o(logsu),

the LZ-index requires
4nlogn(1+o(1)) = 4uHk(T) + o(ulogs) bits, for k = o(logsu)

To reduce the space requirement we must reduce the
number of nlogn-bit terms in the index
Search Algorithm

Occurrences of Type 1

Occurrences of Type 2

Occurrences of Type 3

Reporting time: O(m3logs + (m+occ)logn)
Bk-1 Bk …
Bl Bl+1
Solving Occurrences of Type 1
Shortest possible LZ78 phrases containing P
LZTrie
By LZ78, P is a suffix of
such phrases
P
P
P
Subtrees containing ocurrences of type 1
Solving Occurrences of Type 1

As P is a suffix of such phrases, Pr is a prefix of the corresponding
reverse phrases

We need the Reverse Trie (RevTrie) to solve this problem
LZTrie
P
P
P
RevTrie
Pr
Solving Occurrences of Type 2
P1 P2
RevTrie
LZTrie
Pr1
P2
y
x
x’

Search for [x,y][x’,y’] in Range

For every pair (k, k+1) found, report k
y’
Outline







Introduction
The LZ-index (A Review)
LZ-index as a Navigation Scheme
Suffix-Links in the Reverse Trie
xbw LZ-index
Displaying Text Substrings
Conclusions
LZ-index as a Navigation Scheme

In practice Range is replaced by RNode (phrase id  RevTrie node)

Occurrences of type 2:
P1 P2
RevTrie
LZTrie
Pr1
P2
Node
RNode

We have no worst-case guarantees at search time

Average time for type 2 occs: O(n/sm/2)
(O(1) for m ≥ 2logsn)
Original Navigation Scheme

When we replace Range by RNode, we get a “navigation” scheme
But the scheme is
redundant…
We study how to reduce the
redundancy in the LZ-index
Alternative Navigation Scheme
Inverse permutations
represented with
Munro et al.
Space requirement:
(2+e)uHk + o(ulogs)
bits
Search algorithm
remains the same…
O(m2) (average case),
for m ≥ 2logsn
Outline







Introduction
The LZ-index (A Review)
LZ-index as a Navigation Scheme
Suffix-Links in the Reverse Trie
xbw LZ-index
Displaying Text Substrings
Conclusions
Suffix Links in RevTrie
Can we reduce the space requirement of LZ-index to
(1+e)uHk+o(ulogs) bits?
Can we reduce the space requirement while retaining worst-
case guarantees in the search process?
We are going to
compress the R mapping
Suffix Links in RevTrie

Definition 1: We define function j as a suffix link in RevTrie
j(i) = R-1(parentLZ(R[i]))
LZTrie
RevTrie
a
xr
x
x
a
R[i]
j(i)
i
if we follow a suffix link in RevTrie, we are
“going to the parent” in LZTrie
Suffix Links in RevTrie
R[11] =??
0
j 0
1
2
2
0
3
9
L
1
$
2
a
3
a
3
4 5
14 16
4
a
5
a
6
2
7
3
8
14
9
0
10 11 12 13 14 15 16 17
2 14 5 17 2 6 0 2
6
b
7
b
8
d
9
l
10 11 12 13 14 15 16 17
l
l p p r
r _ _
1
2
Suffix Links in RevTrie



We can compute R using j
But, what is the difference in space requirement? (both R
and j require, in principle, nlogn bits)
We can prove the following lemma for function j
Suffix Links in RevTrie

We replace the nlogn-bit representation of R by a
representation of j requiring
nH0(lets) + O(nloglogs) + O(slogs) + n + o(n)


To compute R in O(1/e) time we store en values of R,
requiring enlogn extra bits
R-1 can be computed in O(1/e2) time
Suffix Links in RevTrie
Yes, we can reduce the space
requirement of LZ-index to
(1+e)uHk+o(ulogs) bits
Suffix Links in RevTrie

We can add Range to get worst case guarantees in the
search process, requiring nlogn extra bits
Yes, we can reduce the space requirement of LZindex to (2+e)uHk+o(ulogs) bits, retaining worst
case guarantees at search time
Outline







Introduction
The LZ-index (A Review)
LZ-index as a Navigation Scheme
Suffix-Links in the Reverse Trie
xbw LZ-index
Displaying Text Substrings
Conclusions
xbw LZ-index

The xbw transform [Ferragina et al., 2005] is a succinct tree
representation requiring 2nlogs+O(n) bits and allowing
operations:




parent (O(1) time)
child(x, i) (O(1) time)
child(x, a) (O(1) time)
Subpath queries (O(m) time)
Subpath search
with string P
P
P

P
As we can perform prefix and suffix searching, we can do the
work of both LZTrie and RevTrie only with xbw!
xbw LZ-index
Balanced Parentheses LZTrie
(()()())()(()())(())
ids
xbw LZTrie
Slast Sa
Range
preorder positions
i
Pos
+
Pos-1
i
In principle:
(3+e)uHk(T)+ o(ulogs)
bits
xbw positions
xbw LZ-index
(2+e)uHk(T)+ o(ulogs)
bits
Balanced Parentheses LZTrie
(()()())()(()())(())
Pos[i]
ids
xbw LZTrie
Slast Sa
Pos’
i
j
We store one out of O(1/e)
values of Pos
xbw LZ-index


Occurrences of Type 1: using the xbw (subpath search
with Pr), and then mapping to the parentheses LZTrie
Occurrences of Type 2: subpath search for Pr1 and search
(using child from the root) for P2.



We have achieved
Theorem 1 and 2 with
radically different
means!!
Then use the corresponding xbw and preorder ranges to search in
Range
Ocurrences of Type 3:mostly as with the original LZindex
Occurrences of Type 2 can be solved as Occurrences of
Type 3 (we don’t need Range!)
Outline







Introduction
The LZ-index (A Review)
LZ-index as a Navigation Scheme
Suffix-Links in the Reverse Trie
xbw LZ-index
Displaying Text Substrings
Conclusions
Displaying text substrings

The approach of [Sadakane and Grossi, 2006] to display any
text substring of length Q(logsu) in constant time can be
adapted to our indexes
Outline







Introduction
The LZ-index (A Review)
LZ-index as a Navigation Scheme
Suffix-Links in the Reverse Trie
xbw LZ-index
Displaying Text Substrings
Conclusions
Conclusions

We have studied the reduction of the space requirement of LZ-index
Navigational
scheme

Two different approaches
xbw + bp
LZTrie


In either case we achieve (2+e)uHk(T) + o(ulogs) to index T[1…u],
k = o(logsu)
The search time is improved to O(m2logm + (m+occ)logn) (worst case)
Conclusions




We also define indexes requiring (1+e)uHk(T) + o(ulogs) to
index T[1…u], k = o(logsu)
O(m2) average-case time if m ≥ 2logsn
The time to display a context of length l around any text
position is also improved to the optimal O(l/logsu)
We also remove some restrictions of the original LZ-index (see
the paper)
Questions?
Contact
[email protected]
Thanks!
Contact
[email protected]