A High-speed Digital Search Algorithm by an Improved

A High-speed Digital Search Algorithm by an Improved Double Array
outdeg(r) = | {( r , a) | g (r , a)  S , a  I } |
Abstract
The HDS (high-speed digital search) method based on
trie- memory structure, first presented by E. Fredkin
in 1960[1], has been developed into an efficient and
practical retrieval method using the double array by
Johnson and Aoe, who improved the array structure of
the original method.[2][3][5] As its time efficiency
and its space efficiency, HDS method is widely used
in many areas such as bibliographic search, language
analysis and speech recognition. Since Aoe first
proposed the HDS method using double array, a
number of improvement on the HDS method has been
presented. [5-6]. In the previous study on HDS
method, researchers used the original condition for
constructing double array, without any change. In
this paper, we modify the second equation in the
condition for constructing double array so that we can
reduce the size of the space for every one element of
the array CHECK to |I|, where I is the set of input
symbols. We give the high-speed digital search
algorithm by an improved Double array.
1. Pattern Matching Machines
Let I  {a1 , a 2, …,
a n } be a set of input symbols
and I be a set of all finite strings on I , K ( I
be a set of keywords. We give the definition of a
pattern matching machine of K . [2]
*
Definition 1 . Let M be a 5-tuple
where
. S is a finite set of states
. s1 is the initial state
*
. U ( S ) is a set of leaving states
. g is the state transition function which maps
S  I * to S  {0} .
g ( s, x) : s ' ,
s '  U , then it is said that x is accepted by M . The
set of strings accepted by M is denoted by L (M ) .
If K  L(M ) , then we call M a pattern matching
machine of K . From now on, the state S r is
denoted by r for short and we set the initial state of
r  S to 1.
*
Definition 2 . For a state r  S , we define input
degree indeg(r) and output degree outdeg(r) as
follows. [3]
indeg(r) = | {( t , a) | g (t , a) : r , a  I , t  S} |
1.  r  S , r  1  indeg(r) =0
2.  r  S , r  1  indeg(r) =1
3. For any state r  S , r can be reached from the
state 1 by the transition function g .
I# be the set I  {#} and let I ♯* be the set
Let
{x♯| x  I * } . And let K# be the set
{x♯| x  I * } . Then the set of the output states U
consistent with the set of final states F . From now on,
*
*
we denote I ♯ , I# and K# as I , I , K , respectively.
Suppose that the pattern matching machine M is a Ttype machine.
2. An improved data structure
)
( S , I * , g , s1 ,U } ,
If there exist x  I such that
The state whose output degree is zero is called to be
the final state. We denote the set of all final states as
F.
Definition 3 . For a state r  S , if the machine M
satisfies the following conditions, M is called to be
T-type machine.
Here, we improve the double array structure of the
HDDS so that it can be more efficient in space.
In their paper[2], they identified the two problems
1. For a state r , how can we prove that r is the
element of the set
Sp ?
2. How can we get the position of STR[ r ] on TAIL?
and defined A , the condition for constructing a
double array (not described here).
We modify the condition A as follows.
'
Definition 4 . ( A : the condition for constructing a
double array)
( A -1) CHECK[BASE [r ]  a ]  a iff
'
g (r , a)  t , a  I , r  S M  S p .
( A -2) BASE [r ] <0 iff r  S p .
'
[ p]  b1 ,
TAIL [ p  1]  b2 , … , TAIL [ p  m  1]  bm
iff r  S p and STR[r]= b1b2 ...bm (m  1, n  m)
( A -3) p   BASE [r ] and TAIL
'
( A -4) For i (0  i  MAX), it holds
'
BASE [ I i ]   I i 1 , G _ HEAD
 I i , where MAX
indicates the size of the double array and G _ HEAD
is the variable which represents the initial position of
the array.
r1  r2 , it holds
BASE [r1 ]  BASE [r2 ] iff BASE [r1 ]  0 and
BASE [r2 ]  0 , r1 , r2  S p  S M
'
( A -5) For any ,
else return (FALSE)
end
Theorem 1. If x  K , then Algorithm 1 returns
TRUE and If x  K , then Algorithm 1 returns
FALSE .
Proof. Given x  K , from the condition A , there
exists the state transition series such that
g (ri , ai ) : ri 1 , ri 1  S p  S m , 1  i  n .
'
An example of the double array and the array
TAIL that satisfies the condition A’ is given
below.
a) The case of ri 1  S M
'
From the condition ( A -1), it follows that
1
BASE
CHECK
y
#
$
e
1
13
-1
2
-1
2
-12
-4
-16
2
2
3
4
8
11
1
3
0
r
TAIL
# $
#
$
g
E
#
$
pos=19
BASE[ri ]  0


BASE[ri ]  ai  ri 1 
CHECK[ri 1 ]  ai 
(1)
, therefore, Algorithm 1 makes the transition correctly
from the state ri  S M to the state ri 1  S M for the
ai .
b) The case of ri 1  S p
input symbol
The line (1-2) of Algorithm 1 has two options. If
ai  ♯ , then , TRUE is returned in the step (1-2) of
'
Algorithm 1 according to the condition ( A -1).
ai  ♯ , then , according to the condition ( A -2),
for any p such that
'
If
Fig. An example of the double array and the
array TAIL
3. HDS Algorithms using the improved double
array
1) Search Algorithm
The search algorithm using the improved double array
is given Algorithm 1. In the algorithm, variable MAX
indicates the current size of the double array and
S_TEMP is used to store the string STR[r] determined
from the array TAIL. And FETCH_STR(p) returns
the string which corresponds the position p.
STR_CMP(y,z) returns -1 if the two strings are the
same or returns the length of the common prefix of the
two strings if the two strings are not the same.
BASE[ri 1 ]  0 

p   BASE[ri 1 ]
,it follows that
TAIL[ p]  ai 1 , TAIL[ p  1]  ai 2 , … ,
TAIL[ p  n  1]  an . Thus , TRUE is returned in
the step (1-2). Therefore, x  L (M ) is valid in the
two cases.
Next, suppose x
begin
r  0; h  0
repeat
h  h  1 ; t  BASE[r ]  ah ;
(1-1) if (t>max) or CHECK [t ]  ah
then return (FALSE) else r=t
until BASE[r]<0
(1-2) if ah # then return (TRUE)
else S_TEMP=FETCH_STR(-BASE[r]);
(1-3)
if
STR_CMP (ah 1 , ah  2 ,..., an, S _ TEMP )  1
then return (TRUE)
 a1a2 an  K . Then, according
'
to the condition A and by the Definition 3, it holds
that i  n, ri  S M , g (ri , ai ) : ri 1 ,
ri 1  S P  {0} .
In the case of
'
Algorithm 1
(2)
ri 1  0 , according to the conditions
'
( A -1) and ( A -5), , FALSE is returned in the step (12) of Algorithm 1. Thus, x  L (M ) .
In the case of
'
ri 1  S P , according to the conditions
'
( A -2) and ( A -3), since
STR[ri 1 ]  b1b2 bz
y  ai 1ai  2 an , the
suffix of the input string x , FALSE is returned in the
step (1-3) of Algorithm. Thus, x  L (M ) .
Therefore, Algorithm 1 satisfies K  L(M ) for K ,
does not consistent with
the set of input strings.
2) Insertion algorithm
The algorithm for inserting a string to the double array
is given by modifying (1-1) and (1-3) in Algorithm 1.
for C  LIST do
begin
Algorithm 2. (Insertion Algorithm)
(m  2  1) t  old _ base  C ;
a) [Modifying return(FALSE) in (1-1)]
begin /* ay  ah ah 1 ,..., an */
t '  BASE[r ]  C ;
(m  2  2) CHECK[t ' ]  CHECK[t ] ;
(m  2  3) BASE[t ' ]  BASE[t ] ;
A_INSERT (r , ay) ;
return FALSE
end
end
(m  3)
b) [Modifying return(FALSE) in (1-3) ]
(Processing for unused elements)
end
begin /* b : common prefix */
B_INSERT (r , b, cy, dz) ;
return FALSE
end
PROCEDURE MODIFY is a procedure to avoid the
overlapping when
r  SM , a  I
The procedures A_INSERT, B_INSERT, and
MODIFY are given as follows.
, g (r, a) : t , t  S p  S M .
PROCEDURE INS_STR (r , ey, d , pos )
begin
( s  1) t  BASE[r ]  e ;
PROCEDURE A_INSERT ( r , ax )
begin
CHECK [t ]  e ;
( s  2) BASE[t ]  d _ pos ;
( s  3) pos  SET _ STR(d _ pos, y ) ;
( s  4) if (t>MAX) then MAX=t;
( s  5) (Processing for unused elements)
(a  1) t  BASE [r ]  a
(a  2) if CHECK [t ]  0 then
begin
(a  2  1) LIST=SET_LIST(r)
(a  2  2) MODIFY (a, LIST ) ;
end
end
(a  3)
BASE[t ]  0; CHECK (t )  0 ;
INS_STR ( r , ax, pos ) ;
PROCEDURE INS_STR is the procedure that records
end
such that r  SM , a  I , g (r, a) : t , t  S p  S M to the
double array and also records the string STR[r] to the
array TAIL.
Functions and variables used in the above procedures
are given as follows:
pos indicates the end position of the array TAIL.
N(LIST) is a function that returns |Q|.
X_CHECK(LIST) is a function such that
C  Q, CHECK[C  0]  0 and returns q, the
minimum that satisfies the condition (A’-5).
SET_STR(p,y) is a function that records string y to the
place where p indicates in the array TAIL and gets pos.
SET_LIST(r) is a function that returns Q for the state
r  SM .
t
PROCEDURE A_INSERT inserts the string
it is known that x K on the double array.
x
when
PROCEDURE B_INSERT (r , b, cy, dz )
begin /* b
 b1b2 bk */
(b  1) old_POS=-BASE[r]
for i=0 to STR_LEN(b) do
begin
(b  1  1) BASE[r ]  X _ CHECK ({bi }) ;
(b  1  2) CHECK[ BASE[r ]  bi ]  bi ;
(b  1  3) r  BASE[r ]  bi
end
(b  2) BASE[r ]  X _ CHECK ({c, d })
(b  3) INS_STR (r, dz, old _ pos) ;
(b  4) INS_STR (r , cy, pos) ;
end
PROCEDURE B_INSERT inserts the string
it is known that x K on the array TAIL.
PROCEDURE MODIFY ( a, LIST )
begin
(m  1) old_base = BASE [r ] ;
( m  2)
BASE[r ]  X _ CHECK ( LIST  {a}) ;
x
when
Theorem 2. After inserting a string x  K to the
double array D by Algorithm 2, D satisfies the
condition A' .
Poof. We prove for the two procedures A_INSERT
and B_INSERT .
a) Procedure A_INSERT
If the condition of the line ( a -2) is not satisfied , then
there is no overlapping state, therefore, according to
the conditions (A’-1) and (A’-2), the information
related the state transition g (r , a) : t , t  S p is
stored to the double array. And according to the line
(s-3), STR[t] is stored to the array TAIL. Since
t  S P , the conditions (A’-4) and (A’-5) is still
satisfied, therefore the condition A’ is satisfied by
t  SP .
Next, we show that the double array satisfies the
condition (A’) after storing the state transition
g (r , a) : t  S P , when the condition of the
after inserting the input string x K , the double
array satisfies the condition A’.
line ( a -2) is satisfied, that is, when there exists an
overlapping state.
- Defining the new state transition by BASE [r ]
3) Deletion algorithm
(m  2) in the procedure MODIFY called
in the line (a  2  2) defines new BASE [r ]
which satisfies the condition (A’-5) for r  S M .
The lines (m  2  1) and (m  2  2) makes
The line
change the old state transitions according to new
BASE[r ] rather than doing in [3-4]. The lines
(m  2  1) and (m  2  2)
do not make change
for the state transitions when there is not an
overlapping state. After processing for the
overlapping state by the procedure MODIFY, the
lines (s  1), (s  2) , and ( s  3) in the procedure
INS_STR, stores correctly the information about the
state transition g (r , a) : t  S P for new BASE[r]
so that the double array satisfies the conditions (A’-1),
(A’-3), and (A’-5).
- Deleting the information about the old state
transition
When we process to avoid the overlapping state, in the
line (m  3) , the information about the old BASE[r]
is initialized by zero and, at the same time, the unused
elements are added to the list for unused elements.
Thus, the double array satisfies the condition (A’-4)
after inserting a string.
b) Procedure B_INSERT
The lines (b-1-1)-(b-1-3) records the state transitions
leaving from the state r among state transitions related
the common prefix of the string STR[r ]  bdz for the
r  S P and x'  bcy , the rest of the
input string. At this time, the line (b  1  1) calls
current state
the function X_CHECK and determines
BASE[rbi ] , bi (i  1,2,, k ) so that the
condition (A’-5) is satisfied by them. And lines
(b  1  2) and (b  1  3) records state
transitions for bi to the double array according to the
condition (A’-1).
The line (b  2) determines BASE[ri ] which
satisfies the condition (A’-5) for the state ri such that
i  n, g (ri 1 , bk ) : ri  S M using the function
X_CHECK. And the lines (b  3) and (b  4)
record the information about the state transition
g (ri , d )  S P , g (ri , c)  S P to the double array
and to the array TAIL according to the conditions (A’1)-(A’-3) using the PROCEDURE INS_STR. And
then, the lines (b  3) and (b  4) rearrange the
unused elements in the step (s-5) of the PROCEDURE
INS_STR according to the condition (A’-4). Therefore,
The algorithm for deleting a string from the double
array is made by modifying return TRUE in (1-2) and
(1-3) of Algorithm 1.
Modifying return TRUE in (1-2) and (1-3)
begin /*
r Sp */
BASE[r]=0; CHECK[r]=0
return TRUE
end
Theorem 3. After deleting a string x  K from the
double array D by Algorithm 2, D satisfies the
condition A' .
4. Evaluation
In our evaluation, we use the following terms.
n(=|K|) : the number of input strings;
e: the number of input symbols;
l : the size of the double array;
u : the size of the double array for K, the set of input
string( the number of states of
S M S P )
m : the number of unused elements
It holds that k , h  0, u  kn, m  he, l  u  m .
C ext be the size of the array CHECK in the
existing methods. And let C new be the size of the
Let
array CHECK in our method. We assume that
2 t 1  l  2 t and 2 z 1  e  2 z
Then, we have
C ext = tl ,
C new = zl
(1)
(2)
log 2 l  t  log 2 l  1 and
l  i  m  kn  he from (1), t increases in
Since
proposition to n. On the other hand, in the equation (2),
t increases in proposition to n while z is a constant.
Therefore, it holds that
Cext
lim C
n 
new
 lim
n 
tl

zl
Above equation says that when we increase n , the
size of the storage space of the array CHECK in
existing methods becomes much greater than the one
of our method.
On the other hand, in our method, the time cost for the
search operation is the same with the one in the
existing methods. Since it depends on the time needed
for running the procedure MODIFY, the time
complexity of inserting operation becomes
O(e(he  1)) at most. In the existing methods, since
the time complexity of inserting operation becomes
O(e 2 (h  1)) we can say that there is little different
between two methods as for the time complexity of
inserting operation. The time complexity of deleting
operation becomes O(k  m)  O(k  he) at most.
5. Conclusion
In this paper, we have presented a method of reducing
the size of storage space needed for the double array
in HDS method. By using our method, one can
reduce the size of storage space needed for applying
HDS without degrading the fast retrieval. methodWe
modified the second equation in the condition for
constructing double array so that we can reduce the
size of the space for every one element of the array
CHECK to |I|, where I is the set of input symbols. Our
method keeps the fast search Our experiment shows
that the presented technique is very useful in reducing
the space size for applying HDS method.
References
[1] E.Fredkin.Trie-memory,Communication of the
ACM. Vol.3. No9,1960.
[2] 靑江順,然言語辭書の檢索.ダプル配列による
高速藪字檢索アルゴリズム, bit. Vol.21.No6,1990.
[3]本勝士 et al., 二ッのトライを用いた辭書檢索アルゴリ
ズム,電子情報通信學會論文誌,J76-D-Ⅱ.No.11.1993.
[4]. J. Aoe. An efficient digital search algorithm by
using a double array. IEEE Transactions Software
Engineering SE-15(9) (1989) 1066-1077
[5]. Masafumi Koyama, Masaki Oono, Kazuhiro
Morita, Jun-ichi Aoe. A fast and compact technique of
implementing transition tables for finite state
automata. Information Sciences 129 (2000) 141-154
[6]. Shoji Mizobuchi, Toru Sumitomo, Masao Fuketa,
Jun-ichi Aoe. An eficient representation for
implementing finite state machines based on the
double-array. Information Sciences 129 (2000) 119139