A compact sparse matrix representation using random hash functions

Data & Knowledge Engineering 32 (2000) 29±49
www.elsevier.com/locate/datak
A compact sparse matrix representation using random hash
functions
Ji-Han Jiang a,1, Chin-Chen Chang a,2, Tung-Shou Chen b,*
a
b
Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan 62107, ROC
Department of Computer Science and Information Management, Providence University, Shalu, Taichung County, Taiwan 43301, ROC
Received 17 February 1998; received in revised form 6 October 1998; accepted 16 April 1999
Abstract
In this paper, a practical method is presented that allows for the compact representation of sparse matrices. We have
employed some random hash functions and applied the rehash technique to the compression of sparse matrices. Using
our method, a large-scale sparse matrix can be compressed into some condensed tables. The zero elements of the
original matrix can be determined directly by these condensed tables, and the values of nonzero elements can be recovered in a row major order. Moreover, the space occupied by these condensed tables is small. Though the elements
cannot be referenced directly, the compression result can be transmitted progressively. Performance evaluation shows
that our method has achieved quite some e€ective improvement for the compression of randomly distributed sparse
matrices. Ó 2000 Elsevier Science B.V. All rights reserved.
Keywords: Data compression; Hash function; Matrix compression; Rehash; Sparse matrix; Progressive transmission; Data ®ltering
1. Introduction
A matrix is sparse if the number of its nonzero elements is much less than the size of the matrix
itself. Sparse matrices are frequently used to describe two-dimensional arrays, sparse tables
[1,13,15,3], handwritten signature maps [14], and Chinese character patterns [6,11]. The storing of
sparse matrices and sparse tables are important problems of data processing. A sparse table is
used to permit ecient processing of membership queries. Not only the membership queries are
supported, the original positions of the elements located (i.e., the indices of elements) must also be
considered in storing sparse matrices. The indices are important for the elements of a sparse
matrix. Many methods can be used to store sparse tables [13,15], but they cannot be applied to
compress sparse matrices e€ectively.
Consider an n1 n2 matrix. We can simply store the matrix in a two-dimensional array sized
n1 n2 . This method is direct enough but not economically reasonable especially for a large-scale
*
Corresponding author. Tel.: +886-4-6328001/1202; fax: +886-4-6320020; e-mail: [email protected]
E-mail: [email protected]
2
E-mail: [email protected]
1
0169-023X/00/$ ± see front matter Ó 2000 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 9 - 0 2 3 X ( 9 9 ) 0 0 0 1 7 - 8
30
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
sparse matrix. In order to yield a storage usage more e€ective and a transmission cost more
reasonable for sparse matrices, a more e€ective data compression technique is necessary. Many
methods can be used to compress sparse matrices, but most of them are designed for special
applications (e.g. handwritten signature maps and Chinese character patterns) under some particular data distribution patterns. For example, in the bit map of a Chinese character pattern,
there are many large homogeneous regions. Moreover, they use a complex preprocessing to yield
good compression results. We call a matrix randomly distributed if its nonzero elements are distributed randomly throughout the matrix. For the compression of randomly distributed sparse
matrices, nonzero-term method (NM) [5] and Ziegler's method (ZM) [16] are the most famous
methods. In addition, Chang et al. [2] have proposed a rehash method (RM) to compress sparse
binary-matrices.
NM is simple and e€ective for extremely sparse matrices, but it requires some linear search on
the nonzero elements. ZM makes use of the left-shift and merging operations to compress sparse
matrices. However, the compression result of ZM is related to the dimension of the input matrix,
and the selections of shift numbers are time-consuming. As for RM, it applies the rehash technique
to the sparse binary-matrices whose elements are limited to 02 or 12 only. RM cannot be employed
to sparse matrices (i.e., not binary-matrices) like NM and ZM. To avoid the drawbacks of NM,
ZM, and RM, we need another e€ective method to compress sparse matrices.
In the past, the methods for compressing sparse matrices focus on how to achieve more e€ective
storage usage and more ecient processing of membership queries. In some environments,
however, the receiver hopes the incoming data to be ®ltered in compressed form or to be transmitted progressively, especially when the receiver is power restricted, such as a mobile client or a
local user on the Internet. Besides the eciency of storage usage and membership query processing, the applicability of progressive transmission and data ®ltering are being considered.
In this paper, we extend the rehash technique of RM and propose a new method to compress
sparse matrices. For each sparse matrix, we use one random hash function to map its elements into
a small table named hash indicator table (HIT). The elements mapped to an entry of HIT are
successful if they are all zero or all nonzero elements. Otherwise, they are unsuccessful. So, this
HIT is used to identify the successful elements and store their values in some nonzero lists. As for
those unsuccessful elements, they must be rehashed by other random hash functions until all the
elements of the sparse matrix are successful. We collect the nonzero lists and put them into an
element collection table (ECT) to reduce the storage space of nonzero lists. HIT and ECT are our
compression results.
Both the storage spaces of HIT and ECT are shown to be small [2]. The zero elements of the
original matrix can be determined directly through these condensed tables, but the values of the
nonzero elements need to be recovered in a row major order (i.e., decompressing sequentially).
Therefore, our method is more suitable for those sparse matrices compressed only for o€-line
storage or transmission but fully decompressed before being accessed. It can be used in progressive transmission and data ®ltering. Since the hashing result of each hash function is stored in
the HIT separately, the HIT can be transmitted progressively. Moreover, the HIT or the partial
HIT of a query can act as a ®lter of incoming data. Our method can be used in a system which
requires progressive transmission of a great number of large static tables or images.
The rest of this paper is organized as follows. In Section 2, we have a quick review on NM, ZM,
and RM. Section 3 introduces our method and shows how random hash functions are used in
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
31
matrix compression. Moreover, we also present our technique based on random hash functions in
this section. The performance analysis of our method is given in Section 4. Finally, the conclusions are given in Section 5.
2. Previous works
2.1. Nonzero-term method (NM)
NM uses triples of the form (row-index, column-index, value) to store the nonzero elements in a
sparse matrix. Consider an n1 n2 sparse matrix. Suppose it has t nonzero elements. NM requires
t triples to store this sparse matrix. It is possible that the storage space for these triples is bigger
than that of the original matrix itself. Then NM cannot compress this sparse matrix. So, NM ®ts
only the compressing of extremely sparse matrices. For example, Let M1 be a 4 ´ 4 sparse matrix
and
M1 can be represented by NM as follows:
…1; 3; 3†;
…2; 1; 5†;
…2; 4; 1†;
…3; 3; 2†; and
…4; 3; 6†:
2.2. Ziegler's method (ZM)
ZM uses a sequence of left-shift and merging operations to compress sparse matrices. A linear
array R is used to record the current compression result. Initially, ZM assigns the ®rst row of the
sparse matrix to R. Next, it compresses the sparse matrix row by row. Consider a row of the
sparse matrix. ZM shifts it left until each nonzero element of this row matches a zero element of
R. If necessary, ZM will add extra zero elements to the head of R. Furthermore, ZM requires that
the nonzero elements of this row be as close as possible to those of R. The process is called the leftshift operation of ZM. After this process, ZM merges this row into R (i.e., the merging operation).
In the same way, ZM shifts the other rows of the sparse matrix and merges them into R. Finally,
32
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
the linear array R is the compression result of the input sparse matrix. Now, we consider the
example above. Suppose we use ZM to compress M1 . Then,
Step 0: Initially, ZM assigns the ®rst row of M1 to R (i.e., R ˆ [0, 0, 3, 0]).
Step 1: Since each nonzero element of the second row matches a zero element of R, ZM merges
it with R directly. Therefore, the compression result of the ®rst two rows of M1 is [5, 0, 3, 1].
Step 2: ZM shifts the third row of M1 to the left by one location and merges it with R. In the
following, we use ``'' to denote Ziegler's merging operation.
5 0 3 1
0 0 2 0
0 5 2 3 1
Step 3: ZM shifts the fourth row to the left by three locations and merges it with [0, 5, 2, 3, 1].
Then, M1 is compressed as:
0 5 2 3 1
0 0 6 0
0 0 6 5 2 3 1
So, the ®nal R is [0, 0, 6, 5, 2, 3, 1] (i.e., the merging result of M1 is [0, 0, 6, 5, 2, 3, 1]). In order to
decompress M1 correctly, ZM also needs to record the shift number of each row and the addresses
of nonzero-terms in the sparse matrix. ZM employs an extra row F to keep the shift number of
each row as well as a binary-matrix B1 to record the addresses of nonzero-terms in M1 . In this
example, F ˆ [0, 0, 1, 3] and
2
3
0 0 1 0
61 0 0 17
7
B1 ˆ 6
40 0 1 05 :
0 0 1 0 44
Therefore, we need seven integers to record the merging result, four integers to keep the shift
numbers, and a 4 ´ 4 binary-matrix to indicate the addresses of all nonzero elements in M1 .
However, the addresses of nonzero elements can be stored with a more e€ective method [1]. We
can use a linear array (L) parallel to R to specify the row indices of nonzero elements (i.e., using L
to indicate which nonzero elements belong to which row). So, the compression result of M1 will be
represented as:
R ˆ ‰0; 0; 6; 5; 2; 3; 1Š;
L ˆ ‰0; 0; 4; 2; 3; 1; 2Š; and
F ˆ ‰0; 0; 1; 3Š:
For instance, the two bold numbers in L denote that the nonzero elements ``5'' and ``1'' in R
belong to the second row of M1 . Furthermore, the second entry of F is ``0'' (i.e., the second row
was not shifted). So the column indices of ``5'' and ``1'' are the ``1'' and ``4'', respectively. Note
that the lengths of L and R are a€ected by the merging result and the number of nonzero elements
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
33
(i.e., t). If the matrix is not very sparse or the storage space for L and R is much greater than t,
then using L to represent the addresses will become uneconomic.
The compression result of ZM is dependent on the dimension of the input matrix and the result
of the merging operations. When n1 is much greater than n2 , the storage space of F is much greater
than in the case n1 ˆ n2 . The compression result also depends on the distribution of nonzero elements. R is not very compact and the process of determining the shift number is time-consuming.
Thus, the compression of ZM is ine€ective and a little complex.
2.3. Rehash method (RM)
Rehash method (RM) is an e€ective method to compress sparse binary-matrices. There are
many random hash functions employed in RM. They map the elements of a sparse binary-matrix
into a small table. We call this method rehash method and call the small table hash indication table
(HIT). Consider a binary-matrix B. Suppose we want to compress it with RM. The number of
entries of the HIT is m. Then, RM ®rst applies one random hash function and maps all the elements of B into the m entries of the HIT. Consider entry i of the HIT. If all the elements mapped
into it have the same value (zero or one), then we call this entry and these elements successful.
These successful elements do not need to be rehashed again; RM records all the successful entries
and the values of the elements mapped to them (i.e., ``02 '' for zero and ``12 '' for 1) into HIT. On
the other hand, as for those unsuccessful elements, RM has to rehash them by other random hash
functions. The process is repeated until all the elements of B are successful. In the following, we
use FNm to denote the set of random hash functions that randomly maps N keys into an address
space sized m.
For example, consider a 4 ´ 4 (i.e., N ˆ 16) binary-matrix B2 with elements eij Õs, where
1 6 i 6 4 and 1 6 j 6 4.
Suppose we use RM to compress B2 . Let the size of HIT be four (i.e., m ˆ 4). Assume that h1 , h2 ,
and h3 are three random hash functions selected from FNm . Their hashing results are listed in Table
1. In this table, the italic entries denote that the elements corresponding to those columns have
been successfully mapped by the previous hash function. So, the hashing results of these entries
can be ignored by the following hash functions.
In this example, only the four elements e14 , e23 , e34 , and e44 are mapped to the second entries of
the HIT by h1 . The value of e14 , e23 , e34 , and e44 are all zero. So, the elements mapped to the second
entry are successful after being hashed by h1 . In the same way, we ®nd that the elements mapped
to the entries in {1, 4} and {2, 3} of the HIT are successful after being hashed by h2 and h3 ,
respectively. Since all the elements of B are successful after the hashing operations of h1 , h2 , and
h3 , the total number of random hash functions applied in this example is three.
34
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Table 1
The hashing results of hash functions h1 , h2 , and h3
Index (kij )
Element
e11
1
0
e12
2
0
e13
3
1
e14
4
0
e21
5
1
e22
6
0
e23
7
0
e24
8
1
e31
9
0
e32
10
0
e33
11
1
e34
12
0
e41
13
0
e42
14
0
e43
15
1
e44
16
0
h1
h2
h3
3
1
2
4
3
3
1
4
4
2
2
2
4
4
1
1
3
3
2
1
4
3
3
2
3
2
3
4
1
1
4
2
2
2
2
1
1
1
4
3
3
3
1
4
1
2
3
4
During compression, RM employs an HIT to record the compression result. For each entry of
this HIT, the binary ®eld f` denotes the hashing result of h` (i.e., f` ˆ ``12 '' if all the elements
mapped to the entry are successful; otherwise, f` ˆ ``02 ''). The binary ®eld v` is used to record the
value of the successful elements mapped to the entry. The subscript ` denotes the `th hash
function in RM. Consider the example above again. After applying h1 , the four elements e14 , e23 ,
e34 , and e44 are successful. They are mapped to the second entry of the HIT. So, we set the value of
f1 and v1 in the second entry to ``12 '' and ``02 ,'' respectively. However, the f1 value of the other
entries are set to ``02 '' in this case. As for the other entries of HIT, the value of f2 , v2 , f3 , and v3 can
be set in the same way. Finally, the compression result of the matrix B2 is shown in Fig. 1. If the
value of a ®eld is denoted by `` '', it means that the value can be set to ``12 '' or ``02 '', and the setting
does not a€ect the compression result (i.e., don't care).
Now, we use an example to describe the decompression process of RM. Let us say we want to
restore the value of e43 . First, we ®nd that e43 is mapped to the ®rst entry of HIT by h1 , and the f1
of this entry is ``02 ''. That is, e43 is not successfully mapped after being hashed by h1 . Then we
continue to check the hashing results of e43 until it is mapped successfully. Since e43 is mapped to
the fourth entry of HIT by h2 with the f2 of this entry being ``12 '', e43 is successful after being
hashed by h2 . Therefore, the value of e43 is equal to the v2 of the fourth entry of HIT (i.e., e43 ˆ 1).
The number of random hash functions applied in RM has been shown to be small [2]. Thus the
execution time of RM is short, and the compression result (i.e., HIT) of RM is also small. Overall,
RM is an ecient and e€ective method in execution time and compression ratio. It can, however,
compress sparse binary-matrices only. Furthermore, RM is only appropriate to compress very
sparse matrices [2].
3. Our method
Apparently, NM and ZM are designed to compress sparse matrices. However, under some
circumstances, as we have mentioned in Section 2, they cannot run very e€ectively. Thus, we need
Fig. 1. The HIT of Example 1 (*: donÕt care).
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
35
some other e€ective methods that go more smoothly through these obstacles. In this section, we
shall provide a new method and give it a try. Basically, our method is extended from the rehash
technique of RM [2]. Using this method, any randomly distributed sparse matrix can be compressed into some condensed tables.
3.1. Basic concept
Let M ˆ ‰eij Šn1 n2 be a randomly distributed sparse matrix sized N (i.e., N ˆ n1 n2 ), where
1 6 i 6 n1 and 1 6 j 6 n2 The element of M located in row i and column j is indicated by eij . Assume that there are t nonzero elements in M (t is much less than N). The proposed method
processes the elements of M in a linear order. First, a row major arrangement [7] is adopted to put
the elements of M in a linear order indicated by a linear ordering set. Let kij denote the index of
element eij (i.e., kij indexes eij ). Then, kij equals (i ÿ 1) ´ n2 +j. Obviously, this is a one-to-one
mapping (i.e., kij 6ˆ kij if i 6ˆ i0 or j 6ˆ j0 ). Each element eij has a unique index kij to it. Let S denote
the set of indices {kij }, where 1 6 i 6 n1 and 1 6 i 6 n2 (i.e., S ˆ f1; 2; . . . ; Ng). Next, a random
hash function h1 selected from FNm is employed to map the indices of M into an address-space sized
m. Here S acts as the key-space of h1 . In this way, the elements of M are processed in a row major
order.
The hash function h1 can be represented by a list of the values h1 (kij )'s, where
1 6 h1 …kij † 6 m; 1 6 i 6 n1 , and 1 6 j 6 n2 . Let Cp ˆ fkij ; ki0 j0 ; . . . ; ki00 j00 g denote the set of indices
mapped to location p after being hashed by h1 (i.e., h1 …kij † ˆ h1 …ki0 j0 † ˆ ˆ h1 …ki00 j00 † ˆ p). If all of
the indices in Cp indicate zero elements (or nonzero elements), then p is a successfully mapped
location (SML), and kij ; ki0 j0 ; . . . ; ki00 j00 are successfully mapped indices (SMIs). Otherwise, p is an
unsuccessfully mapped location (UML), and kij ; ki0 j0 ; . . . ; ki00 j00 are marked as unsuccessfully mapped
indices (UMIs). A binary attribute b is used to indicate the content of an SML. Here b ˆ 0 (b ˆ 1)
means that all of the indices mapped to this SML point to zero elements (nonzero elements)
denoted by SML0 (SML1 ).
The proposed method employs a small table of m entries to record the hashing result of
these N indices. This small space is named hash indicator table (HIT). In this HIT, the SMLs
are indicated and the values of nonzero elements are stored in linear lists. The basic structure
of an HIT is shown in Fig. 2. The entry of an HIT contains two binary ®elds F and A.
Consider the entry p of an HIT. The ®eld F is set to ``12 '' if the location p is an SML after the
hashing of h1 ; otherwise, this ®eld is set to ``02 ''. In addition, the ®eld A is set to ``12 '' (or ``02 '')
if the attribute b of this SML is 1 (or 0). As for UMLs, the values in ®eld A are meaningless
(i.e., they could be ``12 '' or ``02 ''). In Fig. 2, location p1 and location p2 are SML0 and SML1 ,
respectively. The (F, A)'s of p1 and p2 are (``12 '', ``02 '') and (``12 '', ``12 ''), respectively. The ®eld
A of p1 is ``02 ''. So, all the elements mapped to location p1 by their indices are zeroes. That is,
the SMIs in Cp1 indicate f0; 0; . . . ; 0g. Therefore, the nonzero list of p1 is a null list. On the
other hand, The ®eld A of p2 is ``12 .'' Accordingly, Cp2 ˆ fkij ; ki0 j0 ; . . . ; ki00 j00 g indicates the set
feij ; ei0 j0 ; . . . ; ei00 j00 g, where all the elements are nonzero. In this case, the values of eij ; ei0 j0 ; . . . ; ei00 j00
are stored in the nonzero list of p2 . Chang et al. [2] have shown that it is almost impossible to
hash every kij successfully using only one hash function. So, we need other hash functions to
compress sparse matrices.
36
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Fig. 2. The basic structure of an HIT.
3.2. Our compression method
In this paper, the rehash technique of RM is extended to compress sparse matrices. Using a set
of di€erent hash functions, this technique repeatedly rehashes UMIs of S until no UMIs exist. The
basic structure illustrated in Fig. 2 is only adequate to record the hashing result of one hash
function. The HIT we used is an enlarged one so as to record the attribute b of each SML and to
indicate the hash function that makes it successful. The rehash model of the proposed method is
described in Fig. 3.
In Fig. 3(a), the number of entries in the HIT is also m, and the hash functions are selected from
m
FN . Each entry of the HIT is connected with an FIFO (®rst in ®rst out) nonzero list. The entries of
the HIT are composed of two binary strings. One is the function indicator string (FIS), and the
other is the attribute indicator string (AIS). The structure of the entry p of an HIT is shown in Fig.
3 (b) denoted by HIT[p]. Here the values of f` and b` are 0 or 1 (` ˆ 1; 2; . . . ; r), and r denotes the
number of the random hash functions needed in the compression process. The FIS and the AIS
are used to maintain the whole index information of the original matrix.
Suppose a sparse matrix is to be compressed with our method. For the sake of convenience, let
HIT[p].f` and HIT[p].b` denote the value of f` and b` of HIT[p], respectively. HIT[p].f` and
HIT[p].b` are de®ned in the following. Assume that HIT[p] is an SMLb after the hashing of h` . Let
Cp ˆ fkij ; ki0 j0 ; . . . ; ki00 j00 g denote the set of SMIs mapped to HIT[p] (i.e., kij ; ki0 j0 ; . . . ; ki00 j00 index
ei0 j0 . . . ; ei00 j00 ). Then,
(1) HIT[p].f` is set to ``12 ''. Also,
(2) if b ˆ 1, HIT[p].b` is set to ``12 '' and eij ; ei0 j0 ; . . . ; ei00 j00 is added to the nonzero list Lp in order.
Note that if b ˆ 0, HIT[p].b` is set to ``02 ''.
Otherwise, if HIT[p] is a UML, HIT[p].f` is set to ``02 ''. In this case, HIT[p]. b` is meaningless (i.e.,
do not care). Note that Lp will still be a null list if location p is always a UML or an SML0 after
being hashed by these r hash functions. Brie¯y, the setting of HIT[p].f` and HIT[p].b` is summarized in Fig. 3(c).
Most of the elements in a sparse matrix are zeroes. So, the majority of SMIs point to zero
elements (i.e., most of the successful elements are zeroes) in the beginning of the compression
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
37
Fig. 3. The rehash model of our method and the structure of an HIT. (a) The rehash model. (b) The entry p of HIT. (c) The value of
HIT[p].f` and HIT[p].b` .
process. That is, very few indices that stand for nonzero elements are SMIs. Let si1` denote the
number of SMIs that index nonzero elements after being hashed by h` . Chang et al. [2] suggest
that the si1` nonzero elements can be ignored if the ratio of si1` =t is less than a prede®ned threshold
value TH, but the si1` nonzero elements must be rehashed by another hash function, and HIT[p].f`
will be set to ``02 ''. Assume that the ratios of ( si1` =t) for 1 6 ` 6 r0 are not greater than TH until the
(r0 +1)th hash function is applied to M. Accordingly, the leading r0 bits of the AISs of HIT (i.e.,
the bit-string b1 b2 . . . br0 ) are set to ``00 . . . 02 ''. Therefore, we can delete the bit-string f1 f2 . . . fr0 ,
and the bit-string b1 b2 . . . br0 of each entry of HIT and store the HIT in a more compact form. We
only need to keep the number r0 and the reduced HIT where each entry contains 2 …r ÿ r0 † bits
for FIS and AIS. As for the values in the nonzero lists, they can be stored more compactly as they
are collected in a linear array. Instead of keeping pointers to a nonzero list, we only have to
maintain an o€set in each entry of HIT. This is shown in Fig. 4(a). The entry of the modi®ed HIT
is illustrated in Fig. 4(b).
In Fig. 4(a), we adopt an element collection table (ECT) instead of m nonzero lists to store the
nonzero elements. The nonzero elements of these m lists are stored into the ECT according to the
ordering of L1 ; L2 ; . . . ; Lm . The OFF ®eld of the HIT is used to denote the o€sets of the nonzero
lists in the ECT. For a sparse matrix, the combination of HIT, ECT, r, and r0 is our ®nal
compression result, and we can use the information they keep to reconstruct the original matrix.
38
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Fig. 4. The modi®ed HIT and the ECT, and the entry of the modi®ed HIT. (a) The modi®ed HIT and the ECT. (b) The entry of the
modi®ed HIT.
We should emphasize here that Chang et al. [2] have used HIT to record the hash results, but our
HIT structure is by no means the same as theirs. In our HIT, we use the ®eld OFF as an o€set to
its corresponding nonzero lists and the data structure ECT in charge of all the nonzero elements,
which are not used in [2] just like another.
Example 1 illustrates how to compress a sparse matrix into the HIT and the ECT.
Example 1. (Compression).
Let h1 , h2 and h3 be three random hash functions selected from F164 (i.e., m ˆ 4). Their corresponding hash values are listed in Table 2. In this table, the value p of the entry (h` , kij ) denotes
that h` (kij ) ˆ p, where 1 6 ` 6 3 and 1 6 i; j 6 4. Here the italic entries denote that their corresponding elements are successful when mapped by the previous hash function.
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
39
Table 2
The Hash functions h1 , h2 , and h3 of Example 1
Index (kij )
Element
e11
1
0
e12
2
0
e13
3
3
e14
4
0
e21
5
5
e22
6
0
e23
7
0
e24
8
1
e31
9
0
e32
10
0
e33
11
2
e34
12
0
e41
13
0
e42
14
0
e43
15
6
e44
16
0
h1
h2
h3
3
1
2
4
3
3
1
4
4
2
2
2
4
4
1
1
3
3
2
1
4
3
3
2
3
2
3
4
1
1
4
2
2
2
2
1
1
1
4
3
3
3
1
4
1
2
3
4
In this example, there are four 2's in the third row and their indices are 4, 7, 12, and 16. Note
that they index 0, 0, 0, and 0, respectively. Therefore, the second location is an SML0 , and 4, 7, 12,
and 16 are SMIs after being hashed by h1 . Obviously, the other locations are UMLs, and the other
indices are UMIs. In this way, we ®nd that the locations 1, 4 and 2, 3 are SMLs after the hashing
of h2 and h3 , respectively. Since all indices become SMIs after the hashing of h1 , h2 , and h3 , the
total number of random hash functions is three (i.e., r ˆ 3). The hashing results of h1 , h2 , and h3
are summarized in Table 3.
The second location is an SML0 after being hashed by h1 . So we set HIT[2].f1 to ``12 '' and set
HIT[2].b1 to ``02 ''. After applying h2 , the ®rst and the fourth locations are SML0 and SML0 ,
respectively. So, we set HIT[1].f2 and HIT[4].f2 to ``12 '' and set HIT[1].b2 and HIT[4].b2 to ``02 ''
and ``12 '', respectively. At the same time, the set of elements {3, 5, 6} indexed by C4 ˆ {3, 5, 15}
are added to L4 in the order of 3, 5, and 6. The other ®elds of HIT can be set the same way. The
HIT and the nonzero lists for M1 are shown in Fig. 5.
Next, we store the four nonzero lists into the ECT in the order of L1 , L2 , L3 , and L4 . In the
ECT, the o€sets of L1 , L2 , L3 , and L4 are 0, 1, 0, and 3, respectively. Therefore, we set the values of
the OFF ®elds in the HIT to be 0, 1, 0, and 3, respectively. The ®nal compression result of Example 1 is shown in Fig. 6. Here we set TH ˆ 0 so r0 ˆ 0. In this example, m ˆ 4, r ˆ 3, r0 ˆ 0, and
t ˆ 5. Assume that we need EL bits to store a nonzero element. Following the result of Eq. (8),
which we will derive in Section 4, the memory space occupied by the ®nal compression result of
M1 is ‰2 …3 ÿ 0† ‡ dlog 5eŠ 4 ‡ ‰…5 ‡ 2† ELŠ bits.
3.3. Decompression
This subsection describes how to restore (i.e., decompress) the elements of a sparse matrix. Let
the size of the original matrix be N ( ˆ n1 n2 ) and the compression result be an HIT, an ECT, the
Table 3
The SMLs and their attributes after hashing by h1 , h2 , and h3
Hash function
SMLs
Attribute (b)
SMIs
Indexed values
h1
2
0
4, 7, 12, 16
0, 0, 0, 0
h2
1
4
0
1
1, 10, 13
3, 5, 15
0, 0, 0
3, 5, 6
h3
2
3
1
0
8, 11
2, 6, 9, 14
1, 2
0, 0, 0, 0
40
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Fig. 5. The HIT and nonzero lists of Example 1.
Fig. 6. The ®nal HIT and ECT of Example 2.
r, and r0 . Assume that FNm already exists in both the sender and the receiver. In the following, we
use HIT[p]. O to denote the value of the OFF ®eld of HIT[p]. The elements of M1 can be restored
in the order of e11 ; e12 ; . . . ; en1 n2 (i.e., in row major order). To recover eij , we sequentially apply the
hash functions hr0 ‡1 ; hr0 ‡2 . . . ; hr to kij until HIT[h` (kij )]. f`ÿr0 ˆ ``12 '', where r0 ‡ 1 6 ` 6 r. Then we
verify the value of HIT[h` (kij )].b`ÿr0 . The value of eij is ECT[HIT[h` …kij †].O] if HIT[h` …kij †].b`ÿr0
equals to ``12 ''. Otherwise, eij is a zero element.
After recovering eij , we must increase the value of HIT[h` (kij )].O by one if eij ¹0. We ignore the
hashing result of the leading r0 hash functions. Thus, if HIT[h` (kij )].f`ÿr0 is ``02 ''
(` ˆ r0 ‡ 1; r0 ‡ 2; . . . ; r), then the value of eij is zero. Generally, the value of r is small [2], so the
decompression of eij must be ecient. Thus, we can quickly recover each element of the original
matrix from HIT and ECT. That is, we can decompress the whole matrix completely and correctly.
Example 2. (Decompression).
Let us follow the results illustrated in Example 1 and see how the elements of M1 are recovered.
We decompress the elements of M1 in the order of e11 ; e12 ; . . . ; e44 : Consider e11 ®rst. Its index is
k11 . We apply h1 , h2 and h3 to hash k11 in turn until HIT[h` (1)].f` ˆ ``12 '', where ` ˆ 1, 2, or 3. Note
that k11 ˆ 1 and HIT[h1 (1)].f1 ˆ HIT[3].f1 ˆ ``02 '' but HIT[h2 (1)].f2 ˆ HIT[1].f2 ˆ ``12 ''. Then we
check the value of HIT[h2 (1)].b2 . Here HIT[h2 (1)].b2 ˆ HIT[1].b2 ˆ ``02 ''. Hence, k11 indexes a zero
element. That is, the element e11 of M1 is zero. Next, letÕs recover e12 . Since
HIT[h1 (2)].f1 ˆ HIT[4].f1 ˆ ``02 '', HIT[h2 (2)].f2 ˆ HIT[3].f2 ˆ ``02 '', HIT[h3 (2)].f3 ˆ HIT[3]f3 ˆ ``12 '',
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
41
and HIT[h3 (2)].b3 ˆ HIT[3].b3 ˆ ``02 '', so k12 indexes a zero element. After the reconstruction of e11
and e12 , we have to recover e13 . We can ®nd that HIT[h1 (3)].f1 ˆ HIT[1].f1 ˆ ``02 '',
HIT[h2 (3)].f2 ˆ HIT[4].f2 ˆ ``12 '', and HIT[h2 (3)].b2 ˆ HIT[4].b2 ˆ ``12 ''. Thus, the value of e13 ˆ
ECT[HIT[4].O] ˆ ECT[3] ˆ 3. Note that, after recovering e13 , we must increase the o€set value of
HIT[4] by one, and the new value of HIT[4].O equals 4. In the way, we can recover all the elements
e13 ; e14 ; e21 ; . . . ; and e44 to be 0, 0, 3, 0, 5, 0, 0, 1,0, 0, 2, 0, 0, 0, 6, and 0, respectively.
3.4. Random hash functions
In order to compress sparse matrices, we need a family of hash functions {h` (k)}, where h` 2 FNm
and it maps a key k in the set S ˆ f1; 2; . . . ; Ng into the range from 1 to m randomly. In our
method, each h` should provide a di€erent random mapping from the set S to f1; 2; . . . ; mg. Some
criteria for the selection of appropriate hash functions have been discussed in [8]. Knuth suggests
that the remainder can be processed well if the divisor is a prime number (i.e., m is a prime). There
are some families of hash functions which are very small and still have good randomness properties [13,15,11,12,10]. In addition, a pseudo random number generator can generate a sequence of
random numbers when given a seed. For convenience, we can use the pseudo random number
generator in C Language to generate random numbers in the range from 1 to m as a random hash
function for our own use. Accordingly, we may implement a set of random hash functions
fh1 ; h2 ; . . . ; h` g by setting the random seeds of the pseudo random number generator to be
1; 2; . . . ; and `, respectively.
3.5. The applications
In our method, the sparse matrix is compressed by a sequence of hash functions. The content of
a matrix is added to the HIT gradually. On the other hand, when more hash functions are used,
more indices of the elements are successfully mapped. Furthermore, the hashing result of each
hash function is stored in the HIT separately (i.e., the HIT grows row by row). Therefore, the HIT
can be transmitted progressively. Moreover, the HIT or the partial HIT of a given query can act
as a ®lter of matrices in compressed form.
There are two alternatives to use our method for progressive transmission and data filtering.
One policy is that the sender can send the HIT progressively. At the same time, the receiver accepts the partial HIT of matrix and veri®es it with his request. If it does not match the request, the
receiver will stop the transmission of the incoming compressed matrix. Otherwise, it continues the
transmission and veri®cation. The alternative policy is that the HITs of matrices are ®ltered according to the receiverÕs request before transmission. Therefore, the incoming data can be ®ltered
in compressed form or to be transmitted progressively.
3.6. The Algorithms
Let M ˆ ‰eij Šn1 n2 be a sparse matrix with t nonzero elements. We can compress and decompress
M with a set of hash functions fh1 ; h2 ; . . .g selected from FNm . The algorithms are described as
follows.
Algorithm 1. (Compression)
Input: a matrix M ˆ ‰eij Šn1 n2 a threshold TH.
Output: r, r0 , an HIT, and an ECT
Begin
42
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Step 1. Initialize HIT and ECT;
Linear mapping of indices (i.e., S ˆ fk11 ; k12 ; . . . ; kn1 n2 g ˆ f1; 2; . . . ; N g†;
Let ` ˆ 0, r0 ˆ 0, and r ˆ 0;
Step 2. While exist UMIs (i.e., S is not empty)
` ˆ `+1;
Use h` to hash the keys in S onto HIT;
/* Count the number of nonzero SMIs after being hashed by h` */
Compute si1` ;
If (si1` =t P TH)
Record the hashing results in HIT and insert the nonzero elements to Lp ;
S ˆ S- {SMIs after being hashed by h` };
Else
r0 ˆ r0 + 1;
S ˆ S-{SMIs after being hashed by h` , that index zeroÕs};
EndIf
EndWhile
r ˆ `;
/* Construct ECT and OFF ®elds in HIT */
offset ˆ 1
For each nonzero list Lp
If (Lp ) not null
Insert each element in Lp to ECT
HIT[p].O ˆ offset
offset ˆ offset+(length of Lp )
Else HIT[p].O ˆ 0
EndIf
EndFor
End
Algorithm 2. (Decompression)
Input: r, r0 , HIT, ECT, and the size of the original matrix M (i.e., n1 n2 )
Output: a matrix M ˆ ‰eij Šn1 n2
Begin
Construct S (Linear mapping of indices);
` ˆ r0 +1;
Decompress element eij in the order of row major by applying h` to kij ;
While (HIT[h` (kij )].f`ÿr0 ˆ ``02 '' and ` 6 r )
` ˆ `+1;
EndWhile
If (` 6 r And HIT[h` (kij )].b`ÿr0 ˆ ``12 '')
eij ˆ ECT [ HIT[h` (kij )].O ]
/* Modify the o€set value */
HIT[h` (kij )].O ˆ HIT[h` (kij )].O+1
Else
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
43
eij ˆ 0;
EndIf
End
4. Performance analysis
Chang et al. [2] have proposed an analysis model for RM. In this section, we shall follow this
model to analyze the performance of our method. We assume that a random hash function maps
all the given keys uniformly over m entries of HIT. The notations used in this section are identical
to those speci®ed in [2].
4.1. The analysis of the number of hash functions
Now, we elaborate on some equations to determine the expected number of the hash functions
needed. These equations are to be used to evaluate the expected storage space of the compression
result. Given a sparse matrix M ˆ ‰eij Šn1 n2 with t0 nonzero elements. Let S0 be the set of indices of
eij , and let N0 be the size of M (i.e., N0 ˆ n1 n2 ). Here the superscript 0 of t0 , N0 , and S0 denotes
the initial value. Suppose h1 is the ®rst random hash function selected from FNm . After applying h1
to S0 , there must be N0 =m keys mapped to the same location on the average. Let p1b denote the
probability of an entry in HIT that is an SMLb after the hashing of h1 , where b ˆ 0 or 1. The
probability of a key in S0 indexing a nonzero element is t0 =N0 and that of a key indexing 0 is
(1 ÿ t0 =N0 ). So, p11 equals …t0 =N0 †N0 =m and p10 equals …1 ÿ t0 =N0 †N0 =m . Assume that slb1 and sib1 are the
numbers of SMLs and SMIs after the hashing of h1 , respectively. Let eslb1 and esib1 denote the
expected values of slb1 and sib1 , respectively. Then esl11 ˆ m p11 ˆ m …t0 =N0 †N0 =m ; and
N =m
esl01 ˆ m p10 ˆ m …1 ÿ t0 =N0 † 0 . Furthermore, since there are N0 =m keys in each SML on the
1
average, the value of esi1 and esi01 can be estimated by Eqs. (1) and (2).
"
N0 =m #
N0 =m
N0
N0
t0
t0
1
1
esi1 ˆ
ˆ N0 ;
…1†
es`1 ˆ
m
m
m
N0
N0
esi01
"
N0 =m #
N0 =m
n0
n0
t0
t0
0
ˆ N0 1 ÿ
ˆ es`1 ˆ m 1 ÿ
:
m
m
N0
N0
…2†
Assume that the number of UMIs that index nonzero elements is t1 after being hashed by h1 .
Then, t1 equals t0 -esi11 if esi11 =t0 is not less than TH; otherwise, t1 equals t0 . So, the value of t1 can
be estimated in the following equation.
(
n0 =m
N =m
t0
0† 0
ÿ
N
if N0 …t0 =N
P TH ;
t
0
0
N0
t0
…3†
t1 ˆ
otherwise :
t0
After being hashed by h1 , the set of UMIs and the number of UMIs are denoted by S1 and N1 ,
respectively. Here S1 is set to S0 -{UMIs after being hashed by h1 } and the value of N1 is modi®ed
to be N0 ÿ N0 …1 ÿ Nt00 †N0 =m ÿ …t0 ÿ t1 †.
44
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Sequentially, we apply h1 to S0 , h2 to S1 , h3 to S2 , and so forth. Suppose hi is the ith random
hash function that is selected from FNm . Then, we can obtain Eqs. (4)±(6). Using Eqs. (4)±(6), we
can derive r0 and r for the given n, t, and m, where r0 is the minimal number of hash functions such
that esi1r0 =t0 P TH , and r is the maximal number of hash functions so that Nr 6 0. Let ravg denote
the average number of hash functions applied to each key. Notice that ravg represents the average
execution time consumed by our method. Here can be computed by Eq. (7). The time eciency of
our method is dependent on r, r0 and ravg .
(
Niÿ1 =m
N
=m
tiÿ1
iÿ1 † iÿ1
t
ÿ
N
if Niÿ1 …tiÿ1 =N
P TH or i P r0 ;
iÿ1
iÿ1
Niÿ1
t0
…4†
ti ˆ
otherwise;
tiÿ1
Si ˆ Siÿ1 ÿ fall successful keys after being hashed by of hi g;
ni ˆ Niÿ1 ÿ Niÿ1
ravg
1
ˆ
N
tiÿ1
1ÿ
Niÿ1
Niÿ1 =m
ÿ …tiÿ1 ti †:
"
(
…r ÿ r0 † …N0 ÿ Nr0 † ‡
r
X
…5†
…6†
#)
i …Niÿ1 ÿ Ni †
…7†
iˆr0 ‡1
Let d denote the ratio between t and N of a matrix, namely the density (i.e., d equals t=N ). The
performance of our method is related to d, N, m, and t. To show the eciency of our method, we
assume that N ˆ 40000, m ˆ 0.4 ´ t, 0.6 ´ t, 0.8 ´ t, and t, and the threshold value TH ˆ 0.1 ´ t. We
use computers to estimate r, r0 , and ravg after hashing in the given conditions, and the result is
shown in Figs. 7±9,respectively.
Observing Figs. 7 and 8, we ®nd that the values of r and r0 will increase when d decreases. This
is because the value of esi1i ˆ Niÿ1 …tiÿ1 =Niÿ1 †Niÿ1 =m is a€ected by tiÿ1 =Niÿ1 . For a smaller d, the
value of si1i …i:e:; esi1i < 1) is near zero in the beginning of the compression process, where
Fig. 7. The number of hash functions needed for compressing a matrix.
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
45
Fig. 8. The maximal number of hash functions that make the ratio of …esi1r0 =t† 6 0:1 t.
Fig. 9. The average number of hash functions applied for each element in a matrix.
i ˆ 1; 2; . . . ; r0 . Hence, we require more random hash functions to accomplish the compression
process. On the other hand, the values of r and r0 are also a€ected by m, especially when m ˆ 0.4 t.
This means that, if the size of HIT is too small, there can be few SMLs (i.e., the values of
sl1i and sl0i are very small) after the application of a hash function.
In Fig. 9, we can see how it works when hash functions of the average number are applied to an
element in a matrix. The value of ravg approaches a constant for m ˆ 0.4 ´ t, 0.6 ´ t, 0.8 ´ t, and t.
46
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Also, the value of ravg turns larger if the value of m becomes smaller. We can ®nd that our method
is ecient. For instance, we only need to apply 5.7 and 3.7 hash functions on the average to
decompress an element when m ˆ 0.6 ´ t and 0.8 ´ t, respectively.
4.2. The analysis of the compression results
According to the results in Section 4.1, we measure the compression results of our method with
NM and ZM. In the following, we employ a viable T to denote the total compressed space of a
matrix. In fact, T is the storage space for HIT, ECT, r, and r0 in our method. Using Eqs. (4)±(6),
we can compute r0 and r. The entries of an HIT are composed of an FIS, an AIS, and an OFF.
Here the total length of FIS plus AIS is equal to 2 ´ (rÿr0 ), and the length of the OFF is dlog2 te.
Thus, the storage space of this HIT is ‰2 …r ÿ r0 † ‡ dlog2 teŠ m. Assume that we need EL bits to
store a nonzero element. Then, we need t ´ EL bits to store the nonzero elements of M. So we need
(t+2) ´ EL bits to store our ECT, r, and r0 . Thus, the compressed space of M in our method is
‰2 …r ÿ r0 † ‡ dlog2 teŠ m ‡ ‰…t ‡ 2† ELŠ bits.
As for ZM, TZM is equal to the sum of the storage spaces of R, F, and L. Here the subscript of T
denotes the compression method. The R, F, and L need kRk integers ; n1 dlog2 kRke bits, and
kRk dlog2 n1 e bits to store them, respectively. Here kX k denotes the length of X. In the case of
NM, it needs to store the addresses of nonzero elements and their values. It requires dlog2 N e bits
to describe the addresses if the size of the matrix is N. Let us list the total compressed spaces of M
processed respectively by our method, by ZM, and by NM as follows:
T ˆ ‰2 …r ÿ r0 † ‡ dlog2 teŠ m ‡ ‰…t ‡ 2† ELŠbits;
…8†
TZM ˆ kRk …EL ‡ dlog2 n1 e† ‡ n1 d log 2 kRkebits;
…9†
TNM ˆ …dlog2 N e ‡ EL† t bits:
as
…10†
The compression performance is measured in terms of compression ratio (CR), which is de®ned
CR ˆ
The original space of M
:
The total compressed space of M
…11†
The greater CR a method has, the better compression performance is achieved. The compression
ratios of our method, ZM, and NM for various matrix densities are compared in Tables 4 and 5.
In the experiment, we assume that the EL is 16 bits. The TZM 's are evaluated with kRk set to
1.2 ´ t, 1.6 ´ t, and 2 ´ t. From Table 4, we obtain that our method achieves better compression
performance than the ZM. In Table 5, the compression ratio of our method is higher than TNM .
Furthermore, the proposed method achieves better compression performance for low densities.
Our experiments seem to indicate that ZM performs a little better than our method in the case
of kRk ˆ 1.2 ´ t and m ˆ 0.8 ´ t. As a matter of fact, this case is almost impossible to happen
because the compression results of ZM are a€ected by the eciency of its merging operations and
the distribution of nonzero elements. So kRk will be bigger than t. Moreover, the merging operations of ZM are time-consuming and not easily implemented. On the other hand, our method is
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
47
Table 4
The compression ratios (CRs) of our method and ZM
Method
Density (d)
0.01
0.05
0.1
0.15
0.2
ZM(||R|| ˆ 1.2t)
ZM(||R|| ˆ 1.6t)
ZM(||R|| ˆ 2t)
48.0
36.9
30.2
10.7
8.1
6.5
5.4
4.1
3.3
3.6
2.7
2.2
2.7
2.1
1.7
Our method (m ˆ 0.4 ´ t)
Our method (m ˆ 0.6 ´ t)
Our method (m ˆ 0.8 ´ t)
63.5
58.4
54.1
12.3
11.2
10.3
6.1
5.7
5.0
4.0
3.6
3.4
3.0
2.7
2.4
Table 5
The compression ratios (CRs) of our method and NM
Method
NM
Our method (m ˆ 0.4 ´ t)
Our method (m ˆ 0.6 ´ t)
Our method (m ˆ 0.8 ´ t)
Density (d)
0.01
0.03
0.05
0.07
0.09
0.11
50.0
63.5
58.4
54.1
16.7
20.5
19.5
18.0
10.0
12.3
11.2
10.3
7.1
8.7
7.8
7.1
5.6
6.7
6.1
5.6
4.5
5.4
4.9
4.7
simply based on the rehash technique. It can be put together with ease, and the compression
results are independent of the distribution of nonzero elements.
5. Conclusions
In this paper, we have proposed a new method to compress sparse matrices based on the rehash
method (RM). RM is an ecient method to compress sparse binary-matrices. We cannot, however, apply it to compress sparse matrices directly. Thus, the purpose of our method is to extend
the technique of RM and to store any sparse matrix with small tables. According to our analysis,
such extension is e€ective. It not only compresses sparse matrices, but also inherits all the bene®ts
from RM. The number of the random hash functions used in our method is always limited like in
RM. The execution time and the compression ratio of our method are therefore satisfactory.
Moreover, our method can be applied to progressive transmission and data ®ltering.
Unfortunately, our method is not suitable for the compression of sparse binary-matrices since
our method contains too many redundancies in OFF and ECT. That is, it cannot be categorized
as a generalized RM. To compress sparse binary-matrices, RM is a proper method.
6. For further reading
[4]
48
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
References
[1] A.V. Aho, J.D. Ullman, Principles of Compiler Design, Addison-Wesley, Reading, MA, 1977.
[2] C.C. Chang, J.H. Jiang, T.S. Chen, Rehash method: A compression technique of sparse binary-matrices, Technical Report of the
Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan, 1996.
[3] C.C. Chang, T.C. Wu, A letter-oriented perfect hashing scheme based upon sparse table compression, Software-Practice and
Experience 21 (1) (1991) 35±49.
[4] M.W. Du, T.M. Hsieh, K.F. Jea, D.W. Shieh, The study of a new perfect hash function, IEEE Transactions on Software
Engineering SE-9 (3) (1983) 305±313.
[5] E. Horowitz, S. Sahni, Fundamentals of Data Structures, Computer Science Press, Potomac, Maryland, 1976.
[6] R.H. Ju, I.C. Jou, M.K. Tsay, Global study on data compression techniques for digital chinese character patterns, IEEE
Proceedings-E 139 (1) (1992) 1±8.
[7] D.E. Knuth, The Art of Computer Programming, Fundamental Algorithms, vol. 1, Addison-Wesley, Reading, MA, 1973a.
[8] D.E. Knuth, The Art of Computer Programming, Fundamental Algorithms, vol. 3, Addison-Wesley, Reading, MA, 1973b.
[10] J.W. Miller, Random access from compressed data sets with perfect value hashing, Proceedings of 1995 IEEE International
Symposium on Information Theory, BC, Canada, 1995, p. 454.
[11] M. Nagao, Data compression of chinese pattern, Proceedings of the IEEE 68 (1980) 818±829.
[12] A. Siegel, On universal classes of fast high performance hash functions, their time-space tradeo€, and their applications,
Proceedings of IEEE 36th Annual Foundations of Computer Science, NC, USA, 1989, pp. 20±25.
[13] R.E. Tarjan, A. C.Yao, Sorting a sparse table, Communications of the ACM 22 (21) (1979) 606±611.
[14] L. Yang, B.k. Widjaja, R. Prasad, Application of hidden Markov models for signature veri®cation, Pattern Recognition 28 (2)
(1995) 161±170.
[15] A.C. Yao, Should tables be stored, Journal of the Association for Computing Machinery 28 (3) (1981) 615±628.
[16] S.F. Ziegler, Small faster table driven parser, Madison Academic Computing Center, University of Wisconsin, Madison,
Wisconsin, 1977.
Ji-Han Jiang was born in Chiayi,
Taiwan, Republic of China, on May
17, 1968. He received his B.S. degree in
1992 and M.S. degree in 1994 from
Tatung Institute of Technology and
National Chung Cheng University,
respectively, both in Information Engineering And Computer Science. He
is currently a Ph.D. candidate of the
Department of Computer Science and
Information Engineering at National
Chung Cheng University, Chiayi,
Taiwan. His research interests include
data engineering, data compression, and image compression.
Chin-Chen Chang was born in Taichung, Taiwan, Republic of China, on
November 12, 1954. He received his
B.S. degree in applied mathematics in
1977 and his M.S. degree in computer
and decision sciences in 1979 form
National Tsing Hua University,
Hsinchu, Taiwan. He received his
Ph.D degree in computer engineering
in 1982 from National Chiao Tung
University, Hsinchu, Taiwan. During
the academic years 1980-83, he was on
the faculty of the Department of
Computer Engineering at National Chiao Tung University.From 1983 to 1989, he was on the faculty of the Institute of
Applied Mathematics, National Chung Hsing University,
Taichung, Taiwan. From August 1959 to July 1992, he was the
head and a professor of the Institute of Computer Science and
Information Engineering at National Chung Cheng University,
Chiayi, Taiwan. From August 1993 to July 1995, he was the
Dean of the College of Engineering at National Chung Cheng
University. Since August 1995, he has been the Dean of Academic A€airs at National Chung Cheng University. And since
September 1996, he is the Acting President at the same university.In addition, he has served as a consultant to several
research institutes and government departments. His current
research interests include database design, computer cryptography, coding theory, and data structures.
J.-H. Jiang et al. / Data & Knowledge Engineering 32 (2000) 29±49
Tung-Shou Chen was born in Taichung, Taiwan, Republic of China, on
October 14, 1964. He received the B.S.
and Ph.D. degrees from National
Chiao Tung University in 1986 and
1992, respectively, both in Computer
Science and Information Engineering.
He served at the computer center,
Chinese Army Infantry School, Taiwan, from 1992 to 1994. During the
academic years 1994-97, he was on the
faculty of the Department of Information Management at National
Chin-Yi Institute of Technology, Taichung, Taiwan. Since
August 1998, he has been the dean of Student A€airs and a
professor of the Department of Computer Science and Information Management at Providence University. His current research interests include data structures, image cryptosystems,
and image compression.
49