11_chapter 3

CHAPTER 3
QUICK TEXT RETRIEVAL ALGORITHM SUPPORTING
SYNONYMS BASED ON FUZZY LOGIC
Traditional Information Retrieval techniques become inadequate for the increasingly vast
amounts of text data. In this chapter, we show a method of query processing, which
retrieve the documents containing not only the query terms but also documents having
their synonyms. The method performs the query processing by retrieving and scanning the
inverted index document list. We show that query response time for conjunctive Boolean
queries can be dramatically reduced, at cost in terms of disk storage, by applying range
partition feature of Oracle to reduce the primary memory storage space requirement for
looking the inverted list. The proposed method is based on fuzzy relations and fuzzy
reasoning to retrieve only top ranking documents from the database. To group similar
documents Suffix tree clustering is used.
3.1
Inverted Index Files
User accesses the Information Retrieval system by submitting a query; the Information
Retrieval system then tries to retrieve all documents that are ―relevant‖ to the query. For
this, the documents contained in the archive are analyzed to provide a formal
representation of their contents through inverted indexing, where a surrogate describing
the document is stored in an index, while the document itself is stored in the collection or
archive. An index is a structure that is used to map from query-able entities to indexed
items. For example, in a database system an index is used to map from entities such as
name and bank account to records containing data about those entities.
To gain the speed benefits of indexing at retrieval time, we have to build the inverted
index in advance [96], [112]. The major steps to build the inverted index are:
1. Collect the documents to be indexed.
2. Tokenize the text, turning each document into a list of tokens.
3. Do linguistic preprocessing, producing a list of normalized tokens, treated as index
terms.
67
4. Create an inverted index, consisting of a dictionary and postings lists. Dictionary
contains the unique index terms appearing in the documents corpus. Postings lists
corresponding to each term in the dictionary contain document IDs containing that
term.
The earlier stages of processing, that is, steps 1–3 are already discussed in chapter 1,
section 1.3.2. In this section, we examine building a basic sorted inverted index based on
indexing.
Each document in the document corpus is assigned a unique serial number, known as the
document identifier (docID). During index construction, successive integers are assigned
to each new document when it is first encountered. The input to indexing is a list of
normalized tokens for each document, containing list of pairs of term and docID. The core
indexing step is to arrange this pair list sorted alphabetically. Multiple occurrences of same
term from same document are merged to obtain a list of unique terms appearing in the
document. Instances of the same term from different documents are group together to
create a dictionary maintaining postings lists for each unique term in the dictionary. Since
a term generally occurs in a number of documents, this data organization reduces the
storage requirements of the index. The dictionary also records some statistics, such as the
number of documents which contain each term (the document frequency). The postings are
secondarily sorted by docID. This provides the basis for efficient query processing. This
inverted index structure is the most efficient structure for supporting ad hoc text search. In
the resulting index, the dictionary is commonly kept in memory, while postings lists are
normally kept on disk.
There are two alternatives available for maintaining the postings lists - Singly linked lists
or variable length arrays.

Singly linked lists allow cheap insertion of documents into postings lists (following
updates, such as when crawling the web again for updated documents), and
naturally extend to more advanced indexing strategies such as skip lists, which
require additional pointers.

Variable length arrays win in space requirements by avoiding the overhead for
pointers and in time requirements because their use of contiguous memory
68
increases speed on modern processors with memory caches. Extra pointers can in
practice be encoded into the lists as offsets.
If updates are relatively infrequent, variable length arrays will be more compact and faster
to traverse. Hybrid scheme can also be used with a linked list of fixed length arrays for
each term. When postings lists are stored on disk, they are stored (perhaps compressed) as
a contiguous run of postings without explicit pointers, so as to minimize the size of the
postings list and the number of disk seeks to read a postings list into memory.
The concept of inverted index file is explained through an example given below. Consider
two documents of the document corpus – Doc 1 and Doc 2 having the text content shown
below:
Doc 1
Mahatma Gandhi‘s birthday is celebrated on 2 Oct. He is father of India.
He is a freedom fighter.
Doc 2
Indira Gandhi is the mother of India.
She is a politician of great personality.
Given below are the steps to be followed to create the inverted index file.
1. Collect the documents to be indexed
Mahatma Gandhi‘s birthday is
Indira Gandhi is the mother of
Doc 1
Doc 2
2. Tokenize the text, turning each document into a list of tokens
Mahatma
Gandhi‘s
birthday
….
is
Doc 1
3. Do linguistic preprocessing, producing a list of normalized tokens, treated as index
terms
Mahatma
Gandhi
celebrate ….
birthday
Figure 3.1 : Showing process of generating normalized tokens
69
4. Build an index by sorting and grouping
Index term
freq.
doc no.
Freq.
Mahatma
1
→
1
1
Gandhi
2
→
1
1
2
1
India
2
→
1
1
2
1
.
.
.
Dictionary
Postings
Figure 3.2 : Inverted index file
Inverted index file has two parts - dictionary and postings lists. Dictionary stores the
terms, and has a pointer to the postings list for each term. The dictionary is commonly kept
in memory, with pointers to each postings list, which is stored on disk. It commonly also
stores other summary information such as, the document frequency of each term. Each
posting list stores the list of documents in which a term occurs, and may store other
information such as term frequency of each term in the document.
The principal cost of accessing the inverted list for retrieval process is:
•
the space needed in Random Access Memory to maintain the inverted list,
•
the time required to process inverted lists since it has to maintain information about a
large number of documents as they are potential answers.
•
large terms in a query mean that more disk accesses are required into the inverted file,
and more time must be spent merging the lists.
To handle the above listed problems, compression technique is used to reduce the file sizes
considerably [150]. This in turn reduces the needed storage size and transfer channel
capacity. Especially in systems where memory is at premium compression can make the
difference between impossible and implement-able. In the absence of compression, let if
four bytes are allocated for storing the document number and two bytes for documentfrequency, then, six bytes are needed for each document, document-frequency pair. Using
70
compression the space required can be reduced to about one byte per pair. In section 3.2,
we discuss few compression algorithms, especially to compress integer values.
The goal of an Information Retrieval System (IRS) is to retrieve information considered
pertinent to a user‘s query. However, the nature of the goal is not deterministic, since
uncertainty and vagueness are present in many different parts of the retrieval process. The
user‘s expression of his/her information needs in a query is uncertain and often vague, the
representation of a document informative content is uncertain, and so is the process by
which a query representation is matched to a document representation. The effectiveness
of an IRS is therefore crucially related to the system‘s capability to deal with the
vagueness and uncertainty of the retrieval process. Commercially available IRSs generally
ignore these aspects; they oversimplify both the representation of the documents‘ content
and the user-system interaction. In section 3.3, we study technique for dealing with this
vagueness and uncertainty. For that, we discuss the approach to Soft Information
Retrieval, in particular the approach that make use of Fuzzy Set Theory.
Since the document retrieval systems return long lists of ranked documents that users are
forced to surf through to find relevant documents, clustering can be applied to group the
set of ranked documents returned in response to a query. In section 3.4, we discuss Suffix
tree clustering algorithm for grouping the ranked documents.
3.2
Inverted File Compression Techniques
Dictionary and the inverted index are the central data structures in Information Retrieval.
A number of compression techniques for dictionary and inverted index are employed that
are essential for efficient IR systems. There are two more subtle benefits of compression:
1) Efficient and maximum utilization of cache
Search systems use some parts of the dictionary and the index much more than others. For
example, if the postings list of a frequently used query term t is cached in memory, then
the computations necessary for responding to the one-term query t can be entirely done in
memory. With compression, a lot more information can be fitted into main memory.
Instead of having to expend a disk seek when processing a query with t, an access is made
to its postings list in memory and decompress it. There are simple and efficient
71
decompression methods, so that the penalty of having to decompress the postings list is
small. As a result, the response time of the IR system can be decreased substantially.
2) Faster transfer of data from disk to memory
Efficient decompression algorithms run so fast on modern hardware that the total time of
transferring a compressed chunk of data from disk and then decompressing it is usually
less than transferring the same chunk of data in uncompressed form. For instance, it can
reduce input/output (I/O) time by loading a much smaller compressed postings list, even
when the cost of decompression is added. So, in most cases, the retrieval system runs
faster on compressed postings lists than on uncompressed postings lists.
Basically compression algorithms can be crudely divided into four groups: block-to-block
codes, block-to-variable codes, variable-to-block codes, and variable-to-variable codes
[103].
a) Block-to-block codes - These codes take a specific number of bits at a time from
the input and emit a specific number of bits as a result. If all of the symbols in the
input alphabet (in the case of bytes, all values from 0 to 255) are used, the output
alphabet must be the same size as the input alphabet, i.e. uses the same number of
bits. Otherwise it could not represent all arbitrary messages. Obviously, this kind of
code does not give any compression, but it allows a transformation to be performed
on the data, which may make the data more easily compressible, or which separates
the essential information for loss compression. For example, the discrete cosine
transform (DCT) belongs to this group. It does not really compress anything, as it
takes in a matrix of values and produces a matrix of equal size as output, but the
resulting values hold the information in a more compact form.
b) Block-to-variable codes - They use a variable number of output bits for each input
symbol. All statistical data compression systems, such as symbol ranking, Huffman
coding, Shannon-Fano coding, and arithmetic coding belong to this group. The
idea is to assign shorter codes for symbols that occur often, and longer codes for
symbols that occur rarely. This provides a reduction in the average code length,
and thus lossless compression. But Block-to-variable codes have drawbacks. The
first drawback is the amount of memory needed to store the probability tables. The
72
frequencies for each character encountered must be accounted for. In the case of
Huffman encoding, the Huffman tree needs to be recreated. And the encoding and
decoding itself certainly takes time also.
c) Variable-to-block codes - They use a fixed-length output code to represent a
variable-length part of the input. Variable-to-block code methods are also called
free-parse ones, because there is no predefined way to divide the input message
into encodable parts (i.e. strings that will be replaced by shorter codes).
Substitutional compressors belong to this group. Substitutional compressors work
by trying to replace strings in the input data with shorter codes. Lempel-Ziv
methods contain two main groups: LZ77 and LZ78. In 1977 Ziv and Lempel
proposed a lossless compression method LZ77 that replaces phrases in the data
stream by a reference to a previous occurrence of the phrase. LZ77-type
compressors use a history buffer, which contains a fixed amount of symbols
output/seen so far. The compressor reads symbols from the input to look ahead
buffer and tries to find as long as possible match from the history buffer. The
length of the string match and the location in the buffer (offset from the current
position) is written to the output. If there is no suitable match, the next input
symbol is sent as a literal symbol. The basic scheme is a variable-to-block code. A
variable-length piece of the message is represented by a constant amount of bits:
the match length and the match offset. Run length encoding also belongs to this
group. To represent each inverted list, the series of difference between successive
numbers is stored as a list of run-lengths or d-gaps. And rather than compressing
the series of item numbers in an inverted file entry, it is convenient to compress
their run length encoding, that is, the series of differences between successive
numbers. For example, consider the inverted file entry storing the following
document numbers (4, 5, 8, 10, 11, 17…). Using the run length encoding it actually
stores (4, 1, 4, 2, 1, 5…). This does not in itself yield any compression, but does
expose patterns that can be exploited for compression purposes [96], [149]. One
consequence of this representation is that small gaps are common, since frequent
words must of necessity give rise to many small gaps. Hence, a variable-length
encoding of the integers in which small values are stored more succinctly than
numbers in an inverted file entry, it is convenient to compress their run length
73
encoding, that is, the series of differences between successive numbers. For
example, consider the inverted file entry storing the following document numbers
(4, 5, 8, 10, 11, 17…). Using the run length encoding it actually stores (4, 1, 4, 2, 1,
5…). This does not in itself yield any compression, but does expose patterns that
can be exploited for compression purposes. One consequence of this representation
is that small gaps are common, since frequent words must of necessity give rise to
many small gaps. Hence, a variable-length encoding of the integers in which small
values are stored more succinctly than long values can achieve a more economical
overall representation than the more usual at binary encoding.
d) Variable-to-variable codes - The compression algorithms in this category are
mostly hybrids or concatenations of the previously described compressors.
Compression is effective in direct proportion to the extent that it eliminates
obvious patterns in the data. So if the first compression step is any good, it will
leave little traction for the second step. Combining multiple compression methods
is only helpful when the methods are specifically chosen to be complementary.
Randomly concatenating algorithms rarely produces good results.
Many compression algorithms use integer values for something or another. Next, we
discuss few compression algorithms for integers. Selection of these encoding methods
depends on the distribution and the range of the values.
3.2.1
Compression algorithms for Integers
A number of compression algorithms for integers [44] are discussed hereby.
a) Fixed, Linear - If the values are evenly distributed throughout the whole range, a
direct binary representation is the optimal choice. The number of bits needed of
course depends on the range. If the range is not a power of two, some tweaking can
be done to the code to get nearer the theoretical optimum log2 (range) bits per value.
74
Table 3.1: Fixed Linear prefix adjusted codes with minimum average number of bits
Value
Binary
Adjusted
1&2
H = 2.585
0
000
00
000
L = 2.666
1
001
01
001
(for flat distribution)
2
010
100
010
3
011
101
011
4
100
110
10
5
101
111
11
Table 3.1 shows two different versions of how the adjustment could be done for a
code that has to represent 6 different values with the minimum average number of
bits. As can be seen, they are still both prefix codes, i.e. it is possible to (easily)
decode them.
If there is no definite upper limit to the integer value, direct binary code cannot be
used and one of the following codes must be selected.
b) Elias Gamma Code - The Elias gamma code [51] assumes that smaller integer
values are more probable. In fact it assumes (or benefits from) a proportionally
decreasing distribution. Values that use n bits should be twice as probable as values
that use n+1 bits. His γ code, maps an integer x onto the binary value of x prefaced
by floor (log x) zeros. The binary value of x is expressed in as few bits as possible,
and therefore begins with a 1, which serves to delimit the prefix. The result is an
instantaneously decodable code since the total length of a codeword is exactly one
greater than twice the number of zeros in the prefix. Therefore, as soon as the first
1 of a codeword is encountered, its length is known. An integer of N significant
bits is represented in 2N+1 bits, or an integer n is represented by 2[log2 n]+1 bits.
Therefore we can say that an Elias code represents an integer n in 2 log2 n+1 bits.
75
Table 3.2: Elias Gamma Code Representation
Gamma Code
Integer
Bits
1
1
1
01x
2-3
3
001xx
4-7
5
0001xxx
8-15
7
00001xxxx
16-31
9
000001xxxxx
32-63
11
0000001xxxxxx
64-127
13
c) Elias Delta Code - The Elias Delta Code is an extension of the gamma code. This
code assumes a little more traditional value distribution. The first part of the code
is a gamma code, which tells how many more bits to get (one less than the gamma
code value).
Table 3.3: Elias Delta Code Representation
Delta Code
Integer
Bits
1
1
1
010x
2-3
4
011xx
4-7
5
00100xxx
8-15
8
00101xxxx
16-31
9
00110xxxxx
32-63
10
00111xxxxxx
64-127
11
The delta code is better than gamma code for big values, as it is asymptotically
optimal (the expected codeword length approaches constant times entropy when
entropy approaches infinity), which the gamma code is not. What this means is that
the extra bits needed to indicate where the code ends become smaller and smaller
proportion of the total bits as we encode bigger and bigger numbers. The gamma
code is better for greatly skewed value distributions (a lot of small values).
76
d) Fibonacci Code - The Fibonacci code is another variable length code where smaller
integers get shorter codes. The code ends with two one-bits, and the value is the
sum of the corresponding Fibonacci values for the bits that are set (except the last
one-bit, which ends the code).
Table 3.4: Fibonacci Code Representation
1 2 3 5 8 13 21 34 55 89
1 (1)
=1
0 1 (1)
=2
0 0 1 (1)
=3
1 0 1 (1)
=4
0 0 0 1 (1)
=5
1 0 0 1 (1)
=6
0 1 0 1 (1)
=7
0 0 0 0 1 (1)
=8
: : : : : :
:
1 0 1 0 1 (1)
= 12
0 0 0 0 0 1 (1)
= 13
: : : : : : :
:
0 1 0 1 0 1 (1)
= 20
0 0 0 0 0 0 1 (1)
= 21
: : : : : : : :
1 0 0 1 0 0 1 (1)
:
= 27
Note that because the code does not have two successive one-bits until the end
mark, the code density may seem quite poor compared to the other codes, and it is,
if most of the values are small, (1-3). On the other hand, it also makes the code
very robust by localizing and containing possible errors. Although if the Fibonacci
code is used as a part of a larger system, this robustness may not help much as it
loses the synchronization in the upper level anyway. Most adaptive methods
cannot recover from any errors, whether they are detected or not. Even in LZ77 the
errors can be inherited infinitely far into the future.
77
Table 3.5: Comparison between delta, gamma and Fibonacci code lengths
Number(s)
Gamma
Delta
Fibonacci
1
1
1
2.0
2-3
3
4
3.5
4-7
5
5
4.8
8-15
7
8
6.4
16-31
9
9
7.9
32-63
11
10
9.2
64-127
13
11
10.6
The comparison shows that if even half of the values are in the range 1..7 (and
other values relatively near this range), the Elias gamma code wins by a handsome
margin. The best part is that the Gamma code is much simpler to decode and does
not need additional memory.
3.3
Fuzzy Information Retrieval
The term fuzzy Information Retrieval refers to methods of Information Retrieval that are
based upon the theory of fuzzy sets. These methods are increasingly recognized as more
realistic than the various classical methods of Information Retrieval [67], [81].
The problem of Information Retrieval involves two finite crisp sets, a set of recognized
index terms,
X = {x1,x2,……….,xn}, and a set of relevant documents, Y = {y1,y2,……..,yn}. (3.1)
Although these sets change whenever new documents are added to systems or new index
terms are recognized (or, possibly, when some documents or index terms are discarded),
they are fixed for each particular inquiry.
78
In fuzzy Information Retrieval, a fuzzy relation expresses the relevance of index terms to
individual documents, R: X x Y  [0,1], such that the membership value R(xi,yi) specifies
for each xi ε X and each yj ε Y the grade of relevance of index term xi to document yj. One
way of determining the grades objectively is to define them in an appropriate way in terms
of the frequency of occurrences of individual index term in the document involved relative
to the size of the document.
Another important relation in fuzzy Information Retrieval is called a fuzzy thesaurus T. It
is composed of a number of semantic relations (equivalence E, inclusion I, association A)
that cover every possible semantic entity. T is basically a reflexive fuzzy relation, defined
on X2. For each pair of index terms (xi, xk) ε X2, T(xi, xk) expresses the degree of
association of xi with xk; that is, the degree to which the meaning of index term xk is
compatible with the meaning of the given index term xi. By using the thesaurus, the user
query can be expanded to contain all the associated semantic entities. The expanded query
is expected to retrieve more relevant documents, because of the higher probability that the
annotator has included one of the associated entities in the description. Use of the
equivalence E relation, which is similar to the MPEG-7 equivalentTo and identifiedWith
relation, expand the query to contain terms that are, in one sense or another, synonyms [2],
[81].
For example, the E relation of the thesaurus, constructed with the EquivalentTo and
identifiedWith relations:
Re = EquivalentTo U IdentifiedWith =
{(soccerGame - object, soccerGame – event,1),
(football – event, soccerGame – event, 0.8),
(football – event, AmerFootball – event, 0.8)}
Now, Let A denote the fuzzy set representing a particular inquiry. Then, by composing A
with the fuzzy thesaurus T, a new fuzzy set on X (say, set B) is obtained, which represents
an augmented inquiry (i.e., augmented by associated index terms). That is,
A º T = B,
(3.2)
where º is usually understood to be the max-min composition, so that
79
B(xj) = max min[A(xi), t(xi,xj)], xi ε X for all xj ε X.
(3.3)
The retrieved list of documents, expressed by a fuzzy set D defined on Y, are then obtained
by composing the augmented inquiry, expressed by fuzzy set B, with the relevance relation
R. That is,
BºR=D
(3.4)
where º is usually understood to be the max-min composition, so that
D(xj) = max min[B(xi), r(xi,yj)], xi ε X, yj ε Y.
(3.5)
The above document retrieval systems return long lists of ranked documents that users are
forced to surf through to find relevant documents. The majority of today‘s Web search
engines (e.g., Excite, AltaVista) follow this paradigm. Web search engines are also
characterized by extremely low precision. The low precision of the Web search engines
coupled with the ranked list presentation make it hard for users to find the information
they are looking for. Therefore clustering has to be applied to group the set of ranked
documents returned in response to a query. In the following section 3.4, Suffix tree
clustering algorithm is discusses for grouping the ranked documents.
3.4
Algorithm of Suffix Tree Clustering (STC)
Suffix Tree Clustering [90], [147] is a novel, incremental, O(n) time algorithm designed to
meet the requirements of post-retrieval clustering of web search results. STC does not treat
a document as a set of words but rather as a string, making use of proximity information
between words. STC relies on a suffix tree to efficiently identify sets of documents that
share common phrases and uses this information to create clusters and to succinctly
summarize their contents for users.
80
STC has three logical steps: (a) document ―cleaning‖, (b) identifying base clusters using a
suffix tree, and (c) combining these base clusters into clusters.
a) Document "Cleaning" Step
In this step, the string of text representing each document is transformed using a
light-stemming algorithm (deleting word prefixes and suffixes and reducing plural
to singular). Sentence boundaries (identified via punctuation and HTML tags) are
marked and non-word tokens (such as numbers, HTML tags and most punctuation)
are stripped. The original document strings are kept, as well as pointers from the
beginning of each word in the transformed string to its position in the original
string. This enables to identify key phrases in the transformed string and to display
the original text for enhanced user readability.
b) Identifying Base Clusters Step
A base cluster is defined as a set of documents that share a common phrase. A
phrase is an ordered sequence of one or more words. The identification of base
clusters can be viewed as the creation of an inverted index of phrases for our
document collection. This is done efficiently using a data structure called a suffix
tree. This structure can be constructed in time linear with the size of the collection,
and can be constructed incrementally as the documents are being read.
A suffix tree of a string S is a compact tree containing all the suffixes of S.
Documents are treated as strings of words, not characters, thus suffixes contain
one or more whole words. In more precise terms:
i.
A suffix tree is a rooted, directed tree.
ii.
Each internal node has at least 2 children.
iii.
Each edge is labeled with a non-empty sub-string of S (hence it is a tree).
The label of a node in defined to be the concatenation of the edge-labels on
the path from the root to that node.
iv.
No two edges out of the same node can have edge-labels that begin with the
same word (hence it is compact).
v.
For each suffix s of S, there exists a suffix-node whose label equals s.
81
Each suffix-node is marked to designate from which string (or strings) it originated
from (i.e., the label of that suffix node is a suffix of that string). In the application,
suffix tree is constructed including all the sentences of all the documents from the
collection.
Table 3.6: Six nodes and their corresponding base clusters from the example shown in figure 3.3
Node
Phrase
Documents
a
cat ate
1, 3
b
Ate
1, 2, 3
c
Cheese
1, 2
d
Mouse
2, 3
e
Too
2, 3
f
ate cheese
1, 2
Each node represents a group of documents and their common phrase. Each base
cluster is assigned a score, that is a function of the number of documents it
contains, and the words that make up its phrase.
The score s(B) of base cluster B with phrase P is given by: s(B) = |B|.f(|p|),
Where, |B| is the number of documents in base cluster B, and |P| is the number of
words in P that have a non-zero score (i.e., the effective length of the phrase).
Words appearing in the stoplist, or that appear in too few (3 or less) or too many
(more than 40% of the collection) documents receive a score of zero. The function
f penalizes single word phrases, is linear for phrases that are two to six words long,
and becomes constant for longer phrases.
82
cat
ate
ate
a
cheese
b
mouse
c
too
d
e
1,3
cheese
mouse
too
cheese
mouse
too
f
1,1
3,1
1,2
3,2
too
2,4
3,4
ate
cheese too
too
2,3
2,1
3,3
too
2,2
Figure 3.3: The suffix tree of the strings "cat ate cheese", "mouse ate
cheese too" and "cat ate mouse too"
c) Combining Base Clusters Step
Documents may share more than one phrase. As a result, the document sets of
distinct base clusters may overlap and may even be identical. To avoid the
proliferation of nearly identical clusters, the third step of the algorithm merges
base clusters with a high overlap in their document sets (phrases are not considered
in this step). A binary similarity measure is defined between base clusters based on
the overlap of their document sets.Given two base clusters Bm and Bn, with sizes
|Bm| and |Bn| respectively, and |Bm∩Bn| representing the number of documents
common to both base clusters, we define the similarity of Bm and Bn to be 1 iff:
|Bm∩Bn|/|Bm| > α and
|Bm∩Bn|/|Bn| > α
Otherwise, their similarity is defined to be 0.
Next, look at the base cluster graph, where nodes are base clusters, and two nodes
are connected if the two base clusters have a similarity of 1. A cluster is defined as
being a connected component in the base cluster graph. Each cluster contains the
union of the documents of all its base clusters.
83
Table 3.7: Node a,b from the Table 3.6 with their corresponding base clusters to find similarity
between them
e.g.
Node
Phrase
Documents
Bm:
a
cat ate
1,3
Bn:
b
ate
1,2,3
|Bm| = 2 , |Bn| = 3
| Bm ∩ Bn | / | Bm |= 2/2 = 1
| Bm ∩Bn | / | Bn | = 2/3 = 0.67
Similarity=1
Phrase: cat ate
Documents: 1,3
Phrase: mouse
Documents: 2,3
a
Phrase: cheese
Documents: 1,2
d
c
b
e
Phrase: ate
Documents: 1,2,3
Phrase: too
Documents: 2,3
f
Phrase: ate cheese
Documents: 1,2
(b)  = 0.6
Figure 3.4 The base cluster graph of the example given in Figure 3.3 and in Table 3.6.
(In this example there is one connected component, therefore one cluster.)
The STC algorithm is incremental. As each document arrives from the Web, it is
―clean‖ and is added to the suffix tree. Each node that is updated (or created) as a
result of this is tagged. Then update the relevant base clusters and recalculates the
similarity of these base clusters to the rest of the base clusters. Finally, check if the
changes in the base cluster graph result in any changes to the final clusters.
The goal of a clustering algorithm in domain is to group each document with
others sharing a common topic, but not necessarily to partition the collection. The
84
STC algorithm does not require the user to specify the required number of clusters.
It does, on the other hand, require the specification of the threshold used to
determine the similarity between base clusters. However, it has been found that the
performance of STC is not very sensitive to this threshold, unlike Agglomerative
Hierarchical Clustering (AHC) algorithms that showed extreme sensitivity to the
number of clusters required.
3.5
Proposed Method
Based on the approach of Moffat and Zobel [96], we proposed a solution to retrieval
system that is both quick in operation and economical in disk storage space. The proposed
method stored the lexicon using Range partition feature of Oracle that can access the
lexicon in blocks from disk, by defining a range partition on first letter of each lexicon
word. To retrieve words of the lexicon that match the query term, only the partition
corresponding to first letter of each query term is brought into main memory from disk and
the words (along with their synonyms) in the partition are identified via the index defined
on the lexicon database. Using this technique, the entire lexicon database is not required to
be in memory during the query search. Only very few, small size, disk accesses are
required (maximum equal to no. of query terms, if all query terms have distinct first letter)
to obtain the relevant information from the stored lexicon database. The proposed method
associates the semantic entities (concepts, objects, events) with the aid of a fuzzy
thesaurus. It is based on fuzzy relations and fuzzy reasoning.
3.5.1
Document Database Structure
Here, we discuss the structure of the inverted index files used by our methodology. These
inverted index files are maintained as tables in Oracle database.
First table is the term_table having the following three fields: first field stores the
beginning alphabet of each word; second field stores the word itself; third field maintains
the posting_list of all the synonyms of the word specified in the second field along with
their fuzzy thesaurus association value. The table is sorted on second field and an index is
defined on it. Range partition is defined on the first field.
85
d
d
data
d
d
database
data
databank
g
database
d
0001000
d
0001011
00100
00110
00111
000010001
010
00100
00111
010
d
g
gram
g
get
r
reach
00100
r
y
00110
r
d
00111
relational
r
r
r
repeat
Compressed,
range-
partitioned dictionary
of words
0001011
Posting list containing
Compressed,
Synonyms
partitioned
rangeVocabulary
Posting list
containing
document_ids
of Words
Figure 3.5: Compressed inverted index files structure showing range partition
86
Document Collection Set
Text
document
collection
stored
on disk
Second table named word_document_table have three fields: first field to store the
beginning alphabet of the word (range partitioned on it); second field to store the word_id,
(sorted and indexed on it); third field maintains posting_list referring to the document_ids
of all the documents containing the respective word. Along with the document_id, it also
mentions the frequency of occurrence of the word in the document relative to document
size. Further Elias γ code is used to represent the document_ids and word_ids in the table.
This table is also maintained on disk.
Third table is document_table with two fields: first field mention document_ids (sorted
and indexed on document_id); second_field is address_field holding the location address
of the document on the disk obtained from the File Allocation Table. Table should be
brought in main memory to locate document address for document reference.
3.5.2
Quick text retrieval algorithm
In this section, we will describe the algorithm for grouping the retrieved ranked documents
that is based on the structure outlined in section 3.5.1.
Given below are explained, the steps of the algorithm to be performed.
1. Consider a query Q. Ignore all the stop words from the search query. Final query is
represented by set A.
2. For each query term t,
a. if target partition of term_table is not in main memory then bring to
memory from disk
b. search the partition term_table to locate t
c. record the posting_list (synonyms with their fuzzy thesaurus association)
for term t, in the final list X.
3. Identify the fuzzy thesaurus T for each pair of index terms (xi,xk) ε X,
where t(xi,xk) expresses the degree of association of xi with xk.
4. Compute
B←AοT
5. For each term t ε X
87
a. if target partition of word_document table is not in main memory then bring
to memory from disk
b. Search the partition word_document table to get word_id for term t
c. Record the posting_list (for each term t, get Elias γ codes representing
document_id along with their term_frequency value)
d. Decompress to get the actual document_ids
e. Record the document_ids to the final relation R along with their
term_frequency
6. Compute
D←BοR
7. Inspect only those document_ids captured by some α-cut of D
8. Bring the document_table in memory
9. For each d ε D
a. look up the address of document d in document_table
b. retrieve document d
The retrieved documents are ranked but not grouped. Therefore clustering has to be
applied to the set of documents returned in response to a query. Here, we discuss Suffix
tree clustering method to group the retrieved documents [90], [147].
Following steps are to be performed to group the retrieved documents:
1. Perform document "Cleaning" to delete word prefixes and suffixes and to reduce
plural to singular
2. Identify Base Clusters. Each base cluster is assigned a score that is a function of
the number of documents it contains, and the words that make up its phrase. The
score s(B) of base cluster B with phrase P is given by:
s(B) = |B|.f(|p|)
where |B| is the number of documents in base cluster B, and |P| is the number of
words in P that have a non-zero score
3. Combine Base Clusters with common documents. Given two base clusters Bm and
Bn, with sizes |Bm| and |Bn| respectively, and |Bm∩Bn| representing the number of
88
documents common to both base clusters, we define the similarity of Bm and Bn to
be 1 iff:
|Bm∩Bn|/|Bm| > α and |Bm∩Bn|/|Bn| > α
Otherwise, their similarity is defined to be 0.
3.6
Discussion
To illustrate the above process, let us consider a very simple example, in which the inquiry
involves only the few index terms.
Let us consider a simple inquiry ―Characteristics of relational database”
Ignore the stop-word of so that the final inquiry involves only the following three index
terms: x1 = characteristics, x2 = relational, x3 = database.
That is, A = {x1,x2,x3} is the support of fuzzy set A expressing the inquiry. Assume that the
vector representation of A is
x1 x2 x3
A = [.6 .8 1].
Now, look the term_table for retrieving all the synonyms of x2 term. We get the terms
relationship, narrative. Repeat for each xj term of A.
We get the final list X, such that
x4 = relationship
x5 = narrative
x6 = distinctive
x7 = databank
x8 = data
Assume further that the relevant part of fuzzy thesaurus (restricted to the support of A and
nonzero columns showing equivalence relation) is given by the matrix in table 3.8.
89
Table 3.8: Term – Term matrix showing degree of similarity between terms
T
=
x1
x2
x3
x4
x5
x6
x7
x8
x1
1
0
0
0
0.4
0.8
0
0
x2
0
1
0.6
0.9
0.1
0
0.4
0.7
x3
0
0.6
1
0
0
0
1
0.9
Then, by equation (3.1), the composition A º T results in fuzzy set B, which represents the
augmented inquiry; its vector form is
B
=
x1
x2
x3
x4
x5
x6
x7
x8
0.6
0.8
1.0
0.9
0.4
0.8
1.0
0.9 
Now look the word_document table to get the document_id of all the documents that
contain xj , where j = 1 to 8 (ref. Table 3.9). Here y1, y2, ….. , y3 are the only documents
related to index terms x1, x2, ….. , x8 and pair (xi,yk) in R shows the frequency of ith word
occurrence in kth document relative to size of document.
By equation (3.2), the composition B º R results in fuzzy set D, which characterizes the
retrieved documents; its vector form is
y1
D
=
y2
0.9 0 .8
y3
1.0
y4
y5
y6
y7
y8
0 .4 0 .5 0 .6 0 .6 0 .3
y9
0 .7 
We can now decide to retrieve only those documents captured by 0.5-cut of D.
The relevant part of the relevance relation (restricted to the support of B and nonzero
columns) is given by the matrix in table 3.9.
90
Table 3.9: Term – Document Matrix showing frequency of terms assigned to documents
R=
y1
y2
y3
y4
y5
y6
y7
y8
y9
x1
1.0
0.2
0.7
0
0
0.8
0
0
0
x2
0.2
1.0
0.3
0
0
0.2
0.6
0
0.2
x3
0.7
0.3
1.0
0.4
0.2
0
0.3
0.1
0
x4
0.7
0
0
0.1
0
0.2
0.5
0
0
x5
0.6
0
0
0
0.1
0
0.8
0.1
0
x6
0.8
0.2
0.1
0
0.1
1.0
0.2
0
0
x7
0.5
0
0
0.4
0.5
0
0
0.1
0.3
x8
0.9
0.2
0.1
0
0
0
0.5
0.3
0.7
Let us now apply suffix tree clustering to group the documents.
purpose
Relational
database
database
c
a
characteristics
characteristics
b
d
,
Features
Purpose
f
f
31,3
,
1,1
3,1
1
1,2
2
1
3
,
,
2
2
purpose
2
3,4
2,4
,
characteristics
3
e
e
1
b
Features
purpose
features
d
purpose
4
Database
Characteristics
purpose
2pur
2,3
3,2
,pos
1
e
purpose
3
2,1,
3
3,3
,
2,2
2
Figure 3.6: The suffix tree of the strings ―Relational Database Characteristics‖, ―Features of Database‖,
―Relational Database Features and Purpose‖
Using Figure 3.6, we represent the documents with their base clusters in tabular form.
91
Table 3.10 Six nodes from the example shown in Figure 3.6
and their corresponding base clusters
Node
Phrase
Documents
A
relational database
1, 3
B
Database
1, 2, 3
C
Characteristics
1, 2
D
Features
2, 3
E
Purpose
2, 3
F
Database
1, 2
characteristics
Given two base clusters Bm and Bn for nodes a and b, with sizes |Bm| and |Bn| respectively,
and |Bm∩Bn| representing the number of documents common to both base clusters, we
define the similarity of Bm and Bn as
Node Phrase
Documents
Bm:
a
relational database
1,3
Bn:
b
database
1,2,3
|Bm|= 2 , |Bn|=3
| Bm ∩ Bn | / | Bm |=2/2=1
| Bm ∩Bn | / | Bn |=2/3=0.67
Similarity=1, since  = 0.6
a
d
c
b
e
92
Phrase: relational database
Document: 1.3
Phrase: features
Document: 2.3
Phrase: characteristics
a
Documents: 1,2
c
d
b
Phrase: purpose
Document: 2.3
e
Phrase: database
f
Phrase: database characteristics
Documents: 1,2
Documents: 1,2,3
Figure 3.7: The base cluster graph of the example given in Figure 3.6 and in Table 3.10
Using the algorithm presented here, it is shown that improvement is obtained in speeding
the process of text document retrieval. By using the range partition feature of oracle, the
space requirement of Random Access Memory is reduced considerably as the inverted
files are hold on secondary storage and only those required are brought to main memory.
Fuzzy logic is applied to retrieve selected documents and then suffix tree clustering is used
to group the ranked retrieved documents.
3.7
Conclusion of this Chapter
Compressed, partitioned, in-memory Inverted Index to store text documents index terms is
proposed which enables fast searching requiring less main memory. Through the use of
compression and range partitioning, a small, faster representation for sorted lexicon of
interest is achieved. Application of range partition feature on proposed data structure
enables few relevant partitions against the given query (2-5 words) to be retrieved from the
proposed data structure and hence saves time to decompress the small relevant data of
interest. Sorted lexicon also permits rapid binary search for matching strings. Conjunctive
queries are easily handled through the concept of fuzzy logic in retrieving the documents
having high value of -cut (threshold limit) in fuzzy set D for AND operation and non-
93
zero value of -cut for OR operation. The proposed method also retrieves the documents
containing synonyms of the search query terms in the document.
94

Download Report

11_chapter 3

Paperzz.com

Your Paperzz