Developing a specialized directory system by

Developing a specialized directory
system by automatically classifying
Web documents
Young Mee Chung
possible to enhance the classification performance by
applying a hybrid method combining a dictionary-based
technique and a kNN classifier.
Yonsei University, Seoul, South Korea
Young-Hee Noh
Ewha Womans University, Seoul, South Korea
Received 10 May 2002
Revised 13 December 2002
Abstract.
This study developed a specialized directory system using
an automatic classification technique. Economics was
selected as the subject field for the classification
experiments with Web documents. The classification
scheme of the directory follows the DDC, and subject terms
representing each class number or subject category were
selected from the DDC table to construct a representative
term dictionary. In collecting and classifying the Web
documents, various strategies were tested in order to find
the optimal thresholds. In the classification experiments,
Web documents in economics were classified into a total of
757 hierarchical subject categories built from the DDC
scheme. The first and second experiments using the
representative term dictionary resulted in relatively high
precision ratios of 77 and 60%, respectively. The third
experiment employing a machine learning-based k-nearest
neighbours (kNN) classifier in a closed experimental setting
achieved a precision ratio of 96%. This implies that it is
Correspondence to: Young Mee Chung, Department of
Library and Information Science, Yonsei University, 134
Shinchon-Dong, Seodaemun-Gu, Seoul, Korea.
E-mail: [email protected]
1. Introduction
Hundreds of thousands of new sites are appearing on
the Internet daily and the increase of Web documents
provided by Internet sites is extraordinary. Not only
the quantity but also the quality of these documents
has been improving from the early stages of the Internet
age. As a result, researchers and research institutes are
increasingly using Web documents for scholarly
purposes.
The well-known retrieval tools of the Web are search
engines and directory services. The search engines that
mainly perform keyword searching are continually
improving with the addition of new functions for more
effective retrieval. However, the search results are still
far from being satisfactory. On the other hand, the
directory services provide more satisfactory results to
users by manually reviewing and classifying documents before retrieval. Nevertheless, it is almost
impossible for human editors to keep classifying the
immensely increasing amount of Web documents.
Subject gateways like SOSIG (Social Science Information Gateway) also provide organized access to
Internet resources, usually in a specialized subject
area. Subject gateways are similar to specialized
directories in that they maintain a list of quality
Internet resources reviewed by human experts.
Besides, many academic subject portals being operated
by university libraries like Cleveland State University
Library also give access to Web documents by pulling
together free Internet resources and subscription-based
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
117
Developing a specialized directory system
sources such as databases, electronic journals and
other digitized collections.
A way of improving the effectiveness of retrieval is to
group similar documents by clustering algorithms.
Document clustering can be used for automatic
classification of documents as in the SONIA project
of Stanford University [1] as well as for partitioning a
group of documents falling within the same subject
category into several sub-groups as in the OCLC’s
SCORPION project [2]. Especially in the Web environment where enormous numbers of documents are
being retrieved, generating small homogeneous groups
each containing more similar documents may increase
the retrieval effectiveness [3]. For example, the Northern Light search engine clusters retrieved documents
with similar topics into ‘custom search folders’,
enabling users to access the most appropriate document subset. The ‘Find similar’ feature that most
search engines provide as a search aid is also based
on the clustering concept.
Similarly, in directory services automatic classification or text categorization techniques based on a priori
classification schemes can be used to automatically
assign documents to relevant subject categories, greatly
reducing the work of human experts. Especially in a
research library environment where a specialized
directory system is needed, a new information service
can be accomplished without the use of additional
manpower by employing an appropriate document
classification method.
In this study, we developed a specialized directory
system that automates the collecting, indexing and
classification processes of Web documents. Web documents were gathered by a Web robot and then
automatically indexed. After indexing, automatic
classification of the documents was performed using
the subject term dictionary built from the DDC table.
In an effort to build a more effective directory
system, three classification experiments were carried
out as described below:
. A subject specialist extracted subject terms representing each subject category or class number under
economics using the explanatory note and relative
index parts of the DDC table. The documents
gathered by a Web robot were automatically
indexed to make index terms represent each
document. A Web document was then classified
into an appropriate subject category after computing similarities between the document and every
subject category.
. Additional representative terms for each subject
category were derived from MARC records of
118
.
library materials that had been manually classified
into each category in a real library setting. For this
purpose, the DDC numbers in a MARC record were
used to extract subject terms from the title and
subject headings in the same record. The terms that
appeared in the title and the subject headings field
were added to the representative term dictionary.
Then the same method of classification was used as
in the first experiment.
For the purpose of enhancing the performance of
the dictionary-based classification technique, a
kNN (k-nearest neighbours) classifier, a very effective machine learning classifier [4, 5], was applied
to the pre-classified Web documents. For the
efficiency of text categorization, document frequency was used as a feature selection method to
remove non-informative terms according to corpus
statistics.
2. Developing a specialized directory system
2.1. System outline
Many people use general Internet directory systems
such as Yahoo! in order to meet their information
needs. Most of the general directory systems do not
apply standard classification schemes like DDC or UDC
but use specific schemes designed for Internet information resources. However, specialized directory systems
that focus on scientific information tend to use
standard library schemes. McKiernan [6] surveyed
sites that provide scientific information on Internet
and found that two sites classify resources in alphabetical order, one in numerical order, 20 by DDC, and
three by UDC.
Most of the Internet directory systems, whether
general or specialized, classify or categorize information resources by editors or surfers. However, Cora [7],
a specialized retrieval engine in the computer science
field, uses an extended Naive Bayes classifier to
automatically classify documents into 75 hierarchical
subject categories. The performance of the classification method is reported to have an accuracy of 66% [8].
In this study economics was selected as the sample
subject field to construct a specialized directory system
through an automatic classification process. The
classification scheme of the directory follows the
DDC, and subject terms representing each class
number or subject category were selected from the
DDC table. A representative term dictionary was built
with category entries, each consisting of class number,
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
Y. M. CHUNG AND Y.-H. NOH
2.2. Classificatory structure of the directory
Fig. 1. Overview of the specialized directory system.
subject category label and corresponding subject terms.
The classification scheme and term dictionary need to
be updated periodically in order to reflect new
concepts appearing in incoming Web documents.
The overall structure of the specialized directory
system constructed in this study is shown in Fig. 1.
The workflow of the system can be described as
follows:
. To gather Internet documents by a Web robot,
initial URLs of Web sites dealing with economics
resources are designated for the Web robot to visit;
. the documents gathered by the Web robot are stored
in a temporary database;
. the document indexer extracts URL information for
the next visit and other indexing information to be
used in the automatic classification process from
each document in the temporary database;
. according to the designated automatic classification
algorithm, each Web document subjected to classification is compared with the representative term
dictionary and assigned to the most relevant
category;
. a subject specialist verifies the result of classification and saves it in the directory database if
correctly assigned. This verification process may
be omitted if 100% automatic classification is
desired.
Since specialized directories deal mainly with scholarly information, unlike general Internet directory
services, the use of such standard classification
schemes as the DDC and the UDC is encouraged. To
develop a specialized directory in the economics field,
the DDC scheme was adopted and the subject headings
that correspond to class numbers in the DDC table were
used as subject category labels.
DDC is a decimal classification scheme that divides
all subjects into 10 categories from 0 to 9. Each main
class is further divided into 10 divisions, and the
divisions are once again divided into 10 sections.
Since this study deals with the field of economics, it
uses the ‘Economics’ (330) subdivision under the
‘Social Sciences’ (300) main class. A total of 757
subject categories were collected from the Economics
subdivision, and subject terms representing each
subject category were selected. A representative term
dictionary built from these data was then used in
automatically classifying incoming Web documents.
Each record of the representative term dictionary
includes a class number, a subject category label and
subject terms for the subject category. A sample record
for class number 334.7 is given below:
[334.7] * Benefit societies [benefit societies, benevolent
societies, friendly societies, mutual aid societies, provident societies]
The starting structure of the directory consists of nine
levels and the class number structure for each level is
shown in Table 1. Thus, the top subject category, i.e.
economics subdivision, is level 1 and can be subdivided up to level 9.
The number of levels in the directory hierarchy is
decided after considering the effectiveness of the
Table 1
Hierarchy of subject categories
Levels of subject category
Class number structure
Level
Level
Level
Level
Level
Level
Level
Level
Level
330
33*
33* .*
33* .**
33* .***
33* .****
33* .*****
33* .******
33* .*******
1
2
3
4
5
6
7
8
9
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
119
Developing a specialized directory system
system. The performance evaluation of the results of
classification experiments guides the selection of the
hierarchy depth. The level depth up to level 9 is not
unusual in other directories. For example, subject
categories exceeding 10 levels can be seen in Yahoo!.
2.3. Collecting Web documents
In general, Web documents on the Internet are
collected by a Web robot. Web robots gather documents
by periodically navigating domestic and foreign Web
sites on a URL list. Often the URL list of related sites is
drawn up to designate the starting URLs. The wellknown sites in a specific field or the URLs appearing
on Usenet Newsgroups are usually put on the list.
Other options include using sites where mailing list
documents are stored or designating the server of a
well-known Web directory system.
This study sought out Web sites of well-known
research institutes in the economics field to come up
with 50 starting URLs. The URL list includes 37
domestic sites and 13 foreign sites such as the Asian
Development Bank and International Monetary Fund.
Beginning with the starting URLs, the collected
documents were temporarily saved on a local database
to extract new URL information. This is a process of
acquiring new URL information from already visited
Web sites. The Web robot examined acquired domain
names and excluded overlapping sites or checks
whether the URL had already been visited in order to
prevent the retrieval of duplicate Web documents. The
Web robot developed for this study followed a
standard for robot exclusion that specified the method
to exclude robots from a server [9].
There are two ways for a Web robot to search
hierarchically linked Internet documents, breadth-first
search and depth-first search. The Web robot in our
directory system was designed to first retrieve one host
and then move on to the next by the depth-first search.
The depth designated for visits was ‘+10’. By
designating ‘+10’, the robot visited the 10 upper sites
and 10 lower sites from the starting URL. If ‘0’ was
designated, it would retrieve all references both above
and below on the starting URL.
2.4. Automatic indexing and classification of Web
documents
In order to classify the gathered Web documents into a
specific subject category in the directory, the index
terms need be extracted. In the indexing process, a
commercial indexing system based on a stop word list
120
and morphological analyser was used. Index terms
were extracted from the full text of documents
including title, subtitle, hyperlink anchor and words
in bold or italics. However, to minimize the indexing
overhead in the directory system, file names, file
directory paths, e-mail addresses, host names and
HTML tags were excluded from the indexing. For each
index term, term frequency (TF) within a specific
document and document frequency (DF), that is the
number of documents where a term appears, were
calculated. Each index term was assigned a standardized term frequency weight of 1 þ logTF.
After indexing the gathered Web documents, the
similarity between a specific document and subject
categories of the directory was measured by comparing
the assigned index terms and category representative
terms. As previously mentioned, each subject category
was represented by subject terms. The cosine coefficient was used to measure the similarity between a
specific Web document i and a subject category j as
follows:
P
tik 6cjk
SðDi ; Cj Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
P
ðtik Þ2 6 ðcjk Þ2
where tik ¼ weight of term k within document i (index
term weight) and cjk ¼ weight of representative term k
in subject category j (set to 1 if it exists).
If the similarity between an input document and a
subject category exceeds a certain threshold value, the
document may be classified into the corresponding
subject category. In this study, the threshold value
varied in four steps, from 0.1 to 0.4. Here, one
document was assigned to only one subject category
with the highest similarity value, although there could
be multiple categories for which the similarity
exceeded the threshold.
3. Automatic classification experiments
3.1. Outline of the experiments
In order to achieve the best performance in automatically classifying the collected Web documents in our
directory system, classification experiments were
carried out in three different settings. The number of
test documents for the first and second experiments
was 6743. In the third experiment, a total of 2512
documents including 1889 training documents and
623 test documents were used.
In the first and second experiments, the threshold
values for the term frequency of index terms and the
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
Y. M. CHUNG AND Y.-H. NOH
Table 2
Twelve threshold conditions for directory building
TF
Similarity
dir 1
dir 2
dir 3
dir 4
dir 5
dir 6
dir 7
dir 8
dir 9
dir 10
dir 11
dir 12
1
0.1
1
0.2
1
0.3
1
0.4
2
0.1
2
0.2
2
0.3
2
0.4
3
0.1
3
0.2
3
0.3
3
0.4
similarity between an input document and subject
categories varied to construct 12 different experimental
directories. The term frequency of index terms was set
to from 1 to 3, generating three conditions. As
mentioned before, the similarity threshold value was
also set to from 0.1 to 0.4 generating four conditions. In
total, there were 12 threshold conditions resulting in
12 different directories, as shown in Table 2.
In performance evaluation, three measures of classification precision, the degree of dispersion, and the
degree of hierarchical concentration were used. To
measure the performance of binary classification, the
result of the category assignments is usually evaluated
using a two-way contingency table for each subject
category and average precision across categories is
computed [1]. In this study, classification precision is
an averaged precision ratio that computes the percentage of correctly assigned documents among the total
documents assigned into a specific category. The
classification precision formula for each subject category is as follows:
Classification Precision
Number of correctly assigned documents
¼
Total number of documents assigned to each category
The degree of dispersion measures whether the documents gathered by the Web robot are evenly assigned in
the subject categories of the directory without concentrating in a specific category. The formula is as follows:
Degree of Dispersion
Number of categories assigned to classified documents
¼
Total number of categories
The degree of hierarchical concentration estimates the
percentage of documents assigned into the subject
categories in each level among the total number of the
classified documents. It is calculated by the following
formula:
Degree of Hierarchical Concentration
Number of documents classified per level
¼
Total number of documents classified
3.2. Results of the first and second experiments
Variation of thresholding conditions. Tables 3 and 4
show that, as the similarity threshold gets higher, the
number of documents being classified into each subject
category decreases in both experiments. On the other
hand, it appears that the frequency threshold of index
terms does not significantly affect the classification
result. In the cases of dir 1, dir 5 and dir 9, where the
similarity threshold was set at 0.1, the number of
documents classified into the subject categories in both
experiments exceeded 4694 documents in the three
directories.
However, for dir 4, dir 8, dir 12, where the similarity
threshold was set at 0.4, the actual number of
documents classified into subject categories was 87,
163 and 407, respectively, for the first experiment and
63, 122 and 249 for the second experiment. This shows
a very low rate of category assignment.
When the documents assigned to each category were
examined by level, the greatest number of documents
was assigned to categories in levels 1, 4 and 5 in the
first experiment. However, in the second experiment,
few documents were assigned to level 1, while many
documents were assigned to levels 6, 5 and 4 in the
listed order.
The subject category of level 1 is ‘Economics’ with
the class number 330. In the first experiment, the
representative terms for the category are economics
and its corresponding Korean term. This caused many
documents to be assigned into that category when
applying the low threshold such as 0.1 or 0.2. Since the
directory is a specialized one in the economics field, it
is not desirable to have many documents in the top
level. In the second experiment where nearly 100
representative terms were given for the top level
category, a dramatic decrease in the number of
documents assigned to the top level can be noticed.
On the other hand, level 9 contains very few
documents, implying that this level may be too specific
to be included in the directory.
Analyses of categorical dispersion and hierarchical
concentration. It is preferable to design the architecture of a directory so that documents are evenly
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
121
Developing a specialized directory system
Table 3
Result of the first experiment
Level
dir 1
dir 2
dir 3
dir 4
dir 5
dir 6
dir 7
dir 8
dir 9
Dir 10
dir 11
dir 12
Total
1
2
3
4
5
6
7
8
9
1315
99
305
1930
1884
744
337
14
1
1215
35
92
804
724
304
71
3
0
19
6
5
131
133
57
4
0
0
1
0
0
34
47
5
0
0
0
1285
71
247
1771
1683
585
179
8
0
1253
32
116
1106
993
366
87
7
0
1188
6
18
241
233
108
23
0
0
18
0
5
63
64
13
0
0
0
1272
53
247
1566
1601
527
163
5
0
1418
32
126
1120
1139
388
100
4
0
1187
7
32
369
389
168
34
2
0
62
1
7
125
127
72
13
0
0
10,233
342,342
1200
9260
9017
3337
1011
4343
11
852.75
34.20
109.09
771.67
751.42
278.08
101.10
6.14
1.00
Total
6629
3248
355
87
5829
3960
1817
163
5434
4327
2188
407
34,444
2870.33
dispersed in all the subject categories. The number of
categories was counted into which documents were
actually assigned in the entire 757 subject categories in
the directory. The average number of such categories
for the two experiments was 171.42 and 185.5,
respectively. This means that only a quarter of the
subject categories were used to classify the collected
Web documents.
The degree of dispersions computed for 12 directory
conditions is shown in Table 5 and graphically in
Fig. 2. The average dispersion ratio was 23% for the
first experiment and 25% for the second.
The degree of hierarchical concentration was also
computed for every level of 12 directory conditions. The
average degree of concentration for each level is
presented in Fig. 3. The degree of hierarchical concen-
Average
tration is relatively high in levels 1, 4, 5 and 6 for the first
experiment and in levels 4, 5 and 6 for the second.
Evaluation of classification accuracy. The classification accuracy was evaluated using the precision ratio
as stated in 3.1. It was analysed by level as well as by
directory condition. In calculating the average precision, level 9 was excluded because it contained too few
documents to exist as a subject category in our
directory.
Table 6 gives the precision ratios by level for the first
and second experiments. The average precision was
about 77% for the first experiment and about 60% for
the second. Level 1, containing the most broad subject
category, achieved a precision ratio of 1.0, as expected.
When excluding the top level, the precision ratios
decreased to 73 and 58%, respectively.
Table 4
Result of the second experiment
Level
dir 1
dir 2
dir 3
dir 4
dir 5
dir 6
dir 7
dir 8
dir 9
dir 10
dir 11
dir 12
Total
Average
1
2
3
4
5
6
7
8
9
7
32
397
1264
1680
2181
120
32
2
1
15
164
603
713
1722
32
10
0
0
0
6
47
131
109
2
0
0
0
0
0
3
46
14
0
0
0
2
9
198
1307
1739
2106
145
29
0
0
5
126
770
970
1794
56
16
0
0
0
7
126
218
187
10
2
0
0
0
0
14
57
51
0
0
0
0
3
193
1130
1458
1745
138
27
0
0
1
102
691
1080
1952
81
23
0
0
1
13
185
376
1384
19
7
0
0
1
0
38
116
87
6
1
0
10
67
1206
6178
8584
13,332
609
147
2
0.83
5.58
100.50
514.83
715.33
1111.00
50.75
12.25
0.17
Total
5715
3260
295
63
5535
3737
550
122
4694
3930
1985
249
30135
279.03
122
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
Y. M. CHUNG AND Y.-H. NOH
Table 5
Degree of categorical dispersion
First
experiment
Second
experiment
dir 1
dir 2
dir 3
dir 4
dir 5
dir 6
dir 7
dir 8
dir 9
dir 10
dir 11
dir 12
Average
0.46
0.23
0.08
0.02
0.43
0.3
0.13
0.04
0.43
0.34
0.18
0.10
0.23
0.51
0.29
0.07
0.02
0.49
0.33
0.12
0.03
0.47
0.35
0.18
0.07
0.25
Fig. 2. Comparison of categorical dispersion.
Fig. 3. Degree of hierarchical concentration by level.
Table 7 shows the classification precision of the
directory according to the different directory conditions. In both experiments, the performance was
enhanced with the increase of the similarity threshold
from 0.1 to 0.3. When the threshold was set at 0.4, there
were several directory conditions where the performance dropped. However, the frequency threshold
seemed to have little impact on the performance of
automatic classification.
Discussion on the first and second experiments. The
second experiment sought to improve the classification
performance by adding the terms from titles, subject
headings and other fields in MARC records to the
representative terms drawn from the DDC table.
Although the performance of the automatic classification became worse in the second experiment, the
percentage of the documents assigned to the top-level
category decreased from 26.63 to 0.02%. Therefore, in
terms of the specificity of classification and the
hierarchical concentration, the second experiment
gave more satisfactory results.
When looking at the classification results in the
directory system varying threshold conditions, it was
found that the actual number of documents being
assigned into subject categories decreased when the
similarity threshold between a document and subject
categories became higher. Similarly, the actual number
of categories where documents were assigned also
decreased when the similarity threshold became lower.
This implies that an appropriate level of the similarity
threshold needs to be ascertained for any Web
directory to be effective.
In both the first and second experiments, when
excluding level 9 where too few documents were
assigned due to over-segmentation, the directory
achieved a relatively high classification performance
with precision ratios of 77 and 60%, respectively.
However, when using a subject term dictionary built
from the DDC table, as in the first experimental setting,
it is desirable to add more general terms to the
representative terms of categories in the top level.
The analysis of the classification results according to
the threshold condition indicates that the performance
improves with higher similarity threshold from 0.1 up
to 0.3. In contrast, the term frequency threshold
varying from 1 to 3 seems to have little influence on
the performance as well as on the number of documents being actually classified.
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
123
Developing a specialized directory system
Table 6
Comparison of classification precisions by level
Level
First experiment
Second experiment
1
2
3
4
5
6
7
8
1.00
0.97
0.70
0.78
0.67
0.69
0.62
0.71
1.00
0.14
0.56
0.65
0.55
0.67
0.55
0.80
Average
Average (excluding level 1)
0.77
0.73
0.60
0.58
Table 7
Comparison of classification precisions by directory conditions
First experiment
Second experiment
dir 1
dir 2
dir 3
dir 4
dir 5
dir 6
dir 7
dir 8
dir 9
dir 10
dir 11
dir 12
Average
0.72
0.55
0.72
0.60
0.75
0.69
0.86
0.52
0.75
0.63
0.77
0.67
0.81
0.68
0.70
0.72
0.69
0.46
0.81
0.52
0.82
0.61
0.85
0.51
0.77
0.60
3.3. The third experiment to enhance the classification
performance
Test collection. Of the 12 directories evaluated in the
first experiment, dir 1 with the frequency threshold of
1 and similarity threshold of 0.1 was used for the third
experiment. dir 1 was characterized by a total of 757
subject categories, and 386 subject categories containing more than one document. Among the total documents actually classified in dir 1, 2512 documents
with category labels were used for the third experiment. This test collection was divided into two subsets
with a ratio of 7 to 3, resulting in 1889 training
documents and 623 test documents.
Classification process and result. In the third
experiment, a kNN classifier was used to categorize
the pre-classified test documents by learning from the
training documents. The classifying process of the
kNN classifier was as follows [10]:
1. First, given a test document, k nearest documents
were retrieved from the training documents. For
the calculation of similarity between each
retrieved training document Dj and the test
document Dx , the following cosine coefficient
formula was used. In this experiment, the k value
was varied to 10, 20 and 30 for the purpose of
124
finding optimal k. As mentioned earlier, the term
weights, txk and tjk , were in the form of
1 þ log TF.
P
P
txk 6 tjk
ffi
SimðDx ; Dj Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P
P
ðtxk Þ2 6 ðtjk Þ2
2.
3.
For the purpose of more efficient processing, a
feature reduction was applied in representing each
document to be classified. The document frequency that proved as effective as information
gain or chi-square test in a previous study [11] was
used as a feature selection method. Based on the
previous studies containing feature selection
experiments [4, 11], the top 20% of the terms with
the highest document frequency were used to
represent each document.
After computing the relevance score of each
subject category on the basis of cosine similarity
value obtained in step 1, the test document was
assigned to the category with the highest relevance
score. As shown in the relevance formula below,
the similarity of each nearest neighbor Dj to the test
document was multiplied by the conditional
probability of the document Dj to be in each
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
Y. M. CHUNG AND Y.-H. NOH
Table 8
Categorization performance of a kNN classifier
k value
Recall ratio
Precision ratio
Accuracy
10
20
30
0.9015
0.9010
0.9009
0.9486
0.9658
0.9752
0.9984
0.9984
0.9984
Average
0.9011
0.9632
0.9984
category Ck and summed to give the relevance
score.
X
relðCk jDx Þ &
simðDx ; Dj Þ pðCk ; Dj Þ
Dj [ k top documents
Table 8 shows the performance of the kNN classifier
when k was set at 10, 20 and 30 respectively. The
average precision was as high as 96.32% with a slight
increase in precision when k was larger. It can be seen
from the table that the k value hardly affects the
categorization performance in this experiment.
4. Conclusions and suggestions
In developing a specialized Internet directory system
in the field of economics, automatic classification
experiments were performed to construct a more
effective system. The DDC scheme was adopted as a
classificatory architecture for the directory and a
dictionary-based classification method employed to
automatically categorize Web documents being input
to the directory system. The first classification experiment that used a representative term dictionary
achieved relatively high precision of 77%. Although
this level of performance is acceptable in a real system
environment, human subject specialists may verify the
classification results and reassign incorrectly classified
documents without much effort. This process will
create a high performance directory similar to directory
services managed by human editors.
However, it was assumed that the classification
performance would also improve if a machine learning
categorization technique was applied as a second step
to refine the classification result. When a kNN classifier
was employed to reclassify the test documents, the
classification precision was enhanced up to 96%.
Although this extremely high precision was possible
because the classifier was applied to the pre-classified
documents in our directory, the experimental result
suggests a way to optimize the classification performance in a specialized directory system. The suggestions to be made from the experimental results of this
study are as follows:
. Construct a specialized directory system with a
well-established classification scheme such as DDC
and employ an automatic classification procedure
using a subject term dictionary to classify collected
documents.
. To improve classification performance, select a
certain number of correctly classified documents
from every subject category during the verifying
process of classified documents and use them as
training documents for a kNN classifier. Reclassify
the collected documents in the directory using the
kNN classifier.
. Once a sufficient number of Web documents are
collected and classified for a certain period of time,
it is also possible to activate a kNN classifier to
categorize newly collected documents by providing
some correctly classified documents as training
documents.
. When gathering documents in a specific subject
field from the Web and classifying them into a
directory, the term frequency threshold for input
documents as well as the similarity threshold for
document-category matching need to be optimized.
References
[1] M. Sahami et al., SONIA: a service for organizing
networked information autonomously, Digital Libraries
98. Available at: http://robotics.stanford.edu/users/
sahami/papers-dir/98-sonia.ps (posted 1998, access
date: 25 March 1999).
[2] The Scorpion Project. Available at: http://orc.rsch.oclc.org:6109/ (posted 1998, access date: 20 March 1999).
[3] O. Zamir and O. Etzioni, Web document clustering: a
feasibility demonstration, ACM SIGIR ‘98 (1998) 46–54.
[4] Y. Yang, An evaluation of statistical approaches to text
categorization, Information Retrieval 1 (1999) 69–90.
[5] Y. Yang and X. Liu, A re-examination of text categorization methods, ACM SIGIR ‘99 (1999) 42–49.
[6] Gerry McKiernan, Beyond Bookmarks: Schemes for
Organizing the Web. Available at: www.public.iastate.edu/*CYBERSTACKS/CTW.htm/
(posted
1999,
access date: 24 February 1999).
[7] Cora: Computer Science Research Paper Search Engine.
Available at: www.whizbang.com (access date: 20 May
2001).
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016
125
Developing a specialized directory system
[8] A. McCallum et al., A machine learning approach to
building domain-specific search engines, The Sixteenth
International Joint Conference on Artificial Intelligence
(IJCAI-99) (1999) 662–667.
[9] M. Koster, A Standard for Robot Exclusion. Available at:
www.demon.co.uk/pages/knowbots/norobots.html
(posted 1997, access date: 2 April 1999).
126
[10] L.S. Larkey and W.B. Croft, Conbining classifiers in text
categorization, SIGIR’ 96 (1996) 287–297.
[11] Y. Yang and J.O. Perdersen, A comparative study on
feature selection in text categorization, Proceedings of
the Fourteenth International Conference on Machine
Learning (ICML ‘97) (1997) 412–420.
Journal of Information Science, 29 (2) 2003, pp. 117–126
Downloaded from jis.sagepub.com at PENNSYLVANIA STATE UNIV on May 11, 2016