AUTOMATED DISCOVERY OF CONCEPTS FROM TEXT ONG SIOU

AUTOMATED
DISCOVERY
OF CONCEPTS FROM TEXT
ONG SIOU CHIN
A thesis submitted
in fulfillment of the requirements for the degree of Master of Science
Faculty of Computer Science and Information Technology
UNIVERSITI MALAYSIA SARAWAK
2008
Acknowledgements
My first word of gratitude goes to my parents for their unconditional
and understanding.
I would also like to extend my appreciation and admiration
Associate Professor Narayanan Kulathurarnaiyer
provided.
Fellowship
(Zamalah
Pascasiswazah
My acknowledgement
Unirnas/ZPU)
for
patience
to my supervisors,
and Associate Professor Dr. Alvin
their excellent guidance and their infinite patience.
Postgraduate
encouragement,
Yeo Wee, for
also goes to Unimas
the financial
support
To my friends; I would like to thank them for their help and sharing of ideas during the
progress of my research.
Finally,
I am grateful and appreciative
to my siblings and Tat Sin for
their supports that kept me continues to pursue my goals, especially during hard times.
I
Abstract
Concept discovery is a process of exploring
new concepts by forming connections or
relationships between terms. Existing mechanisms for concept discovery tend to pick up all
possible relationships between terms in a document based on sentence-structure of terms
identified. Also, concepts and their relationships discovered tend to be
cluttered and difficult to
be incrementally expanded. A framework is proposed for the automatic
concept discovery for a
large collection of documents.
The proposed framework
discovery process.
combines machine learning
The framework
structured organization
as text summarization
and semantic modelling
enables the extraction
in enabling
the
of concepts and the formulation
of a
and labelling of concepts to be used by text processing applications
such
and text categorization/
words that captures the meaning of a collection
A mechanism for automatically
inducing a set of
of documents has been developed. The WordNet
lexical database is used to extract root meanings of terms and to determine
relationships
amongst
these terms.
The proposed framework, based on an integration of statistical, machine learning and semantic
approaches, consists of five phases: feature selection, clustering, word sense disambiguation,
semantic associations and automatic labelling. Distributional clustering is able to group together
words based on similar probability distribution.
Word sense disambiguation is then used to
emphasize key "senses", based on the co-occurrence of similar synsets across terms in a cluster.
Semantic Associations is then performed to visualise dependency relationships in the form
of
ii
predicate rules. Subsequently, automatic labelling derives a label for each cluster by considering
the branching factor of each term and the scoring obtained in the sensedisambiguation phase.
The experiments carried out have demonstrated the capability of the proposed framework in
performing an automatic characterisation of document collections. The research has presented
many possibilities for future knowledge discovery systems.
111
Abstrak
Penemuan konsep merupakan satu proses pembentukan konsep baru dengan meijalin
antara
Mekanisma
perkataan.
hubungan
yang
mungkin
antara
penemuan
perkataan
konsep yang
dengan
tersebut berdasarkan struktur ayat di dalam dokumen.
yang dihasilkan
adalah
sedia
mempertimbangkan
memilih
peranan
semua
perkataan
Selain itu, konsep-konsep clan hublingall
clan sukar dikembangkan.
tidak teratur
ada sering
huhungan
Sebuah rangka
kerja
telah
dicadangkan bagi penemuan konsep daripada koleksi dokumen secara automatik.
Rangka
dicadangkml
kerja yang
pemodelan
semantik
pengekstrakan
untuk
1111111elTggahlingkan pemoc/elan penlhelajaran
proses
penenulan
konsep nlaka orgalllsasl
konsep.
helstruktlly
Rangka
kerja
clan konsep herlahel
ini
lnesin
dan
nlenlhole/lkan
ini holeh digllnakan
dalam kegunaan aplikasi pemprosesan teks seperti pertnnusan teks clan klasi/ikasi teks,.,'Satu
.
mekanisme dibina
mengambarkan
untuk menginduksi,
secara automatik,
erti-makna koleksi dokumen.
satiu set perkataan
yang
herupc{va
Pangkalan data leksikal WordNet telah digunakan
hubungan
dan
semantik antara perkataan.
makna
perkataan
untuk mengekstrak
Rangka kerja ini yang mengintegerasikan kaedah statistik, pemhelc{jaran mesin dan semantik
kata,
lima
fasa;
persatuun
pemilihan ciri, pengkelompokan, penyahkahuran erti
mengandungi
dan
perlabelan automatlk.
semantik
Pengkelompokan herdasarkan tabu ran yang digUnakan
ke
dalam
kelompok.
tabu
yang
memptnryai
ran
yang
samu
suhu
perkataan
menetapkan
Seterusnya, penyahkahuran erti kata digunakan untuk meuitikheralkun erli kata yang penting
berdasarkan kejadian erti kata yang serupu dalaln kelompok. Pelsatuan semantik clijulankan
iv
untuk memaparkan hubungan tanggungan yang berbentuk peraturan perdikat.
Seterusnya, label
bagi setiap kelompok dihasilkan
faktor
secara automatik dengan mempertimbangkan
dan kiraan mata difasa penyahkaburan
cabangan
erti berdasarkan konteks bagi setiap perkataan.
Eksperimen-eksperimen yang dijalankan telah menunjukkan kebolehan rangka kerja ini untuk
perwatakan koleksi dokumen. Penyelidikan ini telah menyampaikan kemungkinan untuk system
penemuan ilmu pada masa hadapan.
V
Table of Contents
Acknowledgements
1
Abstract
11
Abstrak
iv
Table of Contents
vi
List of Figures
xii
List of Tables
xv
List of Papers
xvii
Chapter 1
Introduction
I
to Concept Discovery
1.1
Introduction
1.2
Research Problem
3
1.3
ResearchObjectives
5
1.4
Scope
6
1.5
Chapters Overview
7
Chapter 2
2.1
Literature Review
x
Introduction
9
2.2
Concept and Concept Discovery
2.3
Ontology
12
2.3.1
WordNet
13
vi
2.3.2
Suggested Upper Merged Ontology (SUMO)
16
2.4
Feature Selection
18
2.4.1
Mutual Information
2.4.2
Information
2.4.3
X2-Test (CHI)
21
2.4.4
WordNet Lexical Semantic (WLS)
22
2.4.5
Evaluation Measures for Feature Selection
23
2.5
Clustering
25
2.5.1
K-Means
27
2.5.2
Clustering by Committee (CBC)
28
2.5.3
Distributional
2.5.4
Evaluation Measures for Clustering
32
2.6
Semantic Similarity
33
2.6.1
The Resnik Measure
34
2.6.2
The Jiang and Conrath Measure
35
2.6.3
The Seco Measure
36
2.6.4
Evaluation
2.7
Word Sense Disambiguation
38
2.7.1
Yarowsky Approach
39
2.7.2
Conceptual Density Approach
40
2.7.3
Semantic Relatedness Approach
41
2.7.5
Evaluation Measures for Word Sense Disambiguation
42
(MI)
19
Gain (IG)
20
Clustering
29
Measures for Semantic Similarity
vii
37
2.8
Concept Discovery Framework
2.8.1
General Architecture
2.8.2
PARMENIDES
2.8.3
Ontological
2.8.4
OntoLearn
49
2.8.5
Conceptual Clustering using Fonnal Concept Analysis (FCA)
50
2.8.6
Evaluation Measures for Concept Discovery
51
2.9
Summary
Chapter 3
42
for Text Engineering (GATE)
43
45
Semantics (OntoSem)
47
Framework
52
Conceptual Design
3.1
Introduction
54
3.2
Framework for Concept Discovery
55
3.2.1
Document Pre-Processing
58
3.2.2
Feature Selection
59
3.2.3
Clustering
62
3.2.4
Word Sense Disambiguation
63
3.2.5
Semantic Association
64
3.2.6
Automatic Labelling
67
3.3
Evaluation
70
3.3.1
Automated Text Categorization
70
3.3.2
Incremental Learning
72
3.4
Summary
73
Viii
Chapter 4 Implementation
4.1
Introduction
75
4.2
Framework for Concept Discovery
76
4.2.1
Document Pre-Processing
77
4.2.2
Feature Selection
78
4.2.3
Clustering
78
4.2.4
Word Sense Disambiguation
79
4.2.5
Semantic Association
81
4.2.6
Automatic Labelling
84
4.3
Evaluation
85
4.3.1
Automated Text Categorization
85
4.3.2
Incremental Learning
87
4.4
Summary
89
Chapter 5 Result and Analysis
5.1
Introduction
90
5.2
Dataset
91
5.3
Contextual Sense Disambiguation
93
5.3.1
Quantitative Approach
93
5.3.2
Baseline Approach
94
5.3.3
Discussion
94
5.4
Concept Discovery and Automatically Labelling
99
ix
5.4.1
Concept Discovery and Automatically
5.4.2
Quantitative
5.4.3
Discussion
104
5.5
Context Representation
106
5.5.1
Automated Text Categorization
107
5.5.2
Results and Analysis
108
5.5.3
Discussion
112
5.6
Incremental Learning
114
5.6.1
Quantitative
116
5.6.2
Discussion
117
5.7
Summary
123
Approach
Labelling
Result
99
103
Approach
Chapter 6 Discussion
6.1
Discussion of the Proposed Framework
124
6.2
Summary
131
Chapter 7 Conclusion
7.1
Introduction
132
7.2
Research Contributions
132
7.3
Limitations
134
7.4
Future Works
135
7.4.1
Human and System Interaction
135
x
7.5
139
Summary
140
References
Appendix
A
Stop Word List
147
Appendix
B
Index File Format
152
Appendix
C
Concept Discovery Result-Reuters (Top 10 categories for
155
Reuters-21578)
Appendix D
Concept Discovery Result (Two categories from SITE)
171
Appendix E
Conceptual Sense Disambiguation Result (dataset used for
174
evaluation)
Appendix
F
Appendix G
Contextual
Sense Disambiguation
Concept Discovery Evaluation
Appendix
H
Incremental
Appendix
I
Document
Evaluation
Form
Learning Evaluation
Form
I7760. txt from category "acy"
xi
Form
178
192
203
208
List of Figures
Figure 1.1
Concept discovery (extracted from Smith and Humphreys, 2004)
3
Figure 1.2
Outline of dissertation
7
Figure 2.1
Outline of areas covered in the literature review
8
Figure 2.2
The different perspectives of concept
I0
Figure 2.3
Ontology Learning Task
II
Figure 2.4
Top level of SUMO concepts
17
Figure 2.5
Dependency of SUMO, MILO and domain ontologies
18
Figure 2.6
Overview of clustering algorithm (Downs and Barnard, 2002)
26
Figure 2.7
Distributional
30
Figure 3.1
Outline of Chapter 3
54
Figure 3.2
A framework for concept discovery
55
Figure 3.3
Document pre-processing
58
Figure 3.4
Feature Selection
59
Figure 3.5
Classification
clustering process
accuracy from Reuter's dataset (Baker and
61
1998)
McCallum,
62
Figure 3.6
Distributional
Figure 3.7
Contextual Sense Disambiguation
64
Figure 3.8
Semantic Associations
65
Figure 3.9
Possible patterns of words (and intersection)
Figure 3.10
Automatic
clustering
Labelling
are related
66
68
X11
Figure 3.11
Framework
for text categorization
71
Figure 3.12
Framework
for incremental learning
72
Figure 4.1
Outline of Chapter 4
75
Figure 4.2
An extract of the Reuters-21578 index file
77
Figure 4.3
Algorithm for distributional clustering
79
Figure 4.4
Algorithm for contextual sensedisambiguation
80
Figure 4.5
Algorithm for semantic association
81
Figure 4.6
Algorithm for forming dependency relationship
93
Figure 4.7
Algorithm for automatic labelling
84
Figure 4.8
Extraction of ARFF file for category acq
87
Figure 4.9
Algorithm for incremental learning
88
Figure 5.1
Outline of Chapter 5
90
Figure 5.2
Concept map for cluster cornC2
101
Figure 5.3
Concept map for cluster 010
102
Figure 5.4
Concept map 1 used for evaluation
105
Figure 5.5
Concept snap 3 used for evaluation
106
Figure 5.6
Cluster distribution
Figure 5.7
Micro-averaged f-measure of different compression rates (SFS)
for 10 categories
108
II1
and IG
Figure 5.8
Macro-averaged f-measure of different compression rates (SFS)
III
and IG
Figure 5.9
Optimum compression rate for Reuters using SFS
X111
115
Figure 5.10
Extracted from 17760.txt, category "acq"
118
Figure 5.11
Concept map for cluster acqC2
119
Figure 5.12
Addition
120
Figure 5.13
Different perspective of concept addition- hoard
121
Figure 5.14
Different perspective of concept addition- chairman, chief; leader
122
Figure 6.1
Context summarization for ten categories in Reuters
126
Figure 6.2
Context summarization of "earn", "corn", "crude" and "a(--q"
127
Figure 6.3
Context summarization of "corn", "grain" and "wheat"
128
Figure 6.4
15618.txt, category "wheat"
129
Figure 6.5
Category membership of 15618.txt, category "grain"
130
Figure 7.1
Iterative update in incremental ontology learning
136
Figure 7.2
Proposed process flow for the incremental ontology learning
137
Figure 7.3
Comparison of concepts discovered with and without instantiation
138
of new concept
\1v'
List of Tables
Table 2.1
Semantic relations in WordNet (Miller,
Table 2.2
Probability
Table 2.3
Contingency
Table 2.4
Formulas for computing performance metric (Witten and Frank,
interpretation
1995)
(Based on Debole and Sebastiani, 2003)
table for binary categorization
15
21
24
24
2005)
Table 2.5
Summary of related works in word sense disambiguation
39
Table 2.6
Processing resources in ANNIE
44
(adapted from (Cunningham,
2002))
Table 5.1
The number of categories in each category set and the number of
91
categories with at least 1 occurrence and 20 occurrences
(extracted from Lewis, 1997)
Table 5.2
Qualitative approach as compared to baseline approach
95
experimental result
Table 5.3
Word senses for loss
95
Table 5.4
Case 1; shell from cluster crudeC2
97
Table 5.5
Case 2; turkey from clustershipC2
98
Table 5.6
Result summary of cluster cornC2
100
Table 5.7
Result of correlation of concept maps and documents
104
Table 5.8
Result of label suitability
105
Table 5.9
F-measure comparison between IG, FSC and SFS
109
xv
Table 5.10
Compression rate of feature sets selected using SFS
110
Table 5.11
Compression rate based on cluster-feature
110
Table 5.12
Words selected for category "earn" for IG and SFS
113
Table 5.13
Words selected for category "acq" for IG and SFS
113
Table 5.14
Words selected for category "interest"
114
Table 5.15
Result of correlation of concept maps and document
118
Table 5.16
Result of new node addition
121
xvi
ratio, using SFS
for IG and SFS
List of Papers
Ong, S.C., Kulathuramaiyer,
N. and Yeo, A. W. (2007).
of the 2007 Conference
Proceedings
Conceptual
of Information
Sense Disambiguation.
Technology
In
in Asia (CITA07),
Sarawak, Malaysia.
Kulathuramaiyer,
Learning.
N. and Ong,
The WICI
International
(WImBI
Informatics
(BI)
Technology,
Beijing, China.
Ong, S.C., Kulathuramaiyer,
Structured
S.C., (2006).
2006),
Workshop
International
N. and Yeo, A. W. (2006).
on Web Intelligence
WIC
Institute,
Automatic
Text. In Proceedings of the 2006 IEEE / WIC / ACM
Intellingence,
Schema Extraction
(WI)
Beijing
for Ontology
meets Brain
University
of
Discovery of Concepts from
International
Conference on Web
Hong Kong, China.
Ong, S.C., Kulathuramaiyer,
N. and Yeo, A. W. (2006).
Proceedings of the Workshop
Discovery
in Language, Artificial
Intelligence
for Natural Language Processing applications (LAICS-NLP),
XV11
of Meaning
from Text.
In
and Computer Science
Bangkok, Thailand.
Chapter 1 Introduction
1.1
Introduction
to Concept Discovery
The World Wide Web (WWW)
is a huge network of data for information
retrieval and
information sharing. Although the WWW is a huge success and has impact on everyday life, it is
its
layout
but
its
Computer
human-readable
towards
and
structure
only understands
only.
more
does not understand the content and the semantic. Besides, a specific mapping between data
data
integration
for
The
IT
be
developed
has
data
to
causes a
most
application.
model and
source
bottleneck in application development. If the semantic descriptors of data sources are machine2004).
least
be
done
(Harmelen,
the
at
semi-automatically
could
readable,
process of mapping
The Semantic Web has been proposed in an effort to overcome this problem.
The Semantic Web,
WWW,
is
the
Bernes-Lee,
Tim
of
extension
an
a vision of
where semantic approaches will
integrated into textual data to improve the human-computer
co-operation
ease the knowledge
acquisition
focuses on knowledge acquisition
process.
Instead of information
and further progress to
retrieval,
the Semantic
where results of the search will be semantically
obtained as requested by user.
To provide
support
Web
related to the
keyword used. In the Semantic Weh, a computer will also be able to infer, draw conclusion
act on the information
he
and
for knowledge
acquisition, the Semantic Web relies on an ontology to close the semantic gal).
An ontology is a data model that represents a domain and is used to reason about the objects in a
is
2006).
As
between
(Wikipcdia,
domain
them
machine
the
ontology
an
relations
and
particular
I
it can act as a medium
understandable,
knowledge
or interface
instances.
An ontology
ontology"
while an ontology
"heavy-weight
Web, considering
consisting of concepts and relations only is considered as "light-weight
that includes all five components
engineering
or ontology
and ontology
the ontology
(2001),
ontology
maintenance tasks.
and the maintenance
is predominantly
Manual crafting of the ontology
time consuming.
learning
the large amount of knowledge
As described by Omelayenko
engineering
of an
or rules and
mentioned
is considered
as a
Integration of knowledge structures
is not easy as the Semantic Web is being constructed in a bottom-up
Hence, ontology
acquisition
Components
and
ontology".
or an ontology
ontologies.
understanding
or roles, restrictions
properties
The Semantic Web has however been a challenge to achieve.
within
common
between human and machine (i. e. human and software agents).
includes classes or concepts, relations,
ontology
to provide
done manually
is crucial in the formation
manner.
of the Semantic
that is required to make the new web works.
learning
tasks can be divided
into ontology
These tasks involve the creation of new concepts
of
an existing
by knowledge
ontology.
Currently,
engineers and domain
ontology
experts.
he
human
tremendous
can
and
amount of
effort
will require a
The automation
process is therefore
required
to support
the creation
of
A fully automated ontology learning, will however be a major challenge.
Automating a light-weight ontology formation will therefore be a good start to move towards the
fully-loaded,
heavy-weight ontology.
a
vision of automating
This can be done by exploring
is
discovery,
which
a process of exploring new concepts based on connections or
concept
between
relationships
objects.
set of' linked-concepts
A sufficient
will
form
1.1arýra'i
%snaiA
a light-weight
ontology.
1.2
Research Problem
Reraöon: - 1000
askcd
?, OJoll(
nQpr
te]lFed
tqld
T[Yed
00t'_
I
xýpt
buV. tblP'
.luýi
builýin0
ý
ýý
7
approwmately
frpjst
tI.., ý.t,
ýýC
GnHesq
roturn
'1,
aataaWaeon
ni
+.' .
rIr arks
T,:,^rn;stion
reeNced
vr
Prssidrý
P r. W.
ý"<en-"e"ýý"
Fjoi
At
ofW e
046a.
, otoorad"
P",
RV,,
maetypation
:,
achWncs
tnp
-!
inforwation
]; 7.'f''3
Figure 1.1: Concept discovery
(extracted from Lcximancer, undated)
The concept
discovery
part of information.
significant
processing (NLP)
2005).
process is employed
This
knowledge
for
knowledge
Generally, concept discovery
acquisition
and to extract
the
is clone by using a natural language
approach and relies on the sentence structure of' an input text (Cunningham,
approach
acquisition.
can be extremely
knowledge
This research as such, will
intensive
requiring
focus on a hybrid
of statistical,
learning and semantic approaches to discover concepts from a large collection
3
a large
of'text.
effort
in
machine
Although
conventional
relationships,
concept
discovery
maps formed
the concept
occurrence
of terms within
structuring
and organizing
documents
works
are able to extract
tend to be simplistic,
(refer to Figure
1.1).
merely
Since information
that is available
of base concepts can be acquired from a new input document
and organization.
manageable and more organized.
information
representing
Therefore,
their
the co-
there is a need for
is growing
ability of concept discovery should also support incremental discovery.
representation
and
the concepts. A good way of structuring the concepts is by building
frame for the base concepts.
knowledge
concepts
Hence, the resulting
dynamically,
a
the
For example, instances
incrementally
to build
up the
concept map becomes more
It is able to delineate boundaries between concepts, based on
about usage and relevance.
New ways needed to be incorporated
to automatically
discover and visualise deeper structures.
Due to the bottom-up development of the Semantic Web, a single concept cannot be directly
mapped across ontologies.
Thus in each development, the mapping of data model and data
for
learn
framework
discovery
is
is
A
that
to
concept
able
source across ontology
required.
incrementally is therefore necessary to overcome this problem. New concepts and mappings can
be added incrementally to the ontology as the content of the web grows; and hence enhancing the
reusability.
Since an ontology is meant for developing a common understanding and knowledge sharing,
lexical ontologies such as WordNet will be explored to ensure concepts learned are unambiguous
and normalised.
4
1.3
Research Objectives
As knowledge acquisition
continues to be a bottleneck in the development
research has both proposed a systematic means of automatically
possibility
of most systems, this
acquiring
knowledge
and the
of adding value to it.
The prime focus of this research is to develop a framework
process based on a large corpus.
that automates the concept discovery
A hybrid approach; statistical, machine learning and semantic
approaches, for concept discovery will also be explored.
The specific objectives of this research
are outlined below:
o
To explore a framework for a structured concept discovery.
o
To identify techniques that can be applied to automatically discover and represent these
concepts, from a large collection of documents.
o
To characterize the concept discovered by
o
Assigning labels to clusters
o
Assigning meaning to words. Hence, to explore the possibility of extending word
in
larger
discourse
in
disambiguation
to
a
or
a
context.
a
collocation,
sense
o
1.4
To explore the means of performing
incremental learning of concepts.
Scope
This research focuses on a framework for concept discovery that incorporates a combination of
Gain
feature
The
Information
(IG)
as
means
research
employs
a
of
selection
various approaches.
and distributional
clustering
as a means of concept structuring.
Previous
have
experiments
2004).
datasets
(Bong
Kulathuramaiyer,
is
IG
to
that
and
consistency
across
able
perform
shown
In addition,
WordNet
disambiguation.
be
for
sense
employed
word
will
discovery.
for
concept
used
The framework
as a means of concept relationship
OntoSem's
WordNet
will
also be
proposed need not however be limited to WordNet
determining.
Other ontologies,
ontology can be explored, as an alternative.
such as Goi-Takei
Our experiments
have however mainly
The use of
discovery
demonstrate
WordNet
the
to
concept
and organization.
extent
of
employed
discovered.
further
be
then
refine
pattern
other sources could
intervention
human
final
for
the
this
step,
at
research as
appropriate
and
The use of WordNet
is also
As such,
is needed anyway.
is
discovery
`perfect'
not required.
a
concept
Experiments
by
done
Reuters-2I57X
for
the
this
mainly
using
are
study
and evaluations
dataset.
WordNet,
version
The training and testing set is compiled from the top ten categories of Topic.
2.0 is used as the only knowledge
resource for semantic approaches employed.
The syntactic
based
discovered
hence
be
the
also
on nouns.
that
are
relationships
used are nouns;
category
will
Nouns are explored as it is the major category in WordNet,
database
70%
the
of'
where almost
consists of noun.
1.5
Chapters
Overview
This chapter provides
an overview
of this dissertation.
The following
chapter,
Chapter
2
discusses and reviews the current work in related areas, which includes semantic approaches and
learning
approaches that are potentially
machine
6
important
for this research.
The evaluation