design and implementation of the colex proposed

CHAPTER V
DESIGN AND IMPLEMENTATION OF THE COLEX
PROPOSED
150
5.1. INTRODUCTION
This chapter discusses the structural desigri ·and implementation of the CO LEX
proposed. It examines the organization of lexical entries and file handling for their
corresponding data storage, retrieval and deletion. An object oriented design is adopted
for the development of this system. During the design of the system the major objects
identified are noun, verb, postposition, and modifier. The coordination of these objects
are done through the main module. The file handling used for the organization of the
data is the B-Tree mechanism of the order 10. There are five files for keeping the
information needed, i.e., word.ind for index file, noun.dat for noun file, verb.dat for
verb file, mdfy,dat for adjective and adverb file, and prst.dat for postposition and other
indeclinables file. The strategy used for deleting a content from a file is to store a "$"
sign, in the index file, in the place of the deleted item. To release the place occupied
by the deleted item an executable file FD.EXE has been developed which is called
from the main module through DOS system. The programme code is given in the
appendix B.
5.2. THE STRUCTURE FOR THE PRESENT LEXICON SYSTEM
The following are the fields of knowledge provided in the proposed system as it
became very clear from the discussions in the previous chapters that relevant
· knowledge types are needed to an NLP system.
1.
Syntactic category of word
2.
Subcategory
3.
Semantic knowledge
4
Target equivalent
5.
Pragmatic and other information (etymology, synonym/ antonym, etc.)
6.
Paradigm forms for both languages
So keeping all these things in mind a structure is proposed for a bilingual lexicon of
Malayalam-Hindi with the following lexical information.
1S1
Head word in Source language (Malayalam)
field I
Category
(Noun,verb,modifier
(adjective/adverb),
post-
position)
feild2
Sub-cat. (Abstract/ concrete, transitive/ intransitive, quality/
numeral, temporal/ spatial, etc.)
field3
Semantic attributes: argument structure (112/3 argument,
subj/object, animate/ inanimate, etc.)
field4
Inflectional (paradigms) forms (infinitive/tense forms; past/
present/ future, plural/ case forms/ participial forms, etc.)
Derivational forms (verbal noun, adjective, verb, etc.)
fieldS
Compounds I Phrases
field6
Synonyms/antonym word(s)
field 7
Tar·gct equivalent (Hindi) and Gender (if noun)
field8
T-lnflcctional forms (infinitive/tense forms(with GNP), plural/
case forms/ participial forms etc.) Derivational forms (verbal
noun, verb, adjective, etc.)
Example
Mal. Head word L1 > paTi-kku 'study'
Fieldl
verb (Tran.)
Field2
Arg 2 [Sub(Animative)/obj]
Field3
paTi-kku-ka/
paTi-ccu/
paTi-kku-nnu/
paTi-kk-um/
paTi-ppi-kku-nnu.
Field4
paTi-ttam/ PaT-anam/ paTippu, paTi-cc-a-van/val,
FieldS
paT ippum uri
Field6
152
Hindi L2 >
paT-
Field7
paT-na/ paT-aa/ paT-ttaa/ paT-uum +GNP, paT-aana
Field8
paT-aa-yi, paTnevaala,
Field9
paTaayi_kamara
FieldlO
Target equivalent is given as one of the fields, and source and target word have same
category and argument structure/s. All the inflectional forms are given in a regular
pattern, i.e., for verb: first infinite, then tense forms (past, present, future), causative,
or (plural, case forms, etc.) etc. Inflectional forms can be got either by on-line input
from the keyboard or can be generated through the morphological analyzer. Since
Malayalam doesn't show any GNP ending, which is important in Hindi it should be
provided for the latter, such as gender specification for noun. The derivational forms,
compound forms, etc. come in the next fields.
The following are the general attributes which each lexical unit carncs.
I. Category specification.
2. Specification of subcategory.
3. Defining semantic features.
4. Related semantic categories.
5. Syntactical related paradigms.
In this system, four major lexical categories are used for convenience i.e., noun, verb,
modifier and indeclinables. Paradigm forms and phrase or compound forms derived
from the first two categories also find a place. As the system is provided with a partial
morphological analyzer, base or root morph is used as the basic unit along with pattern
specification. The bound morphemes. i.e., suffixes, and prefixes are specified with
separate identification. An entry have at least ten fields, i.e., base form, category,
subcategory, argument structure and semantic and pragmatic tags, inflectional and
derivational forms, synonym/ antonym words and etymology.
153
Each category possesses different information. They are syntactico-semantic
information including the inflectional and derivational forms and the compound and
phrasal formations. Following are the issues concerned with ·the respective category.
I. Noun dictionary
1.
Syntactic and semantic subcategory (proper noun, pronoun, numeral,
common and abstract noun, etc.)
2.
3.
4.
Plural/ Case forms (all cases) I Gender forms /etc.
· Compound forms and samasas.
Semantic equivalents and the gender specification m the target
language.
II. Verb dictionary
I.
Grammatical sub-category forms such as transitive-intransitive, voice,
causative, regular or irregular etc.
2.
Argument structure/theta role, etc. (Case identification, such as subject
and object, etc.)
3.
Tense/ Aspect/ Mood forms (paradigms).
4.
Derivations and compounds
Ill. Modifier Dictionary includes adjectives, and adverbs.
1.
Adjectival modifiers have sub-classifications as cardinals, ordinals,
approximates, multiplicaters, fractional collectives, measurements, etc.
2.
IV.
Adverbial modifiers of time, place, manner, direction, etc.
Indeclinables are case specifiers, coordinator, postposition, etc.
1.
Case denotatives
2.
Co-ordinatives
154
5.3. SCHEMATA OF THE SYSTEM
Different modules have been used for each purpose. Foliowing are the step by step
operation for different functions.
Simple schematic representation of the lexicon can be done shown as below.
lexlcDn
1----f
Dal11
Index
{/(Verb )
~(
LIDd11)
fig. I
The system is developed with the object oriented approach in C++ (Turbo C++ Ver
3.0) in DOS environment. The C++ language is selected as it is the most commonly
used programming language and finds its market rapidly. It presents direct
implementation of abstract data types. Expandability (reusability), reliability and
readability features of it make modification and maintenance of the system easier.
5.4. FILES AND DATA STRUCTURES
In this package there are five permanent files which keep the database information.
Four of them are with '* .dat' extension. These files are database for the lexicon. The
other file is 'word.ind' which is used for the building of Btree. Btree enables the user
to access the database. The data file of our lexicon consists of entries such as key,
category and associated information, The key word and the category information and
an index are the three fields used in the storage and search of a particular item.
Associated information such as subcategory, etymological tagging, target equivalent,
etc. are stored in a different place. The specific information for the major categories
are: for verb, argument structure(tran-intra), the thematic roles and paradigm forms
(tense-past/present/future, aspect and mood); for noun, subcategory information like
155
animate/inanimate, abstract/common, personal/ pronoun, etc. and
paradigms like
(number, gender and case forms), etc.; for adjectives, the derived nouns, and for
adverb, temporal, spatial, etc. The paradigms of the source and target equivalents are
given one-to-one. For modifier and indeclinables subcategory, etymology, and target
equivalent are the information.
J!ey _\lOrd
c~tegO:t',Y
8
1
I
lndK
1
Dat:a-s:t:I'UCt:UPe
The objects and the classes are word, noun, verb, modifier, postposition. The class
structure of each category can be defined as given below:
class noun{
II key word
char x[7],
hum_nhum,
//<human -nonhuman>
etm[3], //etymology,
mca(8] // target-meaning
sy[7],at[7],
II synonym, antonym
pard[5][9];tpd(5][9]; source and target paradigms
class verb{
char x[7], //key word
tra_intr,
//<lntra,Transitive>
etm[3],// etymology
mea(8], //target meaning
sy[7], //synonym
at[7], II antonym
pard(5](9], //paradigms
tpd(5][9], //target paradigms
156
arg[7]; // argument structure
class mdfy { .
char x[7], //key word
adj_adv, II adjective/ adverb
etm[3], //etymology
mea[8], II target meaning
sy[7] ,at[7]; synonym/antonym
class prep{
char x[7], //key word
etm[3], //etymology
mca[8], // target meaning
sy[7],at[7]; // synonym/antonym
As noted earlier, the system has four data files for·major categories, besides a common
indicator file. A noun data entry takes 131 bytes, a verb entry takes I52 bytes, a
modifier 32 bytes, and a postposition 22 bytes. The indicator file for an item takes 8
bytes.
5.5. SEARCH MECIIANISM
The B-Tree algorithm has been implemented to perform the search in the lexicon. The
keys appear in lexicographical order from left to right. The key provides an index to
the related record in the database. At each node of the Btree, a key and the associated
index to database is provided. When searching the database for a record, one has to
obtain a key which gives the corresponding index to the record. Index indicator is
separate for each category. The structure of a node is shown as follows. In Btree the
order is I 0. The number of keys in a node is between 5 (order divided by 2) and I 0
with the exception of root node which may at least contain one node.
157
5.5.1. Retrieval Mechanism
Once the lexical word along with the ·category is given for retrieval by using the
search mechanism the stored data will be accessed into the appropriate object. The
information about the lexical item will be available to the user with in the object as
far as the object is not replaced by another information or deleted. The updation of
the information if necessary can be done on the object and the updated object will be
restored into the data base.
5.6. DESCRIPTION OF THE MODULES
When the system is executed the system asks first for the 'key' word followed by its
category. Once the key and the category is given, it asks for confirmation to 'save or
not'. If 'yes' is the answer it selects the respective category format for storing the
data, and asks for further information like subcategory, semantic attributes, etymology/
possible paradigms/ synonyms/ antonym and the target equivalent along with the
derivative forms. Semantic attributes like number of arguments for verb/ adjectives,
relational binary classifications like animate/ inanimate, male/ female, concrete/
abstract, etc. for noun, temporal/ comparative/ state, etc. for modifiers, etc. will be
asked consecutively.
The modules explained by the algorithm are as follows:
1.
Main module (lexicon.cpp)
2.
Verb module ( verb.cpp)
3.
Noun module (noun.cpp)
4.
Modifier module (mdfy.cpp)
5.
Module for post-position and indeclinables (prep.cpp)
6.
Module for file handling (btree.cpp)
7.
Module for file updation (fd.cpp)
8.
Morphological
9.
Morphological analyser (manl.cpp)
g~nerator
module (mgen.cpp)
158
The morphological analyser module has sub modules such as:
Module for noun
a. case forms(case.cpp)
b. plural forms(pl.cpp)
c. gender forms (gend.cpp)
Module for verb
a. tense
i. past forms (past.cpp)
ii. non-past forms (prst.cpp)
b. aspects & moods (pard.cpp)
c. participles ( part.cpp)
The system has different modules, because the contents of each category is different.
Modules are implemented dependently and the main module controls the rest.
Following is the working of modules explained by the algorithms.
Main module
Stepl. Construct the B-Tree
Step2 Set the options(Store/read and update/ delete/index/quit)
step3. If quit GOTO step! 0.
Step4. Take the input word followed by the category from the user
Step5. Check the lexical item in the database
Step6. If present
if the option is store, report the existence and GOTO step2.
if the option is read/update GOTO Step8
if the option is delete GOTO step9.
159
If not present
if the option
IS
store, collect the corresponding lexical
information and GOTO step7
if the option is read/update, report failure and GOTO step2
if the option is delete, report failure and GOTO step2.
Save the data and GOTO step2.
step7.
Step 8.
Display the data, make necessary change and GOTO step7.
Step9.
Remove the item from B-tree and store "$" in place of item to
be deleted in index file and GOTO step2.
Step 10.
Exit from the system
for noun category
if cat is 'n' for step4 of the main module
Select noun data-structure and store further information
for verb category
if cat is 'v' for step4 of the main module
Select verb data-structure and store further information
for modifier
if cat is 'm' for step4 of the main module
Select modifier (adjective/adverb)
data-structure and store further
information·
for post-position and indeclinables
if cat is 'p' for step4 of the main module
select
postposition/indeclinable
data-structure
and
store
information
The lexical editor allows the user to put the data in the following format:
further
160
first the root word, 'key' in Malayalam, and then category, and so on.
'key'.
category
subcategory and semantic attr.
paradigms
inflected forms
derivations
compounds /phrases
etymology
synonyms/antonym
target equivalent
target paradigms
The search operation by I3trce method and searching algorithm are given below.
search :
1. get the key word
2. get the category
3. go to the database and check whether the entry is there or not
4. if yes retrieve and display the key and the related information.
5. if not say "data not in the database, want to save or not"
The Btree structure can be put in the following way
step 1:
current node := root
step 2 : lower := 1 and upper := max key
step 3 : if current node = nil
report failure
terminate search
161
step 4: if upper>= lower then
middle := (upper +lower) /2
key = current node [middle] .
if x = key then { x is key to be searched and key }
report success
terminate search
else
if x < key then upper := middle- I
else lower = + 1
goto step 4
step 5:current node := node pointed by pointer between upper and lower goto step 2
Read and Update
1. get the key lexical item
2. get the category
3. check in the lexical data.
4. if yes report and display
5. if not say "want to save or not y/n"
step! :
current node := root
step2 :
call search algorithm
if search succeeds { report 'key already exist'
and terminate insert}
step3 :
if node is not a full node then insert the key at appropriate place and
terminate.
step4 :
split the node into two nodes. If the node is not the root, then take the
middle key to the parent node
· Else, make a new root containing only the middle key.
stepS :
goto step 3
162
Modification
stepl :
current node := root
step2:
call search algorithm
if search fails then
report key doesn't exist and terminate modification
step3:
make appropriate modifications.
Deletion
stepl :
current node := root
step2 :
call search algorithm
if search fails { report 'key does not exist'
and terminate delete}
step3 :
remove the key from the node
step4 :
if number of keys on that node is less than the Btree order then
step 5 :
rearrange the nodes if required.
The starting menu allows a user to perform five options - Store, Read or Update,
Delete, Inde and Exit. lntigers l-5 are taken as input choice.
Data Files Stored
The following are the lexical data files produced/used by the system.
I. Word.ind (Index data)
2. Verb.dat (Verb data)
3. Noun.dat (Noun data)
4. Modf.dat (Modifier data)
5. PP.dat (Post-position and indeclinables)
163
One can get and control the information according to his requirements. For example,
get the data in a text form for printing or other human consumption, etc. A sample
lexical ·data along with the menu options are provided in the appendix C
A brief overview of the advantages, drawbacks, limit in terms of data, the prospects
of the model, etc. of this system is done in the last chapter of conclusion.

Download Report

design and implementation of the colex proposed

Paperzz.com

Your Paperzz