UNL-AUB-2011

Universal Networking Language Thesis Part ||
An “enconverter”, which generates Bengali language to UNL, plays a core role in the
UNL system. It is very significant that converter will be capable of expressing UNL
information with very high accuracy. It will consist of word dictionary and conversion
rules for a language. This will be language independent software that is applicable for
any languages. This engine takes Bengali Language as input and generates target UNL
expressions with the help of various database files like lexicon files, morphological rule
files.
The en-converter can be logically portioned into three phases as:
1) syntax planning phase
2) case marking phase
3) morphology phase
The syntax planning phase is aimed at generation of proper sequence of words for the
target sentence. These phases first read the input file and convert it into semantic-net like
structure known as node-net. Node-net is a directed acyclic graph structure (DAG), which
defines the sentence in the form of DAG. We use lexicon files to map the UW’s to
target language worlds. After generating a node-net, the problem of the syntax plan
generation get reduce to the problem of DAG traversal. Proper traversal of the node net
generates the syntax plan of the target sentence.
The syntax planning phase generates the sequence of words, which cannot express the
complete contents of the sentence. This syntax plan needs to be processed by the case
marking file, which apply proper case marker for each and every relations. This case
marking phase is next processed by the morphology phase. The morphology phase gives
a final form of the target sentence.
Here we discuss the structure of Bengali sentence and based on this we describe the basic
idea of the generator system.
Enconverter :
An "enconverter" is a software that automatically or interactively enconverts natural
languages text into UNL. UNU/IAS developed a software for enconversion called
"EnCo" which constitutes an enconverter together with a word dictionary, co-occurrence
dictionary and conversion rules for a language. This "EnCo" is a language independent
software, then it is applicable for any languages.
An "enconverter", as it generates UNL from natural languages, enables people to make
UNL documents without any knowledge about UNL. It means that users of the UNL
system do not need learn UNL. This makes UNL quite different from Esperanto, for
instance.
Deconverter :
A "deconverter" is software that automatically deconverts UNL into native languages. It
is important to achieve a high quality and correct results. It is also important that the
basic architecture of the "deconverter" is widely shared throughout the world, in order to
treat all languages with the same quality and precision standards. Technology developed
for a language can be applied to other languages as long as the architecture is shared. A
"Deconverter", which generates natural language from UNL, plays a core role in the UNL
system .It is very significant that "deconverter" is capable of expressing UNL information
with very high accuracy.
It follows that information, once composed in UNL, can be understood in any language
as far as there be a "deconverter" of the language
Bangla Sentence Structure and Representation
One of the main motivations towards the generation system is the subject-object-verb
structure of Bengali as against subject-verb-object structure of English.
E.g.:
Rahim saw Poly
Subject – verb – object
Rahim poly ke dekhechilo
Subject - object – verb
Simple UNL Representations
Simple sentences in UNL are represented through node-net by taking into consideration
all the relation it has in the UNL expression.
Consider the example: Rahim live in Dhaka
Multiple UNL Representations
Clausal sentences can be represented in more than one ways in UNL viz. either using a
scope node (Compound-UW), or using multiple parents on some of the nodes.
Consider the same example:
"Rahim saw Poly who lives in Dhaka".
This is the first representation: It is called the hyper-graph representation:
The second way to represent this sentence is using multiple parents:
Both are valid UNL representations for the same sentence. In the representation, the node
"live" has no parents. And starting from the node marked "entry", we cannot reach this
node (live) by following only child pointers. Such nodes are called "Orphans"
However this is not the case with Compound Sentences.
For example, "Karim likes singing and dancing".
It has only one representation
Based on this observation the UNL representation of a sentence can be classified in the
following way:
The Case of aoj-Parents:
Multiple parents can also be encountered in case of aoj-relation like in the following
example: "American novelist won Noble Prize".
Its UNL-representation is:
Here even if we don't have a clausal or compound sentence we see a case of multiple
parent. This is a sort of exceptional behavior and is not shown by any other relation.
Bangla – English Dictionary
Bangla to English dictionary is the source to build Bangla to UNL dictionary as universal
words are English words mandated by UNL. Such dictionaries also provide all attributes
along with meaning of a word. Any entry in the dictionary is put in the following format
[1]:
[HW] {ID} “UW” (ATTRIBUTE1, ATTRIBUTE2 . . .) <FLG, FRE, PRI>
Here,
HW <- Head Word (Bangla word)
ID <- Identification of Head Word (omitable)
UW <- Universal Word
ATTRIBUTE <- Attribute of the HW
FLG <- Language Flag
FRE <- Frequency of Head Word
PRI <- Priority of Head Word
Some example entries of dictionary for Bangla are given below:
shahor {} “city(icl>region)” (N, PLACE) <B,0.0>
prochur {} “huge(icl>big)” (ADJ) <B,0,0>
karim {} “person(icl>name)” (N, PERSON) <B, 0, 0>
Here the attributes,
N stands for Noun
PLACE stands for place
ADJ stands for Adjective
FLG field entry is B which stands for Bangla
Bangla to UNL En-Converion
An EnConverter is a software that automatically or interactively enconverts
natural language text into UNL. As it generates UNL from natural languages, enables
peoples to make UNL documents without any knowledge about UNL. It means that users
of the UNL system do not need to learn UNL.
EnConverter is a language independent parser that provides synchronously a
framework for morphological, syntactic and semantic analysis. It would be impossible
to solve an ambiguity in morphological analysis without the use of syntactic or semantic
information. Also, it would be impossible to solve an ambiguity in syntactic analysis
without the use of semantic information.
EnConverter generates UNL expressions from sentences (or lists of words of
sentences) of a Bengali language by applying enconversion rules. In addition to the
fundamental function of enconversion, it checks the formats of rules, and outputs the
messages for any errors. It also outputs the information required for each stage of
enconversion in different levels. With these facilities, a rule developer can easily develop
and improve rules by using EnConverter.
EnConverter loads the enconversion rules and the rule checker works while
converting rules. Once the rules are made, they are stored automatically and can be used
directly the next time without rule conversion.



Convert or load the rules.
Secondly, it inputs a string or a list of morphemes/words of a sentence of Bengali
language.
Input a Bengali sentence.
Then, it starts to apply rules to the Node-list from the initial state. EnConverter
applies enconversion rules to the Node-list. The process of rule application is to find a
suitable rule and to take actions or operate on the Node-list in order to create a syntactic
functionalities and UNL network using the nodes in the Analysis Windows. If a string
appears in a window, the system will retrieve the Word Dictionary and apply the rule to
the candidates of word entries. In this case, if a word satisfies the conditions required for
the window of a rule, this word is selected and the rule application succeeds. This process
will be continued until the syntactic functions and UNL network are completed and only
the entry node remains in the Node-list.
-- Apply the rules and retrieve the Word Dictionary
Finally, it outputs the UNL network (Node-net) to the output file in the binary relation
format of UNL expression.
-- Output the UNL expressions
with the exception of the first process of rule conversion and loading, once EnConverter
starts to work, it will repeat the other processes for all input sentences.
EnConverter analyses a sentence using the Word Dictionary, Knowledge Base,
and Enconversion Rules. It retrieves relevant dictionary entries from the Word
Dictionary, operates on nodes in the Node-list by applying Enconversion Rules, and
generates semantic networks of UNL by consulting the Knowledge Base.
The word entries of Bengali language are stored in the Word Dictionary. Each entry
of the Word Dictionary is composed of three kinds of elements: the Headword, the
Universal Word (UW) and the Grammatical Attributes. A headword is a notation/surface
of a word of a Bengali language that composes the input sentence, and it is to be used as a
Bengali Enconversion Rules
Apply the rules and Retrieve the Word Dictionary
Bengali Input a sentence
Bangla Word
Dictionary
Knowledge
Base
Output the UNL expressions
trigger for obtaining equivalent UWs from the Word Dictionary in enconversion. An UW
expresses the meaning of the word and it is to be used in creating UNL networks (UNL
expressions) of output. Grammatical Attributes are the information on how the word
behaves in a sentence and they are to be used in enconversion rules.
Bengali Language specific and language independent information
EnConverter analyses a sentence using the Word Dictionary, Knowledge Base, and
Enconversion Rules . It retrieves relevant dictionary entries from the Word Dictionary,
operates on nodes in the Node-list by applying Enconversion Rules, and generates semantic
networks of UNL by consulting the Knowledge Base.
The word entries of Bengali language are stored in the Word Dictionary. Each entry
of the Word Dictionary is composed of three kinds of elements: the Headword, the
Universal Word (UW) and the Grammatical Attributes. A headword is a notation/surface of
a word of a Bengali language that composes the input sentence, and it is to be used as a
trigger for obtaining equivalent UWs from the Word Dictionary in enconversion. An UW
expresses the meaning of the word and it is to be used in creating UNL networks (UNL
expressions) of output. Grammatical Attributes are the information on how the word
behaves in a sentence and they are to be used in enconversion rules.
All possible relations between each pair of UWs are defined in the UNL
Knowledge Base (KB) using the UW system, a kind of hierarchy of UWs, with certainty
values. When a relation is being established between two UWs by applying an
enconversion rule, EnConverter consults with the UNL KB. If the relationship is
approved, EnConverter will establish the relation between the two UWs (i.e., it will
connect the two UWs using the relation label) and the rule application succeeds. If the
relationship is not approved, no relation will be established between the two UWs and the
rule application fails. To utilize the KB function, all the UWs used in a native language
must be linked in the UNL KB.
For example binary relation agt in Knowledge base
UW1 – should be a action (do) and
UW2 – should be a thing
E.g. ‘sha douray’ (he run)
‘sha’ - a thing
‘douray’ – do (action)
An Enconversion Rule is composed of Conditions for the nodes placed on the
Analysis and Actions and/or Operations for the nodes placed on the Analysis Windows. Such
enconversion rules describe the kind of actions and/or operations that should be
carried out for all phenomena of a language, and under what conditions. EnConverter will
find the most suitable rule every time, and create a UNL expression. A set of UNL
expressions of a sentence will finally be completed after having applied a set of all the
necessary rules.
Basically the Enconverter needs certain information from the input sentences. The
information is available at various linguistic levels either the morphological, syntactic or
semantic levels. The amount and type of information available at each level is largely
dependent on the characteristics of the language. This means the design of the
Enconverter is decided by what information is needed by UNL and the nature of the
language, which decides on the type of information that can be extracted from the various
linguistic levels. UNL has separate concepts for noun, verb, adjective and adverb in other
words there is a need to syntactically categorize the words of the sentence. For a noun the
attributes to be included in the concept definition in UNL is number. For the verb the
concept definition tense marker is required. The next important part of UNL is the
definition of relations. These include case relations of noun concepts with the
corresponding verbs, association of adverbial components with verb definitions and the
association of adjectival components with the noun definitions.
In this paper the extraction of the information needed in building the UNL
structures from the various linguistic levels of Bengali language has been discussed.
MORPHOLOGICAL ANALYSIS OF BANGLA WORDS
There are five types of words (parts of speech) in Bangla sentences. So we consider the
following five types of morphologies.
A. Noun morphology:
Bangla nouns have very strong and structural inflectional morphology based on case.
Case of noun may be nominative (“chele”, boy), accusative (“cheleke”, to the boy), and
genitive (“chele-r”, of the boy) and so on. Gender and number are also important for
identifying proper categories of nouns. Number may be singular (“boi”, book) or plural
(“boi-gulo”, books). Gender of nouns can be masculine (“vai”, brother), feminine
(“bon”, sister), and common (“shishu”, child) and neuter (“kolom”, pen) [8]. So, some
dictionary entries may look like,
[Chele] {}”boy (icl>person)” (N) <B, 0, 0>
[ke] “ke”(NMORP) <B,0,0>
[r] “r”(NMORP)<B,0,0>
B. Adjective morphology:
Here we consider Bangla words “shahosh” and “valo” meaning “bravary” and “good”, in
English respectively. From the first word, we get “shahosh-i”, “shahosh-er”etc. and from
the second word we get “valo-ke”, “valo-ta”etc. So, we may have,
[Shahosh] {}”bravery (iof>quality)” (N) <B, 0, 0>
[Valo] {}”good (iof>quality)”(N)<B,0,0>
[i] “i” (ADJMORP), <B, 0, 0>
[ta] “ta”(ADJMORP),<B,0,0>
C. Pronoun morphology:
We can consider the word root “taha” (he/she). From it we get “taha-ke”, “taha-ra”,
“taha-der” etc. So, we may have,
[Taha] {}”he (iof>person)” (PRP) <B, 0, 0>
[ke]“ke”(PROMORP)<B,0,0>
[der] “der”(PROMORP)<B,0,0>
D. Preposition morphology:
Prepositions like “ebong” (and), “jonno” (for) etc. used in Bangla are also called
indeclinable. They do not undergo any morphological change. So morphologically we
can represent them as
[ebong] {} “and” (PRE) <B, 0, 0>
[jonno] {} “for” (PRE) <B, 0, 0>
E. Verb morphology:
Diversity of verb morphology in Bangla is very significant [8]. For example, if we
consider “kha”(eat) as a root word, then after adding “be”, we get a word “khabe” which
means will eat(for second and third person). Similarly, after adding “i”, we get the word
“khai” which means eat (for first person). Here one word represents future indefinite
tense of the root word “kha” and another represents present indefinite tense, but for
different person. Therefore, by morphological analysis, we get the grammatical attributes
of the main word and other attributes. For this reason, we have applied morphological
analysis for different persons with different transformations to find out the actual
meaning of the word. Some examples of morphological structure of Bangla-UNL
dictionary entries are given below [11, 12]:
[kor] { } “do (icl>do” (List of Symantic and Syntactic
Attributes) <B, 0, 0>, where “kor” is the LCLU (Longest Common Lexical
Unit) of kor-ebe, kor-che, koe-eteche, kor-chilo and kor-a, and a verb morphology is:
[-ebe] “ebe”(VMORP,FUTURE)<B,0,0>
For the Bangla word “ja”, this means “go”. Some possible transformations are [3, 9]
[ja]{} ”go(icl>do)”(V,@present)<B,0,0>
For third person:
[-e]{} ”go(icl>do)”(V,@present)<B,0,0>
[giechilo]{} ”go(icl>do)”(V,@past)<B,0,0>
[-be]{} ”go(icl>do)”(V,@future)”<B,0,0>
...
For second person:
[-o]{}”go(icl>do)”(V,@present)<B,0,0>
[-n]{} ”go(icl>do)”(V,@present)<B,0,0>
[giechilen]{}”go (icl>do)” (V, @past) <B, 0, 0>
[-ben]{} ”go(icl>do)”(V,@future)<B,0,0>
For first person:
[-i]{} ”go(icl>do)(V,@present)<B,0,0>
[-giechi]{} ”go(icl>do)(V,@past)<B,0,0>
[-bo]{} ”go(icl>do)(V,@future)<B,0,0>
SYNTACTIC-SEMANTIC ANALYSIS OF BANGLA WEB DOCUMENTS
Syntax and semantics of any natural language is complex and ambiguous and natural
language processing (NLP) is a long term goal of current researchers in 21st century.
UNL is an attempt to address machine translation (MT) to express syntax 730 and
semantics of languages in a unified way through a rich set of syntactic and semantic
attributes as well as using a knowledge base (KB) to represent information, i.e. meaning,
sentence by sentence. Sentence information is represented as a hyper-graph having
Universal Words (UWs) as nodes and relations as arcs. A simple natural language
sentence is expressed as a long list of relations between concepts to capture syntactic and
semantic information. We have done limited work so far to build a UNL-Bangla system.
We explain our effort with a simple Bangla sentence below.
Alal ebong Dulal bazar-e ja-chhe
Bangla to UNL translation explaining how syntax and semantics are captured in UNL is
as follows.
/<</[Alal]/[ebong]/”Dulal bazar-e ja-chhe”/>>/
The proper noun “Alal” and the conjunction “ebong” can be combined and a “and”
relation can be initiated.
/<</[Alal ebong]/[Dulal]/”bazar-e ja-chhe”/>>/
After the “and” relation is made between “Alal” and “Dulal”, “Alal” “ebong Dulal” will
be deleted/ reduced and a scope ID may be added, say, 01.
Current sentence in the analysis windows is as follows:
/<</1:01/[bazar-e]/” ja-chhe”/>>/
Then, the analyzer looks ahead further right beyond the noun phrase bazar-e to get the
verb “ja”. As noun “bazar” can be resolved with the verb “ja”, ‘reduce’ action will take
place with a relation between the noun and verb, after an “obj” relation is created
between the noun “bazar-e” and the verb “ja-chhe” and “bazar-e” is
deleted.
/<</ [:01]/ [ja-chhe]/>>/
Next, “agt” relation is created between the scope 01 and the verb “ja-chhe”. The analysis
windows will have
/<</ ja-chhe/[>>]/
A final right shift is performed that attaches attribute @entry to the last word “ja-chhe”
left in the node list. Bangla to UNL dictionary is searched to replace “jachhe”
by the UW “(go (icl>do).@present.@cont.)”. A verb is the main word of a sentence and
most of the relations are created involving it. The UNL output is:
and: 01(Dulal(icl>male),Alal(icl>male))
obj(go(icl>do).@entry@[email protected](icl>place)
agt(go(icl>do).@entry@present.@cont.:01))
The UNL parser can translate the UNL sentence to
English as follows:
Alal and Dulal is going to bazar.
Syntactic functional grouping based on knowledge base of UNL
The free order ness of the language is achieved by two mechanisms namely richness in
morphology and cues available that enable syntactic grouping of words. Though
morphological endings allows thematic cases of the nouns to be determined, the complete
binary relation can be constructed only with help of syntactic functional groupings.
The syntactic grouping or the job of the parser in the case of Bangla EnConverter
is obtaining information to fill the binary relations obtained from the morphological
analyzer phase. Though the morphological level does provide information about the
possible case relations between the main verb and noun components there is ambiguity in
some cases. This ambiguity is solved using a combination of information obtained at the
syntactic and semantic levels. In this paper the parser has been designed to extract the
required syntactic and semantic information in order to build the UNL relations. The
parser mainly helps in deciding one among the thematic cases suggested by the
morphological analyzer and to form the corresponding case based UNL binary relation.
Take the example of the sentence ‘Sha dokaner dike dour diyechilo’ (he ran to shop ).
Fig 2. Syntactical structure of the sentence ‘sha dokaner dike dour diyechilo’
Fig 3. Possible binary relation for the sentence ‘sha dokaner dike dour diyechilo’
Fig 4. Equivalent Possible binary relation in UNL format
The noun without case ending is normally the agent in this case ‘Sha’. The word
‘Dour Diyechilo’ is the action verb (run), which has been morphologically analyzed and
obtained from the morphological analyzer. So the two words ‘Sha’ and ‘Dour Diyechilo’ are
linked with the binary relation ‘agt’. The binary relation ‘agt’ can be assigned in many
ways. If that sentence contains only one nominative case or UNL dictionary contains
person information for a word or subject matches with the verb then the ‘agt’ relation is
assigned. But in the word dictionary ‘Dokan’ specifies it to be a location.
The words ‘Dokan er Dike’ and ‘Dour Diyechilo’ are linked with the binary relation ‘plt’.
relation for plt (final place). Here
UW1 - is an event or state, and
UW2 - a place or thing defining a place (locative case)
The equivalent UNL format of the above sentence as follows.
[S]
{unl}
[W]
he(icl>person).@generic : 0
shop(icl>place).@generic : 1
run(icl>do).@past.@entry : 2
[/W]
[R]
2 agt 0
2 plt 1
[/R]
{/unl}
[/S]
The morphological analyzer detects adverbs and adjectives, however the
intensifiers corresponding to these are determined by their position in the sentence. Thus
the parser is needed to identify the adjectives or adverbs to which the intensifiers are
attached in order to form the appropriate UNL modifier relation.
agt plt
Fig 3. Possible binary relation for the sentence ‘Sha Dokaner dike Dour Diyechilo’
Fig 4. Equivalent Possible binary relation in UNL format
Take the example sentence ‘sha khub shundor vabe dowriyechilo” (he ran very beautifully).
Here the intensifier ‘khub’ comes just before the adverb ‘shundor’. So this
intensifier modifies the adverb. These words are linked with the ‘mod’ relation. In a
sentence a intensifier normally occurs before adjective or adverb.
UNL equivalents of the above sentence is
[S]
{unl}
[W]
he(icl>person).@generic : 0
very(icl>concept).@generic : 1
run(icl>do).@past.@entry : 2
beauty(aoj>thing) : 3
[/W]
[R]
2 agt 0
3 man 2
1 mod 3
[/R]
{/unl}
[/S]
Here the word ‘Sha’ is the nominative case in this sentence and it matches with
verb ‘Dour’. So these two words are linked with ‘agt’ relation. The word ‘Khub’ is
adverb and it defines the characteristics of the event ‘Dour’. So these words are linked
with the relation ‘man’. And also the word ‘Vabe’ modifies the adverb ‘Shundor’. So
these two words are linked with the relation ‘mod’.
Participles in Bengali may function as postpositions when occurring with noun or as
adverbial participles when occurring with verbs. At the morphological level if the
participle is attached to the word then the function of the participle is identified. However
in Bengali it is possible for the participle to occur in isolation. In this case the parser is
needed to decide on the functionality of the participle and to form the corresponding
UNL relation.
Take the example sentence
‘mazai vanta pootu avaL paaTinaaL’ ( when the rain came she sang )
In this sentence the word ‘mazai vanta pootu’ is an adverbial phrase, because of it
modifies the verb ‘paTu’
[S]
{unl}
[W]
she(icl>person.@generic) : 0
sing (icl>do.@past.@entry) : 1
come(icl>do.@adv.@generic) : 2
rain(icl>thing.@generic):3
[/W]
[R]
1 agt 0
1 tim 2
2 mod 3
[/R]
{/unl}
[/S]
Take the example sentence
‘gotkal j lokti gan geyechilo sha gram e fire gelo’ ( the man who sang yesterday went to
village)
The UNL format of the above sentence is as follows.
[S]
{unl}
[W]
[W]
man(icl>person.@generic) : 0
go (icl>do.@past.@entry) : 1
village (icl >place.@generic ) : 2
yesterday(icl>time.@generic) : 3
sing(icl>adj.@generic) : 4
[/W]
[R]
1 agt 0
1 plc 2
0 mod 4
4 tim 3
[/R]
{/unl}
[/S]
This like for all the binary relation, the UNL Knowledge Base (KB) stores all the
information about the relation.
Endings in Bengali words are used to find out most of the binary relation. Endings
with syntactic functional grouping and semantic information are used for identify the
correct binary relation.
Ambiguity Handling
Generally there are two ambiguities named as Local ambiguity and Global
ambiguity when dealing with natural language sentences. Local ambiguity means that
part of a sentence can have more than one interpretation, but not the whole sentence.
Global ambiguity means that the whole sentence can have more than one interpretation.
Local ambiguity can sometimes be resolved by syntactic analysis. In a syntactic
ambiguity, a sentence can have more than one parse interpretation. Each parse could have
different meanings.
In some languages the agent of the action is determined by its position in the
sentence. In Bengali the agent of the main action is determined by its person, number and
gender agreement with the main verb. Thus the way to determine the agent of the verb in
case of ambiguity,
‘NaaTakam avarkaL paarttaarkaL’ (They saw the dreama) “aamra natok gulo dekhechilam”
Here the two nouns (‘natok gulo’ and ‘Amra’) do not have case attachment.
Hence there is an ambiguity in determining the agent. Here the information obtained from
the morphological analyzer for
‘aamra’ > ‘ami’ (root) + ‘ra’ (plural marker)
‘natok’ > ‘natok’ (root) + ‘gulo’ (plural marker)
Shows that two words have person, number and gender agreement. Hence the word
‘aamra’ is determined to be the agent. Thus the morphological information enables the
disambiguation of agent case.

Download Report

UNL-AUB-2011

Paperzz.com

Your Paperzz