IJCST Vol. 2, Issue 4, Oct. - Dec. 2011
ISSN : 0976-8491 (Online) | ISSN : 2229-4333(Print)
Challenges During Constrution of Punjabi Thesaurus and
their Possible Solutions
1
1
Aarti Tayal, 2Dharam Veer Sharma
Dept. of Computer Science, Guru Ram Das Institute of Engineering & Technology, Lehra Bega,
Bathinda, Punjab, India
2
Dept. of Computer Science, Punjabi University, Patiala, Punjab, India
Abstract
A thesaurus is a basic necessity for composing text in any
language. Though considerable work has been done in the
area for English and related languages, the Indian Language
scenario present a relatively more complex and uphill task.
Punjabi is the world’s 12th most widely spoken language.
There is a very little amount of work done in this field. In this
paper, we try to presents various challenges to be encounter
during construction of Punjabi language thesaurus. Whether
those are related to Punjabi language itself or related to design
and implementation of Punjabi thesaurus. Also try to provide
possible solution to these various challenges.
I. Introduction
A. what is a thesaurus?
A thesaurus is designed to help users with exact and
nuanced word choice. Thesauruses usually have one of two
organizational forms: either they’re organized alphabetically,
in what is called dictionary form, or thematically, so that words
with similar meanings are grouped together. Once writer locate
the first word in document, he will find a great range of options
from which to select the word best-suited to his purpose [2].
B. Main steps performed by thesaurus
1. Take the word from document as input.
2. Search for the input word in stored database.
3. If match, give list of synonyms and antonyms as output
to user.
4. Otherwise, give no suggestions as a result.
5. Repeat above same steps for other words as well.
Even though this appears to be very simple at first glance but
designing a thesaurus for Indian languages such as Punjabi
poses many new challenges not found in English, which
complicates the design of the Punjabi Thesaurus [3]. Punjabi
language is far different from Western languages in phonetic
properties and grammatical rules.
II. Development of Thesaurus for Indian Languages
provides following advantages
1. A thesaurus reduces all the extra information found in a
dictionary to a few simple word options that are easy to
find and consider [4]. Dictionaries are great for looking up
the meaning of a word. But they also provide etymologies
(word origins), part of speech, pronunciation, and several
meanings that a user must wade through to find optional
usages. A thesaurus, on the other hand, lists synonyms, or
other words with a similar meaning, for quick selection.
2. A thesaurus helps a writer avoid repetition [5]. When we
write in a hurry, it’s easy to use the same word several
times in the same document. That can sound monotonous
to those who will have to read the document. Looking up a
170
International Journal of Computer Science & Technology
term in a thesaurus will provide functional options, so that
you can choose a different word to convey you meaning
without confusing or boring your readers.
3. In some versions of a thesaurus, a synonym, or similar
word, is included. For example, the word “ਉਜਾਲਾ” might
have a listed synonym of “ਪ੍ਰਕਾਸ਼” or another word that
helps to show, the precise meaning of the original word and
its synonym. We learn as much by similarity. Occasionally
we aren’t sure of the word we’re looking for, so looking up
a potential synonym word can enable us to find one with
more specific meaning.
4. Using a thesaurus routinely can help to expand a writer’s
vocabulary [5]. If written often simply, one can get into
a rut of using similar words, expressions, terms, and
phrases over and over. While the time of editing, take a few
moments to look up the most frequently-used or key words
of the document to find meaningful substitutes. As look
over the list of possibilities, easily will gain understanding
about similar words and perhaps add these to memory for
future references?
III. Problem Encounter While Working With Punjabi
Language
1. There is no standardization of Punjabi keyboard layouts.
There are more than forty keyboard layouts and more than
500 fonts commonly being used, which means that the same
Punjabi word can be internally stored in forty different ways.
As for example, the word ਪੰਜਾਬੀ is internally stored in following
fonts by using different key map [6, 7].
Table 1: Key map of word ਪੰਜਾਬੀ in different fonts
Font Name
Key Map
Akhar
pMjwbI
Amrit-Lipi2
pMj`bI
Anandpur Sahib
pμj;bI
Asees
Gzikph
Satluj
ê³ÜÅìÆ
Sukhmani
P^JABI
The thesaurus has to deal with each of these cases separately
and read the whole word. Even in the same font, a character
can be typed and stored in more than one ways.
2. Punjabi language has Phonetic nature. One of the unique
features of Punjabi, in the variety of modern South Asian
Languages, is the presence of pitch contours. These change
the meaning of the word depending on the way it sounds. In
technical terms these are called ‘tones’ and these are of three
types: low, high and level.
w w w. i j c s t. c o m
IJCST Vol. 2, Issue 4, Oct. - Dec. 2011
ISSN : 0976-8491(Online) | ISSN : 2229-4333(Print)
Table 2: Example of one word having different tones
Low Tone
Level Tone
High Tone
link
Turmeric
ਘੜੀ Ghaṛī Watch ਕੜੀ kaṛī ofa
ਕੜ੍ਹੀ kaṛhī curry
chain
3. Punjabi typing is much more complex as compared to
English typing, as 57 characters have to be typed on the
standard QWERTY keyboard. One has to memorize the Punjabi
characters corresponding to the English keys and search out
each character and then worry whether to type it with SHIFT
or without SHIFT.
4. Unlike English, there is no well defined word boundary for
Punjabi words written in different Punjabi fonts. As for example,
in Asees font the following punctuation marks are encoded
as Punjabi characters and thus are part of the word (‘ “ + /
: ; ? [ ] \ { } ). But there are many other fonts such as Akhar,
Satluj etc. which do not encode the above punctuation marks
as Punjabi characters. So the extraction of word boundary is
font dependent in case of Punjabi. In English and in many
other languages, special characters and delimiters separate
one word from another word. But in various different fonts of
Punjabi, with the help of these special characters and delimiters
some letters to be put in word. This is clear from the following
table.
Table 3: Keymap of words consist of delimiters and special
characters in different fonts
Word
Font Name
KeyMap
ਉਦਾਸ
Asees
Tüdk;
ਬੇਚੈਨ
Asees
p/u?B
ਪੰਜਾਬੀ
Anandpur
Sahib
pμj;bI
ਪੰਜਾਬੀ
Asees
Gzikph
ਪੰਜਾਬੀ
Satluj
ê³ÜÅìÆ
ਪੰਜਾਬੀ
Sukhmani
P^JABI
5. Punjabi is not written in linear fashion. The structure of the
Gurmukhi script, the script for Punjabi, is non-linear i.e. besides
41 consonants of the language; there are other symbols such as
Laga, Lagakhar etc, which are used to represent the phonetic
structure of the word. These symbols inherently decorate the
consonant. For example, the word ‘COMPUTE’ as written in
English, the character ‘O’ is called the colleague of ‘C’, ‘M’ is
called the colleague of ‘O’ and so on, but in Punjabi it will be
written as ‘ਕੰਪਿਊਟ’ where character ਕ is said to be wearing a
cap, ਪ is holding a stick and ਓ is wearing shoes.
6. There is no standardization of Punjabi spellings. A word may
be spelled in more than one way and all the forms may be
acceptable. This problem mainly exists because of presence
of too many dialects in Punjab. Punjabi language has many
different dialects, spoken in different sub-regions of greater
Punjab. Different dialects of Punjabi are Majhi, Malwi, Doabi,
Pothohari, jhangvi, Multani etc. Residents of one dialect
w w w. i j c s t. c o m
pronounce one word in different manner from residents of
another dialect. The problem arises when they write the words
as they actually pronounce it. Following are some words which
are pronounced in different manner only with one sound change
but written in different manner depending upon pronunciation.
These words are sometimes called homonyms.
- ਆਪਣਾ, ਆਪਨਾ
- ਬਿਪਰੀਤ, ਵਿਪਰੀਤ
- ਗੂੜਾ, ਗੂੜ੍ਹਾ
- ਹਨੇਰਾ, ਹਨ੍ਹੇਰਾ
- ਅਨੋਖਾ, ਅਨੌਖਾ
- ਆਰੰਭ, ਅਰੰਭ
- ਵਰਕ, ਬਰਕ
- ਜਤਨ,ਯਤਨ etc.
7. In some of the Punjabi fonts, the Punjabi characters such
as bindi, lava, onkar, dulainkar etc. have zero width and so if
by mistake a user makes multiple entries of such characters
only a single entry is visible. Following table shows some words
written by single entry in some fonts but some font require
more than one entry to spell same word.
Table 4: Same words make use of different key entry in
different-2 fonts
Word
Font Name
Ascii Code
ਨੂੰ
Nanak
96;124;121
ਨੂੰ
Merapunjab
126
ਉ
e-Panjabi30
85
ਊ
EKTA-DUNIYA
84;91
IV. Some Challenges Faced during Design of Punjabi
Thesaurus and their Possible Solutions
1. The biggest challenge in construction of a thesaurus is in
identifying words that are semantically related to one another.
Manual construction of thesauri is a tedious and time consuming
task. As such Punjabi thesaurus is not available in market as
yet. That’s why it is very difficult to collect Punjabi words with
their synonyms and antonyms. Therefore there exists only one
way, which is, first to collect words from Punjabi books of various
courses and store them in database. But in Punjabi books very
limited data is available. This becomes major challenge during
design of thesaurus to make a big database of words.
2. There are various ways to store collected data. Following are
some ways to store data. In first manner, the word along with
its synonyms is stored under common class in rows.
Table 5: Storage manner of synonyms in database
Word
Class
ਉਦਾਸ
c1
ਚਿੰਤਾਤੁਰ
c1
ਉਪਰਾਮ
c1
ਨਿਰਾਸ਼
c1
ਪਰੇਸ਼ਾਨ
c1
ਨਾਖੁਸ਼
c1
International Journal of Computer Science & Technology 171
IJCST Vol. 2, Issue 4, Oct. - Dec. 2011
ISSN : 0976-8491 (Online) | ISSN : 2229-4333(Print)
ਉਜੱਡ
c2
ਗੰਵਾਰ
c2
ਬਗਲੋਲ
c2
ਝੁੱਡੂ
c2
In second manner, all the synonyms has same class are stored
in one row.
Table 6: Another manner to store Synonyms in database
Word
Class
ਉਦਾਸ,ਚਿੰਤਾਤੁਰ,ਉਪਰਾਮ,ਨਿਰਾਸ਼,ਪਰੇਸ਼ਾਨ,ਨਾਖੁਸ਼,ਦੁਖੀ,
ਫਿਕਰਮੰਦ,ਉਚਾਟ,ਮਾਯੂਸ,ਅਪ੍ਰਸੰਨ
c1
ਬਿਕਲ,ਔਖਾ,ਬੇਚੈਨ,ਉਤਾਵਲਾ,ਵਿਕਲ,ਵਿਆਕੁਲ,ਬੇਆਰਾ
ਮ
c2
First manner takes more storage space than second one but
search for word is easy in first method.
3. To construct a thesaurus successfully, we need to include
not only words but also their meanings to avoid confusion in
case when words with same spellings have different meanings.
The different meaning of words with same spellings is treated
as different concepts.
Table 7: Example of words with more than one context
Words
Synonym
ਭਾਗ (“part”)
ਹਿੱਸਾ, ਅੰਸ਼, ਅੰਗ, ਟੋਟਾ
ਭਾਗ (“luck”)
ਕਿਸਮਤ, ਲੇਖ, ਨਸੀਬ
ਠੀਕ (“well”) ਦਰੁਸਤ
ਠੀਕ(“right”)
ਸਹੀ, ਉਚਿਤ
Like in the above example the word ਭਾਗ have same letter but
different meaning so when the user make use of thesaurus
it will become difficult to provide the right synonym based on
only word.
If the above problem exists during design of Punjabi thesaurus
then there are various ways to provide output to user. Either
give all words matching with input Punjabi word, independent
of number of context in suggestion list or store words along
with their synonyms and antonyms under context so that when
at time Punjabi word match with words in database, first give
output to user in form of context then left on user to choose
context depends upon sentence requirement and finally choose
word from suggestion list under selected context.
V. Problems Faced during Implementation of the Punjabi
Thesaurus
Once Punjabi thesaurus is built, main thing left during
implementation phase is selection of word from suggestion
list and replacement.
1. Problem during Selection of Word from Suggestion
List
172
International Journal of Computer Science & Technology
From suggestion list, which word conveys best meaning of
sentence is well known to the user only. So it is user duty to
select word from suggestion list but problems arises when there
is more than one context. The example already explained in table
7 that there are chances words have more than one context.
If this is the case then suggestion list consist of contexts first
rather than list of possible synonyms and antonyms. The user
has to first select context depends upon context of sentence
then select synonym or antonym under selected context.
2. Problems during Replacement of selected word with
input Punjabi word
During replacement if the input Punjabi word (for which user
wants list of synonyms and antonyms) is not in Unicode or
in any other font then replacement process is not simple. If
word consist of special characters, delimiters, letters and
numbers then it become very difficult to get right result because
delimiters are not considered as part of the word. So follow
some way to read those words successfully by thesaurus
application which consists of special characters, delimiters,
letters and numbers. This is clearer by taking one example.
Suppose word ਖ਼ਰੂਦ written in document under Asees font, then
it consist of letters p/u?B which are backslash, question mark
and three English letter. The problem comes when selected
option from suggestion list is replace with entered Punjabi
word ਖ਼ਰੂਦ. There is need to convert selected option in Asees
font and then replace with word ਖ਼ਰੂਦ but it will fail because
entered word consists of delimiters. Suppose the suggestion
list consist of words ਉਪੱਦਰ, ਹੰਗਾਮਾ, ਹੁੱਲੜ etc. and user wants
to replace ਖ਼ਰੂਦ with ਹੰਗਾਮਾ word then after replacement it will
becomes ਖ਼ਹੰਗਾਮਾਰੂਦ or ਖ਼ਰੂਹੰਗਾਮਾਦ depending upon how user
selects the word for getting synonyms or antonyms. It means
when one word is replaced with another word then it is partially
replaced. So solution is before replacement there is a need of
whole selection of word by any mean or the user is required
to select whole word until space is encountered. If at time of
selection, there is presence of special character or delimiters
there should be continued selection of word then user will get
correct suggestion list.
VI. Conclusion
This paper divided into five sections basically. Section one
gave introduction to thesaurus and steps performed by
thesaurus. In second section we tried to give some advantages
of thesaurus which develop in Indian languages and tried to
compare thesaurus with dictionary as well. In the next three
sections we discussed problems related to Punjabi language
and challenges during design and implementation of Punjabi
thesaurus in detail. Also we tried to provide possible solution
to various challenges to some extent.
References
[1] Rupinder Kaur, R.K.Sharma, Suman Preet, Parteek Bhatia,
2010, "Punjabi Wordnet Relations and Categorization of
Synsets", Thapar University, Patiala.
[2] [Online] Available: http://www.wisegeek.com/what-is-athesaurus.htm, accessed on 31may, 2011.
[3] G S Lehal, “Design and Implementation of Punjabi Spell
Checker”, International Journal of Systemics, Cybernetics
and Informatics, pp. 70-75 (Jan 2007).
[4] Jayanta Chatterjee, T.V.Prabhakar., “On to Action–Building
a Digital Ecosystem for Knowledge Diffusion in Rural
w w w. i j c s t. c o m
ISSN : 0976-8491(Online) | ISSN : 2229-4333(Print)
IJCST Vol. 2, Issue 4, Oct. - Dec. 2011
INDIA”, Knowledge Management: Nurturing Culture,
Innovation and Technology, pp. 401-416, 2005
[5] [Online] Available: http://www.essortment.com/usethesaurus-34508.html, accessed on 20may, 2011
[6] G S Lehal, “Design and Implementation of Punjabi Spell
Checker”, International Journal of Systemics, Cybernetics
and Informatics, pp. 70-75, 2007.
[7] [Online] Available: http://www.learnpunjabi.org/intro1.
asp, accessed on 15may, 2011.
conferences.
Aarti Tayal , M. Tech., an aluminous of
Department of Computer Science, Punjabi
University, Patiala is currently working
as lecturer in department of Computer
Science and Engineering, Guru Ram Das
Institute of Engg. & Technology,Lehra
Bega Bathinda. She has research interest
in Natural Language Processing and has
published 5 papers in reputed journals and
Dr. Dharam Veer Sharma , Ph. D., MCA, is
an aluminous of Department of Computer
Science, Punjabi University, Patiala, India
and presently serving there as Assistant
Professor . He has research interests
in Optical Character Recognition and
Natural Language Processing. He has
more than 40 research publications in
reputed journals in conferences.
w w w. i j c s t. c o m
International Journal of Computer Science & Technology 173
© Copyright 2026 Paperzz