Vulnerability measure of a English sentence

Scientific Voyage, Volume - 2, Issue - 2, Date Of Publication - May, 2016
Vulnerability measure of a English sentence
Dinabandhu Bhandari1 , Sumon Ghosh1 , and S. K. Venkatesan2
1
Heritage Institute of Technology, Kolkata
[email protected]; [email protected]
2
TNQ Books and Journals
[email protected]
Chennai, India
Abstract. A senten ce is a grou p of word s tha t mu st expresse s a c omplete id ea or sen s e
or mea n ing . The mea ning of a sentence dep ends pri ma rily on the use of appropria te
words . Confu sion arises when there a re multiple options (synony ms) ava ilable for a
word. In this work, we intro duce a vulnerability measu re o f a senten ce that tries to
define the amou nt of a mb iguity due to the variou s options available for words u se d
in the sentence. This measure facilita tes the auth ors or copy e ditors to identify the
vulnerable sentences in a document. Though the selection of appropriate words depends
on the conte xt, here we ha ve just considere d the synony ms of the wo rds an d define d
the measure. The context will be considered in the subsequent investigations.
Keyw ord: Uncertainty measure; Entropy; Ill-formed sentence; Well-formed sentence; Sentence vulnerability
1
Introduction
Reading a document and feeling satisfaction about the quality of the document is natural t o
human beings. When we read a novel, a story, an essay or even an editorial, we perceive a
satisfaction in terms of ease of writing or flow of the do cument, quality of the language and the
information content. Obviously, different person considers different set of characteristics to
measure the quality of a document dep ending on the ob jective, and knowledge and exp erience
in the relevant topic or subject. When an academician reads an article, published recently, in
a journal or in a proceeding, the expectation of the reader is to make himself/ herself aware
of the state of the art of the subject and to enhance his/ her knowledge in the topic. Among
all these, the primary criterion of a quality document is its language how well it is written.
A very basic aspect of an article is its language. More specifically, the very first feature
while judging an article is the way it is been written, how good its language is. Good use of
words makes an article easy to read, understand and interpret. The quality of an article is
measured based on mainly two aspects:
1. Syntactic: The way in which words and punctuation are used and arranged to form
phrases, clauses and sentences. It is quite straight forward to measure the syntactic quality
of an article. The following characteristics provide an indication, with a grea t extent, about
the syntactic quality.
(a) Grammar
(b) Use of conjunctive adverbs, articles
(c) Repeated words
(d) Punctuation
2. Semantic: It is the study of the meaning of language. It also deals with varieties and
changes in the meaning of words, phrases, sentences and text. It is quite obvious that
examining the semantic aspect of an article is comparatively more challenging than the
ISSN No: 2395-5546
21
Vulnerability measure of a English sentence
syntactic aspect. However, it has been experienced by us that s ometimes a correct sentence
can be rewritten in an improved fashion to properly express the feeling of the author.
These are observed mainly in the use of a proper verb, or noun or conjunctive adverb or
adjective.
Though there is no unique way to write a sentence for expressing the feeling, one can easily
identify a well-formed sentence. It has been found that the experienced authors have use the
right words at the right places. Whereas the less experienced authors or the new authors
sometimes struggle to choose the right word to express their feelings. Due to availability of
multiple choices to select one for his/her sentence, the uncertainty arises in authors mind.
This is one of the primary reasons of writing an ill-formed sentence. The uncertainty or
ambiguity reduces in authors mind as he/she gains experience. Over time, the author tries to
take into consideration the neighboring sentences, paragraphs and even the whole document
or article. It is clear that the uncertainty reduces with the enhancement of authors knowledge
and experience.
In the era of digitization, people have made various attempts to judge the quality of a
document with respect to its language and to develop system that would automatically improve the quality of the document. Significant research is in progress to develop automatic
copy editing systems adopting machine learning techniques. AS per the authors’ knowledge,
the first such attempt is found way back in 1951 when C. E. Shannon has defined entropy o f
printed English and predicted letters when the preceding text is known [1]. In [2], Gelbukh
and Bolshakov have worked for the automatic correction of words in a sentence using dictionary of one-letter-distant literal paronyms. Scientists have also worked in the correction
malapropism in a sentence [3, 4]. In [5], Richard H. Granger have worked for Detection and
correction of Errors during Understanding of Syntactically and Semantically Ill-Formed Text
using NOMAD system.
In this article, an attempt has been made to define a vulnerability measure that computes
the uncertainty involved in a sentence due to the multiple synonyms (options) available w ords
used in it. This uncertainty indeed considers only the local information. This measure can
be used to find how vulnerable a sentence is while verifying the quality of the sentence. So
far, it has been used to alert the copy editors about the vulnerability of the sentences so that
they can put emphasis while editing a document. The exp eriment is well appreciated by the
editors of a renowned publishing company.
A brief discussion on the quality of the sentences is provided in section 2. Section 3
introduces the vulnerability measure of a sentence in a document. The experimental results
are being presented in section 4.
2
Quality of a sentence in a document
Sentences of a document should have proper meaning otherwise the meaning of the document
would not be clear. In general, if the sentences are not constructed properly readers do
not understand what the authors want to mean. In this sense, native speakers of English
already have an instinctive knowledge of its grammar of appropriate usage of words. It is
this knowledge which enables one to distinguish a well-formed English sentence from one
which is clearly ill-formed. For example, native speakers know that the following sentence is
well-formed: Sumon plays the piano. Native speakers can produce and understand a sentenc e
like this without even thinking about its grammar. Conversely, in the course of everyday
communication, no native s peaker w ould e ver produce this: piano plays s umon. One recognizes
instinctively that there is something very wrong with this sentence, b ecause it doesn’t make
sense. Another aspect of constructing a well-formed sentence is the selection of appropriate
word. For example consider following two sentences: 1) Sumon scored 90% in Mathematics. 2)
Sumon obtained 90% marks in Mathematics. Note that both the sentences are grammatically
correct and readers will not have any problem to understand the information conveyed b y
these sentences. However, in this context, the second one is the appropriate one. Therefore, it
ISSN No: 2395-5546
22
Vulnerability measure of a English sentence
is very important to select the correct word from several synonyms considering the context.
1. Ill formed: a sentence that is grammatically correct but words are not selected properly
are known as ill formed sentence. For example,
(a) There is interest among young people about the history of India.
(b) When Buddha was born, there was need of a spiritual leader in India.
2. Well formed: A sentence that is grammatically correct as well as the words are well
selected is known as well formed sentence. For example,
(a) Young people have many interests about the history of India.
(b) At the time Buddha was born, India was in need of a great spiritual leader
In general, when the author is a novice, not having much experience, the probability of using
an option is equal. As the author gains experience, the probability of the use of an option
varies. Let us first consider a hyp othetical scenario where a computer (dumb p ers on) writes a
sentence from a given set of words. By following the rule of grammar one can automatically
position a parts of speech in the sentence but confusion arises when multiple options (synonyms) available for a particular word. In this case, the computer randomly selects a word
from the set of possible option. This ambiguity results in ill-formed sentence. In this work, a n
attempt has been made to propose a measure that would quantitatively judge the correctness
of a sentence. In other words, the measure quantifies the appropriateness of the use of words
in forming a sentence. The proposed measure actually tries to quantify the uncertainty or
ambiguity present in a sentence.
As mentioned earlier, the uncertainty arises due to the lack of knowledge in choosing an
appropriate word among several options available to the author. Thereafter, the ambiguity
reduces with the increase of experience of the author who takes into account the information available in the neighboring sentences, paragraphs and in the document. Likewise, the
proposed measure should also b e adaptive enough to take into account the associated information. In the following section, we have defined a vulnerability measure based the entrop y
measure defined by Shannon.
3
3.1
Vulnerability measure of sentences in a document
Vulnerability measure due to synonyms
The vulnerability measure is defined based on the ambiguity or uncertainty due to the us e
of a word that has multiple synonyms. It is assumed that the sentences are grammatically
correct and the sentences are not well formed due to the lack of expertise in the use of prop er
word. It is to be noted that the vulnerability of a sentence depends on the options available
for a word and the number of words used in the sentence. Let a sentence S consists of n words
w1 , w2 , · · · wn . Let the word wi has mi options (s ynonyms). Then the vulnerability measure
V (wi ) should be a function of m i having the following properties:
1. the measure is directly prop ortional to mi and
2. the is maximum when all the options are equally likely, i.e., having same probability of
use.
Let pij be the
options are equally
Now, following
corresponding to a
ISSN No: 2395-5546
probability of selecting an option among mi available options. When the
likely, pij = m1 .
i
the definition of Shannons entropy [6], we define the vulnerability meas ure
specific word(wi ) as
23
Vulnerability measure of a English sentence
mi
V (Wi ) =
-pij log(p ij )
(1)
j =1
For a sentence consisting of n words the vulnerability measure can be defined as
n
n
V (S) =
mi
V (W i ) =
i=1
-pij log(pij )
(2)
i= 1 j =1
Clearly, the vulnerability measure V (S) of a sentence depends on two factors:
1. number of options available for a word and there probability of use and
2. number of words in the sentence
These are in concurrence with our understanding in writing English sentences. A well
experienced person always writes small sentences as a sentence with many words is prone
to be wrong. On the other hand, if a sentence is constructed of the words having several
synonyms can also be error prone. The measure therefore satisfies the general perception
of English writing. To consider the second factor, one can define the average vulnerabilit y
measure simply as
n mi
1
V A (S) =
-pij log(pi j )
(3)
n i= 1 j =1
While judging a sentence, one has to check the overall vulnerability measure as defined in
(2) and to compute average vulnerability (3) can be used. If the vulnerability measure of a
sentence is more than some specified value (threshold), one can consider the sentence as a
complex one. Similarly, if the average vulnerability measure is more than some specified value
we can easily point out that the words used in constructing the sentence are having more
options there by making the sentence more ambiguous.
4
Result and discussion
To demonstrate the efficacy of the proposed vulnerability measure, we have considered multiple set of documents. In the process, the entropy of each sentences are calculated and the
vulnerable sentences are identified. The complexity means how vulnerable a sentence is. vulnerability means, how much prone a sentence is in terms of the its construction. The results
are then verified by the expert copy editors. The following tables s how the sentences and their
measures of some paragraphs. As examples the vulnerability measure of the sentence of tw o
paragraphs (viz. Paragraph 1 and Paragraph 2), presented below, are calculated and depicted
in table 1 and table 2.
Paragraph 1: The distortions of Dharma that we now a days see all around us are
largely the result of foreign education. The English word religion has substantially contributed
in distorting the pure meaning of dharma. The British first heard the word when they came
to India. They had no word of an equally comprehensive meaning. So they translated it as
religion. Such translation has distorted the meaning of Indian words. Dharma is a very wide
term which includes many religions. A religion is a system of worship, and there are many
such systems prevailing in this land. But in spite of these systems and sects dharma is one.
Dharma is that which is good for everyone irrespective of the system of worship and can open
up for him the way to salvation.
ISSN No: 2395-5546
24
Vulnerability measure of a English sentence
Line No. Vulnerable Words
1
2
3
4
5
6
7
8
9
10
distortions, Dharma, see,largely, result,
foreign, education.
english, word, religion, substantially,
contributed, distorting, pure, meaning, dharma.
british, first, heard, word, came.
wo rd, equ a lly , c o mp re he n siv e, mea ni ng.
translated, religion.
translation, distorted, meaning, indian, words.
dharma, wide, term, includes, many, religions.
relig ion , syste m, wo rsh ip, ma ny ,
su ch, systems, prevailing, land.
spite, systems, sects, dharma, one.
dharma, good, everyone,
irrespective, system, worship, way,
Number Vulnerability
Average
o f Words
Vulnerability
M ea su re
7
10.6736
1.524799
9
12.820176
1.424464
5
4
2
5
6
8
10.760537
6.492239
2.995732
10.709963
6.556778
12.280530
2.152107
1.623059
1.497866
2.141992
1.092796
1.535066
5
8
6.186208
11.053743
1.237241
1.381717
Table 1. Vulnerable measures of the sentences of the Paragraph 1
Note that the vulnerability measures of the first sentence and third sentence are almost
equal though first has more number of words. Likewise the vulnerability measures of the third
sentence and sixth sentence are same and the numb er of words are also same. But in sentence
number third the vulnerability measures is more b ecause the words which are used have more
synonyms.
Paragraph 2: God cannot act contrary to dharma. If he does then he is not omnipotent.
Adharma is a characteristic of weakness not of strength. If fire instead of radiating heat, dies
out, it is no longer strong. Strengthles not in unrestrained behavior, but in well regulated
action. Therefore god who is omnipotent is also self regulated and consequently fully in tune
with dharma. God descends in the human body to destroy adharma and reestablish dharma,
not to act on passing whims and fancies. Hence even god can do everything but act contrary
to dharma. But for the risk of being misunderstood, one can say that dharma is even greater
than God. The universe is sustained because He acts according to dharma.
In second paragraph the sentence numb er sixth is less vulnerable b ecause the words which
are used have lesser number of synonyms.
Normally, vulnerability measure depends on the length of the sentence and the words used
in it. Now we can see that if a sentence is carrying many words and the words are having
many synonyms then the sentence is more vulnerable, So it is exp ected that the sentences are
more error prone.
In the first paragraph we wrote the sentence ’The British first heard the w ord when they
came to India’ likewise there is a another sentence ’But in spite of these systems and sects
dharma is one’. In both the cases the length of the sentence is 5 but vulnerability measure of
the first sentence is less than that of the second one. The words which were used in the first
sentence are having fewer synonyms compared to the second sentence. So the possibilities of
errors are als o much more than the second sentence.
ISSN No: 2395-5546
25
Vulnerability measure of a English sentence
Line No . Vulnerable Words
Number Vulnerability
Average
of Word s
Vulnerability
M ea su re
1
cannot, act, contrary, dharma.
4
4.653960
1.163490
2
does, omnipotent.
2
2.708050
1.354025
3
adharma, characteristic, weakness, strength.
4
5.416100
1.354025
4
fire, instead, radiating, heat, dies, out, longer, strong.
8
18.302599
2.287824
5
strengthles, unrestrained, behaviour, well, regulated,
action.
6
9.852615
1.642102
6
omnipotent, self, regulated, consequently,
fully, tune, dharma.
7
6.291569
0.898795
7
descends, human, body, destroy, adharma, reestablish,
dharma, act, passing, whims, fancies.
11
15.109689
1.373608
8
even, god, do, everything, act, contrary, dharma.
7
11.451900
1.635985
9
risk, being, misunderstood, say, dharma,
even, greater, God.
8
13.089509
1.636188
10
universe, sustained, because, act, according, dharma.
6
7.390181
1.231696
Table 2. Vulnerable measures of the sentences of the Paragraph 2
5
Conclusion
The uncertainty arises due to the multiple s ynonyms of the words in forming a sentence is
being measured. The measure is then used to identify the vulnerable sentences. The current
work does not consider the context in selecting appropriate word among several options
(synonyms). Note that depending on the context the probability of using a particular word
varies. Further investigation can be performed in deciding the probabilities considering the
context that would be more useful in solving real life problem.
6
Acknowledgement
The authors sincerely acknowledge TNQ Books and Journals for providing relevant data to
accomplish this work.
References
1. C. E. Shannon, “Prediction and entropy of printed english,” Bell System Technical Journal, vol. 30,
no. 1, pp . 5 0 – 6 4 , Ja nua ry 1 95 1 .
2. A. Gelbuk h a nd I. A. Bolsha k ov, Adv anc es in Web Inte lligen ce.
2004, ch. On Correction of Semantic Errors in Natura l La nguage
Literal Paronyms, pp. 105–114.
Berlin Heide lb erg: Springer,
Te xts with a Dictiona ry of
3. I. A. Bolsha kov a n d A. Gelbukh, “On detection of malapropisms by multistage collocation testing,”
in 8th Intern. Conference on Applications of Natural Language to Information Systems NLDB2003, A. Dusterhoft and B. Talhei m, Eds., Bon n, 2 003, pp. 28–41.
4. G. Hirst and D. St-Onge, “Lexical chains as representation of context for detection and corrections
of ma la pro pisms,” in Wo rd Ne t : An Ele c tro n ic Le x ic a l D a ta b a s e , C. Fellbaum, Ed. The MIT Press.
ISSN No: 2395-5546
26
Vulnerability measure of a English sentence
5. R. H. Granger, The NOMAD System: Expectation-Based Detection and Correction of Errors durin g Un de rstand ing o f Sy n tac tic a lly a nd S e mantica lly Ill- Fo r me d Tex t.
6. C. E. Shannon, “ A mathe matica l th eory of
vol. 27, no. 3 , pp. 3 79 –42 3 , Ju lyOcto b er 1 9 48 .
ISSN No: 2395-5546
27
co mmunication,”
Bell System Technical Journal,