Scientific Voyage, Volume - 2, Issue - 2, Date Of Publication - May, 2016 Vulnerability measure of a English sentence Dinabandhu Bhandari1 , Sumon Ghosh1 , and S. K. Venkatesan2 1 Heritage Institute of Technology, Kolkata [email protected]; [email protected] 2 TNQ Books and Journals [email protected] Chennai, India Abstract. A senten ce is a grou p of word s tha t mu st expresse s a c omplete id ea or sen s e or mea n ing . The mea ning of a sentence dep ends pri ma rily on the use of appropria te words . Confu sion arises when there a re multiple options (synony ms) ava ilable for a word. In this work, we intro duce a vulnerability measu re o f a senten ce that tries to define the amou nt of a mb iguity due to the variou s options available for words u se d in the sentence. This measure facilita tes the auth ors or copy e ditors to identify the vulnerable sentences in a document. Though the selection of appropriate words depends on the conte xt, here we ha ve just considere d the synony ms of the wo rds an d define d the measure. The context will be considered in the subsequent investigations. Keyw ord: Uncertainty measure; Entropy; Ill-formed sentence; Well-formed sentence; Sentence vulnerability 1 Introduction Reading a document and feeling satisfaction about the quality of the document is natural t o human beings. When we read a novel, a story, an essay or even an editorial, we perceive a satisfaction in terms of ease of writing or flow of the do cument, quality of the language and the information content. Obviously, different person considers different set of characteristics to measure the quality of a document dep ending on the ob jective, and knowledge and exp erience in the relevant topic or subject. When an academician reads an article, published recently, in a journal or in a proceeding, the expectation of the reader is to make himself/ herself aware of the state of the art of the subject and to enhance his/ her knowledge in the topic. Among all these, the primary criterion of a quality document is its language how well it is written. A very basic aspect of an article is its language. More specifically, the very first feature while judging an article is the way it is been written, how good its language is. Good use of words makes an article easy to read, understand and interpret. The quality of an article is measured based on mainly two aspects: 1. Syntactic: The way in which words and punctuation are used and arranged to form phrases, clauses and sentences. It is quite straight forward to measure the syntactic quality of an article. The following characteristics provide an indication, with a grea t extent, about the syntactic quality. (a) Grammar (b) Use of conjunctive adverbs, articles (c) Repeated words (d) Punctuation 2. Semantic: It is the study of the meaning of language. It also deals with varieties and changes in the meaning of words, phrases, sentences and text. It is quite obvious that examining the semantic aspect of an article is comparatively more challenging than the ISSN No: 2395-5546 21 Vulnerability measure of a English sentence syntactic aspect. However, it has been experienced by us that s ometimes a correct sentence can be rewritten in an improved fashion to properly express the feeling of the author. These are observed mainly in the use of a proper verb, or noun or conjunctive adverb or adjective. Though there is no unique way to write a sentence for expressing the feeling, one can easily identify a well-formed sentence. It has been found that the experienced authors have use the right words at the right places. Whereas the less experienced authors or the new authors sometimes struggle to choose the right word to express their feelings. Due to availability of multiple choices to select one for his/her sentence, the uncertainty arises in authors mind. This is one of the primary reasons of writing an ill-formed sentence. The uncertainty or ambiguity reduces in authors mind as he/she gains experience. Over time, the author tries to take into consideration the neighboring sentences, paragraphs and even the whole document or article. It is clear that the uncertainty reduces with the enhancement of authors knowledge and experience. In the era of digitization, people have made various attempts to judge the quality of a document with respect to its language and to develop system that would automatically improve the quality of the document. Significant research is in progress to develop automatic copy editing systems adopting machine learning techniques. AS per the authors’ knowledge, the first such attempt is found way back in 1951 when C. E. Shannon has defined entropy o f printed English and predicted letters when the preceding text is known [1]. In [2], Gelbukh and Bolshakov have worked for the automatic correction of words in a sentence using dictionary of one-letter-distant literal paronyms. Scientists have also worked in the correction malapropism in a sentence [3, 4]. In [5], Richard H. Granger have worked for Detection and correction of Errors during Understanding of Syntactically and Semantically Ill-Formed Text using NOMAD system. In this article, an attempt has been made to define a vulnerability measure that computes the uncertainty involved in a sentence due to the multiple synonyms (options) available w ords used in it. This uncertainty indeed considers only the local information. This measure can be used to find how vulnerable a sentence is while verifying the quality of the sentence. So far, it has been used to alert the copy editors about the vulnerability of the sentences so that they can put emphasis while editing a document. The exp eriment is well appreciated by the editors of a renowned publishing company. A brief discussion on the quality of the sentences is provided in section 2. Section 3 introduces the vulnerability measure of a sentence in a document. The experimental results are being presented in section 4. 2 Quality of a sentence in a document Sentences of a document should have proper meaning otherwise the meaning of the document would not be clear. In general, if the sentences are not constructed properly readers do not understand what the authors want to mean. In this sense, native speakers of English already have an instinctive knowledge of its grammar of appropriate usage of words. It is this knowledge which enables one to distinguish a well-formed English sentence from one which is clearly ill-formed. For example, native speakers know that the following sentence is well-formed: Sumon plays the piano. Native speakers can produce and understand a sentenc e like this without even thinking about its grammar. Conversely, in the course of everyday communication, no native s peaker w ould e ver produce this: piano plays s umon. One recognizes instinctively that there is something very wrong with this sentence, b ecause it doesn’t make sense. Another aspect of constructing a well-formed sentence is the selection of appropriate word. For example consider following two sentences: 1) Sumon scored 90% in Mathematics. 2) Sumon obtained 90% marks in Mathematics. Note that both the sentences are grammatically correct and readers will not have any problem to understand the information conveyed b y these sentences. However, in this context, the second one is the appropriate one. Therefore, it ISSN No: 2395-5546 22 Vulnerability measure of a English sentence is very important to select the correct word from several synonyms considering the context. 1. Ill formed: a sentence that is grammatically correct but words are not selected properly are known as ill formed sentence. For example, (a) There is interest among young people about the history of India. (b) When Buddha was born, there was need of a spiritual leader in India. 2. Well formed: A sentence that is grammatically correct as well as the words are well selected is known as well formed sentence. For example, (a) Young people have many interests about the history of India. (b) At the time Buddha was born, India was in need of a great spiritual leader In general, when the author is a novice, not having much experience, the probability of using an option is equal. As the author gains experience, the probability of the use of an option varies. Let us first consider a hyp othetical scenario where a computer (dumb p ers on) writes a sentence from a given set of words. By following the rule of grammar one can automatically position a parts of speech in the sentence but confusion arises when multiple options (synonyms) available for a particular word. In this case, the computer randomly selects a word from the set of possible option. This ambiguity results in ill-formed sentence. In this work, a n attempt has been made to propose a measure that would quantitatively judge the correctness of a sentence. In other words, the measure quantifies the appropriateness of the use of words in forming a sentence. The proposed measure actually tries to quantify the uncertainty or ambiguity present in a sentence. As mentioned earlier, the uncertainty arises due to the lack of knowledge in choosing an appropriate word among several options available to the author. Thereafter, the ambiguity reduces with the increase of experience of the author who takes into account the information available in the neighboring sentences, paragraphs and in the document. Likewise, the proposed measure should also b e adaptive enough to take into account the associated information. In the following section, we have defined a vulnerability measure based the entrop y measure defined by Shannon. 3 3.1 Vulnerability measure of sentences in a document Vulnerability measure due to synonyms The vulnerability measure is defined based on the ambiguity or uncertainty due to the us e of a word that has multiple synonyms. It is assumed that the sentences are grammatically correct and the sentences are not well formed due to the lack of expertise in the use of prop er word. It is to be noted that the vulnerability of a sentence depends on the options available for a word and the number of words used in the sentence. Let a sentence S consists of n words w1 , w2 , · · · wn . Let the word wi has mi options (s ynonyms). Then the vulnerability measure V (wi ) should be a function of m i having the following properties: 1. the measure is directly prop ortional to mi and 2. the is maximum when all the options are equally likely, i.e., having same probability of use. Let pij be the options are equally Now, following corresponding to a ISSN No: 2395-5546 probability of selecting an option among mi available options. When the likely, pij = m1 . i the definition of Shannons entropy [6], we define the vulnerability meas ure specific word(wi ) as 23 Vulnerability measure of a English sentence mi V (Wi ) = -pij log(p ij ) (1) j =1 For a sentence consisting of n words the vulnerability measure can be defined as n n V (S) = mi V (W i ) = i=1 -pij log(pij ) (2) i= 1 j =1 Clearly, the vulnerability measure V (S) of a sentence depends on two factors: 1. number of options available for a word and there probability of use and 2. number of words in the sentence These are in concurrence with our understanding in writing English sentences. A well experienced person always writes small sentences as a sentence with many words is prone to be wrong. On the other hand, if a sentence is constructed of the words having several synonyms can also be error prone. The measure therefore satisfies the general perception of English writing. To consider the second factor, one can define the average vulnerabilit y measure simply as n mi 1 V A (S) = -pij log(pi j ) (3) n i= 1 j =1 While judging a sentence, one has to check the overall vulnerability measure as defined in (2) and to compute average vulnerability (3) can be used. If the vulnerability measure of a sentence is more than some specified value (threshold), one can consider the sentence as a complex one. Similarly, if the average vulnerability measure is more than some specified value we can easily point out that the words used in constructing the sentence are having more options there by making the sentence more ambiguous. 4 Result and discussion To demonstrate the efficacy of the proposed vulnerability measure, we have considered multiple set of documents. In the process, the entropy of each sentences are calculated and the vulnerable sentences are identified. The complexity means how vulnerable a sentence is. vulnerability means, how much prone a sentence is in terms of the its construction. The results are then verified by the expert copy editors. The following tables s how the sentences and their measures of some paragraphs. As examples the vulnerability measure of the sentence of tw o paragraphs (viz. Paragraph 1 and Paragraph 2), presented below, are calculated and depicted in table 1 and table 2. Paragraph 1: The distortions of Dharma that we now a days see all around us are largely the result of foreign education. The English word religion has substantially contributed in distorting the pure meaning of dharma. The British first heard the word when they came to India. They had no word of an equally comprehensive meaning. So they translated it as religion. Such translation has distorted the meaning of Indian words. Dharma is a very wide term which includes many religions. A religion is a system of worship, and there are many such systems prevailing in this land. But in spite of these systems and sects dharma is one. Dharma is that which is good for everyone irrespective of the system of worship and can open up for him the way to salvation. ISSN No: 2395-5546 24 Vulnerability measure of a English sentence Line No. Vulnerable Words 1 2 3 4 5 6 7 8 9 10 distortions, Dharma, see,largely, result, foreign, education. english, word, religion, substantially, contributed, distorting, pure, meaning, dharma. british, first, heard, word, came. wo rd, equ a lly , c o mp re he n siv e, mea ni ng. translated, religion. translation, distorted, meaning, indian, words. dharma, wide, term, includes, many, religions. relig ion , syste m, wo rsh ip, ma ny , su ch, systems, prevailing, land. spite, systems, sects, dharma, one. dharma, good, everyone, irrespective, system, worship, way, Number Vulnerability Average o f Words Vulnerability M ea su re 7 10.6736 1.524799 9 12.820176 1.424464 5 4 2 5 6 8 10.760537 6.492239 2.995732 10.709963 6.556778 12.280530 2.152107 1.623059 1.497866 2.141992 1.092796 1.535066 5 8 6.186208 11.053743 1.237241 1.381717 Table 1. Vulnerable measures of the sentences of the Paragraph 1 Note that the vulnerability measures of the first sentence and third sentence are almost equal though first has more number of words. Likewise the vulnerability measures of the third sentence and sixth sentence are same and the numb er of words are also same. But in sentence number third the vulnerability measures is more b ecause the words which are used have more synonyms. Paragraph 2: God cannot act contrary to dharma. If he does then he is not omnipotent. Adharma is a characteristic of weakness not of strength. If fire instead of radiating heat, dies out, it is no longer strong. Strengthles not in unrestrained behavior, but in well regulated action. Therefore god who is omnipotent is also self regulated and consequently fully in tune with dharma. God descends in the human body to destroy adharma and reestablish dharma, not to act on passing whims and fancies. Hence even god can do everything but act contrary to dharma. But for the risk of being misunderstood, one can say that dharma is even greater than God. The universe is sustained because He acts according to dharma. In second paragraph the sentence numb er sixth is less vulnerable b ecause the words which are used have lesser number of synonyms. Normally, vulnerability measure depends on the length of the sentence and the words used in it. Now we can see that if a sentence is carrying many words and the words are having many synonyms then the sentence is more vulnerable, So it is exp ected that the sentences are more error prone. In the first paragraph we wrote the sentence ’The British first heard the w ord when they came to India’ likewise there is a another sentence ’But in spite of these systems and sects dharma is one’. In both the cases the length of the sentence is 5 but vulnerability measure of the first sentence is less than that of the second one. The words which were used in the first sentence are having fewer synonyms compared to the second sentence. So the possibilities of errors are als o much more than the second sentence. ISSN No: 2395-5546 25 Vulnerability measure of a English sentence Line No . Vulnerable Words Number Vulnerability Average of Word s Vulnerability M ea su re 1 cannot, act, contrary, dharma. 4 4.653960 1.163490 2 does, omnipotent. 2 2.708050 1.354025 3 adharma, characteristic, weakness, strength. 4 5.416100 1.354025 4 fire, instead, radiating, heat, dies, out, longer, strong. 8 18.302599 2.287824 5 strengthles, unrestrained, behaviour, well, regulated, action. 6 9.852615 1.642102 6 omnipotent, self, regulated, consequently, fully, tune, dharma. 7 6.291569 0.898795 7 descends, human, body, destroy, adharma, reestablish, dharma, act, passing, whims, fancies. 11 15.109689 1.373608 8 even, god, do, everything, act, contrary, dharma. 7 11.451900 1.635985 9 risk, being, misunderstood, say, dharma, even, greater, God. 8 13.089509 1.636188 10 universe, sustained, because, act, according, dharma. 6 7.390181 1.231696 Table 2. Vulnerable measures of the sentences of the Paragraph 2 5 Conclusion The uncertainty arises due to the multiple s ynonyms of the words in forming a sentence is being measured. The measure is then used to identify the vulnerable sentences. The current work does not consider the context in selecting appropriate word among several options (synonyms). Note that depending on the context the probability of using a particular word varies. Further investigation can be performed in deciding the probabilities considering the context that would be more useful in solving real life problem. 6 Acknowledgement The authors sincerely acknowledge TNQ Books and Journals for providing relevant data to accomplish this work. References 1. C. E. Shannon, “Prediction and entropy of printed english,” Bell System Technical Journal, vol. 30, no. 1, pp . 5 0 – 6 4 , Ja nua ry 1 95 1 . 2. A. Gelbuk h a nd I. A. Bolsha k ov, Adv anc es in Web Inte lligen ce. 2004, ch. On Correction of Semantic Errors in Natura l La nguage Literal Paronyms, pp. 105–114. Berlin Heide lb erg: Springer, Te xts with a Dictiona ry of 3. I. A. Bolsha kov a n d A. Gelbukh, “On detection of malapropisms by multistage collocation testing,” in 8th Intern. Conference on Applications of Natural Language to Information Systems NLDB2003, A. Dusterhoft and B. Talhei m, Eds., Bon n, 2 003, pp. 28–41. 4. G. Hirst and D. St-Onge, “Lexical chains as representation of context for detection and corrections of ma la pro pisms,” in Wo rd Ne t : An Ele c tro n ic Le x ic a l D a ta b a s e , C. Fellbaum, Ed. The MIT Press. ISSN No: 2395-5546 26 Vulnerability measure of a English sentence 5. R. H. Granger, The NOMAD System: Expectation-Based Detection and Correction of Errors durin g Un de rstand ing o f Sy n tac tic a lly a nd S e mantica lly Ill- Fo r me d Tex t. 6. C. E. Shannon, “ A mathe matica l th eory of vol. 27, no. 3 , pp. 3 79 –42 3 , Ju lyOcto b er 1 9 48 . ISSN No: 2395-5546 27 co mmunication,” Bell System Technical Journal,
© Copyright 2026 Paperzz