12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008 Text Watermarking by Syntactic Analysis Mi-Young Kim School of Computer Science and Engineering, Sungshin Women’s University, Dongseon-dong 3-ga, Seongbuk-gu, Seoul, 136-742 South Korea Abstract: - This paper explores the method of text watermarking for Korean by syntactic analysis. The proposed method is useful for agglutinative languages – such as Korean, Turkish, etc.-- of which syntactic constituent order is relatively free. Our proposed natural language watermarking method consists of several steps. First, we construct a syntactic dependency tree of input text. Next, we choose target syntactic constituents to move. Then, we embed watermark bits. If the watermark bit does not coincide with the movement bit of the target constituent, we move the syntactic constituent in the syntactic tree. Finally, from the modified syntactic tree, we obtain a marked text. From the experimental results, we show that the coverage of our method is 75%, which outperforms that of previous systems, and the rate of unnatural sentences of marked text is better than that of previous systems. Even if the information-hiding capacity is worse than that of the previous systems, our marked text keeps the same style, and it also has the same information without semantic distortion. Key-Words: - Natural language watermarking, syntactic analysis, text watermarking, information hiding language, differs significantly from Indo-European languages such as English with respect to free word order characteristics. For this reason, we believe that Korean and other agglutinative languages provide a good ground for text watermarking based on syntactic constituent movement techniques. This paper proposes Korean text watermarking by syntactic analysis. We embed watermark in original text, creating a ciphertext, which preserves the meaning, and only moves subject constituents. This paper is organized as follows. Section 2 presents previous work on text watermarking. Section 3 explains our method for Korean text watermarking in detail. Section 4 describes the data used for our experiments and shows experimental results that demonstrate that our method is effective for Korean text watermarking. Finally, we provide our conclusions 1 Introduction Natural language watermarking is an emerging technique in the intersection of natural language processing and the technologies of security. Natural language watermarking aims at embedding additional information in the text itself with the goals of subliminal communication and hidden information transport, of content and authorship authentication, and finally of enriching the text with metadata [2]. The watermarking techniques have been explored extensively for multimedia documents in the last decade[1]. In contrast, the studies on natural language watermarking are just starting. In [3,4,6] the techniques of synonym substitution for watermarking have been addressed and various attack scenarios have been described. In [8], Atallah et al. have attempted to use quadratic residues technique to insert a watermark to a given text via synonym substitution. The ambiguity induced on the word precision by the synonym substitution technique has led Topkara et al.[12] to syntax-based natural language watermarking. Their technique basically focuses on the syntactic sentence-paraphrasing. It turns out that the syntactic approach offers the most prolific set of text watermarking tools. We conclude that syntactic modification is useful for text watermarking without semantic distortion when compared with alternative tools such as synonym substitutions. Note that Korean, as an agglutinative ISSN: 1790-5109 2 Previous Work M. Atallah et al.[7,8] proposed a technique for information hiding in natural language text. Moreover, they established the basic technique for embedding a resilient watermark in text by combining a number of information assurance and security techniques with the advanced methods and resources of natural language processing. A semantically based scheme significantly improves the information-hiding 904 ISBN: 978-960-6766-85-5 12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008 Yeoreumcheol yubaeksaek-eui guiyeo-un ggot-i pi-myeon ontong geu jubyeon-eun gammiro-un hyanggi-ro byeolcheonji-reol iru-bnida. (In summer, when white and cute flowers bloom, the sweet perfume of the flowers makes their surroundings a beautiful world.) iru-bnida (make) conjunctive adverbial (a) pi-myeon(bloom) adverbial (c) subject yeoreumcheol (summer) ontong (whole) subject jubyeon-eun (surroundings) adnominal ggot-i (flower) adnominal yubaeks aek-eui (white) hyanggi-ro (with perfume) (b) byeolcheonji-reol (beautiful world) adnominal geu(the) adnominal object adverbial gammiro-un (sweet) guiyeo-un (cute) Fig. 1. Example of a syntactic dependency tree watermarking by syntactic constituent movement. Syntactic constituent that functions as a subject is selected as a target constituent for movement. The detailed method is explained in the next section. capacity of English text by modifying the granularity of meaning of individual terms/sentences. However, this scheme is suitable only for English, and it was merely conceptual. A technique of embedding secret data, without changing the meaning of a text a lot is proposed by replacing words in the cover text with synonyms[3,4,6]. However, there is deterioration in documents in which importance is attached to delicate nuance when synonyms have been substituted, such as literary works. There are also cases that wrong words are selected as synonyms among many synonym candidate words. Moreover, the method needs a large synonymy dictionary and a huge collocation database[13]. Some methods proposed the text watermarking for agglutinative languages. H.M.Meral et al.[2] proposed 21 syntactic tools for Turkish text watermarking, and O. Takizawa et al.[13] suggested the adjustment to new line positions for Japanese text. This method has limitations that the message sender and recipient must share the same secret rule table, and the total rate of embedding is too low. Topkara et al.[12] also proposed syntax-based natural language watermarking using the syntactic sentence-paraphrasing. They insist that the syntactic approach is useful for natural language watermarking without semantic distortion. We conclude that text watermarking based on syntactic tree is effective. So, using the characteristics of free word order in Korean, we perform text ISSN: 1790-5109 3 Text Watermarking by Syntactic Analysis Watermark embedding involves several steps. First, we perform syntactic parsing, and obtain syntactic dependency tree. Next, we choose target syntactic constituents for movement in a sentence, and determine the moving direction. After that, we embed watermark bits, and determine whether movement bit corresponds to the watermark bit. Then we move the target constituent to the determined direction. Finally, from the modified syntactic tree, we generate a marked sentence. We explain each procedure in detail. 3.1 Syntactic analysis To perform syntactic analysis, we use a Korean syntactic dependency parser of M. Y. Kim et al.[5]. Syntactic dependency parser functions to determine the syntactic relation between words in a sentence. Fig. 1 shows an example of a syntactic dependency tree. In the tree, parent node functions as the syntactic governor of its child nodes, and child node functions as the syntactic dependent of its parent node. 905 ISBN: 978-960-6766-85-5 12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008 Yeoreumcheol yubaeksaek-eui guiyeo-un ggot-i pi-myeon ontong geu jubyeon-eun gammiro-un hyanggi-ro byeolcheonji-reol iru-bnida. (In summer, when white and cute flowers bloom, the sweet perfume of the flowers makes their surroundings a beautiful world.) iru-bnida (make) conjunctive pi-myeon(bloom) subject adverbial ggot-i (flower) adnominal adnominal yubaeks aek-eui (white) yeoreumcheol (summer) subject jubyeon-eun (surroundings) adnominal adverbial ontong (whole) adverbial hyanggi-ro (with perfume) adnominal object byeolcheonji-reol (beautiful world) gammiro-un (sweet) geu(the) guiyeo-un (cute) Marked sentence:Yyubaeksaek-eui guiyeo-un ggot-i yeoreumcheol pi-myeon geu jubyeon-eun ontong gammiro-un hyanggi-ro byeolcheonji-reol iru-bnida. Fig. 2 The converted tree after syntactic subject movement position with the left nearest sibling that is not adnominal and conjunctive adverbial. ‘Right’ means it exchanges the position with the right nearest sibling. The subject ‘jubyeon-eun’ (surroundings) has (a) and (b) positions to move. The subject ‘ggot-i’(flower) has only (c) position to move, because it has only one sibling. If a subject has both positions (‘left’ and ‘right’) to move, we select the nearest position for movement. In the surface sentence, between (a) and the original subject position, one word(‘ontong’) exists. Between (b) and the original subject position, two words (‘gammiro-un’ and ‘hyanggi-ro’) exist. So we determine the subject ‘jubyeon-eun’ (surroundings) can move to (a), not (b). Because (a) is nearer than (b) from the original subject position. If both of the distances to the left and right are same, we determine it moves to the right position. We employ a movement bit for each subject constituent to compare with the watermark bit in the next procedure. If a subject is selected as a target constituent to move, then the movement bit is ‘1’, otherwise the movement bit is ‘0’. 3.2 Selection of syntactic constituents for movement Sentence basically consists of a subject and a predicate. So, usually in a sentence at least one subject exists. We select subject constituents for movement in a sentence. The target subject constituent that is selected should move to the position of one of its siblings. So the target subject constituents should have at least one sibling as following. 1. Sibling should not be adnominal. 2. Sibling should not be conjunctive adverbial. If a constituent is adnominal, it should be adjacent with the following nominal constituent. So, we cannot change its position. In a similar way, conjunctive adverbial usually locates on the first position in a sentence. So, we cannot move its position. To move a syntactic constituent to the position of its sibling, the sibling should not be adnominal and conjunctive adverbial. In Fig. 1, there are two subjects -‘jubyeon-eun’(surroundings), and ‘ggot-i’(flower). Both of them can be a target constituent to move. Target constituent can move to one of both directions -- ‘left’ and ‘right’. ‘Left’ means it exchanges the ISSN: 1790-5109 3.3 Embedding watermark bits After determining the target constituents to move, we 906 ISBN: 978-960-6766-85-5 12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008 Table 1 Performances of our system The number of sentences Avg number of words/sentence Sentences selected for embedding watermark bit string(%) Unsuitable sentences among marked sentences Unsuitable sentences among non-transformed sentences sentences 2080 15.67 75% 29.29 % 14.81% Table 2 Comparison of Performances for coverage and information-hiding capacity Coverage about the sentences selected for embedding watermark bit(%) Relation of information-hiding capacity Y.L. Chiang[9] Our system 6.7% 75% 1:6.6 1:14.05 Table 3 Comparison of Performances for average edit Average edit(%) Our system 55.9% 25.67% evaluation methods. Using those systems for sentence-by-sentence distance evaluation is neither sufficient nor accurate for the task of evaluating natural language watermarking. BLEU is very sensitive to precision in words and their position in the generated sentence. Some of the transformations (e.g. passivization, syntactic constituent movement) change the word order heavily while keeping the meaning very close to the original. A better way of evaluating the distortion made by a natural language watermarking system is measuring the distortion at the full text level. So we measure subjective rate by human as H.M.Meral et al. [2] used. The evaluation method is to let human evaluate the texts and show their reactions by editing attempts. The subjects are given marked texts and asked to edit them for improved intelligibility and style. This is a blind test because the subjects are not aware that text watermarking has taken place. Three humans have checked some strange arrangement sentences for the 2,080 sentences. Table 1 shows the rate of unsuitable sentences among marked sentences and that among untransformed sentences. It is also interesting to note that sentences which have not transformed have also received edit hits at a rate of 14.81%, implying the editing is not a crucial problem for text watermarking. In Table 1, the sentences selected for embedding watermark bit are 75%. As shown in Table 2, the total coverage of the sentences selected for embedding watermark bit is 75%, which outperforms the embed watermark bits. If a sentence has target subject constituents to move, we embed one watermark bit for each subject constituent in the sentence. If a subject constituent does not belong to the target constituents, its corresponding watermark bit becomes ‘1’. Otherwise, the watermark bit is embedded randomly. The target constituent movement is operated if the watermark message bit to be embedded is not the same with the movement bit; otherwise, there is no change made in the syntactic tree. When moving the target constituent, we should also move all the children nodes that the node has. In Fig. 2, the node ‘gue’(the), the child node of the target subject node ‘jubyeon-eun’(surroundings), is also moved. In a similar way, the nodes ‘yubaeksaek-eui’(white) and ‘guiyeo-un’(cute) are also moved following the parent node ‘ggot-i’(flower). After performing target constituent movement, we finally obtain the converted parsing tree as shown in Fig. 2. Then, we recombine the nodes of the parsing tree and obtain a marked sentence as the bottom of Fig. 2. 4 Experimental Results We have used 2,080 sentences in the corpus of Matec99(Morphological Analyzer and Tagger Evaluation Contest in Korean). As shown in Table 1, the average number of words/sentence is 15.67. M. Topkara et al.[3, 12] measured the performance using BLEU[10] and NIST[11], machine translator ISSN: 1790-5109 H. M. Meral[2] 907 ISBN: 978-960-6766-85-5 12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008 coverage of Y.L. Chiang et al.[9]. The average edit rate is 25.67%, which shows better result than that of H. M. Meral et al.[2] as shown in Table 3. The relation of information-hiding capacity means for how many terms of text, one bit of watermark can be hidden. In Table 2, 1:6.6 of the system of Y. L. Chiang et al.[9] means for every 6.6 terms of text, one bit of watermark can be hidden. In our experiment, the relation of capacity is 1:14.05, which is worse than that of Y. L. Chiang et al.[9]. However, our marked text keeps the same style, and it also has the same information without semantic distortion. We conclude that our natural language watermarking based on syntactic constituent movement shows reasonable performance without semantic and stylistic distortion. References: [1] I.Cox, M. L. Miller, J. A. Bloom, and M. Kaufman, “Digital Watermarking”, 2002 [2] H. M. Meral, E. Sevinc, E. Unkar, B. Sankur, A. S. Ozsoy, T. Gungor, “Syntactic tools for text watermarking”, In Proc. of the SPIE International Conference on Security, Steganography, and Watermarking of Multimedia Contents, 2007 [3] M. Topkara, C. M. Taskiran, E. J. Delp, “Natural language watermarking”, SPIE Conf. On Security, Steganography and Watermarking of Multimedia Contents, 2005 [4] C. M. Taskiran, M. Topkara, E. J. Delp, “Attacks on linguistic steganography systems using text analysis”, SPIE Conf. On Security, Steganography and Watermarking of Multimedia Contents, pp. 313-336, 2006 [5] Mi-Young Kim, Sin-Jae Kang, Jong-Hyeok Lee: Resolving Ambiguity in Inter-chunk Dependency Parsing. NLPRS , pp.263-270, 2001 [6] U. Topkara, M. Topkara, M. J. Atallah, “The hiding Virtues of Ambiguity: Quantifiably Resilient Watermarking of Natural language Text through Synonym Substitutiions”, In Proc. Of ACM Multimedia and Security Conference, 2006 [7] M. Atallah, V. Raskin, C. F. Hempelmann, M. Karahan, R. Sion, K. E. Triezenberg, U. Topkara, “Natural language watermarking and tamperproofing”, Lecture Notes in Computer Sciences, 2002 [8] M. J. Atallah, V. Raskin, M. Crogan, C. Hempelmann, F. Kerschbaum, D. Mohamed, S. Naik. “Natural language watermarking: design, analysis, and proof-of-concept implementation”, In Proc. Of the International Information Hiding Workshop, 2001 [9] Y. L. Chiang, L. P. Chang, W. T. Hsieh, W. C. Chen, “Natural language watermarking using semantic substitution for Chinese text”, Lecture Notes in Computer Science, pp. 129-140, 2004 [10] K. Papineni, S. Roukos, T. Ward W. Zhu, “Blue: a method for automatic evaluation of machine translation” In Proc. of 40th Annual Meeting of the ACL, 2002 [11] N. I. Of Standards and Technology. Machine translations benchmark tests provided by national institute of standards and technology. In http://www.nist.gov/speech/tests/mt/resources/. [12] M. Topkara, U. Topkara, M. J. Atallah, “Words are not enough: sentence level natural language watermarking”, In Proc. of 4th ACM International Proceedings of ACM Workshop on Content 5 Conclusion We propose natural language watermarking for Korean based on syntactic analysis. By using the characteristics that agglutinative language permits free word order, we perform syntactic constituent movement in a syntactic tree. Because at least one subject usually exists in a sentence, we choose subject constituents for movement. The overall procedure consists of several steps. First we perform syntactic dependency parsing of original text. Next, we choose target subject constituents for movement. Then, we embed watermark bits. If the movement bit of the target constituents does not correspond to its watermark bit, then we perform movement. Finally from the modified parsing tree, we obtain marked text. The experimental results show that the coverage of our method is 75% and the relation of information-hiding capacity is 1:14.05. In addition, the editing rate is 25.67%, which is better than that of H. M. Meral et al.[2]. We conclude that our syntactic tree-based method using syntactic subject movement is useful in watermarking of Korean text. Besides Korean, other agglutinative languages have also characteristics that syntactic constituents can move freely in some limited boundaries. By using the characteristics, we will try to apply our method to other languages to demonstrate that this method is effective on other languages. Acknowledgements This work was supported by the Sungshin Women’s University Research Grant of 2008. ISSN: 1790-5109 908 ISBN: 978-960-6766-85-5 12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008 Protection and Security (in conjunction with ACM Multimedia), 2006 [13] Osamu Takizawa, Kyoko Makino, Tsutomu Matsumoto, Hiroshi Nakagawa, Ichiro Murase: Method of Hiding Information in Agglutinative Language Documents Using Adjustment to New Line Positions. KES (3) pp. 1039-1048, 2005 ISSN: 1790-5109 909 ISBN: 978-960-6766-85-5
© Copyright 2026 Paperzz