Text Watermarking by Syntactic Analysis

12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008
Text Watermarking by Syntactic Analysis
Mi-Young Kim
School of Computer Science and Engineering, Sungshin Women’s University,
Dongseon-dong 3-ga, Seongbuk-gu, Seoul, 136-742
South Korea
Abstract: - This paper explores the method of text watermarking for Korean by syntactic analysis. The proposed
method is useful for agglutinative languages – such as Korean, Turkish, etc.-- of which syntactic constituent order
is relatively free. Our proposed natural language watermarking method consists of several steps. First, we
construct a syntactic dependency tree of input text. Next, we choose target syntactic constituents to move. Then,
we embed watermark bits. If the watermark bit does not coincide with the movement bit of the target constituent,
we move the syntactic constituent in the syntactic tree. Finally, from the modified syntactic tree, we obtain a
marked text. From the experimental results, we show that the coverage of our method is 75%, which outperforms
that of previous systems, and the rate of unnatural sentences of marked text is better than that of previous systems.
Even if the information-hiding capacity is worse than that of the previous systems, our marked text keeps the same
style, and it also has the same information without semantic distortion.
Key-Words: - Natural language watermarking, syntactic analysis, text watermarking, information hiding
language, differs significantly from Indo-European
languages such as English with respect to free word
order characteristics. For this reason, we believe that
Korean and other agglutinative languages provide a
good ground for text watermarking based on syntactic
constituent movement techniques.
This paper proposes Korean text watermarking by
syntactic analysis. We embed watermark in original
text, creating a ciphertext, which preserves the
meaning, and only moves subject constituents.
This paper is organized as follows. Section 2 presents
previous work on text watermarking. Section 3
explains our method for Korean text watermarking in
detail. Section 4 describes the data used for our
experiments and shows experimental results that
demonstrate that our method is effective for Korean
text watermarking. Finally, we provide our
conclusions
1 Introduction
Natural language watermarking is an emerging
technique in the intersection of natural language
processing and the technologies of security. Natural
language watermarking aims at embedding additional
information in the text itself with the goals of
subliminal communication and hidden information
transport, of content and authorship authentication,
and finally of enriching the text with metadata [2]. The
watermarking techniques have been explored
extensively for multimedia documents in the last
decade[1]. In contrast, the studies on natural language
watermarking are just starting.
In [3,4,6] the techniques of synonym substitution for
watermarking have been addressed and various attack
scenarios have been described. In [8], Atallah et al.
have attempted to use quadratic residues technique to
insert a watermark to a given text via synonym
substitution. The ambiguity induced on the word
precision by the synonym substitution technique has
led Topkara et al.[12] to syntax-based natural
language watermarking. Their technique basically
focuses on the syntactic sentence-paraphrasing. It
turns out that the syntactic approach offers the most
prolific set of text watermarking tools. We conclude
that syntactic modification is useful for text
watermarking without semantic distortion when
compared with alternative tools such as synonym
substitutions. Note that Korean, as an agglutinative
ISSN: 1790-5109
2 Previous Work
M. Atallah et al.[7,8] proposed a technique for
information hiding in natural language text. Moreover,
they established the basic technique for embedding a
resilient watermark in text by combining a number of
information assurance and security techniques with
the advanced methods and resources of natural
language processing. A semantically based scheme
significantly improves the information-hiding
904
ISBN: 978-960-6766-85-5
12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008
Yeoreumcheol yubaeksaek-eui guiyeo-un ggot-i pi-myeon ontong geu jubyeon-eun gammiro-un hyanggi-ro
byeolcheonji-reol iru-bnida.
(In summer, when white and cute flowers bloom, the sweet perfume of the flowers makes their surroundings a
beautiful world.)
iru-bnida (make)
conjunctive
adverbial
(a)
pi-myeon(bloom)
adverbial
(c)
subject
yeoreumcheol
(summer)
ontong
(whole)
subject
jubyeon-eun
(surroundings)
adnominal
ggot-i
(flower)
adnominal
yubaeks
aek-eui
(white)
hyanggi-ro
(with
perfume)
(b)
byeolcheonji-reol
(beautiful world)
adnominal
geu(the)
adnominal
object
adverbial
gammiro-un
(sweet)
guiyeo-un
(cute)
Fig. 1. Example of a syntactic dependency tree
watermarking by syntactic constituent movement.
Syntactic constituent that functions as a subject is
selected as a target constituent for movement. The
detailed method is explained in the next section.
capacity of English text by modifying the granularity
of meaning of individual terms/sentences. However,
this scheme is suitable only for English, and it was
merely conceptual.
A technique of embedding secret data, without
changing the meaning of a text a lot is proposed by
replacing words in the cover text with
synonyms[3,4,6]. However, there is deterioration in
documents in which importance is attached to delicate
nuance when synonyms have been substituted, such as
literary works. There are also cases that wrong words
are selected as synonyms among many synonym
candidate words. Moreover, the method needs a large
synonymy dictionary and a huge collocation
database[13].
Some methods proposed the text watermarking for
agglutinative languages. H.M.Meral et al.[2]
proposed 21 syntactic tools for Turkish text
watermarking, and O. Takizawa et al.[13] suggested
the adjustment to new line positions for Japanese text.
This method has limitations that the message sender
and recipient must share the same secret rule table, and
the total rate of embedding is too low.
Topkara et al.[12] also proposed syntax-based
natural language watermarking using the syntactic
sentence-paraphrasing. They insist that the syntactic
approach is useful for natural language watermarking
without semantic distortion.
We conclude that text watermarking based on
syntactic tree is effective. So, using the characteristics
of free word order in Korean, we perform text
ISSN: 1790-5109
3 Text Watermarking by Syntactic
Analysis
Watermark embedding involves several steps.
First, we perform syntactic parsing, and obtain
syntactic dependency tree. Next, we choose target
syntactic constituents for movement in a sentence, and
determine the moving direction. After that, we embed
watermark bits, and determine whether movement bit
corresponds to the watermark bit. Then we move the
target constituent to the determined direction. Finally,
from the modified syntactic tree, we generate a
marked sentence.
We explain each procedure in detail.
3.1 Syntactic analysis
To perform syntactic analysis, we use a Korean
syntactic dependency parser of M. Y. Kim et al.[5].
Syntactic dependency parser functions to determine
the syntactic relation between words in a sentence. Fig.
1 shows an example of a syntactic dependency tree. In
the tree, parent node functions as the syntactic
governor of its child nodes, and child node functions
as the syntactic dependent of its parent node.
905
ISBN: 978-960-6766-85-5
12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008
Yeoreumcheol yubaeksaek-eui guiyeo-un ggot-i pi-myeon ontong geu jubyeon-eun gammiro-un hyanggi-ro
byeolcheonji-reol iru-bnida.
(In summer, when white and cute flowers bloom, the sweet perfume of the flowers makes their surroundings a
beautiful world.)
iru-bnida (make)
conjunctive
pi-myeon(bloom)
subject
adverbial
ggot-i
(flower)
adnominal
adnominal
yubaeks
aek-eui
(white)
yeoreumcheol
(summer)
subject
jubyeon-eun
(surroundings)
adnominal
adverbial
ontong
(whole)
adverbial
hyanggi-ro
(with perfume)
adnominal
object
byeolcheonji-reol
(beautiful world)
gammiro-un
(sweet)
geu(the)
guiyeo-un
(cute)
Marked sentence:Yyubaeksaek-eui guiyeo-un ggot-i yeoreumcheol pi-myeon geu jubyeon-eun ontong
gammiro-un hyanggi-ro byeolcheonji-reol iru-bnida.
Fig. 2 The converted tree after syntactic subject movement
position with the left nearest sibling that is not
adnominal and conjunctive adverbial. ‘Right’ means it
exchanges the position with the right nearest sibling.
The subject ‘jubyeon-eun’ (surroundings) has (a) and
(b) positions to move. The subject ‘ggot-i’(flower) has
only (c) position to move, because it has only one
sibling. If a subject has both positions (‘left’ and
‘right’) to move, we select the nearest position for
movement. In the surface sentence, between (a) and
the original subject position, one word(‘ontong’)
exists. Between (b) and the original subject position,
two words (‘gammiro-un’ and ‘hyanggi-ro’) exist. So
we
determine
the
subject
‘jubyeon-eun’
(surroundings) can move to (a), not (b). Because (a) is
nearer than (b) from the original subject position. If
both of the distances to the left and right are same, we
determine it moves to the right position.
We employ a movement bit for each subject
constituent to compare with the watermark bit in the
next procedure. If a subject is selected as a target
constituent to move, then the movement bit is ‘1’,
otherwise the movement bit is ‘0’.
3.2 Selection of syntactic constituents for
movement
Sentence basically consists of a subject and a
predicate. So, usually in a sentence at least one subject
exists. We select subject constituents for movement in
a sentence. The target subject constituent that is
selected should move to the position of one of its
siblings. So the target subject constituents should have
at least one sibling as following.
1. Sibling should not be adnominal.
2. Sibling should not be conjunctive adverbial.
If a constituent is adnominal, it should be adjacent
with the following nominal constituent. So, we cannot
change its position. In a similar way, conjunctive
adverbial usually locates on the first position in a
sentence. So, we cannot move its position. To move a
syntactic constituent to the position of its sibling, the
sibling should not be adnominal and conjunctive
adverbial.
In Fig. 1, there are two subjects -‘jubyeon-eun’(surroundings), and ‘ggot-i’(flower).
Both of them can be a target constituent to move.
Target constituent can move to one of both directions
-- ‘left’ and ‘right’. ‘Left’ means it exchanges the
ISSN: 1790-5109
3.3 Embedding watermark bits
After determining the target constituents to move, we
906
ISBN: 978-960-6766-85-5
12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008
Table 1 Performances of our system
The number of sentences
Avg number of words/sentence
Sentences selected for embedding watermark bit
string(%)
Unsuitable sentences among marked sentences
Unsuitable sentences among non-transformed sentences
sentences
2080
15.67
75%
29.29 %
14.81%
Table 2 Comparison of Performances for coverage and information-hiding capacity
Coverage about the sentences selected for embedding watermark
bit(%)
Relation of information-hiding capacity
Y.L. Chiang[9]
Our system
6.7%
75%
1:6.6
1:14.05
Table 3 Comparison of Performances for average edit
Average edit(%)
Our system
55.9%
25.67%
evaluation methods. Using those systems for
sentence-by-sentence distance evaluation is neither
sufficient nor accurate for the task of evaluating
natural language watermarking. BLEU is very
sensitive to precision in words and their position in the
generated sentence. Some of the transformations (e.g.
passivization, syntactic constituent movement) change
the word order heavily while keeping the meaning
very close to the original. A better way of evaluating
the distortion made by a natural language
watermarking system is measuring the distortion at the
full text level. So we measure subjective rate by
human as H.M.Meral et al. [2] used. The evaluation
method is to let human evaluate the texts and show
their reactions by editing attempts. The subjects are
given marked texts and asked to edit them for
improved intelligibility and style. This is a blind test
because the subjects are not aware that text
watermarking has taken place. Three humans have
checked some strange arrangement sentences for the
2,080 sentences.
Table 1 shows the rate of unsuitable sentences among
marked sentences and that among untransformed
sentences. It is also interesting to note that sentences
which have not transformed have also received edit
hits at a rate of 14.81%, implying the editing is not a
crucial problem for text watermarking.
In Table 1, the sentences selected for embedding
watermark bit are 75%. As shown in Table 2, the total
coverage of the sentences selected for embedding
watermark bit is 75%, which outperforms the
embed watermark bits. If a sentence has target subject
constituents to move, we embed one watermark bit for
each subject constituent in the sentence. If a subject
constituent does not belong to the target constituents,
its corresponding watermark bit becomes ‘1’.
Otherwise, the watermark bit is embedded randomly.
The target constituent movement is operated if the
watermark message bit to be embedded is not the same
with the movement bit; otherwise, there is no change
made in the syntactic tree.
When moving the target constituent, we should also
move all the children nodes that the node has. In Fig. 2,
the node ‘gue’(the), the child node of the target subject
node ‘jubyeon-eun’(surroundings), is also moved. In a
similar way, the nodes ‘yubaeksaek-eui’(white) and
‘guiyeo-un’(cute) are also moved following the parent
node ‘ggot-i’(flower). After performing target
constituent movement, we finally obtain the converted
parsing tree as shown in Fig. 2. Then, we recombine
the nodes of the parsing tree and obtain a marked
sentence as the bottom of Fig. 2.
4 Experimental Results
We have used 2,080 sentences in the corpus of
Matec99(Morphological Analyzer and Tagger
Evaluation Contest in Korean). As shown in Table 1,
the average number of words/sentence is 15.67.
M. Topkara et al.[3, 12] measured the performance
using BLEU[10] and NIST[11], machine translator
ISSN: 1790-5109
H. M. Meral[2]
907
ISBN: 978-960-6766-85-5
12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008
coverage of Y.L. Chiang et al.[9]. The average edit
rate is 25.67%, which shows better result than that of
H. M. Meral et al.[2] as shown in Table 3.
The relation of information-hiding capacity means
for how many terms of text, one bit of watermark can
be hidden. In Table 2, 1:6.6 of the system of Y. L.
Chiang et al.[9] means for every 6.6 terms of text, one
bit of watermark can be hidden. In our experiment, the
relation of capacity is 1:14.05, which is worse than
that of Y. L. Chiang et al.[9]. However, our marked
text keeps the same style, and it also has the same
information without semantic distortion.
We conclude that our natural language watermarking
based on syntactic constituent movement shows
reasonable performance without semantic and stylistic
distortion.
References:
[1] I.Cox, M. L. Miller, J. A. Bloom, and M. Kaufman,
“Digital Watermarking”, 2002
[2] H. M. Meral, E. Sevinc, E. Unkar, B. Sankur, A. S.
Ozsoy, T. Gungor, “Syntactic tools for text
watermarking”, In Proc. of the SPIE International
Conference on Security, Steganography, and
Watermarking of Multimedia Contents, 2007
[3] M. Topkara, C. M. Taskiran, E. J. Delp, “Natural
language watermarking”, SPIE Conf. On Security,
Steganography and Watermarking of Multimedia
Contents, 2005
[4] C. M. Taskiran, M. Topkara, E. J. Delp, “Attacks
on linguistic steganography systems using text
analysis”, SPIE Conf. On Security, Steganography
and Watermarking of Multimedia Contents, pp.
313-336, 2006
[5] Mi-Young Kim, Sin-Jae Kang, Jong-Hyeok Lee:
Resolving Ambiguity in Inter-chunk Dependency
Parsing. NLPRS , pp.263-270, 2001
[6] U. Topkara, M. Topkara, M. J. Atallah, “The
hiding Virtues of Ambiguity: Quantifiably
Resilient Watermarking of Natural language Text
through Synonym Substitutiions”, In Proc. Of
ACM Multimedia and Security Conference, 2006
[7] M. Atallah, V. Raskin, C. F. Hempelmann, M.
Karahan, R. Sion, K. E. Triezenberg, U. Topkara,
“Natural
language
watermarking
and
tamperproofing”, Lecture Notes in Computer
Sciences, 2002
[8] M. J. Atallah, V. Raskin, M. Crogan, C.
Hempelmann, F. Kerschbaum, D. Mohamed, S.
Naik. “Natural language watermarking: design,
analysis, and proof-of-concept implementation”,
In Proc. Of the International Information Hiding
Workshop, 2001
[9] Y. L. Chiang, L. P. Chang, W. T. Hsieh, W. C.
Chen, “Natural language watermarking using
semantic substitution for Chinese text”, Lecture
Notes in Computer Science, pp. 129-140, 2004
[10] K. Papineni, S. Roukos, T. Ward W. Zhu, “Blue:
a method for automatic evaluation of machine
translation” In Proc. of 40th Annual Meeting of the
ACL, 2002
[11] N. I. Of Standards and Technology. Machine
translations benchmark tests provided by national
institute of standards and technology. In
http://www.nist.gov/speech/tests/mt/resources/.
[12] M. Topkara, U. Topkara, M. J. Atallah, “Words
are not enough: sentence level natural language
watermarking”, In Proc. of 4th ACM International
Proceedings of ACM Workshop on Content
5 Conclusion
We propose natural language watermarking for
Korean based on syntactic analysis. By using the
characteristics that agglutinative language permits free
word order, we perform syntactic constituent
movement in a syntactic tree. Because at least one
subject usually exists in a sentence, we choose subject
constituents for movement. The overall procedure
consists of several steps. First we perform syntactic
dependency parsing of original text. Next, we choose
target subject constituents for movement. Then, we
embed watermark bits. If the movement bit of the
target constituents does not correspond to its
watermark bit, then we perform movement. Finally
from the modified parsing tree, we obtain marked text.
The experimental results show that the coverage of
our method is 75% and the relation of
information-hiding capacity is 1:14.05. In addition,
the editing rate is 25.67%, which is better than that of
H. M. Meral et al.[2].
We conclude that our syntactic tree-based method
using syntactic subject movement is useful in
watermarking of Korean text. Besides Korean, other
agglutinative languages have also characteristics that
syntactic constituents can move freely in some limited
boundaries. By using the characteristics, we will try to
apply our method to other languages to demonstrate
that this method is effective on other languages.
Acknowledgements
This work was supported by the Sungshin Women’s
University Research Grant of 2008.
ISSN: 1790-5109
908
ISBN: 978-960-6766-85-5
12th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 23-25, 2008
Protection and Security (in conjunction with ACM
Multimedia), 2006
[13] Osamu Takizawa, Kyoko Makino, Tsutomu
Matsumoto, Hiroshi Nakagawa, Ichiro Murase:
Method of Hiding Information in Agglutinative
Language Documents Using Adjustment to New
Line Positions. KES (3) pp. 1039-1048, 2005
ISSN: 1790-5109
909
ISBN: 978-960-6766-85-5