Exploiting Linguistic Features in Lexical Steganography: Design and

Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Exploiting Linguistic Features in Lexical Steganography:
Design and Proof-of-Concept Implementation
Vineeta Chand and C. Orhan Orgun
University of California, Davis
[email protected] [email protected]
Abstract
This paper develops a linguistically robust
encryption, LUNABEL, which converts a message into
semantically innocuous text. Drawing upon linguistic
criteria, LUNABEL uses word replacement, with
substitution classes based on traditional word
replacement features (syntactic categories and subcategories), as well as features under-exploited in
earlier works: semantic criteria, graphotactic
structure, inflectional class and frequency statistics.
The original message is further hidden through the
use of cover texts—within these, LUNABEL retains all
function words and targets specific classes of content
words for replacement, creating text which preserves
the syntactic structure and semantic context of the
original cover text. LUNABEL takes advantage of cover
text styles which are not expected to be necessarily
comprehensible to the general public, making any
semantic anomalies more opaque. This line of work
has the promise of creating encrypted texts which are
less detectable than earlier steganographic efforts.
1. Introduction
While current encryption techniques are
sufficiently advanced to make code-breaking
practically impossible, one major drawback of current
encryption methods is the ease in identifying an
encrypted text—they do not resemble natural text in
any way. Steganography attempts to answer this need,
acting to conceal the message's existence, in order to
transmit encrypted messages without arousing
suspicion. The chances of intercepted messages,
which can lead to broken codes, are dramatically
reduced with effective steganographic methodology.
In some cases, just the fact that correspondents are
exchanging encrypted messages might be more
information than one wants to give away—this is the
major advantage of steganography. The classic
example is lovers writing each other innocent-looking
notes—it wouldn't do if their parents knew they were
writing each other encrypted messages, even if they
had no way of breaking the code. Another recent case
was a political prisoner in Turkey sending letters to
his friends outside and telling them that he was being
tortured. If he had attempted to send a letter that
looked like an encryption, the letter would never have
been delivered. Again, whether the code is breakable
is secondary; the main point is not wanting people to
know about the exchange of encrypted messages. In
these examples, encryption was performed in creative
ways, but regular use of steganography requires a
robust algorithmic technique.
Information hiding has taken one form in imagebased steganography, utilizing minimal changes in
pixels or watermarking techniques. While text-based
messages have also been used within image-based
maneuvers, by modifying the white space between
letters and by minutely changing the fonts, this has
proved less fruitful because text can be retyped and is
often altered in the conversion from one program
version or platform to another. Proving more
productive, as well as resistant to the difficulties
surrounding the re-typing of text-based messages is
lexical steganography, which uses linguistic structures
to disguise encryption of text messages such that the
appearance of the message remains semantically and
syntactically innocent.
This paper develops LUNABEL, a linguisticallyinformed alternative to existing text-based word
replacement steganographic systems (NICETEXT [5],
Tyrannosaurus Lex [13], WordNet [11]). Using
myriad word and phrase categories recognized in
linguistics as well as other linguistically relevant
criteria, this model creates more cohesive and
semantically plausible results than earlier efforts.
Earlier text-based steganographic systems, while
more natural than random alpha-numeric sequences
because of their use of words framed in a sentence
format, still fall short of the goal: the text produced is
0-7695-2507-5/06/$20.00 (C) 2006 IEEE
.
1
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
sufficiently unnatural in appearance to the human eye
to warrant further inspection, which defeats the goal
of steganography.
LUNABEL improves upon existing text-based
steganographic techniques through the inclusion of
linguistic criteria into the program's word replacement
method, producing semantically and syntactically
reasonable text. This paper explores the design
choices underlying LUNABEL as well as the
substitution classes of words, focusing on their
adherence to linguistic features beyond basic word
classes (noun, verb, adjective, determiner, etc.) and
syntactic features (sentence frames), and concludes
with a discussion of linguistic robustness.
2. Past Research
Lexical steganography has had three main veins of
research: watermarking techniques [1] that manipulate
sentences through syntactic transformations, word
replacement systems both with and without cover
texts, and context-free grammars such as NICETEXT.
2.1. Watermarking
Atallah et al. [1] watermark texts by manipulating
and exploiting the syntax (formal word order and
grammatical voice) of sentences. Through common
generative transformations (clefting (2), adjunct
fronting (3), passivization (4), adverbial insertion (5)),
the syntax of each sentence is altered:
1. The lion ate the food yesterday. (original sentence)
2. It was the lion that ate the food yesterday.
3. Yesterday, the lion ate the food.
4. The food was eaten by the lion yesterday.
5. Surprisingly, the lion ate the food yesterday.
While this broadly preserves the meaning of each
sentence, their argument that this approach will
withstand translation into other languages, as well as
appear as innocuous text, is not strong.
Covering the first point, adverbs and adjunct
clauses which may appear in multiple syntactic
positions within English (e.g., in (5), ‘yesterday’ and
‘surprisingly’ can be placed at the beginning or end of
the sentence as well as directly before the verb) do not
necessarily have this same range of movement in
other languages with differing word order rigidity.
Also confounding is English’s notorious strictness in
its subject, object and verb word order, in marked
contrast to its treatment of adverb and adjunct clauses.
Because of these features, the range of possible
transformations within English does not have a
dependable one to one correlation in most other
languages, Indo-European or otherwise.
Additionally, the more serious claim when
examining the English version of Atallah et al.'s
watermarked text, that transformations will not affect
the semantics of a text, reflects a generative bias, and
is not necessarily true. Newer theories of language
argue for the interconnectedness of the semantic and
syntactic levels [9, 10], demonstrating that the
syntactic pattern is itself inherently meaningful.
Furthermore, statistically, various syntactic structures
(word orders) are not equal in distribution: different
genres of text have wildly different syntactic
structures, and replacing such structures freely could
create a text which is trivially broken by statistical
methods—a security threat to the program.
Another problem for their claim of encryption
integrity within translation arises with a closer
examination of syntactic structure cross-linguistically.
The use of passivization, for example, varies crosslinguistically: in English it is appropriate to say (6),
while in Hindi the same concept would be phrased,
when translated directly into English, as (7).
6. I was late.
7. To me lateness happened.
Translation, either based on semantics or word-byword, will not maintain the syntactic structures
Atallah et al. assume. Their approach, while more
linguistically sophisticated than earlier work, is
clearly not without problems. More importantly, its
focus is centered on defeating word frequency
statistics and NLP software which would test the
semantics of the text, making little reference to how
such text would fare with a human inspection. Our
program is focused towards defeating the casual
human inspector, negating the possibility that the text
would raise suspicion and be passed into NLP and
statistical programs for further inspection.
2.2. Word Replacement
Attempts have been made to focus on synonymous
words as potentials for word replacement (WordNet
[11], Tyrannosaurus Lex [13]). However, this is rather
problematic from a linguistic standpoint, as the
number of true synonyms in English is relatively
small to nonexistent (a commonly cited synonymous
pair in American English is sofa/couch, but even this
is only valid in some areas of the United States).
Unfortunately, the notion of synonymy is vague, often
characterized as a similarity in meaning and the
ability to replace one term with another. A closer
examination of commonly perceived synonyms often
2
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
reveals differences in terms of their syntactic
distributions and word frequencies. Additionally,
most words have more than one sense. It is rare that
two terms will overlap completely in all of their
senses, complicating the replace-all plan used in
earlier steganographic works.
Another complication is that many synonymous
pairs actually come from different dialects (i.e.,
regionally defined varieties of language) or registers
(i.e., language varieties used in particular social
settings); their range of underlying implications is
disparate and often based on these variations in
register and dialect. For example, the terms violinist
and fiddler are commonly cited synonyms, however,
they are poor replacement choices. Compare the
original sentence, (8), with (9), synonymous term
replacement:
8. Itzhak Perlman, one of the most famous violinists,
performed recently at Carnegie Hall.
9. *Itzhak Perlman, one of the most famous fiddlers,
performed recently at Carnegie Hall.
Example (9) demonstrates that semantically similar
terms, in this case terms for people who play a small
fretless, bowed string instrument, are not necessarily
replaceable, due to underlying differences in
connotation—Itzhak Perlman is not a famous fiddler.
To say that he is one implies that he plays a style of
music other than Western classical music (perhaps
Bluegrass), and could be interpreted as a derogatory
reflection on his sophistication. Oftentimes this
register difference for synonymous terms is
problematic for wholesale replacement schemes,
which ignore register as well as genre specific word
frequencies, making replaced terms stand out with
respect to the text as a whole.
A further impediment to synonym based word
replacement is graphotactic structure (i.e., legal and
illegal letter combinations)—to cite one example, the
English indeterminate articles a and an are used in a
phonologically rule-based pattern, with a preceding
consonant initial terms and an preceding vowel initial
terms:
/a/ Æ [a] / __#C e.g. a banana
Æ [an] / __#V e.g. an apple
Exchanging apple with banana would thus
produce the ill-formed phrase *an banana. These
factors, taken collectively, severely affect the flow,
semantics, and readability of a text altered based on
"synonymous" word replacement. In short, the notion
of replacement based on synonyms is oversimplifying
language.
Word replacement based on synonymy, while
sharing a focus with LUNABEL towards defeating
human inspection, is not linguistically sophisticated
and has yet to produce encrypted naturalistic text.
2.3. Context-free Grammars
The third approach used in lexical steganography
exploits the syntactic structures of a text, using a partof-speech tagger to place new words in old frames, in
order to create new encrypted text. Past research in
this arena has produced software functionally similar
to LUNABEL, in that a message is encoded into an
innocuous looking text [3]. However, while
improving upon earlier synonym-based word
replacement programs, NICETEXT has certain
drawbacks which limit its potential, and distinguish it
in appearance and methodology from LUNABEL.
NICETEXT uses the cover text simply as a source of
syntactic patterns: by running the cover text through a
part-of-speech tagger, NICETEXT obtains a set of
"sentence frames," e.g. [(noun) (verb) (prep) (det)
(noun)] for ‘I sat in the tree.’ It also compiles a
lexicon of words found in the cover text via part-ofspeech tags, with each word in the lexicon associated
(arbitrarily) to either of the binary digits 0 or 1. In
encryption, the plain text message is converted into a
sequence of binary digits. A random sentence frame is
chosen and the part of speech tags in it are replaced
by words in the lexicon according to the sequence of
binary digits.
Based on a linguistically unsophisticated model of
language, (part of speech tags are useful in syntactic
parsing and similar tasks, but inadequate for
semantically plausible word replacement—an
indication of this may be found in the wildly differing
numbers of part of speech tags used in taggers
developed for different purposes: the Penn Treebank
tagset of 45 tags vs. the C7 tagset of 146 tags [8]),
NICETEXT is ill prepared to deal with ambiguous
lexical items (part of speech ambiguity—words with
two possible syntactic categories, like share, which
functions as a noun and a verb—is especially
disastrous), and items which cannot be interpreted
literally (idioms and metaphors). Furthermore,
function and content words are treated alike in
NICETEXT. Since function word often have more to do
with syntactic relationships (e.g., passive agent by;
infinitival to) than semantics, subjecting them to word
replacement often results in ungrammatical and
semantically anomalous sentences. Compare (10),
NICETEXT's cover text with (11), encryption based on
NICETEXT.
10. NICETEXT Original text (unaltered), from John F.
Kennedy's 1962 inaugural speech
3
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
We observe today not a victory of party but a
celebration of freedom. . . symbolizing an end as well
as a beginning. . . signifying renewal as well as
change for I have sworn before you and Almighty God
the same solemn oath our forbears prescribed nearly
a century and three-quarters ago. The world is very
different now, for man holds in his mortal hands the
power to abolish all forms of human poverty and all
forms of human life. (...)
11. NICETEXT -Encrypted version of above text
My area origins of the suspicion... oppose much what
America will be before you, before what asunder we
would be for the Poverty inside Man. Yet will it do
almighty off the first two south votes... or on the
course at this administration, whether even deadly off
our suspicion by this peace. To those young votes
what proud nor alike origins we comfort: we house
the heritage past hostile americans. (…)
Regarding the semantics of (11), Atallah et al. [1]
comment that NICETEXT, and similar programs, are
"…context-free grammars that generate primitive but
less conspicuously meaningless texts, in which each
individual sentence may be sort of meaningful, even if
rather unrealistically simple…" Clearly, there is room
for improvement in the semantics of the texts
produced by earlier steganographic attempts.
Another factor worth considering is the density of
encryption within the cover text. Ideally, the cover
text should work to hide the word frequencies and
syntactic structure of the hidden plain text message.
Steganographic goals encourage sparse encryption,
which does not alter a majority of the text by the word
replacement. NICETEXT encryption is maximally
dense—every word within the final encrypted cover
text is conveying hidden information. Given that each
encrypted word is part of the original information
bearing message and common word usage patterns are
unavoidable, this is problematic for the original
steganographic intent: avoiding detection and
producing naturalistic text.
All of the shortcomings detailed above make the
fact of encryption obvious, as the syntax and
semantics of the cover text are unnatural enough to
draw attention to the message, whether examined by
AI/NLP software or by a human. This is
acknowledged in the original authors' reflection on
NICETEXT, "Although the initial NICETEXT approach
was an improvement, it was not as effective in
producing text that was 'believable' by a human" [5].
NICETEXT II [5], the successor to NICETEXT, is
based on a context-free grammar instead of sentence
frames, and is focused on the ability to build an
infinitely large set of sentences based on a dictionary
of tagged words. This focus on infinite sentence
generation still leaves NICETEXT II as open to the
density critique as NICETEXT. Additionally, the
semantic plausibility of the sentences has not
increased, nor is there any relevance or relation
between sequential sentences, rendering the resulting
text contextless. These issues can be seen in (12),
which uses the sentence frames from the original text
found in (10), to produce the following sample
sentence:
12. The prison is seldom decreaseless tediously, till
sergeant outcourts in his feline heralds the stampede
to operate all practices among interscapular stile
inasmuch all tailers underneath indigo pasture.
While this sentence might be considered
syntactically reasonable, it is clearly not semantically
wellformed, and hence open to further inspection and
detection. It is reminiscent of Chomsky's [6] famous
sentence (13), in which he highlighted syntactic
wellformedness without semantic wellformedness, i.e.
grammatically reasonable nonsense:
13. Colorless green ideas sleep furiously.
While Chomsky's accompanying generative
syntactic theory has had many incarnations in the last
40 years, his point made with this sentence holds true.
As this review has demonstrated, NICETEXT I and
II are concerned only with syntactic wellformedness,
allowing for grammatically correct but semantically
anomalous text. Given the goal of creating innocuous
text, failing to consider semantic wellformedness
leaves NICETEXT I and II vulnerable to detection.
While the dominant focus has been creating
naturalistic text which would fool a human, it can be
gathered from surveying the goals and steganographic
processes of existing systems that there is much work
still to be done. The texts produced by all of these
earlier systems, while potentially more innocuous
than an alpha-numeric stream of digits, are clearly
unnatural and likely to arouse suspicion, either with
statistical or NLP software, or human inspection.
Linguistic features beyond syntactic structure and
synonymy clearly need to be accounted for in order to
produce natural looking text, and less dense
encryption is critical for future work. The current
project addresses these further needs.
3. Current Work
3.1. Introduction
4
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
Our technique improves upon earlier research with
its use of a cover text, a more informed and less dense
means of disguising the message, and draws upon
linguistically significant criteria and selectivity in
word replacement, especially in dealing with
problematic and ambiguous words. Our cover text
differs in function from earlier works in two pivotal
ways: the majority of words within the cover text are
maintained (unlike [3] and [4]) and only individual
words are replaced, distinct from the manipulation of
syntactic structures [1]. Encryption is based on word
replacement, with replacement word lists based on
part of speech categories and sub-categories
(especially for verbs), semantic criteria, graphotactic
structure (for nouns, to handle the a~an allomorphy of
the indefinite article), inflectional class (regular vs.
irregular plurals, past tense forms, etc.) and word
frequency statistics. The primary goal is to create
naturalistic text which passes human inspection: if
human inspection does not raise suspicions, there is
less chance that the text will be scrutinized further
with statistical and NLP software.
Given our system of encryption and given that it is
hard to replace function words (see section 3.4) and,
to a varying extent, highly ambiguous words, without
also affecting the syntax of a text, it is not necessary,
nor, we argue, linguistically valid, to replace every
word. Instead, LUNABEL replaces only those words
that are in one of the specified substitution classes and
leave other (mostly function) words unchanged in the
cover text. As a result, the encrypted message more
closely reflects the syntax and lexical frequency
characteristics of the cover text, which makes its
appearance more natural under close scrutiny.
Key to this endeavor is the use of cover texts
which are not expected to be comprehensible to the
general public, making possible discrepancies more
opaque within the context of the larger message.
Specifically, we have focused on the writing style
found in "readme" files which accompany software
packages. Using this genre is advantageous for many
reasons: it complicates detection both from the
AI/NLP software side and the human inspection side
(it is elaborated on in section 3.3). While frequency
statistics are taken into account, within this paper the
focus is centered on creating semantically reasonable
text than statistically appropriate word frequencies.
3.2. Overview of LUNABEL
The encryption scheme is two-part; the first step
converts the plain text message (14) into a sequence
of hexadecimal digits (15). This is done by taking the
ASCII code of each character and expressing it as a
pair of hexadecimal digits.
14. Plain text
I had a cat and the cat pleased me
I fed my cat under yonder tree.
Cat went fiddle-dee-dee, fiddle-dee-dee.
I had a dog and the dog pleased me
I fed my dog under yonder tree.
Dog went bawa, bawa,
cat went fiddle-dee-dee, fiddle-dee-dee.
15. Plain text converted to sequence of hexadecimal
digits (via LUNABEL)
9 2 0 6 8 6 1 6 4 2 0 6 1 2 0 6 3 6 1 7 4 2 0 6 1 6 14 6
4 2 0 7 4 6 8 6 5 2 0 6 3 6 1 7 4 2 0 7 0 6 12 6 5 6 1 7
3 6 5 6 4 2 0 6 13 6 5 2 14 0 10 4 9 2 0 6 6 6 5 6 4 2 0
6 13 7 9 2 0 6 3 6 1 7 4 2 0 7 5 6 14 6 4 6 5 7 2 2 0 7 9
6 15 6 14 6 4 6 5 7 2 2 0 7 4 7 2 6 5 6 5 2 14 0 10 4 3
6 1 7 4 2 0 7 7 6 5 6 14 7 4 2 0 6 6 6 9 6 4 6 4 6 12 6 5
2 13 6 4 6 5 6 5 2 13 6 4 6 5 6 5 2 12 2 0 6 6 6 9 6 4 6
4 6 12 6 5 2 13 6 4 6 5 6 5 2 13 6 4 6 5 6 5 2 14 0 10 0
10 4 9 2…
Thus, each ASCII character in the original text
will be encoded by two integers between zero and 15
(two hexadecimal digits), and ultimately encrypted by
two word replacements. In contrast, NICETEXT uses 8
binary digits for each character and therefore 8 word
replacements encrypt one plain text character. The
reasoning behind the encryption scheme used here is
discussed in Linguistic Robustness, section 3.6.
Another piece of text, the "cover text," is then
brought in (the impetus behind this text selection is
covered in section 3.3):
16. Excerpt from Cover Text: (readme.txt file
distributed with GSview® software package)
Features include:
- View pages in arbitrary order (Next, Previous,
Goto).
- Selectable display resolution, depth, alpha.
- Page size is automatically selected from DSC
comments or can be selected using the menu.
- Orientation can be automatically selected from
DSC comments or can be selected using the
menu (Portrait, Landscape).
Words are replaced in this cover text with other
words, according to the above sequence of plain-text
numbers manipulated through an encryption key,
which provides a pseudo-random sequence of
integers. Here is the implementation of this:
1. We take the first number in the hex-dig file, in this
case 9.
5
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
2. We consult our encryption key to obtain an integer,
let's say 8.
3. We add these numbers together to get a new
number n, in this case 17, which corresponds to 1 (17
mod 16).
4. Then, we take the first word in the cover text. If this
word is not in any of our word lists, we skip words
until we come to a word that is (in this example,
“include” is the first such word).
5. Once we find such a word, we look up the word list
it's a member of.
6. We replace the cover text word with the nth word of
the list, which here happens to be “change.”
7. We repeat this procedure for each hex-dig in the
converted plain text.
8. (Minor details). If plain text is done and some
cover text remains, we copy the remainder intact. If
cover text runs out, we issue an error message and
optionally start again from the beginning of the cover
text, appended to the already-encrypted chunk.
The replacement words are taken from numbered lists
of words, substitution classes, which have been
compiled specifically for this purpose:
17. Two examples of substitution classes
word_list([change, alter, configure, add, include,
exclude, insert, restore, delete, edit, write, modify,
manipulate, toggle, clear, rewrite])
word_list([run, eject, send, choose, save, visit, view,
scroll, define, chat, ftp, telnet, transmit, enter, exit,
close])
Word replacement focuses on linguistically
significant features (detailed in section 3.4) beyond
syntactic similarity and synonymy, also heeding
distinctions in graphotactic form, content versus
function grouping, syntactic categories and subcategories, semantic criteria, inflectional class and
word frequency statistics, with the result being
encrypted text which is reasonably semantically and
syntactically innocuous:
18. Encrypted text
Features change:
- Close pages in arbitrary upgrade (Trailing, Last,
Goto).
- Selectable broadcast resolution, depth, alpha.
- Use size is instantaneously selected from DSC
comments or can be selected using the source.
- Orientation can be perpetually selected from
DSC comments or can be selected using the table
(Portrait, Landscape).
Decryption by the recipient requires the key and
the word lists, but not the cover text. In decryption,
we compare each word in the encrypted message
against our word lists. Once we find a word that is in
one of our word lists, we add two integers together:
the position of the found word in its word list and the
pseudo-random integer obtained from the encryption
key. Iteration of this procedure through the encrypted
message results in a sequence of hexadecimal digits,
which, in pairs, correspond to the ASCII codes of
characters forming the plain (decrypted) message.
3.3. The Cover Text
We have chosen to use a particular style of cover
texts within this demonstration of LUNABEL, that of
"readme" documents which generally accompany
software installations. This style of cover text was
chosen for a number of reasons, and is a key to the
success of this style of lexical steganography.
While written text such as stories, news articles
and novels have a very fluid style of prose, they do
not represent, in syntax, word frequencies, or fluency,
the entirety or norms of written text. Many other
styles of writing are considerably less fluid, less
coherent, and less intelligible to the general public,
while still commonly found.
Framing the
steganography within this type of genre is to our
advantage, because we can exploit the expected
discrepancies between this style of writing and more
formal, published text, as an added measure of
security. Word choice differences that might appear
odd or novel in a spoken conversation are easily
attributed to the style and characteristics of this type
of text, further masking the word replacement system.
Using this genre complicates detection both from
the AI/NLP software side and the human inspection
side. This genre has less developed norms of writing
stylistics, resulting in a wide range of variation within
the genre. Given this, statistical software working
within this genre requires a much wider range of
acceptability for word-clustering frequencies and
syntactic structures. Additionally, human inspection is
complicated by the range of words co-opted from
typical uses within software installation and program
details, the frequent use of incomplete sentences, and
the opaqueness typical of directions and descriptions
within "readme" documents. Other useful (and
reasonably opaque) styles are recipes, classified ads,
alcohol/wine descriptions, net blogs, English as a
Second Language (ESL) essays, life interviews,
programming code, and furniture assembly
instructions.
In addition to the benefits of using this style of
text, the use of cover texts allows for sparse
6
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
encryption. Only a minority of the words are
earmarked for replacement, allowing the majority of
the text to remain unchanged. Given this, one can take
advantage of the syntactic pattern provided by the
cover text. Using the cover text's syntactic pattern
lends additional security because the encrypted
message reflects the structure of the cover text, not of
the original message. Our system of sparse
replacement also allows the contextual frame of the
cover text to remain in place, such that the sentences
make sense when read as a whole text, as compared to
the product of a random sentence generator (refer to
(12) for an example of NICETEXT random sentence
generation).
3.4. The Word Lists
The word lists are manually built up from a corpus
of cover texts written in a particular style, in this case
the style of "readme" files. Within LUNABEL, each
word list has 16 words in it which have similar word
frequencies, graphotactic structure, syntactic subcategory, etc. within the corpus of data. Which
numbered word in the list is used to replace the
original word depends on the manner of encryption
used (see Section 3.6 for a further explanation); we
have used a simple encryption based on an arbitrary
user-supplied sequence of integers (akin to a “onetime pad” familiar from WWII encryption methods).
Syntactic categories are typically defined as
substitution classes: in replacing a word with another
word of the same category, the grammaticality of the
sentence should be unaffected. However, part of
speech (PoS) categories define substitution classes
only vaguely. In detailed syntactic work, elaborate
subcategorization—e.g.,
intransitive,
transitive,
ditransitive verbs, etc.—is needed. Once semantic
plausibility is added as a desideratum, semantic
factors have to be considered in defining substitution
classes (e.g., animate vs. inanimate nouns, verbs with
agent vs. experiencer subjects, mass vs. count nouns,
etc.). Additionally, heeding the issues past
steganographic programs have encountered with
synonymy, the semantic range of these word lists is
not limited to synonyms—indeed, it would be
impossible to find 15 synonyms for any word! Rather
than synonymy, the important criterion is usability in
similar syntactic and semantic contexts.
Our sparse word replacement is due, in part, to our
criteria for the word lists. Some categories of words
are more easily replaced within a text than others. As
discussed earlier, content and function words
highlight this distinction. Content words work to
create the meaning within the sentence (e.g., nouns
cat, peace; verbs sneeze, send), while function words
act as the glue which binds the concepts together in a
particular meaningful fashion (e.g., passive agent by;
infinitival to). Since function words often have more
to do with syntactic relationships than semantics,
replacing them is not productive in the quest for
innocuous sentences, nor is it linguistically valid.
Compare (19), with (20), which replaces the function
word by with another preposition, up:
19. The car was washed by the school kids.
20. *The car was washed up the school kids.
All content words are, in principle, candidates for
replacement, but further choices are made about
viable replacements: sparseness is not imposed by the
syntax, but by semantic and pragmatic considerations.
Additionally, ambiguity and related problems are
avoided within LUNABEL by using only easily handled
words as encrypted information bearers.
Of the words we consider candidates for
replacement, there are further subdivisions into word
lists based on the following criteria: word frequencies,
phonetic features, number classes, inflectional
features, and word categories and sub-categories.
Thus, cat and ant are placed in different word lists
based on their phonetic features; cat with other
consonant initial terms, and ant with vowel initial
terms, in order to produce grammatically correct word
combinations like a cat/an ant. A word which has a
limited or minimal frequency within the context of the
larger corpus of data is excluded from the word lists,
and higher frequency words are grouped with words
of similar frequency. Plural nouns are grouped
separately from singular nouns, with nouns which are
ambiguous with respect to number (e.g., one fish, two
fish) also grouped separately. Verbs are grouped
according to their inflectional features, creating word
lists which respect paradigms marking tense and
number. These groupings prevent replacements which
create number inconsistencies on the order of:
21. *The cat were…
22. *The papers is…
Verbs are also grouped according to their syntactic
subclasses of intransitive, transitive, ditransitive, etc.
to avoid sentences such as the following:
23. *The cat sent.
24. *The girl slept a toy to her sister.
The building of the word lists occurs in this
fashion: the corpus has 46,000 words, from which
word frequency statistics have been extracted. The
word process appears 18 times within that. Because it
7
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
appears as a noun in this genre, it needs to heed the
a/an distinction, as it can directly follow an indefinite
determiner. The term section occurs 23 times, follows
the same graphotactic structure, and occurs in both of
the same syntactic positions. Both terms are used as a
transitive action in a manipulation of data, and both
can be used nominally to highlight an area or aspect
of something (generally a program, code, or location
on the computer, within this genre of texts). While
these words are not synonyms, or even near
synonyms, and hence would not be replaceable under
earlier synonym-based word-replacement systems,
such is not important within LUNABEL. Building up
the wordlist continues until 16 terms are found,
factoring all of the above features. These criteria
require a sizable corpus in order to produce
comprehensive word lists. Given all of these criteria,
the result is a syntactically reasonable and
semantically innocuous text with unremarkable word
frequencies when compared to other texts in a corpus
of that style.
3.5 Encryption Density
The size requirement for cover texts is naturally
affected by LUNABEL’s sparse substitution. However,
the size of the word lists is also a factor, in that the
more words are considered viable for replacement, the
less cover text is necessary. Essentially, the ratio of
words in substitution classes versus non-replacement
words determines both how sparse the encryption is
and the minimum length of the cover text.
While we have discussed how only content words
are used, it is in fact even more selective than that,
with only a small subset of content words even
marked for replacement. This also contributes to the
sparse replacement.
3.6 Linguistic Robustness
While cryptography systems are typically rated
and valued based on statistical modeling and testing,
such testing is less relevant for valuing this system's
contribution to lexical steganography. One can use
any encryption scheme one wants in lieu of the
method in Step 2 of the word replacement process,
and still use the rest of the described technique to hide
the encrypted message. Testing this system through
traditional formats would necessarily reflect our
encryption method, thereby rendering results
unreflective of the linguistic sophistication of the
system. Instead, we argue that linguistic robustness,
an impressionistic property, offers a more meaningful
evaluation. Within this rubric, accounting for a higher
number of linguistic features and producing
semantically and syntactically reasonable text within
a larger context or genre is valued.
LUNABEL strives for linguistic robustness,
achieved by replacing only content (not function)
words and assembling word lists based not just on
part of speech, but also on subcategorization,
semantic properties, and word frequencies. The
system outlined in this paper is linguistically more
robust in combining both syntactic and semantic
criteria than other systems we have discussed.
The cover text encourages this deception, having
been chosen specifically for unintelligibility and its
odd assortment of syntactic structures. While prolific
in modern computing (a quick search of the first
author's hard drive found 67 readme files), they do not
necessarily follow full sentence syntactic patterns
throughout the text, and they are not necessarily
comprehensible to someone unfamiliar with the style.
In short, the pragmatics of the cover text do not
undermine the original steganographic intent, and
instead, exploit it, building on the readers confusion
and inattention to detail with such texts.
Testing the strength if encryption is beside the
point, and unresponsive to the object of this project,
which is to develop a linguistically sophisticated
steganographic system. Further encryption can be
piggybacked on to the system by using any and all
encryption techniques on the original message before
it is passed into this system, as well as a more
sophisticated system of word replacement based on
the original message's conversion to numeric stream.
The value of this system is that it uses linguistic
features including word frequencies, content vs.
function words, word categories and sub-categories,
phonetic features, inflectional classes and number
features to replace words within a sparsely encoded
cover text. The cover texts, drawn from a genre of
opaque literature with a broad range of possible
syntactic structures and semantically and syntactically
atypical word usage, further disguise possible
encryption. This system, exploiting syntactic,
semantic, and phonological knowledge, allows
enhanced
linguistic
robustness
in
lexical
steganography.
4. Discussion and Limitations
LUNABEL, although the first linguistically robust
steganographic tool, is not without its limitations.
4.1 Good Cover Texts Help All Equally?
The use of opaque cover texts, while currently
unique to LUNABEL, also holds promise for other
lexical steganographic efforts. However, different
8
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
programs use cover texts in different ways; by
replacing all words within the cover text [4,5], or by
manipulating the syntax of the cover text [1], not by
selectively replacing single words (LUNABEL). The
style of readme files, while less rigid than other
genres, does have expected syntactic patterns,
complicating one steganographic approach [1], and
the their semantic opacity, which LUNABEL exploits,
is marred by wholesale word replacement, which,
within the replacement process, does not maintain the
particular genre of text, nor the particular topic of the
original cover text [4, 5].
Another point of consideration is the level of
commitment each program has to a particular cover
text document: earlier programs [1,4,5] are bound to a
single cover text once its syntactic patterns have been
extracted, while LUNABEL is only restricted to a
particular genre of texts once word lists are created,
with any text within that genre available for possible
use as the cover text in a particular instance of
message transmission.
Programs which rely on already constructed,
mammoth sized synonym lists are also hampered in
their use of the readme genre in particular because the
technology being discussed within a readme is
constantly changing, as new technology (and hence
jargon) is created and distributed. The constant flux
which technology-related language exists within is
harder to capture within a program whose synonym
lists are unwieldy. The potential benefit of using
readme files for these other approaches is thus
uncertain.
4.2 Environmental Factors
LUNABEL’s cover texts are also complicated by
situations in which such types of cover texts are not
‘natural,’ and hence elicit suspicion based simply on
the genre of cover text. However, this can, and has,
been reduced with increasing styles of cover texts
(e.g. recipes, classified ads, instruction manuals,
interviews, etc.) and accompanying word lists, which
will continue to build with time.
Further, messages can be sent and received, or
posted to web sites within the appropriate genre. For
many of the suggested genres of cover texts, it is
simple to find web forums that work only within the
one genre, further disguising the encrypted cover text.
From there, the text can be passively read, and
manipulated, by the appropriate recipient, while other
naïve individuals view the text without suspicion.
Within this increasingly internet savvy age, the range
of styles the average user accesses via web searches is
increasing, further aiding in the innocuousness of
LUNABEL generated texts.
To this end, word lists for a second cover text
style, recipes, have been created, and are briefly
demonstrated here with an ingredient list that
typically prefaces recipes:
25. Excerpt from recipe for Seafood Croquettes:
3
tb
Unsalted butter
1/2 sm
Onion -- finely minced
1/2
Celery stalk -- finely diced
3
tb
All-purpose flour
1/3
c
Milk
1/4
ts
Ground nutmeg
Bread crumbs
1
lb
Cooked fish and/or shellfish
- such as salmon, shrimp,
- scallops, white fish,
- or a combination
1
tb
Finely chopped parsley
1
tb
Chopped chives
1
t
Salt -- or as desired
1/4
ts
Cayenne pepper -- as desired
Flavorless cooking oil
26. Encrypted text of same recipe ingredient list
3
tb
Unsalted butter
½
largish Onion – tightly quartered
1/2
Anise root -- coarsely grated
3
tb
All-purpose flour
2/3
c
Milk
1/4
oz
Ground cinnamon
Bread crumbs
3
pint
Cooked fish and/or shellfish
- such as grouper, eel,
- scallops, golden fish,
- or a combination
2
tb
Evenly bruised marjoram
1
tb
Separated chives
1
t
Salt -- or as desired
1/4
pn
Turmeric spike -- as desired
Flavorless cooking oil
Recipes are interesting, and relatively innocuous
for word replacement based in large part on the
expectations behind recipes. If someone prefers a
recipe with a slightly different flavor, ingredient list,
or ingredient quantities, the recipe is open to
alterations—there is no ‘right’ version of a recipe, and
generally no author with more actual authority to alter
recipes. This results in web sites and cookbooks with
massive databases of recipes, some wildly different,
some with only minute differences. Within this,
recipes manipulated into encrypted messages via
LUNABEL are fairly innocuous.
Recipes, as well as classified ads, have a benefit
over readme files; their size is malleable. Appending a
series of recipes (or ads) together looks, under
9
Proceedings of the 39th Hawaii International Conference on System Sciences - 2006
average perusal, like a cookbook (or a listing forum
from an online news service). It is relatively easy to
develop word lists for a new genre of text, which
expands the possibility for different genres and styles.
Computer Science, Volume 2200, Springer-Verlag: Berlin
Heidelberg. Jan 2001. 156-167.
4.3 Cover Text Size Requirements
[7] Fellbaum, Christiane (Ed.). WordNet: A Lexical
Database for the English Language. MIT Press: Cambridge.
1998.
LUNABEL has a larger cover text size requirement
than earlier efforts (approximately four times as large
as NICETEXT II), in large part because of its sparse
substitution. The exact ratio of cover text to original
message size is not constant, however; it inversely
depends on the size of the word lists (as the number of
word lists increases, the size requirement of the cover
text drops).
5. Conclusion
With the features enumerated in this paper, it has
proven possible to refine the simple word replacement
cryptosystem of past into a much more robust and
linguistically sophisticated program, LUNABEL. This
system improves upon earlier word replacement
programs like NICETEXT in its focus on producing
semantically innocuous text, its adherence to a range
of further linguistically significant criteria, and its
exploitation of cover text genres which are
syntactically and semantically abnormal and prone to
opaque language styles. This line of work has the
advantage of creating encrypted cover texts which are
likely to appear innocent to humans in a casual scan.
[6] Chomsky, Noam. Syntactic Structures. Mouton : The
Hague. 1957.
[8] Jurafsky, Daniel, and Martin, James. Speech and
language processing: an introduction to natural language
processing, computational linguistics, and speech
recognition. Prentice-Hall: Upper Saddle River, NJ. 2000.
[9] Langacker, Ronald. “Space grammar, analyzability, and
the English passive.” Language 58. 1982. 22-80.
[10] Langacker, Ronald. Foundations of Cognitive
Grammar. Vol. 1. Theoretical Prerequisites. Stanford
University Press: Stanford. 1987.
[11] Miller, George. “WordNet: A Lexical Database for the
English Language.” http://www.cogsci.princeton.edu/~wn/
[12] Orgun, Orhan. “Linguistic Steganography Project
Report.” Unpublished M.S. University of California, Davis.
1999.
[13] Winstein, Keith. “Lexical steganography through
adaptive modulation of the word choice hash.”
http://alumni.imsa.edu/~keithw/tlex/lsteg.ps. Ms.
5. References
[1] Atallah, M.J., V. Raskin, M. Crogan, C.F. Hempelmann,
F. Kerschbaum, D. Mohamed, and S. Naik. “Natural
Language Watermarking: Design, Analysis, and a Proof-ofConcept Implementation.” In I. S. Moskowitz (ed.),
Information Hiding: 4th International Workshop, IH 2001,
Pittsburgh, PA, USA Springer-Verlag Heidelberg: Berlin.
April 2001. 185-199.
[2] Bergmair, Richard. “HARMLESS - A First Glimpse at
the Literature.” http://bergmair.cjb.net/pro/towlingsteglitrev-rep.www/. 2003.
[3] Chapman, Mark. “NICETEXT.” http://www.ctgi.net/
NICETEXT 1997.
[4] Chapman, Mark. “Hiding the Hidden: A Software
System for Concealing Ciphertext as Innocuous Text.”
http://www.NICETEXT.com/NICETEXT/doc/thesis.pdf.
1997.
[5] Chapman, Mark, George Davida and Marc Rennhard.
“A Practical and Effective Approach to Large-Scale
Automated Linguistic Steganography.” Lecture Notes in
10