Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 Exploiting Linguistic Features in Lexical Steganography: Design and Proof-of-Concept Implementation Vineeta Chand and C. Orhan Orgun University of California, Davis [email protected] [email protected] Abstract This paper develops a linguistically robust encryption, LUNABEL, which converts a message into semantically innocuous text. Drawing upon linguistic criteria, LUNABEL uses word replacement, with substitution classes based on traditional word replacement features (syntactic categories and subcategories), as well as features under-exploited in earlier works: semantic criteria, graphotactic structure, inflectional class and frequency statistics. The original message is further hidden through the use of cover texts—within these, LUNABEL retains all function words and targets specific classes of content words for replacement, creating text which preserves the syntactic structure and semantic context of the original cover text. LUNABEL takes advantage of cover text styles which are not expected to be necessarily comprehensible to the general public, making any semantic anomalies more opaque. This line of work has the promise of creating encrypted texts which are less detectable than earlier steganographic efforts. 1. Introduction While current encryption techniques are sufficiently advanced to make code-breaking practically impossible, one major drawback of current encryption methods is the ease in identifying an encrypted text—they do not resemble natural text in any way. Steganography attempts to answer this need, acting to conceal the message's existence, in order to transmit encrypted messages without arousing suspicion. The chances of intercepted messages, which can lead to broken codes, are dramatically reduced with effective steganographic methodology. In some cases, just the fact that correspondents are exchanging encrypted messages might be more information than one wants to give away—this is the major advantage of steganography. The classic example is lovers writing each other innocent-looking notes—it wouldn't do if their parents knew they were writing each other encrypted messages, even if they had no way of breaking the code. Another recent case was a political prisoner in Turkey sending letters to his friends outside and telling them that he was being tortured. If he had attempted to send a letter that looked like an encryption, the letter would never have been delivered. Again, whether the code is breakable is secondary; the main point is not wanting people to know about the exchange of encrypted messages. In these examples, encryption was performed in creative ways, but regular use of steganography requires a robust algorithmic technique. Information hiding has taken one form in imagebased steganography, utilizing minimal changes in pixels or watermarking techniques. While text-based messages have also been used within image-based maneuvers, by modifying the white space between letters and by minutely changing the fonts, this has proved less fruitful because text can be retyped and is often altered in the conversion from one program version or platform to another. Proving more productive, as well as resistant to the difficulties surrounding the re-typing of text-based messages is lexical steganography, which uses linguistic structures to disguise encryption of text messages such that the appearance of the message remains semantically and syntactically innocent. This paper develops LUNABEL, a linguisticallyinformed alternative to existing text-based word replacement steganographic systems (NICETEXT [5], Tyrannosaurus Lex [13], WordNet [11]). Using myriad word and phrase categories recognized in linguistics as well as other linguistically relevant criteria, this model creates more cohesive and semantically plausible results than earlier efforts. Earlier text-based steganographic systems, while more natural than random alpha-numeric sequences because of their use of words framed in a sentence format, still fall short of the goal: the text produced is 0-7695-2507-5/06/$20.00 (C) 2006 IEEE . 1 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 sufficiently unnatural in appearance to the human eye to warrant further inspection, which defeats the goal of steganography. LUNABEL improves upon existing text-based steganographic techniques through the inclusion of linguistic criteria into the program's word replacement method, producing semantically and syntactically reasonable text. This paper explores the design choices underlying LUNABEL as well as the substitution classes of words, focusing on their adherence to linguistic features beyond basic word classes (noun, verb, adjective, determiner, etc.) and syntactic features (sentence frames), and concludes with a discussion of linguistic robustness. 2. Past Research Lexical steganography has had three main veins of research: watermarking techniques [1] that manipulate sentences through syntactic transformations, word replacement systems both with and without cover texts, and context-free grammars such as NICETEXT. 2.1. Watermarking Atallah et al. [1] watermark texts by manipulating and exploiting the syntax (formal word order and grammatical voice) of sentences. Through common generative transformations (clefting (2), adjunct fronting (3), passivization (4), adverbial insertion (5)), the syntax of each sentence is altered: 1. The lion ate the food yesterday. (original sentence) 2. It was the lion that ate the food yesterday. 3. Yesterday, the lion ate the food. 4. The food was eaten by the lion yesterday. 5. Surprisingly, the lion ate the food yesterday. While this broadly preserves the meaning of each sentence, their argument that this approach will withstand translation into other languages, as well as appear as innocuous text, is not strong. Covering the first point, adverbs and adjunct clauses which may appear in multiple syntactic positions within English (e.g., in (5), ‘yesterday’ and ‘surprisingly’ can be placed at the beginning or end of the sentence as well as directly before the verb) do not necessarily have this same range of movement in other languages with differing word order rigidity. Also confounding is English’s notorious strictness in its subject, object and verb word order, in marked contrast to its treatment of adverb and adjunct clauses. Because of these features, the range of possible transformations within English does not have a dependable one to one correlation in most other languages, Indo-European or otherwise. Additionally, the more serious claim when examining the English version of Atallah et al.'s watermarked text, that transformations will not affect the semantics of a text, reflects a generative bias, and is not necessarily true. Newer theories of language argue for the interconnectedness of the semantic and syntactic levels [9, 10], demonstrating that the syntactic pattern is itself inherently meaningful. Furthermore, statistically, various syntactic structures (word orders) are not equal in distribution: different genres of text have wildly different syntactic structures, and replacing such structures freely could create a text which is trivially broken by statistical methods—a security threat to the program. Another problem for their claim of encryption integrity within translation arises with a closer examination of syntactic structure cross-linguistically. The use of passivization, for example, varies crosslinguistically: in English it is appropriate to say (6), while in Hindi the same concept would be phrased, when translated directly into English, as (7). 6. I was late. 7. To me lateness happened. Translation, either based on semantics or word-byword, will not maintain the syntactic structures Atallah et al. assume. Their approach, while more linguistically sophisticated than earlier work, is clearly not without problems. More importantly, its focus is centered on defeating word frequency statistics and NLP software which would test the semantics of the text, making little reference to how such text would fare with a human inspection. Our program is focused towards defeating the casual human inspector, negating the possibility that the text would raise suspicion and be passed into NLP and statistical programs for further inspection. 2.2. Word Replacement Attempts have been made to focus on synonymous words as potentials for word replacement (WordNet [11], Tyrannosaurus Lex [13]). However, this is rather problematic from a linguistic standpoint, as the number of true synonyms in English is relatively small to nonexistent (a commonly cited synonymous pair in American English is sofa/couch, but even this is only valid in some areas of the United States). Unfortunately, the notion of synonymy is vague, often characterized as a similarity in meaning and the ability to replace one term with another. A closer examination of commonly perceived synonyms often 2 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 reveals differences in terms of their syntactic distributions and word frequencies. Additionally, most words have more than one sense. It is rare that two terms will overlap completely in all of their senses, complicating the replace-all plan used in earlier steganographic works. Another complication is that many synonymous pairs actually come from different dialects (i.e., regionally defined varieties of language) or registers (i.e., language varieties used in particular social settings); their range of underlying implications is disparate and often based on these variations in register and dialect. For example, the terms violinist and fiddler are commonly cited synonyms, however, they are poor replacement choices. Compare the original sentence, (8), with (9), synonymous term replacement: 8. Itzhak Perlman, one of the most famous violinists, performed recently at Carnegie Hall. 9. *Itzhak Perlman, one of the most famous fiddlers, performed recently at Carnegie Hall. Example (9) demonstrates that semantically similar terms, in this case terms for people who play a small fretless, bowed string instrument, are not necessarily replaceable, due to underlying differences in connotation—Itzhak Perlman is not a famous fiddler. To say that he is one implies that he plays a style of music other than Western classical music (perhaps Bluegrass), and could be interpreted as a derogatory reflection on his sophistication. Oftentimes this register difference for synonymous terms is problematic for wholesale replacement schemes, which ignore register as well as genre specific word frequencies, making replaced terms stand out with respect to the text as a whole. A further impediment to synonym based word replacement is graphotactic structure (i.e., legal and illegal letter combinations)—to cite one example, the English indeterminate articles a and an are used in a phonologically rule-based pattern, with a preceding consonant initial terms and an preceding vowel initial terms: /a/ Æ [a] / __#C e.g. a banana Æ [an] / __#V e.g. an apple Exchanging apple with banana would thus produce the ill-formed phrase *an banana. These factors, taken collectively, severely affect the flow, semantics, and readability of a text altered based on "synonymous" word replacement. In short, the notion of replacement based on synonyms is oversimplifying language. Word replacement based on synonymy, while sharing a focus with LUNABEL towards defeating human inspection, is not linguistically sophisticated and has yet to produce encrypted naturalistic text. 2.3. Context-free Grammars The third approach used in lexical steganography exploits the syntactic structures of a text, using a partof-speech tagger to place new words in old frames, in order to create new encrypted text. Past research in this arena has produced software functionally similar to LUNABEL, in that a message is encoded into an innocuous looking text [3]. However, while improving upon earlier synonym-based word replacement programs, NICETEXT has certain drawbacks which limit its potential, and distinguish it in appearance and methodology from LUNABEL. NICETEXT uses the cover text simply as a source of syntactic patterns: by running the cover text through a part-of-speech tagger, NICETEXT obtains a set of "sentence frames," e.g. [(noun) (verb) (prep) (det) (noun)] for ‘I sat in the tree.’ It also compiles a lexicon of words found in the cover text via part-ofspeech tags, with each word in the lexicon associated (arbitrarily) to either of the binary digits 0 or 1. In encryption, the plain text message is converted into a sequence of binary digits. A random sentence frame is chosen and the part of speech tags in it are replaced by words in the lexicon according to the sequence of binary digits. Based on a linguistically unsophisticated model of language, (part of speech tags are useful in syntactic parsing and similar tasks, but inadequate for semantically plausible word replacement—an indication of this may be found in the wildly differing numbers of part of speech tags used in taggers developed for different purposes: the Penn Treebank tagset of 45 tags vs. the C7 tagset of 146 tags [8]), NICETEXT is ill prepared to deal with ambiguous lexical items (part of speech ambiguity—words with two possible syntactic categories, like share, which functions as a noun and a verb—is especially disastrous), and items which cannot be interpreted literally (idioms and metaphors). Furthermore, function and content words are treated alike in NICETEXT. Since function word often have more to do with syntactic relationships (e.g., passive agent by; infinitival to) than semantics, subjecting them to word replacement often results in ungrammatical and semantically anomalous sentences. Compare (10), NICETEXT's cover text with (11), encryption based on NICETEXT. 10. NICETEXT Original text (unaltered), from John F. Kennedy's 1962 inaugural speech 3 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 We observe today not a victory of party but a celebration of freedom. . . symbolizing an end as well as a beginning. . . signifying renewal as well as change for I have sworn before you and Almighty God the same solemn oath our forbears prescribed nearly a century and three-quarters ago. The world is very different now, for man holds in his mortal hands the power to abolish all forms of human poverty and all forms of human life. (...) 11. NICETEXT -Encrypted version of above text My area origins of the suspicion... oppose much what America will be before you, before what asunder we would be for the Poverty inside Man. Yet will it do almighty off the first two south votes... or on the course at this administration, whether even deadly off our suspicion by this peace. To those young votes what proud nor alike origins we comfort: we house the heritage past hostile americans. (…) Regarding the semantics of (11), Atallah et al. [1] comment that NICETEXT, and similar programs, are "…context-free grammars that generate primitive but less conspicuously meaningless texts, in which each individual sentence may be sort of meaningful, even if rather unrealistically simple…" Clearly, there is room for improvement in the semantics of the texts produced by earlier steganographic attempts. Another factor worth considering is the density of encryption within the cover text. Ideally, the cover text should work to hide the word frequencies and syntactic structure of the hidden plain text message. Steganographic goals encourage sparse encryption, which does not alter a majority of the text by the word replacement. NICETEXT encryption is maximally dense—every word within the final encrypted cover text is conveying hidden information. Given that each encrypted word is part of the original information bearing message and common word usage patterns are unavoidable, this is problematic for the original steganographic intent: avoiding detection and producing naturalistic text. All of the shortcomings detailed above make the fact of encryption obvious, as the syntax and semantics of the cover text are unnatural enough to draw attention to the message, whether examined by AI/NLP software or by a human. This is acknowledged in the original authors' reflection on NICETEXT, "Although the initial NICETEXT approach was an improvement, it was not as effective in producing text that was 'believable' by a human" [5]. NICETEXT II [5], the successor to NICETEXT, is based on a context-free grammar instead of sentence frames, and is focused on the ability to build an infinitely large set of sentences based on a dictionary of tagged words. This focus on infinite sentence generation still leaves NICETEXT II as open to the density critique as NICETEXT. Additionally, the semantic plausibility of the sentences has not increased, nor is there any relevance or relation between sequential sentences, rendering the resulting text contextless. These issues can be seen in (12), which uses the sentence frames from the original text found in (10), to produce the following sample sentence: 12. The prison is seldom decreaseless tediously, till sergeant outcourts in his feline heralds the stampede to operate all practices among interscapular stile inasmuch all tailers underneath indigo pasture. While this sentence might be considered syntactically reasonable, it is clearly not semantically wellformed, and hence open to further inspection and detection. It is reminiscent of Chomsky's [6] famous sentence (13), in which he highlighted syntactic wellformedness without semantic wellformedness, i.e. grammatically reasonable nonsense: 13. Colorless green ideas sleep furiously. While Chomsky's accompanying generative syntactic theory has had many incarnations in the last 40 years, his point made with this sentence holds true. As this review has demonstrated, NICETEXT I and II are concerned only with syntactic wellformedness, allowing for grammatically correct but semantically anomalous text. Given the goal of creating innocuous text, failing to consider semantic wellformedness leaves NICETEXT I and II vulnerable to detection. While the dominant focus has been creating naturalistic text which would fool a human, it can be gathered from surveying the goals and steganographic processes of existing systems that there is much work still to be done. The texts produced by all of these earlier systems, while potentially more innocuous than an alpha-numeric stream of digits, are clearly unnatural and likely to arouse suspicion, either with statistical or NLP software, or human inspection. Linguistic features beyond syntactic structure and synonymy clearly need to be accounted for in order to produce natural looking text, and less dense encryption is critical for future work. The current project addresses these further needs. 3. Current Work 3.1. Introduction 4 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 Our technique improves upon earlier research with its use of a cover text, a more informed and less dense means of disguising the message, and draws upon linguistically significant criteria and selectivity in word replacement, especially in dealing with problematic and ambiguous words. Our cover text differs in function from earlier works in two pivotal ways: the majority of words within the cover text are maintained (unlike [3] and [4]) and only individual words are replaced, distinct from the manipulation of syntactic structures [1]. Encryption is based on word replacement, with replacement word lists based on part of speech categories and sub-categories (especially for verbs), semantic criteria, graphotactic structure (for nouns, to handle the a~an allomorphy of the indefinite article), inflectional class (regular vs. irregular plurals, past tense forms, etc.) and word frequency statistics. The primary goal is to create naturalistic text which passes human inspection: if human inspection does not raise suspicions, there is less chance that the text will be scrutinized further with statistical and NLP software. Given our system of encryption and given that it is hard to replace function words (see section 3.4) and, to a varying extent, highly ambiguous words, without also affecting the syntax of a text, it is not necessary, nor, we argue, linguistically valid, to replace every word. Instead, LUNABEL replaces only those words that are in one of the specified substitution classes and leave other (mostly function) words unchanged in the cover text. As a result, the encrypted message more closely reflects the syntax and lexical frequency characteristics of the cover text, which makes its appearance more natural under close scrutiny. Key to this endeavor is the use of cover texts which are not expected to be comprehensible to the general public, making possible discrepancies more opaque within the context of the larger message. Specifically, we have focused on the writing style found in "readme" files which accompany software packages. Using this genre is advantageous for many reasons: it complicates detection both from the AI/NLP software side and the human inspection side (it is elaborated on in section 3.3). While frequency statistics are taken into account, within this paper the focus is centered on creating semantically reasonable text than statistically appropriate word frequencies. 3.2. Overview of LUNABEL The encryption scheme is two-part; the first step converts the plain text message (14) into a sequence of hexadecimal digits (15). This is done by taking the ASCII code of each character and expressing it as a pair of hexadecimal digits. 14. Plain text I had a cat and the cat pleased me I fed my cat under yonder tree. Cat went fiddle-dee-dee, fiddle-dee-dee. I had a dog and the dog pleased me I fed my dog under yonder tree. Dog went bawa, bawa, cat went fiddle-dee-dee, fiddle-dee-dee. 15. Plain text converted to sequence of hexadecimal digits (via LUNABEL) 9 2 0 6 8 6 1 6 4 2 0 6 1 2 0 6 3 6 1 7 4 2 0 6 1 6 14 6 4 2 0 7 4 6 8 6 5 2 0 6 3 6 1 7 4 2 0 7 0 6 12 6 5 6 1 7 3 6 5 6 4 2 0 6 13 6 5 2 14 0 10 4 9 2 0 6 6 6 5 6 4 2 0 6 13 7 9 2 0 6 3 6 1 7 4 2 0 7 5 6 14 6 4 6 5 7 2 2 0 7 9 6 15 6 14 6 4 6 5 7 2 2 0 7 4 7 2 6 5 6 5 2 14 0 10 4 3 6 1 7 4 2 0 7 7 6 5 6 14 7 4 2 0 6 6 6 9 6 4 6 4 6 12 6 5 2 13 6 4 6 5 6 5 2 13 6 4 6 5 6 5 2 12 2 0 6 6 6 9 6 4 6 4 6 12 6 5 2 13 6 4 6 5 6 5 2 13 6 4 6 5 6 5 2 14 0 10 0 10 4 9 2… Thus, each ASCII character in the original text will be encoded by two integers between zero and 15 (two hexadecimal digits), and ultimately encrypted by two word replacements. In contrast, NICETEXT uses 8 binary digits for each character and therefore 8 word replacements encrypt one plain text character. The reasoning behind the encryption scheme used here is discussed in Linguistic Robustness, section 3.6. Another piece of text, the "cover text," is then brought in (the impetus behind this text selection is covered in section 3.3): 16. Excerpt from Cover Text: (readme.txt file distributed with GSview® software package) Features include: - View pages in arbitrary order (Next, Previous, Goto). - Selectable display resolution, depth, alpha. - Page size is automatically selected from DSC comments or can be selected using the menu. - Orientation can be automatically selected from DSC comments or can be selected using the menu (Portrait, Landscape). Words are replaced in this cover text with other words, according to the above sequence of plain-text numbers manipulated through an encryption key, which provides a pseudo-random sequence of integers. Here is the implementation of this: 1. We take the first number in the hex-dig file, in this case 9. 5 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 2. We consult our encryption key to obtain an integer, let's say 8. 3. We add these numbers together to get a new number n, in this case 17, which corresponds to 1 (17 mod 16). 4. Then, we take the first word in the cover text. If this word is not in any of our word lists, we skip words until we come to a word that is (in this example, “include” is the first such word). 5. Once we find such a word, we look up the word list it's a member of. 6. We replace the cover text word with the nth word of the list, which here happens to be “change.” 7. We repeat this procedure for each hex-dig in the converted plain text. 8. (Minor details). If plain text is done and some cover text remains, we copy the remainder intact. If cover text runs out, we issue an error message and optionally start again from the beginning of the cover text, appended to the already-encrypted chunk. The replacement words are taken from numbered lists of words, substitution classes, which have been compiled specifically for this purpose: 17. Two examples of substitution classes word_list([change, alter, configure, add, include, exclude, insert, restore, delete, edit, write, modify, manipulate, toggle, clear, rewrite]) word_list([run, eject, send, choose, save, visit, view, scroll, define, chat, ftp, telnet, transmit, enter, exit, close]) Word replacement focuses on linguistically significant features (detailed in section 3.4) beyond syntactic similarity and synonymy, also heeding distinctions in graphotactic form, content versus function grouping, syntactic categories and subcategories, semantic criteria, inflectional class and word frequency statistics, with the result being encrypted text which is reasonably semantically and syntactically innocuous: 18. Encrypted text Features change: - Close pages in arbitrary upgrade (Trailing, Last, Goto). - Selectable broadcast resolution, depth, alpha. - Use size is instantaneously selected from DSC comments or can be selected using the source. - Orientation can be perpetually selected from DSC comments or can be selected using the table (Portrait, Landscape). Decryption by the recipient requires the key and the word lists, but not the cover text. In decryption, we compare each word in the encrypted message against our word lists. Once we find a word that is in one of our word lists, we add two integers together: the position of the found word in its word list and the pseudo-random integer obtained from the encryption key. Iteration of this procedure through the encrypted message results in a sequence of hexadecimal digits, which, in pairs, correspond to the ASCII codes of characters forming the plain (decrypted) message. 3.3. The Cover Text We have chosen to use a particular style of cover texts within this demonstration of LUNABEL, that of "readme" documents which generally accompany software installations. This style of cover text was chosen for a number of reasons, and is a key to the success of this style of lexical steganography. While written text such as stories, news articles and novels have a very fluid style of prose, they do not represent, in syntax, word frequencies, or fluency, the entirety or norms of written text. Many other styles of writing are considerably less fluid, less coherent, and less intelligible to the general public, while still commonly found. Framing the steganography within this type of genre is to our advantage, because we can exploit the expected discrepancies between this style of writing and more formal, published text, as an added measure of security. Word choice differences that might appear odd or novel in a spoken conversation are easily attributed to the style and characteristics of this type of text, further masking the word replacement system. Using this genre complicates detection both from the AI/NLP software side and the human inspection side. This genre has less developed norms of writing stylistics, resulting in a wide range of variation within the genre. Given this, statistical software working within this genre requires a much wider range of acceptability for word-clustering frequencies and syntactic structures. Additionally, human inspection is complicated by the range of words co-opted from typical uses within software installation and program details, the frequent use of incomplete sentences, and the opaqueness typical of directions and descriptions within "readme" documents. Other useful (and reasonably opaque) styles are recipes, classified ads, alcohol/wine descriptions, net blogs, English as a Second Language (ESL) essays, life interviews, programming code, and furniture assembly instructions. In addition to the benefits of using this style of text, the use of cover texts allows for sparse 6 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 encryption. Only a minority of the words are earmarked for replacement, allowing the majority of the text to remain unchanged. Given this, one can take advantage of the syntactic pattern provided by the cover text. Using the cover text's syntactic pattern lends additional security because the encrypted message reflects the structure of the cover text, not of the original message. Our system of sparse replacement also allows the contextual frame of the cover text to remain in place, such that the sentences make sense when read as a whole text, as compared to the product of a random sentence generator (refer to (12) for an example of NICETEXT random sentence generation). 3.4. The Word Lists The word lists are manually built up from a corpus of cover texts written in a particular style, in this case the style of "readme" files. Within LUNABEL, each word list has 16 words in it which have similar word frequencies, graphotactic structure, syntactic subcategory, etc. within the corpus of data. Which numbered word in the list is used to replace the original word depends on the manner of encryption used (see Section 3.6 for a further explanation); we have used a simple encryption based on an arbitrary user-supplied sequence of integers (akin to a “onetime pad” familiar from WWII encryption methods). Syntactic categories are typically defined as substitution classes: in replacing a word with another word of the same category, the grammaticality of the sentence should be unaffected. However, part of speech (PoS) categories define substitution classes only vaguely. In detailed syntactic work, elaborate subcategorization—e.g., intransitive, transitive, ditransitive verbs, etc.—is needed. Once semantic plausibility is added as a desideratum, semantic factors have to be considered in defining substitution classes (e.g., animate vs. inanimate nouns, verbs with agent vs. experiencer subjects, mass vs. count nouns, etc.). Additionally, heeding the issues past steganographic programs have encountered with synonymy, the semantic range of these word lists is not limited to synonyms—indeed, it would be impossible to find 15 synonyms for any word! Rather than synonymy, the important criterion is usability in similar syntactic and semantic contexts. Our sparse word replacement is due, in part, to our criteria for the word lists. Some categories of words are more easily replaced within a text than others. As discussed earlier, content and function words highlight this distinction. Content words work to create the meaning within the sentence (e.g., nouns cat, peace; verbs sneeze, send), while function words act as the glue which binds the concepts together in a particular meaningful fashion (e.g., passive agent by; infinitival to). Since function words often have more to do with syntactic relationships than semantics, replacing them is not productive in the quest for innocuous sentences, nor is it linguistically valid. Compare (19), with (20), which replaces the function word by with another preposition, up: 19. The car was washed by the school kids. 20. *The car was washed up the school kids. All content words are, in principle, candidates for replacement, but further choices are made about viable replacements: sparseness is not imposed by the syntax, but by semantic and pragmatic considerations. Additionally, ambiguity and related problems are avoided within LUNABEL by using only easily handled words as encrypted information bearers. Of the words we consider candidates for replacement, there are further subdivisions into word lists based on the following criteria: word frequencies, phonetic features, number classes, inflectional features, and word categories and sub-categories. Thus, cat and ant are placed in different word lists based on their phonetic features; cat with other consonant initial terms, and ant with vowel initial terms, in order to produce grammatically correct word combinations like a cat/an ant. A word which has a limited or minimal frequency within the context of the larger corpus of data is excluded from the word lists, and higher frequency words are grouped with words of similar frequency. Plural nouns are grouped separately from singular nouns, with nouns which are ambiguous with respect to number (e.g., one fish, two fish) also grouped separately. Verbs are grouped according to their inflectional features, creating word lists which respect paradigms marking tense and number. These groupings prevent replacements which create number inconsistencies on the order of: 21. *The cat were… 22. *The papers is… Verbs are also grouped according to their syntactic subclasses of intransitive, transitive, ditransitive, etc. to avoid sentences such as the following: 23. *The cat sent. 24. *The girl slept a toy to her sister. The building of the word lists occurs in this fashion: the corpus has 46,000 words, from which word frequency statistics have been extracted. The word process appears 18 times within that. Because it 7 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 appears as a noun in this genre, it needs to heed the a/an distinction, as it can directly follow an indefinite determiner. The term section occurs 23 times, follows the same graphotactic structure, and occurs in both of the same syntactic positions. Both terms are used as a transitive action in a manipulation of data, and both can be used nominally to highlight an area or aspect of something (generally a program, code, or location on the computer, within this genre of texts). While these words are not synonyms, or even near synonyms, and hence would not be replaceable under earlier synonym-based word-replacement systems, such is not important within LUNABEL. Building up the wordlist continues until 16 terms are found, factoring all of the above features. These criteria require a sizable corpus in order to produce comprehensive word lists. Given all of these criteria, the result is a syntactically reasonable and semantically innocuous text with unremarkable word frequencies when compared to other texts in a corpus of that style. 3.5 Encryption Density The size requirement for cover texts is naturally affected by LUNABEL’s sparse substitution. However, the size of the word lists is also a factor, in that the more words are considered viable for replacement, the less cover text is necessary. Essentially, the ratio of words in substitution classes versus non-replacement words determines both how sparse the encryption is and the minimum length of the cover text. While we have discussed how only content words are used, it is in fact even more selective than that, with only a small subset of content words even marked for replacement. This also contributes to the sparse replacement. 3.6 Linguistic Robustness While cryptography systems are typically rated and valued based on statistical modeling and testing, such testing is less relevant for valuing this system's contribution to lexical steganography. One can use any encryption scheme one wants in lieu of the method in Step 2 of the word replacement process, and still use the rest of the described technique to hide the encrypted message. Testing this system through traditional formats would necessarily reflect our encryption method, thereby rendering results unreflective of the linguistic sophistication of the system. Instead, we argue that linguistic robustness, an impressionistic property, offers a more meaningful evaluation. Within this rubric, accounting for a higher number of linguistic features and producing semantically and syntactically reasonable text within a larger context or genre is valued. LUNABEL strives for linguistic robustness, achieved by replacing only content (not function) words and assembling word lists based not just on part of speech, but also on subcategorization, semantic properties, and word frequencies. The system outlined in this paper is linguistically more robust in combining both syntactic and semantic criteria than other systems we have discussed. The cover text encourages this deception, having been chosen specifically for unintelligibility and its odd assortment of syntactic structures. While prolific in modern computing (a quick search of the first author's hard drive found 67 readme files), they do not necessarily follow full sentence syntactic patterns throughout the text, and they are not necessarily comprehensible to someone unfamiliar with the style. In short, the pragmatics of the cover text do not undermine the original steganographic intent, and instead, exploit it, building on the readers confusion and inattention to detail with such texts. Testing the strength if encryption is beside the point, and unresponsive to the object of this project, which is to develop a linguistically sophisticated steganographic system. Further encryption can be piggybacked on to the system by using any and all encryption techniques on the original message before it is passed into this system, as well as a more sophisticated system of word replacement based on the original message's conversion to numeric stream. The value of this system is that it uses linguistic features including word frequencies, content vs. function words, word categories and sub-categories, phonetic features, inflectional classes and number features to replace words within a sparsely encoded cover text. The cover texts, drawn from a genre of opaque literature with a broad range of possible syntactic structures and semantically and syntactically atypical word usage, further disguise possible encryption. This system, exploiting syntactic, semantic, and phonological knowledge, allows enhanced linguistic robustness in lexical steganography. 4. Discussion and Limitations LUNABEL, although the first linguistically robust steganographic tool, is not without its limitations. 4.1 Good Cover Texts Help All Equally? The use of opaque cover texts, while currently unique to LUNABEL, also holds promise for other lexical steganographic efforts. However, different 8 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 programs use cover texts in different ways; by replacing all words within the cover text [4,5], or by manipulating the syntax of the cover text [1], not by selectively replacing single words (LUNABEL). The style of readme files, while less rigid than other genres, does have expected syntactic patterns, complicating one steganographic approach [1], and the their semantic opacity, which LUNABEL exploits, is marred by wholesale word replacement, which, within the replacement process, does not maintain the particular genre of text, nor the particular topic of the original cover text [4, 5]. Another point of consideration is the level of commitment each program has to a particular cover text document: earlier programs [1,4,5] are bound to a single cover text once its syntactic patterns have been extracted, while LUNABEL is only restricted to a particular genre of texts once word lists are created, with any text within that genre available for possible use as the cover text in a particular instance of message transmission. Programs which rely on already constructed, mammoth sized synonym lists are also hampered in their use of the readme genre in particular because the technology being discussed within a readme is constantly changing, as new technology (and hence jargon) is created and distributed. The constant flux which technology-related language exists within is harder to capture within a program whose synonym lists are unwieldy. The potential benefit of using readme files for these other approaches is thus uncertain. 4.2 Environmental Factors LUNABEL’s cover texts are also complicated by situations in which such types of cover texts are not ‘natural,’ and hence elicit suspicion based simply on the genre of cover text. However, this can, and has, been reduced with increasing styles of cover texts (e.g. recipes, classified ads, instruction manuals, interviews, etc.) and accompanying word lists, which will continue to build with time. Further, messages can be sent and received, or posted to web sites within the appropriate genre. For many of the suggested genres of cover texts, it is simple to find web forums that work only within the one genre, further disguising the encrypted cover text. From there, the text can be passively read, and manipulated, by the appropriate recipient, while other naïve individuals view the text without suspicion. Within this increasingly internet savvy age, the range of styles the average user accesses via web searches is increasing, further aiding in the innocuousness of LUNABEL generated texts. To this end, word lists for a second cover text style, recipes, have been created, and are briefly demonstrated here with an ingredient list that typically prefaces recipes: 25. Excerpt from recipe for Seafood Croquettes: 3 tb Unsalted butter 1/2 sm Onion -- finely minced 1/2 Celery stalk -- finely diced 3 tb All-purpose flour 1/3 c Milk 1/4 ts Ground nutmeg Bread crumbs 1 lb Cooked fish and/or shellfish - such as salmon, shrimp, - scallops, white fish, - or a combination 1 tb Finely chopped parsley 1 tb Chopped chives 1 t Salt -- or as desired 1/4 ts Cayenne pepper -- as desired Flavorless cooking oil 26. Encrypted text of same recipe ingredient list 3 tb Unsalted butter ½ largish Onion – tightly quartered 1/2 Anise root -- coarsely grated 3 tb All-purpose flour 2/3 c Milk 1/4 oz Ground cinnamon Bread crumbs 3 pint Cooked fish and/or shellfish - such as grouper, eel, - scallops, golden fish, - or a combination 2 tb Evenly bruised marjoram 1 tb Separated chives 1 t Salt -- or as desired 1/4 pn Turmeric spike -- as desired Flavorless cooking oil Recipes are interesting, and relatively innocuous for word replacement based in large part on the expectations behind recipes. If someone prefers a recipe with a slightly different flavor, ingredient list, or ingredient quantities, the recipe is open to alterations—there is no ‘right’ version of a recipe, and generally no author with more actual authority to alter recipes. This results in web sites and cookbooks with massive databases of recipes, some wildly different, some with only minute differences. Within this, recipes manipulated into encrypted messages via LUNABEL are fairly innocuous. Recipes, as well as classified ads, have a benefit over readme files; their size is malleable. Appending a series of recipes (or ads) together looks, under 9 Proceedings of the 39th Hawaii International Conference on System Sciences - 2006 average perusal, like a cookbook (or a listing forum from an online news service). It is relatively easy to develop word lists for a new genre of text, which expands the possibility for different genres and styles. Computer Science, Volume 2200, Springer-Verlag: Berlin Heidelberg. Jan 2001. 156-167. 4.3 Cover Text Size Requirements [7] Fellbaum, Christiane (Ed.). WordNet: A Lexical Database for the English Language. MIT Press: Cambridge. 1998. LUNABEL has a larger cover text size requirement than earlier efforts (approximately four times as large as NICETEXT II), in large part because of its sparse substitution. The exact ratio of cover text to original message size is not constant, however; it inversely depends on the size of the word lists (as the number of word lists increases, the size requirement of the cover text drops). 5. Conclusion With the features enumerated in this paper, it has proven possible to refine the simple word replacement cryptosystem of past into a much more robust and linguistically sophisticated program, LUNABEL. This system improves upon earlier word replacement programs like NICETEXT in its focus on producing semantically innocuous text, its adherence to a range of further linguistically significant criteria, and its exploitation of cover text genres which are syntactically and semantically abnormal and prone to opaque language styles. This line of work has the advantage of creating encrypted cover texts which are likely to appear innocent to humans in a casual scan. [6] Chomsky, Noam. Syntactic Structures. Mouton : The Hague. 1957. [8] Jurafsky, Daniel, and Martin, James. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Prentice-Hall: Upper Saddle River, NJ. 2000. [9] Langacker, Ronald. “Space grammar, analyzability, and the English passive.” Language 58. 1982. 22-80. [10] Langacker, Ronald. Foundations of Cognitive Grammar. Vol. 1. Theoretical Prerequisites. Stanford University Press: Stanford. 1987. [11] Miller, George. “WordNet: A Lexical Database for the English Language.” http://www.cogsci.princeton.edu/~wn/ [12] Orgun, Orhan. “Linguistic Steganography Project Report.” Unpublished M.S. University of California, Davis. 1999. [13] Winstein, Keith. “Lexical steganography through adaptive modulation of the word choice hash.” http://alumni.imsa.edu/~keithw/tlex/lsteg.ps. Ms. 5. References [1] Atallah, M.J., V. Raskin, M. Crogan, C.F. Hempelmann, F. Kerschbaum, D. Mohamed, and S. Naik. “Natural Language Watermarking: Design, Analysis, and a Proof-ofConcept Implementation.” In I. S. Moskowitz (ed.), Information Hiding: 4th International Workshop, IH 2001, Pittsburgh, PA, USA Springer-Verlag Heidelberg: Berlin. April 2001. 185-199. [2] Bergmair, Richard. “HARMLESS - A First Glimpse at the Literature.” http://bergmair.cjb.net/pro/towlingsteglitrev-rep.www/. 2003. [3] Chapman, Mark. “NICETEXT.” http://www.ctgi.net/ NICETEXT 1997. [4] Chapman, Mark. “Hiding the Hidden: A Software System for Concealing Ciphertext as Innocuous Text.” http://www.NICETEXT.com/NICETEXT/doc/thesis.pdf. 1997. [5] Chapman, Mark, George Davida and Marc Rennhard. “A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography.” Lecture Notes in 10
© Copyright 2026 Paperzz