Content uploaded by Zhili Chen
Author content
All content in this area was uploaded by Zhili Chen
Content may be subject to copyright.
Linguistic Steganography Detection Using
Statistical Characteristics of Correlations
between Words
Zhili Chen*, Liusheng Huang, Zhenshan Yu, Wei Yang,
Lingjun Li, Xueling Zheng, and Xinxin Zhao
National High Performance Computing Center at Hefei,
Department of Computer Science and Technology,
University of Science and Technology of China,
Hefei, Anhui 230027, China
zlchen3@mail.ustc.edu.cn
Abstract. Linguistic steganography is a branch of Information Hiding
(IH) using written natural language to conceal secret messages. It plays
an important role in Information Security (IS) area. Previous work on
linguistic steganography was mainly focused on steganography and there
were few researches on attacks against it. In this paper, a novel statis-
tical algorithm for linguistic steganography detection is presented. We
use the statistical characteristics of correlations between the general ser-
vice words gathered in a dictionary to classify the given text segments
into stego-text segments and normal text segments. In the experiment of
blindly detecting the three different linguistic steganography approaches:
Markov-Chain-Based, NICETEXT and TEXTO, the total accuracy of
discovering stego-text segments and normal text segments is found to
be 97.19%. Our results show that the linguistic steganalysis based on
correlations between words is promising.
1 Introduction
As text-based Internet information and information dissemination media, such
as e-mail, blog and text messaging, are rising rapidly in people’s lives today, the
importance and size of text data are increasing at an accelerating pace. This
augment of the significance of digital text in turn creates increased concerns
about using text media as a covert channel of communication. One of such
important covert means of communication is known as linguistic steganography.
Linguistic steganographymakes use of written natural language to conceal secret
messages. The whole idea is to hide the very presence of the real messages.
Linguistic steganographyalgorithms embed messages into a cover text in a covert
manner such that the presence of the embedded messages in the resulting stego-
text cannot be easily discovered by anyone except the intended recipient.
Previous work on linguistic steganography was mainly focused on how to
hide messages. One method of modifying text for embedding a message is to
substitute selected words by their synonyms so that the meaning of the modified
K. Solanki, K. Sullivan, and U. Madhow (Eds.): IH 2008, LNCS 5284, pp. 224–235, 2008.
c
Springer-Verlag Berlin Heidelberg 2008
Linguistic Steganography Detection 225
sentences is preserved as much as possible. A steganography approach that is
based on synonym substitution is the system proposed by Winstein [1]. There
are some other approaches. Among them NICETEXT and TEXTO are most
famous.
NICETEXT system generates natural-like cover text by using the mixture of
word substitution and Probabilistic Context-free Grammars (PCFG) ([2], [3]).
There are a dictionary table and a style template in the system. The style tem-
plate can be generated by using PCFG or a sample text. The dictionary is used
to randomly generate sequences of words, while the style template selects natural
sequences of parts-of-speech when controlling generation of word, capitalization,
punctuation, and white space. NICETEXT system is intended to protect the
privacy of cryptograms to avoid detection by censors.
TEXTO is a linguistic steganography program designed for transforming
uuencoded or PGP ascii-armoured ASCII data into English sentences [4]. It is
used to facilitate the exchange of binary data, especially encrypted data. TEXTO
works just like a simple substitution cipher, with each of the 64 ASCII symbols
used by PGP ASCII armour or uuencode from secret data replaced by an English
word. Not all of the words in the resulting text are significant, only those nouns,
verbs, adjectives, and adverbs are used to fill in the preset sentence structures.
Punctuation and “connecting” words (or any other words not in the dictionary)
are ignored.
Markov-Chain-Based is another linguistic steganography approach proposed
by [5]. The approach regards text generation as signal transmission from a
Markov signal source. It builds a state transfer chart of the Markov signal source
from a sample text. A part of state transfer chart with branches tagged by equal
probabilities that are represented with one or more bit(s) is illustrated by Fig.
1. Then the approach uses the chart to generate cover text according to secret
messages.
The approaches described above generate innocuous-like stego-text to conceal
attackers. However, there are some drawbacks in them. For example, the first
approach sometimes replaces word synonyms that do not agree with correct
English usage or the genre and the author style of the given text. And the later
three approaches are detectable by a human warden, the stego-text generated
by which doesn’t have a natural, coherent and complete sense. They can be used
in communication channels where only computers act as attackers.
A few detection methods have been proposed making use of the drawbacks
discussed. The paper [6] brought forward an attack againstsystems based on syn-
onym substitution, especially the system presented by Winstain. The 3-gram lan-
guage model was used in the attack. The experimental accuracy of this method
on classification of steganographically modified sentences was 84.9% and that
of unmodified sentences was 38.6%. Another detection method enlightened by
the design ideas of conception chart was proposed by the paper [7] using the
measurement of correlation between sentences. The accuracy of the simulation
detection using this method was 76%. The two methods fall short of accuracy
that the practical application of detection requires. In addition, the first method
226 Z. Chen et al.
Fig. 1. A part of tagged state transfer chart
requires a great lot of computation to calculate a large number of parameters
of the 3-gram language model and the second one requires a database of rules
consuming a lot of work.
Our research examines drawbacks of the last three steganography approaches,
aiming to accurately detect the application of the three approaches in given text
segments and bring forward a blind detection method for linguistic steganogra-
phy generating cover texts. We have found a novel, efficient and accurate detec-
tion algorithm that uses the statistical characteristics of the correlations between
the general service words that are gathered in a dictionary to distinguish between
stego-text segments and normal text segments.
2 Important Notions
2.1 N-Window Mutual Information (N-WMI)
In the area of statistical Natural Language Processing (NLP), an information-
theoretically measurement for discovering interesting collocation is Mutual In-
formation (MI) [8]. MI is originally defined between two particular events x and
y. In case of NLP, the MI of two particular words x and y, as follows:
MI(x, y)=log
2
P(x, y)
P(x)P(y)=log
2
P(x|y)
P(x)=log
2
P(y|x)
P(y)(1)
Here, P(x, y), P(x)andP(y) are the occurrence probabilities of “xy”, “x” and
“y” in given text. In our case, we regard these probabilities as the occurrence
probabilities of the word pairs “xy”, “x?” and “?y” in given text, respectively,
and “?” represents any word.
In natural language, collocation is usually defined as an expression consisting
of two or more sequential words. In our case, we will investigate pairs of words
Linguistic Steganography Detection 227
Fig. 2. An illustration of 3-WWP
within a certain distance. With the distance constraint, we introduce some def-
initions as follows.
N-Window Word Pair (N-WWP): Any pair of words in the same sentence
with a distance less than N (N is an integer greater than 1). Here, the distance of
a pair of words equals to the number of words between the words in the pair plus
1. Note that N-WWP is order-related. In Fig. 2, the numbered boxes represent
the words in a sentence and the variable d represents distance of the word pair.
The 3-WWPs in the sentence are illustrated in the figure by arrowed, folded
lines. Hereafter, we will denote N-WWP “xy” as x, y.
N-Window Collocation (N-WC): An N-WWP with frequent occurrence. In
some sense, our detection results are partially determined by the distribution of
N-WCs in given text segment and later we can see it.
N-Window Mutual Information (N-WMI): We use the MI of an N-WWP
to measure its occurrence. This MI is called N-Window Mutual Information (N-
WMI) of the words in the word pair. Therefore, an N-WWP is an N-WC if its
N-WMI is greater than a particular value.
With the definition of N-WMI, we can use equation (1) to evaluate the N-WMI
of words x and y in a particular text segment. Given a certain text segment, the
counts of occurrences of any N-WWP, N-WWP x, y,x, ?and ?,yare C,
Cxy,Cxand Cy, respectively, the N-WMI value is denoted by MIN, then the
evaluation as follows:
MIN(x, y)=log
2P(x, y)
P(x)P(y)=log
2Cxy/C
(Cx/C)(Cy/C)=log
2CCxy
CxCy
(2)
Because of the signification of N-WMI in our detection algorithm, we will make
a further explanation with an example. Given a sentence “We were fourteen in
all, and all young men.” let us evaluate the 4-WMI of in, all. All 4-WWPs in the
sentence are as follows: we, were,we, f ourteen,we, in,were, f ourteen,
were, in,fourteen, in,were, all,f ourteen, all,in, all,f ourteen, and,
in, and,all, and,in, all,all, all,and, all,all, young,and, young,
all, young,and, men,all, men,young, men.WegetC= 21, Cin,all =2,
Cin =3,Call =6.Thenwehave:
MI4(in, all )=log
2CCin,all
CinCall
=log
221 ×2
3×6=1.2224
228 Z. Chen et al.
2.2 N-Window Variance of Mutual Information (N-WVMI)
Suppose Dis the general service word dictionary and Mis the count of words
in D, then we can get M×Mdifferent pairs of words from the dictionary D.
In any given text segment, we can calculate the N-WMI value of each different
word pair in dictionary D, and get an N-WMI matrix. However, it is probably
that some items of the matrix have no values because of the absence of their
corresponding pairs of words in the given text segment, that is to say, these
items are not present. We denote the N-WMI matrix of the training corpus as
TM×M, and that of a sample text segment as SM×M. Therefore, when all items
of both TM×Mand SM×Mare present, we define the N-Window Variance of
Mutual Information (N-WVMI) as:
V=1
M×M
M
i=0
M
j=0
(Sij −Tij )2(3)
When either an item of SM×Mor its corresponding item of TM×Mdose not
exist, we say that the pair of items is not present, otherwise it is present. For
example, if either Sij or Tij dose not exist, we say the pair of items in posi-
tion (i, j) is not present. Suppose Ipairs of items are present, we evaluate
N-WVMI as:
V=1
I
M
i=0
M
j=0
(Sij −Tij )2δ(i, j)(4)
Here, δ(i, j)=1ifbothSij and Tij are present, δ(i, j)=0otherwise.When
I=M×M, equation (4) turns into equation (3).
2.3 Partial Average Distance (PAD)
The N-WVMI is defined to distinguish between the statistical characteristics of
normal text segments and stego-text segments in principle. But a more precise
statistical variable is necessary for the accurate detection. Therefore, the Partial
Average Distance (PAD) of the two N-WMI matrixes SM×Mand TM×Mis
defined as follows:
Dα,K =1
K
M
i=0
M
j=0 |Sij −Tij |[|Sij −Tij |>α]λK(i, j)(5)
In this equation, αrepresents a threshold of the distance of two N-WMI values,
Krepresents that only the first Kgreatest items of SM×Mare calculated. As
we can see, equation (5) averages the differences of items of SM×Mand TM×M
with great N-WMI values and great distances, as these items well represent
the statistical characteristics of the two type of text segments. The expressions
[|Sij −Tij |>α]andλK(i, j)areevaluatedas:
[|Sij −Tij |>α]=1if|Sij −Tij |>α
0otherwise
Linguistic Steganography Detection 229
λK(i, j)=1ifSij is the first Kgreatest
0otherwise
3Method
In natural language, normal text has many inherent statistical characteristics
that can’t be provided by text generated by linguistic steganography approaches
we investigate. Here is something we have observed: there is a strong correla-
tion between words in the same sentence in normal text, but the correlation is
weakened a lot in the generated text. The reason is a normal sentence has a
natural, coherent and complete sense, while a generated sentence doesn’t. For
example, for a sentence leading with “She is a ...”, it reads more likely “She
is a woman teacher”, or “She is a beautiful actress”, or “She is a mother” and
so on in the normal text. But “She is a man”, or “She is a good actor”, or
“She is a father” only possibly exists in the generated text. This shows that
the word “she” has a strong correlation with “woman”, “actress” and “mother”,
but it has a weak correlation with “man”, “actor” and “father”. Therefore, we
probably evaluate the N-WMI of “she” and “woman” or “actress” or “mother”
with a greater value than that of “she” and “man” or “actor” or “father”. In
our research, we use N-WMI to measure the strength of correlation between two
words.
In our experiment, we build a corpus from the novels written by Charles
Dickens, a great English novelist in the Victorian period. We name this cor-
pus Charles-Dickens-Corpus. We build another corpus from novels written by
a few novelists whose last name begin with letter “s” and name it S-Corpus.
Finally, we build a third corpus from the cover text generated by the lin-
guistic steganography algorithms we investigated: NICETEXT, TEXTO and
Markov-Chain-Based, calling it Bad-Corpus. We then build the training cor-
pus from Charles-Dickens-Corpus, the good testing sample set from S-Corpus
and the bad testing sample set from Bad-Corpus. The training corpus con-
sists of about 400 text documents amounting to a size of more than 10MB.
There are 184 text documents in the good testing sample set and 422 text
documents in bad testing sample set. The commonly used word dictionary D
collects 2000 words (that is M= 2000) most widely used in English. We let
N= 4, that is, we use 4-WMI. Thereafter, the following procedure has been
employed.
First, we process the training corpus as a large text segment to get the train-
ing N-WMI matrix TM×Musing dictionary D. Our program reads every text
document, splits it into sentences, counts the numbers of occurrences of all the
N-WWP in D that were contained in the training corpus and gets C.Further-
more, for every N-WWP, we can get its Cxy,Cxand Cyincidentally. Then
we can evaluate the N-WMI of every N-WWP with equation (2), and obtain
the training N-WMI matrix TM×M.Inthestep,wecanstoreTM×Mto disk
for later use. So if the related configuration parameters are not altered, we can
just read TM×Mfrom the disk in this step for the sequent sample text segment
detection.
230 Z. Chen et al.
Second, we process a sample text segment to get the sample N-WMI matrix
SM×M. The procedure is similar with the first step, but in this step our program
just read a text segment of a certain size, such as 10kB from every sample text
document. In this way, we can control the size of the sample text segment.
Third, we evaluate N-WVMI value Vof the sample text segment using SM×M
and TM×Mwith equation (3) or (4). Some attentions have to be paid to this
step. If there are some pairs of items absent in matrix SM×Mand TM×M,we
use equation (4). That is to say, we just calculate variance of the Ipairs of items
when Ipairs of items are present in the matrixes SM×Mand TM×M.Otherwise,
equation (4) turns to equation (3). In this step, another variable, the PAD value
Dα,K is calculated by equation (5). The variable is a useful assistant classifica-
tion feature in addition to N-WVMI value V.Intheexperiment,welet =2and
K=100, so we calculate D2,100 .
Finally, we use SVM classifier to classify the sample text segment as stego-text
segment or normal text segment according the values Vand D2,100.
Fig. 3 shows the flow of the detection procedure. The real line arrowhead
represents the flow of data, while the dashed line arrowhead represents that
nothing will be transferred or processed if TM×Mhas already been stored. The
thick dashed rectangle indicates the whole detection system. Obviously, there are
two key flows in the system: training and testing. The training process is not al-
ways required before the testing process. Once the training process is completed,
it does not need to repeat in sequent detection unless the some configuration pa-
rameters are changed. The testing process contains two steps to evaluate sample
N-WMI matrix, classification features N-WVMI and PAD values respectively,
ending with an output that indicates whether the testing sample text segment
is stego-text using a SVM classifier [9].
Fig. 3. Flow of the detection procedure
Linguistic Steganography Detection 231
Tabl e 1. Sample text set and detection results
Type Generator Sample Success Failure Accuracy
Good Set 184 178 6
Bad Set Markov-Chain-Based 100 89 11 94.01%
NICETEXT 212 212 0 98.48%
TEXTO 110 110 0 97.96%
Total 606 589 17 97.19%
4 Results and Discussion
In our experiment, totally 606 testing sample text segments with a size of 20kB
are detected by using their 4-WVMIs. The composing of the sample text set is
present in Table 1. Using a SVM classifier, we can get a total accuracy of 97.19%.
Note that the accuracy of each linguistic steganography method (denoted by
LSM ethod) is computed as follows:
Accuracy(LSM ethod)= SUC(GoodSet)+SUC(LSMethod)
SAM(GoodSet)+SAM(LSMethod)
Where SUC represents the number of success text segments, and SAM repre-
sents the number of sample text segments.
In Table 1, we can see that the accuracy of detecting stego-text segments
generated by Markov-Chain-Based is obviously lower than that generated by the
other two algorithms. The probable reason is that Markov-Chain-Based method
sometimes embeds secret messages by adding white space between sequent words
in sample text, copying these words as generated words, other than generating
new words when there is only a branch in the state transfer chart. For example,
a text segment generated by Markov-Chain-Based as follows:
“...I’ll wait a year, according to the forest to tell each other than a brown
thrush sang against a tree, held his mouth shut and shook it out, the elder
Ammon suggested sending for Polly. ...”
We can see that the algorithm adds white space between words “according”
and “to”, between words “sending” and “for” and so on in the text segment and
these words are copied to generated text from sample text directly. This keeps
more properties of normal text.
Fig. 4 shows testing results of all testing samples. Fig. 5 - Fig. 7 show testing
results of testing samples generated by Markov-Chain-Based, NICETEXT and
TEXTO respectively. As the discussion above, the accuracy of detecting Markov-
Chain-Based method appears slightly lower. The results of detecting the other
two algorithms appear ideal with the text segment size of 20kB. But when the
segment sizes are smaller than 5kB, such as 2kB, the accuracies will decrease
obviously. This is determined by characteristics of the statistical algorithm. So
sample text segments with a size greater than 5kB are recommended.
232 Z. Chen et al.
Fig. 4. Testing results of all testing samples
Fig. 5. Testing results of testing samples generated by Markov-Chain-Based
In addition, our algorithm is time efficient although we have not measured
it strictly. It takes about 1 minute to complete our detection of more than 600
sample text segments.
Linguistic Steganography Detection 233
Fig. 6. Testing results of testing samples generated by NICETEXT
Fig. 7. Testing results of testing samples generated by TEXTO
After all, the results of our research appear pretty promising. We have detected
the three forenamed linguistic steganography methods in a blind way accurately,
and our method may suit to detect other or new linguistic steganography meth-
ods that generate nature-like covertext. For other linguistic steganography meth-
ods such as Synonym-Substitution-Based or translation-based steganography,
234 Z. Chen et al.
the detection based on the characteristics of correlations between words may
still work, and that is also our future work.
5Conclusion
In this paper, a statistical linguistic steganography detection algorithm has been
presented. We use the statistical characteristics of the correlations between the
general service words that are gathered in a dictionary to classify the given text
segments into stego-text segments and normal text segments. The strength of the
correlation is measured by N-window mutual information (N-WMI). The total
accuracy is as high as 97.19%. The accuracies of blindly detecting these three dif-
ferent linguistic steganography approaches: Markov-Chain-Based, NICETEXT
and TEXTO are 94.01%, 98.48% and 97.96% respectively.
Our research mainly focuses on detecting linguistic steganography that em-
beds secret messages by generating cover text. But it is easy to modify our gen-
eral service word dictionary to fit the detection of Synonym-Substitution-Based
algorithm and other linguistic steganography methods modifying the content of
the cover text. Therefore, our algorithm is widely applicable in linguistic ste-
ganalysis.
Many interesting and new challenges are involved in the analysis of linguistic
steganography algorithms, which is called linguistic steganalysis that has little
or no counterpart in other media domains, such as image or video. Linguistic
steganalysis performance strongly depends on many factors such as the length of
the hidden message and the way to generate a cover text. However, our research
shows that the linguistic steganalysis based on correlations between words is
promising.
Acknowledgement
This work was supported by the NSF of China (Grant Nos. 60773032 and
60703071 respectively), the Ph.D. Program Foundation of Ministry of Educa-
tion of China (No. 20060358014), the Natural Science Foundation of Jiangsu
Province of China (No. BK2007060), and the Anhui Provincial Natural Science
Foundation (No. 070412043).
References
1. Winstein, K.: Lexical steganography through adaptive modulation of the word
choice hash, http://alumni.imsa.edu/∼keithw/tlex/lsteg.ps
2. Chapman, M.: Hiding the Hidden: A Software System for Concealing Ciphertext as
Innocuous Text (1997), http://www.NICETEXT.com/NICETEXT/doc/thesis.pdf
3. Chapman, M., Davida, G., Rennhard, M.: A Practical and Effective Approach
to Large-Scale Automated Linguistic Steganography. In: Davida, G.I., Frankel, Y.
(eds.) ISC 2001. LNCS, vol. 2200, pp. 156–167. Springer, Heidelberg (2001)
Linguistic Steganography Detection 235
4. Maher, K.: TEXTO,
ftp://ftp.funet.fi/pub/crypt/steganography/texto.tar.gz
5. Shu-feng, W., Liu-sheng, H.: Research on Information Hiding. Degree of master,
University of Science and Technology of China (2003)
6. Taskiran, C., Topkara, U., Topkara, M., et al.: Attacks on lexical natural language
steganography systems. In: Proceedings of SPIE (2006)
7. Ji-jun, Z., Zhu, Y., Xin-xin, N., et al.: Research on the detecting algorithm of text
document information hiding. Journal on Communications 25(12), 97–101 (2004)
8. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Process-
ing. Beijin Publishing House of Electronics Industry (January 2005)
9. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001),
http://www.csie.ntu.edu.tw/∼cjlin/libsvm