Conference PaperPDF Available

Linguistic Steganography Detection Using Statistical Characteristics of Correlations between Words

Authors:

Abstract and Figures

Linguistic steganography is a branch of Information Hiding (IH) using written natural language to conceal secret messages. It plays an important role in Information Security (IS) area. Previous work on linguistic steganography was mainly focused on steganography and there were few researches on attacks against it. In this paper, a novel statistical algorithm for linguistic steganography detection is presented. We use the statistical characteristics of correlations between the general service words gathered in a dictionary to classify the given text segments into stego-text segments and normal text segments. In the experiment of blindly detecting the three different linguistic steganography approaches: Markov-Chain-Based, NICETEXT and TEXTO, the total accuracy of discovering stego-text segments and normal text segments is found to be 97.19%. Our results show that the linguistic steganalysis based on correlations between words is promising.
Content may be subject to copyright.
Linguistic Steganography Detection Using
Statistical Characteristics of Correlations
between Words
Zhili Chen*, Liusheng Huang, Zhenshan Yu, Wei Yang,
Lingjun Li, Xueling Zheng, and Xinxin Zhao
National High Performance Computing Center at Hefei,
Department of Computer Science and Technology,
University of Science and Technology of China,
Hefei, Anhui 230027, China
zlchen3@mail.ustc.edu.cn
Abstract. Linguistic steganography is a branch of Information Hiding
(IH) using written natural language to conceal secret messages. It plays
an important role in Information Security (IS) area. Previous work on
linguistic steganography was mainly focused on steganography and there
were few researches on attacks against it. In this paper, a novel statis-
tical algorithm for linguistic steganography detection is presented. We
use the statistical characteristics of correlations between the general ser-
vice words gathered in a dictionary to classify the given text segments
into stego-text segments and normal text segments. In the experiment of
blindly detecting the three different linguistic steganography approaches:
Markov-Chain-Based, NICETEXT and TEXTO, the total accuracy of
discovering stego-text segments and normal text segments is found to
be 97.19%. Our results show that the linguistic steganalysis based on
correlations between words is promising.
1 Introduction
As text-based Internet information and information dissemination media, such
as e-mail, blog and text messaging, are rising rapidly in people’s lives today, the
importance and size of text data are increasing at an accelerating pace. This
augment of the significance of digital text in turn creates increased concerns
about using text media as a covert channel of communication. One of such
important covert means of communication is known as linguistic steganography.
Linguistic steganographymakes use of written natural language to conceal secret
messages. The whole idea is to hide the very presence of the real messages.
Linguistic steganographyalgorithms embed messages into a cover text in a covert
manner such that the presence of the embedded messages in the resulting stego-
text cannot be easily discovered by anyone except the intended recipient.
Previous work on linguistic steganography was mainly focused on how to
hide messages. One method of modifying text for embedding a message is to
substitute selected words by their synonyms so that the meaning of the modified
K. Solanki, K. Sullivan, and U. Madhow (Eds.): IH 2008, LNCS 5284, pp. 224–235, 2008.
c
Springer-Verlag Berlin Heidelberg 2008
Linguistic Steganography Detection 225
sentences is preserved as much as possible. A steganography approach that is
based on synonym substitution is the system proposed by Winstein [1]. There
are some other approaches. Among them NICETEXT and TEXTO are most
famous.
NICETEXT system generates natural-like cover text by using the mixture of
word substitution and Probabilistic Context-free Grammars (PCFG) ([2], [3]).
There are a dictionary table and a style template in the system. The style tem-
plate can be generated by using PCFG or a sample text. The dictionary is used
to randomly generate sequences of words, while the style template selects natural
sequences of parts-of-speech when controlling generation of word, capitalization,
punctuation, and white space. NICETEXT system is intended to protect the
privacy of cryptograms to avoid detection by censors.
TEXTO is a linguistic steganography program designed for transforming
uuencoded or PGP ascii-armoured ASCII data into English sentences [4]. It is
used to facilitate the exchange of binary data, especially encrypted data. TEXTO
works just like a simple substitution cipher, with each of the 64 ASCII symbols
used by PGP ASCII armour or uuencode from secret data replaced by an English
word. Not all of the words in the resulting text are significant, only those nouns,
verbs, adjectives, and adverbs are used to fill in the preset sentence structures.
Punctuation and “connecting” words (or any other words not in the dictionary)
are ignored.
Markov-Chain-Based is another linguistic steganography approach proposed
by [5]. The approach regards text generation as signal transmission from a
Markov signal source. It builds a state transfer chart of the Markov signal source
from a sample text. A part of state transfer chart with branches tagged by equal
probabilities that are represented with one or more bit(s) is illustrated by Fig.
1. Then the approach uses the chart to generate cover text according to secret
messages.
The approaches described above generate innocuous-like stego-text to conceal
attackers. However, there are some drawbacks in them. For example, the first
approach sometimes replaces word synonyms that do not agree with correct
English usage or the genre and the author style of the given text. And the later
three approaches are detectable by a human warden, the stego-text generated
by which doesn’t have a natural, coherent and complete sense. They can be used
in communication channels where only computers act as attackers.
A few detection methods have been proposed making use of the drawbacks
discussed. The paper [6] brought forward an attack againstsystems based on syn-
onym substitution, especially the system presented by Winstain. The 3-gram lan-
guage model was used in the attack. The experimental accuracy of this method
on classification of steganographically modified sentences was 84.9% and that
of unmodified sentences was 38.6%. Another detection method enlightened by
the design ideas of conception chart was proposed by the paper [7] using the
measurement of correlation between sentences. The accuracy of the simulation
detection using this method was 76%. The two methods fall short of accuracy
that the practical application of detection requires. In addition, the first method
226 Z. Chen et al.
Fig. 1. A part of tagged state transfer chart
requires a great lot of computation to calculate a large number of parameters
of the 3-gram language model and the second one requires a database of rules
consuming a lot of work.
Our research examines drawbacks of the last three steganography approaches,
aiming to accurately detect the application of the three approaches in given text
segments and bring forward a blind detection method for linguistic steganogra-
phy generating cover texts. We have found a novel, efficient and accurate detec-
tion algorithm that uses the statistical characteristics of the correlations between
the general service words that are gathered in a dictionary to distinguish between
stego-text segments and normal text segments.
2 Important Notions
2.1 N-Window Mutual Information (N-WMI)
In the area of statistical Natural Language Processing (NLP), an information-
theoretically measurement for discovering interesting collocation is Mutual In-
formation (MI) [8]. MI is originally defined between two particular events x and
y. In case of NLP, the MI of two particular words x and y, as follows:
MI(x, y)=log
2
P(x, y)
P(x)P(y)=log
2
P(x|y)
P(x)=log
2
P(y|x)
P(y)(1)
Here, P(x, y), P(x)andP(y) are the occurrence probabilities of “xy”, “x” and
“y” in given text. In our case, we regard these probabilities as the occurrence
probabilities of the word pairs “xy”, “x?” and “?y” in given text, respectively,
and “?” represents any word.
In natural language, collocation is usually defined as an expression consisting
of two or more sequential words. In our case, we will investigate pairs of words
Linguistic Steganography Detection 227
Fig. 2. An illustration of 3-WWP
within a certain distance. With the distance constraint, we introduce some def-
initions as follows.
N-Window Word Pair (N-WWP): Any pair of words in the same sentence
with a distance less than N (N is an integer greater than 1). Here, the distance of
a pair of words equals to the number of words between the words in the pair plus
1. Note that N-WWP is order-related. In Fig. 2, the numbered boxes represent
the words in a sentence and the variable d represents distance of the word pair.
The 3-WWPs in the sentence are illustrated in the figure by arrowed, folded
lines. Hereafter, we will denote N-WWP “xy” as x, y.
N-Window Collocation (N-WC): An N-WWP with frequent occurrence. In
some sense, our detection results are partially determined by the distribution of
N-WCs in given text segment and later we can see it.
N-Window Mutual Information (N-WMI): We use the MI of an N-WWP
to measure its occurrence. This MI is called N-Window Mutual Information (N-
WMI) of the words in the word pair. Therefore, an N-WWP is an N-WC if its
N-WMI is greater than a particular value.
With the definition of N-WMI, we can use equation (1) to evaluate the N-WMI
of words x and y in a particular text segment. Given a certain text segment, the
counts of occurrences of any N-WWP, N-WWP x, y,x, ?and ?,yare C,
Cxy,Cxand Cy, respectively, the N-WMI value is denoted by MIN, then the
evaluation as follows:
MIN(x, y)=log
2P(x, y)
P(x)P(y)=log
2Cxy/C
(Cx/C)(Cy/C)=log
2CCxy
CxCy
(2)
Because of the signification of N-WMI in our detection algorithm, we will make
a further explanation with an example. Given a sentence “We were fourteen in
all, and all young men.” let us evaluate the 4-WMI of in, all. All 4-WWPs in the
sentence are as follows: we, were,we, f ourteen,we, in,were, f ourteen,
were, in,fourteen, in,were, all,f ourteen, all,in, all,f ourteen, and,
in, and,all, and,in, all,all, all,and, all,all, young,and, young,
all, young,and, men,all, men,young, men.WegetC= 21, Cin,all =2,
Cin =3,Call =6.Thenwehave:
MI4(in, all )=log
2CCin,all
CinCall
=log
221 ×2
3×6=1.2224
228 Z. Chen et al.
2.2 N-Window Variance of Mutual Information (N-WVMI)
Suppose Dis the general service word dictionary and Mis the count of words
in D, then we can get M×Mdifferent pairs of words from the dictionary D.
In any given text segment, we can calculate the N-WMI value of each different
word pair in dictionary D, and get an N-WMI matrix. However, it is probably
that some items of the matrix have no values because of the absence of their
corresponding pairs of words in the given text segment, that is to say, these
items are not present. We denote the N-WMI matrix of the training corpus as
TM×M, and that of a sample text segment as SM×M. Therefore, when all items
of both TM×Mand SM×Mare present, we define the N-Window Variance of
Mutual Information (N-WVMI) as:
V=1
M×M
M
i=0
M
j=0
(Sij Tij )2(3)
When either an item of SM×Mor its corresponding item of TM×Mdose not
exist, we say that the pair of items is not present, otherwise it is present. For
example, if either Sij or Tij dose not exist, we say the pair of items in posi-
tion (i, j) is not present. Suppose Ipairs of items are present, we evaluate
N-WVMI as:
V=1
I
M
i=0
M
j=0
(Sij Tij )2δ(i, j)(4)
Here, δ(i, j)=1ifbothSij and Tij are present, δ(i, j)=0otherwise.When
I=M×M, equation (4) turns into equation (3).
2.3 Partial Average Distance (PAD)
The N-WVMI is defined to distinguish between the statistical characteristics of
normal text segments and stego-text segments in principle. But a more precise
statistical variable is necessary for the accurate detection. Therefore, the Partial
Average Distance (PAD) of the two N-WMI matrixes SM×Mand TM×Mis
defined as follows:
Dα,K =1
K
M
i=0
M
j=0 |Sij Tij |[|Sij Tij |]λK(i, j)(5)
In this equation, αrepresents a threshold of the distance of two N-WMI values,
Krepresents that only the first Kgreatest items of SM×Mare calculated. As
we can see, equation (5) averages the differences of items of SM×Mand TM×M
with great N-WMI values and great distances, as these items well represent
the statistical characteristics of the two type of text segments. The expressions
[|Sij Tij |]andλK(i, j)areevaluatedas:
[|Sij Tij |]=1if|Sij Tij |
0otherwise
Linguistic Steganography Detection 229
λK(i, j)=1ifSij is the first Kgreatest
0otherwise
3Method
In natural language, normal text has many inherent statistical characteristics
that can’t be provided by text generated by linguistic steganography approaches
we investigate. Here is something we have observed: there is a strong correla-
tion between words in the same sentence in normal text, but the correlation is
weakened a lot in the generated text. The reason is a normal sentence has a
natural, coherent and complete sense, while a generated sentence doesn’t. For
example, for a sentence leading with “She is a ...”, it reads more likely “She
is a woman teacher”, or “She is a beautiful actress”, or “She is a mother” and
so on in the normal text. But “She is a man”, or “She is a good actor”, or
“She is a father” only possibly exists in the generated text. This shows that
the word “she” has a strong correlation with “woman”, “actress” and “mother”,
but it has a weak correlation with “man”, “actor” and “father”. Therefore, we
probably evaluate the N-WMI of “she” and “woman” or “actress” or “mother”
with a greater value than that of “she” and “man” or “actor” or “father”. In
our research, we use N-WMI to measure the strength of correlation between two
words.
In our experiment, we build a corpus from the novels written by Charles
Dickens, a great English novelist in the Victorian period. We name this cor-
pus Charles-Dickens-Corpus. We build another corpus from novels written by
a few novelists whose last name begin with letter “s” and name it S-Corpus.
Finally, we build a third corpus from the cover text generated by the lin-
guistic steganography algorithms we investigated: NICETEXT, TEXTO and
Markov-Chain-Based, calling it Bad-Corpus. We then build the training cor-
pus from Charles-Dickens-Corpus, the good testing sample set from S-Corpus
and the bad testing sample set from Bad-Corpus. The training corpus con-
sists of about 400 text documents amounting to a size of more than 10MB.
There are 184 text documents in the good testing sample set and 422 text
documents in bad testing sample set. The commonly used word dictionary D
collects 2000 words (that is M= 2000) most widely used in English. We let
N= 4, that is, we use 4-WMI. Thereafter, the following procedure has been
employed.
First, we process the training corpus as a large text segment to get the train-
ing N-WMI matrix TM×Musing dictionary D. Our program reads every text
document, splits it into sentences, counts the numbers of occurrences of all the
N-WWP in D that were contained in the training corpus and gets C.Further-
more, for every N-WWP, we can get its Cxy,Cxand Cyincidentally. Then
we can evaluate the N-WMI of every N-WWP with equation (2), and obtain
the training N-WMI matrix TM×M.Inthestep,wecanstoreTM×Mto disk
for later use. So if the related configuration parameters are not altered, we can
just read TM×Mfrom the disk in this step for the sequent sample text segment
detection.
230 Z. Chen et al.
Second, we process a sample text segment to get the sample N-WMI matrix
SM×M. The procedure is similar with the first step, but in this step our program
just read a text segment of a certain size, such as 10kB from every sample text
document. In this way, we can control the size of the sample text segment.
Third, we evaluate N-WVMI value Vof the sample text segment using SM×M
and TM×Mwith equation (3) or (4). Some attentions have to be paid to this
step. If there are some pairs of items absent in matrix SM×Mand TM×M,we
use equation (4). That is to say, we just calculate variance of the Ipairs of items
when Ipairs of items are present in the matrixes SM×Mand TM×M.Otherwise,
equation (4) turns to equation (3). In this step, another variable, the PAD value
Dα,K is calculated by equation (5). The variable is a useful assistant classifica-
tion feature in addition to N-WVMI value V.Intheexperiment,welet =2and
K=100, so we calculate D2,100 .
Finally, we use SVM classifier to classify the sample text segment as stego-text
segment or normal text segment according the values Vand D2,100.
Fig. 3 shows the flow of the detection procedure. The real line arrowhead
represents the flow of data, while the dashed line arrowhead represents that
nothing will be transferred or processed if TM×Mhas already been stored. The
thick dashed rectangle indicates the whole detection system. Obviously, there are
two key flows in the system: training and testing. The training process is not al-
ways required before the testing process. Once the training process is completed,
it does not need to repeat in sequent detection unless the some configuration pa-
rameters are changed. The testing process contains two steps to evaluate sample
N-WMI matrix, classification features N-WVMI and PAD values respectively,
ending with an output that indicates whether the testing sample text segment
is stego-text using a SVM classifier [9].
Fig. 3. Flow of the detection procedure
Linguistic Steganography Detection 231
Tabl e 1. Sample text set and detection results
Type Generator Sample Success Failure Accuracy
Good Set 184 178 6
Bad Set Markov-Chain-Based 100 89 11 94.01%
NICETEXT 212 212 0 98.48%
TEXTO 110 110 0 97.96%
Total 606 589 17 97.19%
4 Results and Discussion
In our experiment, totally 606 testing sample text segments with a size of 20kB
are detected by using their 4-WVMIs. The composing of the sample text set is
present in Table 1. Using a SVM classifier, we can get a total accuracy of 97.19%.
Note that the accuracy of each linguistic steganography method (denoted by
LSM ethod) is computed as follows:
Accuracy(LSM ethod)= SUC(GoodSet)+SUC(LSMethod)
SAM(GoodSet)+SAM(LSMethod)
Where SUC represents the number of success text segments, and SAM repre-
sents the number of sample text segments.
In Table 1, we can see that the accuracy of detecting stego-text segments
generated by Markov-Chain-Based is obviously lower than that generated by the
other two algorithms. The probable reason is that Markov-Chain-Based method
sometimes embeds secret messages by adding white space between sequent words
in sample text, copying these words as generated words, other than generating
new words when there is only a branch in the state transfer chart. For example,
a text segment generated by Markov-Chain-Based as follows:
...Ill wait a year, according to the forest to tell each other than a brown
thrush sang against a tree, held his mouth shut and shook it out, the elder
Ammon suggested sending for Polly. ...”
We can see that the algorithm adds white space between words “according”
and “to”, between words “sending” and “for” and so on in the text segment and
these words are copied to generated text from sample text directly. This keeps
more properties of normal text.
Fig. 4 shows testing results of all testing samples. Fig. 5 - Fig. 7 show testing
results of testing samples generated by Markov-Chain-Based, NICETEXT and
TEXTO respectively. As the discussion above, the accuracy of detecting Markov-
Chain-Based method appears slightly lower. The results of detecting the other
two algorithms appear ideal with the text segment size of 20kB. But when the
segment sizes are smaller than 5kB, such as 2kB, the accuracies will decrease
obviously. This is determined by characteristics of the statistical algorithm. So
sample text segments with a size greater than 5kB are recommended.
232 Z. Chen et al.
Fig. 4. Testing results of all testing samples
Fig. 5. Testing results of testing samples generated by Markov-Chain-Based
In addition, our algorithm is time efficient although we have not measured
it strictly. It takes about 1 minute to complete our detection of more than 600
sample text segments.
Linguistic Steganography Detection 233
Fig. 6. Testing results of testing samples generated by NICETEXT
Fig. 7. Testing results of testing samples generated by TEXTO
After all, the results of our research appear pretty promising. We have detected
the three forenamed linguistic steganography methods in a blind way accurately,
and our method may suit to detect other or new linguistic steganography meth-
ods that generate nature-like covertext. For other linguistic steganography meth-
ods such as Synonym-Substitution-Based or translation-based steganography,
234 Z. Chen et al.
the detection based on the characteristics of correlations between words may
still work, and that is also our future work.
5Conclusion
In this paper, a statistical linguistic steganography detection algorithm has been
presented. We use the statistical characteristics of the correlations between the
general service words that are gathered in a dictionary to classify the given text
segments into stego-text segments and normal text segments. The strength of the
correlation is measured by N-window mutual information (N-WMI). The total
accuracy is as high as 97.19%. The accuracies of blindly detecting these three dif-
ferent linguistic steganography approaches: Markov-Chain-Based, NICETEXT
and TEXTO are 94.01%, 98.48% and 97.96% respectively.
Our research mainly focuses on detecting linguistic steganography that em-
beds secret messages by generating cover text. But it is easy to modify our gen-
eral service word dictionary to fit the detection of Synonym-Substitution-Based
algorithm and other linguistic steganography methods modifying the content of
the cover text. Therefore, our algorithm is widely applicable in linguistic ste-
ganalysis.
Many interesting and new challenges are involved in the analysis of linguistic
steganography algorithms, which is called linguistic steganalysis that has little
or no counterpart in other media domains, such as image or video. Linguistic
steganalysis performance strongly depends on many factors such as the length of
the hidden message and the way to generate a cover text. However, our research
shows that the linguistic steganalysis based on correlations between words is
promising.
Acknowledgement
This work was supported by the NSF of China (Grant Nos. 60773032 and
60703071 respectively), the Ph.D. Program Foundation of Ministry of Educa-
tion of China (No. 20060358014), the Natural Science Foundation of Jiangsu
Province of China (No. BK2007060), and the Anhui Provincial Natural Science
Foundation (No. 070412043).
References
1. Winstein, K.: Lexical steganography through adaptive modulation of the word
choice hash, http://alumni.imsa.edu/keithw/tlex/lsteg.ps
2. Chapman, M.: Hiding the Hidden: A Software System for Concealing Ciphertext as
Innocuous Text (1997), http://www.NICETEXT.com/NICETEXT/doc/thesis.pdf
3. Chapman, M., Davida, G., Rennhard, M.: A Practical and Effective Approach
to Large-Scale Automated Linguistic Steganography. In: Davida, G.I., Frankel, Y.
(eds.) ISC 2001. LNCS, vol. 2200, pp. 156–167. Springer, Heidelberg (2001)
Linguistic Steganography Detection 235
4. Maher, K.: TEXTO,
ftp://ftp.funet.fi/pub/crypt/steganography/texto.tar.gz
5. Shu-feng, W., Liu-sheng, H.: Research on Information Hiding. Degree of master,
University of Science and Technology of China (2003)
6. Taskiran, C., Topkara, U., Topkara, M., et al.: Attacks on lexical natural language
steganography systems. In: Proceedings of SPIE (2006)
7. Ji-jun, Z., Zhu, Y., Xin-xin, N., et al.: Research on the detecting algorithm of text
document information hiding. Journal on Communications 25(12), 97–101 (2004)
8. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Process-
ing. Beijin Publishing House of Electronics Industry (January 2005)
9. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001),
http://www.csie.ntu.edu.tw/cjlin/libsvm
... Even then, n-gram frequencies will likely differ (e.g. see [6], [7], [8], [16]); for example, even if "up" and "down" have comparable frequencies, the bigrams (sequences of two words) "make up" and "make down" do not. The problem is, find-ing substitution classes is hard enough even without attention to token frequencies. ...
... Bigram Analysis -It has been noted in the literature that steganographic texts may be particularly vulnerable to n-gram analysis ( [6], [7], [8], [16]). While it is relatively easy to match single word frequencies to those found in natural text, it is much more challenging to do the same for word pair (and n-tuplet) frequencies. ...
Article
A common pitfall of existing encryption procedures using lexical (text-based) steganography is the fact that the encrypted text may be recognized as such by someone who intercepts it. We introduce a new procedure which combines an automated algorithm with human input. The resulting texts are novel and therefore not searchable or otherwise easily recognized as encoding a hidden message
... On the contrary, steganalysis aims to detect a stegotext from a cover text [3] by extracting some text features such as word distribution [12], entropy [13], the correlation between words [14], and other features [15], [16]. Then these features are analyzed to determine whether the text contains a secret message. ...
... In [14], the statistical characteristics of correlations between the general service words such as n-window mutual information are used to classify the given text segments into stego-text segments and normal text segments, by an SVM classifier. The classifier accuracy reached 94.01%, 98.48%, and 97.96% for MC-based English stego-texts, NICETEXT, and TEXTO for 20 KB size texts. ...
... Although these kind of methods have high payload and are easy to deceive the detection of human eyes, an obvious disadvantage is that they cannot resist the OCR-based attack and the statistic-based detection. [22]. Since then, in order to improve the security of steganography, modification-based text steganography was proposed, which embedded secret information by modifying and replacing the text content with different granularity, such as synonym substitution [22]- [25] and syntactic change [26]. ...
... [22]. Since then, in order to improve the security of steganography, modification-based text steganography was proposed, which embedded secret information by modifying and replacing the text content with different granularity, such as synonym substitution [22]- [25] and syntactic change [26]. Because these methods directly use the content of text to embed information, they can resist OCR attack. ...
Article
Full-text available
Text has become one of the most extensively used digital media in Internet, which provides steganography an effective carrier to realize confidential message hiding. Nowadays, generation-based linguistic steganography has made a significant breakthrough due to the progress of deep learning. However, previous methods based on recurrent neural network have two deviations including exposure bias and embedding deviation, which seriously destroys the security of steganography. In this paper, we propose a novel linguistic steganographic model based on adaptive probability distribution and generative adversarial network, which achieves the goal of hiding secret messages in the generated text while guaranteeing high security performance. First, the steganographic generator is trained by using generative adversarial network to effectively tackle the exposure bias, and then the candidate pool is obtained by a probability similarity function at each time step, which alleviates the embedding deviation through dynamically maintaining the diversity of probability distribution. Third, to further improve the security, a novel strategy that conducts information embedding during model training is put forward. We design various experiments from different aspects to verify the performance of the proposed model, including imperceptibility, statistical distribution, anti-steganalysis ability. The experimental results demonstrate that our proposed model outperforms the current state-of-the-art steganographic schemes.
... For instance, if the sentiment within a document suddenly changes without apparent reason, it may indicate an attempt to conceal information. • Steganography Detection [22]: Steganography is hiding information within other information. Although commonly associated with images, it can also be applied to text. ...
Article
Full-text available
Transparency in financial reporting is crucial for maintaining trust in financial markets, yet fraudulent financial statements remain challenging to detect and prevent. This study introduces a novel approach to detecting financial statement fraud by applying sentiment analysis to analyse the textual data within financial reports. This research aims to identify patterns and anomalies that might indicate fraudulent activities by examining the language and sentiment expressed across multiple fiscal years. The study focuses on three companies known for financial statement fraud: Wirecard, Tesco, and Under Armour. Utilising Natural Language Processing (NLP) techniques, the research analyses polarity (positive or negative sentiment) and subjectivity (degree of personal opinion) within the financial statements, revealing intriguing patterns. Wirecard showed a consistent tone with a slight decrease in 2018, Tesco exhibited marked changes in the fraud year, and Under Armour presented subtler shifts during the fraud years. While the findings present promising trends, the study emphasises that sentiment analysis alone cannot definitively detect financial statement fraud. It provides insights into the tone and mood of the text but cannot reveal intentional deception or financial discrepancies. The results serve as supplementary information, enriching traditional financial analysis methods. This research contributes to the field by exploring the potential of sentiment analysis in financial fraud detection, offering a unique perspective that complements quantitative methods. It opens new avenues for investigation and underscores the need for an integrated, multidimensional approach to fraud detection.
... Another example of statistical detectors involves neighbor difference feature for word shift methods [146]. Many ML-based detectors for generative linguistic steganography methods have been developed in the recent years, such as SVM classifier in [147], Softmax classifier in [148], TS-RNN [149], or CNN [150]. ...
Preprint
Full-text available
A unified understanding of terms and their applicability is essential for every scientific discipline: steganography is no exception. Being divided into several domains (for instance, text steganography, digital media steganography, and network steganography), it is crucial to provide a unified terminology as well as a taxonomy that is not limited to some specific applications or areas. A prime attempt towards a unified understanding of terms was conducted in 2015 with the introduction of a pattern-based taxonomy for network steganography. Six years later, in 2021, the first work towards a pattern-based taxonomy for steganography was proposed. However, this initial attempt still faced several shortcomings, e.g., the lack of patterns for several steganography domains (the work mainly focused on network steganography and covert channels), various terminology issues, and the need of providing a tutorial on how the taxonomy can be used during engineering and scientific tasks, including the paper-writing process. As the consortium who published this initial 2021-study on steganography patterns, in this paper we present the first comprehensive pattern-based taxonomy tailored to fit all known domains of steganography, including smaller and emerging areas, such as filesystem steganography and cyber-physical systems steganography. Besides, to make our contribution more effective and promote the use of the taxonomy to advance research on steganography, we also provide a thorough tutorial on its utilization. Our pattern collection is available at https://patterns.ztt.hs-worms.de.
... Another example of statistical detectors involves neighbor difference feature for word shift methods [146]. Many ML-based detectors for generative linguistic steganography methods have been developed in the recent years, such as SVM classifier in [147], Softmax classifier in [148], TS-RNN [149], or CNN [150]. ...
Preprint
Full-text available
A unified understanding of terms and their applicability is essential for every scientific discipline: steganography is no exception. Being divided into several domains (for instance, text steganography, digital media steganography, and network steganography), it is crucial to provide a unified terminology as well as a taxonomy that is not limited to some specific applications or areas. A prime attempt towards a unified understanding of terms was conducted in 2015 with the introduction of a pattern-based taxonomy for network steganography. Six years later, in 2021, the first work towards a pattern-based taxonomy for steganography was proposed. However, this initial attempt still faced several shortcomings, e.g., the lack of patterns for several steganography domains (the work mainly focused on network steganography and covert channels), various terminology issues, and the need of providing a tutorial on how the taxonomy can be used during engineering and scientific tasks, including the paper-writing process. As the consortium who published this initial 2021-study on steganography patterns, in this paper we present the first comprehensive pattern-based taxonomy tailored to fit all known domains of steganography, including smaller and emerging areas, such as filesystem steganography and cyber-physical systems steganography. Besides, to make our contribution more effective and promote the use of the taxonomy to advance research on steganography, we also provide a thorough tutorial on its utilization. Our pattern collection is available at https://patterns.ztt.hs-worms.de.
... Another example of statistical detectors involves neighbor difference feature for word shift methods [146]. Many ML-based detectors for generative linguistic steganography methods have been developed in the recent years, such as SVM classifier in [147], Softmax classifier in [148], TS-RNN [149], or CNN [150]. ...
Preprint
Full-text available
A unified understanding of terms and their applicability is essential for every scientific discipline: steganography is no exception. Being divided into several domains (for instance, text steganography, digital media steganography, and network steganography), it is crucial to provide a unified terminology as well as a taxonomy that is not limited to some specific applications or areas. A prime attempt towards a unified understanding of terms was conducted in 2015 with the introduction of a pattern-based taxonomy for network steganography. Six years later, in 2021, the first work towards a pattern-based taxonomy for steganography was proposed. However, this initial attempt still faced several shortcomings, e.g., the lack of patterns for several steganography domains (the work mainly focused on network steganography and covert channels), various terminology issues, and the need of providing a tutorial on how the taxonomy can be used during engineering and scientific tasks, including the paper-writing process. As the consortium who published this initial 2021-study on steganography patterns, in this paper we present the first comprehensive pattern-based taxonomy tailored to fit all known domains of steganography, including smaller and emerging areas, such as filesystem steganography and cyber-physical systems steganography. Besides, to make our contribution more effective and promote the use of the taxonomy to advance research on steganography, we also provide a thorough tutorial on its utilization. Our pattern collection is available at https://patterns.ztt.hs-worms.de.
... According to the way of obtaining the carrier, the text steganography methods can be divided into two types: modification-based steganography with carrier and generation-based steganography without carrier [7]. It turns out that the modification-based approach is easy to be detected because of its explicit changes, and they cannot get fluent steganographic texts at scale [8]. ...
Preprint
Full-text available
Text steganography combined with natural language generation has become increasingly popular. The existing methods usually embed secret information in the generated word by controlling the sampling in the process of text generation. A candidate pool will be constructed by greedy strategy, and only the words with high probability will be encoded, which damages the statistical law of the texts and seriously affects the security of steganography. In order to reduce the influence of the candidate pool on the statistical imperceptibility of steganography, we propose a steganography method based on a new sampling strategy. Instead of just consisting of words with high probability, we select words with relatively small difference from the actual sample of the language model to build a candidate pool, thus keeping consistency with the probability distribution of the language model. What's more, we encode the candidate words according to their probability similarity with the target word, which can further maintain the probability distribution. Experimental results show that the proposed method can outperform the state-of-the-art steganographic methods in terms of security performance.
Chapter
The recent development of deep learning has made a significant breakthrough in linguistic generative steganography. The text has become one of the most intensely used communication carriers on the Internet, making steganography an efficient carrier for concealing secret messages. Text steganography has long been used to protect the privacy and confidentiality of data via public transmission. Steganography utilizes a carrier to embed the data to generate a secret unnoticed and less attractive message. Different techniques have been used to improve the security of the generated text and quality of the steganographic text, such as the Markov model, Recurrent Neural Network (RNN), Long short-term memory (LSTM), Transformers, Knowledge Graph, and Variational autoencoder (VAE). Those techniques enhance the steganographic text’s language model and conditional probability distribution. This paper provides a comparative analysis to review the key contributions of generative linguistic steganographic deep learning-based methods through different perspectives such as text generation, encoding algorithm, and evaluation criteria.KeywordsText steganographyInformation hidingDeep learning
Article
Full-text available
ABSTRACT Text data forms the largest bulk of digital data that people encounter and exchange daily. For this reason the potential usage of text data as a covert channel for secret communication is an imminent concern. Even though information hiding into natural language text has started to attract great interest, there has been no study on attacks against these applications. In this paper we examine the robustness of lexical steganography systems.In this paper we used a universal steganalysis method based on language models and support vector machines to differentiate sentences modified by a lexical steganography algorithm from unmodified sentences. The experimental accuracy of our method on classification of steganographically modified sentences was 84.9%. On classification of isolated sentences we obtained a high recall rate whereas the precision was low. Keywords: steganalysis, lexical steganography, natural language steganography, universal steganalysis, statis-
Conference Paper
Full-text available
Several automated techniques exist to transform ciphertext into text that “looks like” natural-language text while retaining the ability to recover the original ciphertext. This transformation changes the ciphertext so that it doesn’t attract undue attention from, for example, attackers or agencies or organizations that might want to detect or censor encrypted communication. Although it is relatively easy to generate a small sample of quality text, it is challenging to be able to generate large texts that are “meaningful” to a human reader and which appear innocuous. This paper expands on a previous approach that used sentence models and large dictionaries of words classified by part-of-speech [7]. By using an “extensible contextual template” approach combined with a synonymbased replacement strategy, much more realistic text is generated than was possible with NICETEXT.
Article
An abstract is not available.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
We present a system for protecting the privacy of cryptograms to avoid detection by censors. The system transforms ciphertext into innocuous text which can be transformed back into the original ciphertext. The expandable set of tools allows experimentation with custom dictionaries, automatic simulation of writing style, and the use of context-free-grammars to control text generation. The scope of this paper is to provide an overview of the basic transformation processes and to demonstrate the quality of the generated text.
Research on the detecting algorithm of text document information hiding
  • Z Ji-Jun
  • Y Zhu
  • N Xin-Xin
  • Z. Ji-jun
Ji-jun, Z., Zhu, Y., Xin-xin, N., et al.: Research on the detecting algorithm of text document information hiding. Journal on Communications 25(12), 97-101 (2004)
Lexical steganography through adaptive modulation of the word choice hash
  • K Winstein
Winstein, K.: Lexical steganography through adaptive modulation of the word choice hash, http://alumni.imsa.edu/ ∼ keithw/tlex/lsteg.ps
A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography
  • M Chapman
  • G Davida
  • M Rennhard
Chapman, M., Davida, G., Rennhard, M.: A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography. In: Davida, G.I., Frankel, Y. (eds.) ISC 2001. LNCS, vol. 2200, pp. 156-167. Springer, Heidelberg (2001)