Conference PaperPDF Available

Linguistic Steganography Detection Using Statistical Characteristics of Correlations between Words

October 2008

October 2008

DOI:10.1007/978-3-540-88961-8_16

Source
DBLP

Conference: Information Hiding

Authors:

Zhili Chen

East China Normal University

Wei Yang

Beijing Jiaotong University

Show all 7 authorsHide

Linguistic steganography is a branch of Information Hiding (IH) using written natural language to conceal secret messages. It plays an important role in Information Security (IS) area. Previous work on linguistic steganography was mainly focused on steganography and there were few researches on attacks against it. In this paper, a novel statistical algorithm for linguistic steganography detection is presented. We use the statistical characteristics of correlations between the general service words gathered in a dictionary to classify the given text segments into stego-text segments and normal text segments. In the experiment of blindly detecting the three different linguistic steganography approaches: Markov-Chain-Based, NICETEXT and TEXTO, the total accuracy of discovering stego-text segments and normal text segments is found to be 97.19%. Our results show that the linguistic steganalysis based on correlations between words is promising.

A part of tagged state transfer chart

…

. Sample text set and detection results

…

An illustration of 3-WWP

…

Flow of the detection procedure

…

Testing results of all testing samples

…

Figures - uploaded by Zhili Chen

Content may be subject to copyright.

Content uploaded by Zhili Chen

Content may be subject to copyright.

Linguistic Steganography Detection Using

Statistical Characteristics of Correlations

between Words

Zhili Chen*, Liusheng Huang, Zhenshan Yu, Wei Yang,

Lingjun Li, Xueling Zheng, and Xinxin Zhao

National High Performance Computing Center at Hefei,

Department of Computer Science and Technology,

University of Science and Technology of China,

Hefei, Anhui 230027, China

zlchen3@mail.ustc.edu.cn

Abstract. Linguistic steganography is a branch of Information Hiding

(IH) using written natural language to conceal secret messages. It plays

an important role in Information Security (IS) area. Previous work on

linguistic steganography was mainly focused on steganography and there

were few researches on attacks against it. In this paper, a novel statis-

tical algorithm for linguistic steganography detection is presented. We

use the statistical characteristics of correlations between the general ser-

vice words gathered in a dictionary to classify the given text segments

into stego-text segments and normal text segments. In the experiment of

blindly detecting the three diﬀerent linguistic steganography approaches:

Markov-Chain-Based, NICETEXT and TEXTO, the total accuracy of

discovering stego-text segments and normal text segments is found to

be 97.19%. Our results show that the linguistic steganalysis based on

correlations between words is promising.

1 Introduction

As text-based Internet information and information dissemination media, such

as e-mail, blog and text messaging, are rising rapidly in people’s lives today, the

importance and size of text data are increasing at an accelerating pace. This

augment of the signiﬁcance of digital text in turn creates increased concerns

about using text media as a covert channel of communication. One of such

important covert means of communication is known as linguistic steganography.

Linguistic steganographymakes use of written natural language to conceal secret

messages. The whole idea is to hide the very presence of the real messages.

Linguistic steganographyalgorithms embed messages into a cover text in a covert

manner such that the presence of the embedded messages in the resulting stego-

text cannot be easily discovered by anyone except the intended recipient.

Previous work on linguistic steganography was mainly focused on how to

hide messages. One method of modifying text for embedding a message is to

substitute selected words by their synonyms so that the meaning of the modiﬁed

K. Solanki, K. Sullivan, and U. Madhow (Eds.): IH 2008, LNCS 5284, pp. 224–235, 2008.

Springer-Verlag Berlin Heidelberg 2008

Linguistic Steganography Detection 225

sentences is preserved as much as possible. A steganography approach that is

based on synonym substitution is the system proposed by Winstein [1]. There

are some other approaches. Among them NICETEXT and TEXTO are most

famous.

NICETEXT system generates natural-like cover text by using the mixture of

word substitution and Probabilistic Context-free Grammars (PCFG) ([2], [3]).

There are a dictionary table and a style template in the system. The style tem-

plate can be generated by using PCFG or a sample text. The dictionary is used

to randomly generate sequences of words, while the style template selects natural

sequences of parts-of-speech when controlling generation of word, capitalization,

punctuation, and white space. NICETEXT system is intended to protect the

privacy of cryptograms to avoid detection by censors.

TEXTO is a linguistic steganography program designed for transforming

uuencoded or PGP ascii-armoured ASCII data into English sentences [4]. It is

used to facilitate the exchange of binary data, especially encrypted data. TEXTO

works just like a simple substitution cipher, with each of the 64 ASCII symbols

used by PGP ASCII armour or uuencode from secret data replaced by an English

word. Not all of the words in the resulting text are signiﬁcant, only those nouns,

verbs, adjectives, and adverbs are used to ﬁll in the preset sentence structures.

Punctuation and “connecting” words (or any other words not in the dictionary)

are ignored.

Markov-Chain-Based is another linguistic steganography approach proposed

by [5]. The approach regards text generation as signal transmission from a

Markov signal source. It builds a state transfer chart of the Markov signal source

from a sample text. A part of state transfer chart with branches tagged by equal

probabilities that are represented with one or more bit(s) is illustrated by Fig.

1. Then the approach uses the chart to generate cover text according to secret

messages.

The approaches described above generate innocuous-like stego-text to conceal

attackers. However, there are some drawbacks in them. For example, the ﬁrst

approach sometimes replaces word synonyms that do not agree with correct

English usage or the genre and the author style of the given text. And the later

three approaches are detectable by a human warden, the stego-text generated

by which doesn’t have a natural, coherent and complete sense. They can be used

in communication channels where only computers act as attackers.

A few detection methods have been proposed making use of the drawbacks

discussed. The paper [6] brought forward an attack againstsystems based on syn-

onym substitution, especially the system presented by Winstain. The 3-gram lan-

guage model was used in the attack. The experimental accuracy of this method

on classiﬁcation of steganographically modiﬁed sentences was 84.9% and that

of unmodiﬁed sentences was 38.6%. Another detection method enlightened by

the design ideas of conception chart was proposed by the paper [7] using the

measurement of correlation between sentences. The accuracy of the simulation

detection using this method was 76%. The two methods fall short of accuracy

that the practical application of detection requires. In addition, the ﬁrst method

226 Z. Chen et al.

Fig. 1. A part of tagged state transfer chart

requires a great lot of computation to calculate a large number of parameters

of the 3-gram language model and the second one requires a database of rules

consuming a lot of work.

Our research examines drawbacks of the last three steganography approaches,

aiming to accurately detect the application of the three approaches in given text

segments and bring forward a blind detection method for linguistic steganogra-

phy generating cover texts. We have found a novel, eﬃcient and accurate detec-

tion algorithm that uses the statistical characteristics of the correlations between

the general service words that are gathered in a dictionary to distinguish between

stego-text segments and normal text segments.

2 Important Notions

2.1 N-Window Mutual Information (N-WMI)

In the area of statistical Natural Language Processing (NLP), an information-

theoretically measurement for discovering interesting collocation is Mutual In-

formation (MI) [8]. MI is originally deﬁned between two particular events x and

y. In case of NLP, the MI of two particular words x and y, as follows:

MI(x, y)=log

P(x, y)

P(x)P(y)=log

P(x|y)

P(x)=log

P(y|x)

P(y)(1)

Here, P(x, y), P(x)andP(y) are the occurrence probabilities of “xy”, “x” and

“y” in given text. In our case, we regard these probabilities as the occurrence

probabilities of the word pairs “xy”, “x?” and “?y” in given text, respectively,

and “?” represents any word.

In natural language, collocation is usually deﬁned as an expression consisting

of two or more sequential words. In our case, we will investigate pairs of words

Linguistic Steganography Detection 227

Fig. 2. An illustration of 3-WWP

within a certain distance. With the distance constraint, we introduce some def-

initions as follows.

N-Window Word Pair (N-WWP): Any pair of words in the same sentence

with a distance less than N (N is an integer greater than 1). Here, the distance of

a pair of words equals to the number of words between the words in the pair plus

1. Note that N-WWP is order-related. In Fig. 2, the numbered boxes represent

the words in a sentence and the variable d represents distance of the word pair.

The 3-WWPs in the sentence are illustrated in the ﬁgure by arrowed, folded

lines. Hereafter, we will denote N-WWP “xy” as x, y.

N-Window Collocation (N-WC): An N-WWP with frequent occurrence. In

some sense, our detection results are partially determined by the distribution of

N-WCs in given text segment and later we can see it.

N-Window Mutual Information (N-WMI): We use the MI of an N-WWP

to measure its occurrence. This MI is called N-Window Mutual Information (N-

WMI) of the words in the word pair. Therefore, an N-WWP is an N-WC if its

N-WMI is greater than a particular value.

With the deﬁnition of N-WMI, we can use equation (1) to evaluate the N-WMI

of words x and y in a particular text segment. Given a certain text segment, the

counts of occurrences of any N-WWP, N-WWP x, y,x, ?and ?,yare C,

Cxy,Cxand Cy, respectively, the N-WMI value is denoted by MIN, then the

evaluation as follows:

MIN(x, y)=log

2P(x, y)

P(x)P(y)=log

2Cxy/C

(Cx/C)(Cy/C)=log

2CCxy

CxCy

(2)

Because of the signiﬁcation of N-WMI in our detection algorithm, we will make

a further explanation with an example. Given a sentence “We were fourteen in

all, and all young men.” let us evaluate the 4-WMI of in, all. All 4-WWPs in the

sentence are as follows: we, were,we, f ourteen,we, in,were, f ourteen,

were, in,fourteen, in,were, all,f ourteen, all,in, all,f ourteen, and,

in, and,all, and,in, all,all, all,and, all,all, young,and, young,

all, young,and, men,all, men,young, men.WegetC= 21, Cin,all =2,

Cin =3,Call =6.Thenwehave:

MI4(in, all )=log

2CCin,all

CinCall

=log

221 ×2

3×6=1.2224

228 Z. Chen et al.

2.2 N-Window Variance of Mutual Information (N-WVMI)

Suppose Dis the general service word dictionary and Mis the count of words

in D, then we can get M×Mdiﬀerent pairs of words from the dictionary D.

In any given text segment, we can calculate the N-WMI value of each diﬀerent

word pair in dictionary D, and get an N-WMI matrix. However, it is probably

that some items of the matrix have no values because of the absence of their

corresponding pairs of words in the given text segment, that is to say, these

items are not present. We denote the N-WMI matrix of the training corpus as

TM×M, and that of a sample text segment as SM×M. Therefore, when all items

of both TM×Mand SM×Mare present, we deﬁne the N-Window Variance of

Mutual Information (N-WVMI) as:

V=1

M×M



i=0



j=0

(Sij −Tij )2(3)

When either an item of SM×Mor its corresponding item of TM×Mdose not

exist, we say that the pair of items is not present, otherwise it is present. For

example, if either Sij or Tij dose not exist, we say the pair of items in posi-

tion (i, j) is not present. Suppose Ipairs of items are present, we evaluate

N-WVMI as:

V=1



i=0



j=0

(Sij −Tij )2δ(i, j)(4)

Here, δ(i, j)=1ifbothSij and Tij are present, δ(i, j)=0otherwise.When

I=M×M, equation (4) turns into equation (3).

2.3 Partial Average Distance (PAD)

The N-WVMI is deﬁned to distinguish between the statistical characteristics of

normal text segments and stego-text segments in principle. But a more precise

statistical variable is necessary for the accurate detection. Therefore, the Partial

Average Distance (PAD) of the two N-WMI matrixes SM×Mand TM×Mis

deﬁned as follows:

Dα,K =1



i=0



j=0 |Sij −Tij |[|Sij −Tij |>α]λK(i, j)(5)

In this equation, αrepresents a threshold of the distance of two N-WMI values,

Krepresents that only the ﬁrst Kgreatest items of SM×Mare calculated. As

we can see, equation (5) averages the diﬀerences of items of SM×Mand TM×M

with great N-WMI values and great distances, as these items well represent

the statistical characteristics of the two type of text segments. The expressions

[|Sij −Tij |>α]andλK(i, j)areevaluatedas:

[|Sij −Tij |>α]=1if|Sij −Tij |>α

0otherwise

Linguistic Steganography Detection 229

λK(i, j)=1ifSij is the ﬁrst Kgreatest

0otherwise

3Method

In natural language, normal text has many inherent statistical characteristics

that can’t be provided by text generated by linguistic steganography approaches

we investigate. Here is something we have observed: there is a strong correla-

tion between words in the same sentence in normal text, but the correlation is

weakened a lot in the generated text. The reason is a normal sentence has a

natural, coherent and complete sense, while a generated sentence doesn’t. For

example, for a sentence leading with “She is a ...”, it reads more likely “She

is a woman teacher”, or “She is a beautiful actress”, or “She is a mother” and

so on in the normal text. But “She is a man”, or “She is a good actor”, or

“She is a father” only possibly exists in the generated text. This shows that

the word “she” has a strong correlation with “woman”, “actress” and “mother”,

but it has a weak correlation with “man”, “actor” and “father”. Therefore, we

probably evaluate the N-WMI of “she” and “woman” or “actress” or “mother”

with a greater value than that of “she” and “man” or “actor” or “father”. In

our research, we use N-WMI to measure the strength of correlation between two

words.

In our experiment, we build a corpus from the novels written by Charles

Dickens, a great English novelist in the Victorian period. We name this cor-

pus Charles-Dickens-Corpus. We build another corpus from novels written by

a few novelists whose last name begin with letter “s” and name it S-Corpus.

Finally, we build a third corpus from the cover text generated by the lin-

guistic steganography algorithms we investigated: NICETEXT, TEXTO and

Markov-Chain-Based, calling it Bad-Corpus. We then build the training cor-

pus from Charles-Dickens-Corpus, the good testing sample set from S-Corpus

and the bad testing sample set from Bad-Corpus. The training corpus con-

sists of about 400 text documents amounting to a size of more than 10MB.

There are 184 text documents in the good testing sample set and 422 text

documents in bad testing sample set. The commonly used word dictionary D

collects 2000 words (that is M= 2000) most widely used in English. We let

N= 4, that is, we use 4-WMI. Thereafter, the following procedure has been

employed.

First, we process the training corpus as a large text segment to get the train-

ing N-WMI matrix TM×Musing dictionary D. Our program reads every text

document, splits it into sentences, counts the numbers of occurrences of all the

N-WWP in D that were contained in the training corpus and gets C.Further-

more, for every N-WWP, we can get its Cxy,Cxand Cyincidentally. Then

we can evaluate the N-WMI of every N-WWP with equation (2), and obtain

the training N-WMI matrix TM×M.Inthestep,wecanstoreTM×Mto disk

for later use. So if the related conﬁguration parameters are not altered, we can

just read TM×Mfrom the disk in this step for the sequent sample text segment

detection.

230 Z. Chen et al.

Second, we process a sample text segment to get the sample N-WMI matrix

SM×M. The procedure is similar with the ﬁrst step, but in this step our program

just read a text segment of a certain size, such as 10kB from every sample text

document. In this way, we can control the size of the sample text segment.

Third, we evaluate N-WVMI value Vof the sample text segment using SM×M

and TM×Mwith equation (3) or (4). Some attentions have to be paid to this

step. If there are some pairs of items absent in matrix SM×Mand TM×M,we

use equation (4). That is to say, we just calculate variance of the Ipairs of items

when Ipairs of items are present in the matrixes SM×Mand TM×M.Otherwise,

equation (4) turns to equation (3). In this step, another variable, the PAD value

Dα,K is calculated by equation (5). The variable is a useful assistant classiﬁca-

tion feature in addition to N-WVMI value V.Intheexperiment,welet =2and

K=100, so we calculate D2,100 .

Finally, we use SVM classiﬁer to classify the sample text segment as stego-text

segment or normal text segment according the values Vand D2,100.

Fig. 3 shows the ﬂow of the detection procedure. The real line arrowhead

represents the ﬂow of data, while the dashed line arrowhead represents that

nothing will be transferred or processed if TM×Mhas already been stored. The

thick dashed rectangle indicates the whole detection system. Obviously, there are

two key ﬂows in the system: training and testing. The training process is not al-

ways required before the testing process. Once the training process is completed,

it does not need to repeat in sequent detection unless the some conﬁguration pa-

rameters are changed. The testing process contains two steps to evaluate sample

N-WMI matrix, classiﬁcation features N-WVMI and PAD values respectively,

ending with an output that indicates whether the testing sample text segment

is stego-text using a SVM classiﬁer [9].

Fig. 3. Flow of the detection procedure

Linguistic Steganography Detection 231

Tabl e 1. Sample text set and detection results

Type Generator Sample Success Failure Accuracy

Good Set 184 178 6

Bad Set Markov-Chain-Based 100 89 11 94.01%

NICETEXT 212 212 0 98.48%

TEXTO 110 110 0 97.96%

Total 606 589 17 97.19%

4 Results and Discussion

In our experiment, totally 606 testing sample text segments with a size of 20kB

are detected by using their 4-WVMIs. The composing of the sample text set is

present in Table 1. Using a SVM classiﬁer, we can get a total accuracy of 97.19%.

Note that the accuracy of each linguistic steganography method (denoted by

LSM ethod) is computed as follows:

Accuracy(LSM ethod)= SUC(GoodSet)+SUC(LSMethod)

SAM(GoodSet)+SAM(LSMethod)

Where SUC represents the number of success text segments, and SAM repre-

sents the number of sample text segments.

In Table 1, we can see that the accuracy of detecting stego-text segments

generated by Markov-Chain-Based is obviously lower than that generated by the

other two algorithms. The probable reason is that Markov-Chain-Based method

sometimes embeds secret messages by adding white space between sequent words

in sample text, copying these words as generated words, other than generating

new words when there is only a branch in the state transfer chart. For example,

a text segment generated by Markov-Chain-Based as follows:

“...I’ll wait a year, according to the forest to tell each other than a brown

thrush sang against a tree, held his mouth shut and shook it out, the elder

Ammon suggested sending for Polly. ...”

We can see that the algorithm adds white space between words “according”

and “to”, between words “sending” and “for” and so on in the text segment and

these words are copied to generated text from sample text directly. This keeps

more properties of normal text.

Fig. 4 shows testing results of all testing samples. Fig. 5 - Fig. 7 show testing

results of testing samples generated by Markov-Chain-Based, NICETEXT and

TEXTO respectively. As the discussion above, the accuracy of detecting Markov-

Chain-Based method appears slightly lower. The results of detecting the other

two algorithms appear ideal with the text segment size of 20kB. But when the

segment sizes are smaller than 5kB, such as 2kB, the accuracies will decrease

obviously. This is determined by characteristics of the statistical algorithm. So

sample text segments with a size greater than 5kB are recommended.

232 Z. Chen et al.

Fig. 4. Testing results of all testing samples

Fig. 5. Testing results of testing samples generated by Markov-Chain-Based

In addition, our algorithm is time eﬃcient although we have not measured

it strictly. It takes about 1 minute to complete our detection of more than 600

sample text segments.

Linguistic Steganography Detection 233

Fig. 6. Testing results of testing samples generated by NICETEXT

Fig. 7. Testing results of testing samples generated by TEXTO

After all, the results of our research appear pretty promising. We have detected

the three forenamed linguistic steganography methods in a blind way accurately,

and our method may suit to detect other or new linguistic steganography meth-

ods that generate nature-like covertext. For other linguistic steganography meth-

ods such as Synonym-Substitution-Based or translation-based steganography,

234 Z. Chen et al.

the detection based on the characteristics of correlations between words may

still work, and that is also our future work.

5Conclusion

In this paper, a statistical linguistic steganography detection algorithm has been

presented. We use the statistical characteristics of the correlations between the

general service words that are gathered in a dictionary to classify the given text

segments into stego-text segments and normal text segments. The strength of the

correlation is measured by N-window mutual information (N-WMI). The total

accuracy is as high as 97.19%. The accuracies of blindly detecting these three dif-

ferent linguistic steganography approaches: Markov-Chain-Based, NICETEXT

and TEXTO are 94.01%, 98.48% and 97.96% respectively.

Our research mainly focuses on detecting linguistic steganography that em-

beds secret messages by generating cover text. But it is easy to modify our gen-

eral service word dictionary to ﬁt the detection of Synonym-Substitution-Based

algorithm and other linguistic steganography methods modifying the content of

the cover text. Therefore, our algorithm is widely applicable in linguistic ste-

ganalysis.

Many interesting and new challenges are involved in the analysis of linguistic

steganography algorithms, which is called linguistic steganalysis that has little

or no counterpart in other media domains, such as image or video. Linguistic

steganalysis performance strongly depends on many factors such as the length of

the hidden message and the way to generate a cover text. However, our research

shows that the linguistic steganalysis based on correlations between words is

promising.

Acknowledgement

This work was supported by the NSF of China (Grant Nos. 60773032 and

60703071 respectively), the Ph.D. Program Foundation of Ministry of Educa-

tion of China (No. 20060358014), the Natural Science Foundation of Jiangsu

Province of China (No. BK2007060), and the Anhui Provincial Natural Science

Foundation (No. 070412043).

References

1. Winstein, K.: Lexical steganography through adaptive modulation of the word

choice hash, http://alumni.imsa.edu/∼keithw/tlex/lsteg.ps

2. Chapman, M.: Hiding the Hidden: A Software System for Concealing Ciphertext as

Innocuous Text (1997), http://www.NICETEXT.com/NICETEXT/doc/thesis.pdf

3. Chapman, M., Davida, G., Rennhard, M.: A Practical and Eﬀective Approach

to Large-Scale Automated Linguistic Steganography. In: Davida, G.I., Frankel, Y.

(eds.) ISC 2001. LNCS, vol. 2200, pp. 156–167. Springer, Heidelberg (2001)

Linguistic Steganography Detection 235

4. Maher, K.: TEXTO,

ftp://ftp.funet.fi/pub/crypt/steganography/texto.tar.gz

5. Shu-feng, W., Liu-sheng, H.: Research on Information Hiding. Degree of master,

University of Science and Technology of China (2003)

6. Taskiran, C., Topkara, U., Topkara, M., et al.: Attacks on lexical natural language

steganography systems. In: Proceedings of SPIE (2006)

7. Ji-jun, Z., Zhu, Y., Xin-xin, N., et al.: Research on the detecting algorithm of text

document information hiding. Journal on Communications 25(12), 97–101 (2004)

8. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Process-

ing. Beijin Publishing House of Electronics Industry (January 2005)

9. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines (2001),

http://www.csie.ntu.edu.tw/∼cjlin/libsvm

Text-based Steganography Using Cover Text Free Human-Generated Natural Language (HGNL) Approach

Article

May 2013

A common pitfall of existing encryption procedures using lexical (text-based) steganography is the fact that the encrypted text may be recognized as such by someone who intercepts it. We introduce a new procedure which combines an automated algorithm with human input. The resulting texts are novel and therefore not searchable or otherwise easily recognized as encoding a hidden message

Steganalysis of Markov Chain-Based Statistical Text Steganography

Article

Full-text available

Dec 2022

Linguistic Steganography Based on Adaptive Probability Distribution

Article

Full-text available

May 2021

Text has become one of the most extensively used digital media in Internet, which provides steganography an effective carrier to realize confidential message hiding. Nowadays, generation-based linguistic steganography has made a significant breakthrough due to the progress of deep learning. However, previous methods based on recurrent neural network have two deviations including exposure bias and embedding deviation, which seriously destroys the security of steganography. In this paper, we propose a novel linguistic steganographic model based on adaptive probability distribution and generative adversarial network, which achieves the goal of hiding secret messages in the generated text while guaranteeing high security performance. First, the steganographic generator is trained by using generative adversarial network to effectively tackle the exposure bias, and then the candidate pool is obtained by a probability similarity function at each time step, which alleviates the embedding deviation through dynamically maintaining the diversity of probability distribution. Third, to further improve the security, a novel strategy that conducts information embedding during model training is put forward. We design various experiments from different aspects to verify the performance of the proposed model, including imperceptibility, statistical distribution, anti-steganalysis ability. The experimental results demonstrate that our proposed model outperforms the current state-of-the-art steganographic schemes.

NLP Sentiment Analysis and Accounting Transparency: A New Era of Financial Record Keeping

Article

Full-text available

Dec 2023

Transparency in financial reporting is crucial for maintaining trust in financial markets, yet fraudulent financial statements remain challenging to detect and prevent. This study introduces a novel approach to detecting financial statement fraud by applying sentiment analysis to analyse the textual data within financial reports. This research aims to identify patterns and anomalies that might indicate fraudulent activities by examining the language and sentiment expressed across multiple fiscal years. The study focuses on three companies known for financial statement fraud: Wirecard, Tesco, and Under Armour. Utilising Natural Language Processing (NLP) techniques, the research analyses polarity (positive or negative sentiment) and subjectivity (degree of personal opinion) within the financial statements, revealing intriguing patterns. Wirecard showed a consistent tone with a slight decrease in 2018, Tesco exhibited marked changes in the fraud year, and Under Armour presented subtler shifts during the fraud years. While the findings present promising trends, the study emphasises that sentiment analysis alone cannot definitively detect financial statement fraud. It provides insights into the tone and mood of the text but cannot reveal intentional deception or financial discrepancies. The results serve as supplementary information, enriching traditional financial analysis methods. This research contributes to the field by exploring the potential of sentiment analysis in financial fraud detection, offering a unique perspective that complements quantitative methods. It opens new avenues for investigation and underscores the need for an integrated, multidimensional approach to fraud detection.

A Generic Taxonomy for Steganography Methods

Preprint

Full-text available

Jul 2022

A unified understanding of terms and their applicability is essential for every scientific discipline: steganography is no exception. Being divided into several domains (for instance, text steganography, digital media steganography, and network steganography), it is crucial to provide a unified terminology as well as a taxonomy that is not limited to some specific applications or areas. A prime attempt towards a unified understanding of terms was conducted in 2015 with the introduction of a pattern-based taxonomy for network steganography. Six years later, in 2021, the first work towards a pattern-based taxonomy for steganography was proposed. However, this initial attempt still faced several shortcomings, e.g., the lack of patterns for several steganography domains (the work mainly focused on network steganography and covert channels), various terminology issues, and the need of providing a tutorial on how the taxonomy can be used during engineering and scientific tasks, including the paper-writing process. As the consortium who published this initial 2021-study on steganography patterns, in this paper we present the first comprehensive pattern-based taxonomy tailored to fit all known domains of steganography, including smaller and emerging areas, such as filesystem steganography and cyber-physical systems steganography. Besides, to make our contribution more effective and promote the use of the taxonomy to advance research on steganography, we also provide a thorough tutorial on its utilization. Our pattern collection is available at https://patterns.ztt.hs-worms.de.

A Generic Taxonomy for Steganography Methods

Preprint

Full-text available

Jul 2022

A Generic Taxonomy for Steganography Methods

Preprint

Full-text available

Jul 2022

A Generation-based Text Steganography by Maintaining Consistency of Probability Distribution

Preprint

Full-text available

Sep 2021

Text steganography combined with natural language generation has become increasingly popular. The existing methods usually embed secret information in the generated word by controlling the sampling in the process of text generation. A candidate pool will be constructed by greedy strategy, and only the words with high probability will be encoded, which damages the statistical law of the texts and seriously affects the security of steganography. In order to reduce the influence of the candidate pool on the statistical imperceptibility of steganography, we propose a steganography method based on a new sampling strategy. Instead of just consisting of words with high probability, we select words with relatively small difference from the actual sample of the language model to build a candidate pool, thus keeping consistency with the probability distribution of the language model. What's more, we encode the candidate words according to their probability similarity with the target word, which can further maintain the probability distribution. Experimental results show that the proposed method can outperform the state-of-the-art steganographic methods in terms of security performance.

Leveraging Generative Models for Covert Messaging: Challenges and Tradeoffs for "Dead-Drop" Deployments

Conference Paper

Jun 2024

A Comprehensive Review on Deep Learning-Based Generative Linguistic Steganography

Chapter

Mar 2023

The recent development of deep learning has made a significant breakthrough in linguistic generative steganography. The text has become one of the most intensely used communication carriers on the Internet, making steganography an efficient carrier for concealing secret messages. Text steganography has long been used to protect the privacy and confidentiality of data via public transmission. Steganography utilizes a carrier to embed the data to generate a secret unnoticed and less attractive message. Different techniques have been used to improve the security of the generated text and quality of the steganographic text, such as the Markov model, Recurrent Neural Network (RNN), Long short-term memory (LSTM), Transformers, Knowledge Graph, and Variational autoencoder (VAE). Those techniques enhance the steganographic text’s language model and conditional probability distribution. This paper provides a comparative analysis to review the key contributions of generative linguistic steganographic deep learning-based methods through different perspectives such as text generation, encoding algorithm, and evaluation criteria.KeywordsText steganographyInformation hidingDeep learning

Attacks on Lexical Natural Language Steganography Systems

Article

Full-text available

Feb 2006
Proceedings of SPIE

ABSTRACT Text data forms the largest bulk of digital data that people encounter and exchange daily. For this reason the potential usage of text data as a covert channel for secret communication is an imminent concern. Even though information hiding into natural language text has started to attract great interest, there has been no study on attacks against these applications. In this paper we examine the robustness of lexical steganography systems.In this paper we used a universal steganalysis method based on language models and support vector machines to differentiate sentences modified by a lexical steganography algorithm from unmodified sentences. The experimental accuracy of our method on classification of steganographically modified sentences was 84.9%. On classification of isolated sentences we obtained a high recall rate whereas the precision was low. Keywords: steganalysis, lexical steganography, natural language steganography, universal steganalysis, statis-

A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography

Conference Paper

Full-text available

Oct 2001

Several automated techniques exist to transform ciphertext into text that “looks like” natural-language text while retaining the ability to recover the original ciphertext. This transformation changes the ciphertext so that it doesn’t attract undue attention from, for example, attackers or agencies or organizations that might want to detect or censor encrypted communication. Although it is relatively easy to generate a small sample of quality text, it is challenging to be able to generate large texts that are “meaningful” to a human reader and which appear innocuous. This paper expands on a previous approach that used sentence models and large dictionaries of words classified by part-of-speech [7]. By using an “extensible contextual template” approach combined with a synonymbased replacement strategy, much more realistic text is generated than was possible with NICETEXT.

Foundations of Statistical Natural Language Processing

Article

Sep 2002
SIGMOD REC

Gerhard Weikum

An abstract is not available.

LIBSVM: A library for support vector machines

Article

Jul 2007

LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.

Foundations of Statistical Natural Language Processing

Chapter

Jan 1999

Foundations of Statistical Natural Language Processing

Chapter

Jan 1999

Hiding The Hidden: A Software System For Concealing Ciphertext As Innocuous Text.

Article

Sep 1998

We present a system for protecting the privacy of cryptograms to avoid detection by censors. The system transforms ciphertext into innocuous text which can be transformed back into the original ciphertext. The expandable set of tools allows experimentation with custom dictionaries, automatic simulation of writing style, and the use of context-free-grammars to control text generation. The scope of this paper is to provide an overview of the basic transformation processes and to demonstrate the quality of the generated text.

Research on the detecting algorithm of text document information hiding

Jan 2004
97

Z Ji-Jun
Y Zhu
N Xin-Xin
Z. Ji-jun

Ji-jun, Z., Zhu, Y., Xin-xin, N., et al.: Research on the detecting algorithm of text document information hiding. Journal on Communications 25(12), 97-101 (2004)

Lexical steganography through adaptive modulation of the word choice hash

K Winstein

Winstein, K.: Lexical steganography through adaptive modulation of the word choice hash, http://alumni.imsa.edu/ ∼ keithw/tlex/lsteg.ps

A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography

Jan 2001
156-167

M Chapman
G Davida
M Rennhard

Chapman, M., Davida, G., Rennhard, M.: A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography. In: Davida, G.I., Frankel, Y. (eds.) ISC 2001. LNCS, vol. 2200, pp. 156-167. Springer, Heidelberg (2001)

Linguistic Steganography Detection Using Statistical Characteristics of Correlations between Words

Abstract and Figures

Recommended publications

Text steganalysis method - Breaking steganographic utility of Stego

Generative Steganography Based on GANs: 4th International Conference, ICCCS 2018, Haikou, China, Jun...

Coverless Information Hiding Based on Robust Image Hashing

Improved LSB matching steganography with histogram characters reserved