Conference PaperPDF Available

IM Session Identification by Outlier Detection in Cross-correlation Functions

March 2015

March 2015

DOI:10.13140/RG.2.1.3524.5602

Conference: 49th Annual Conference on Information Sciences and Systems (CISS)
At: Johns Hopkins University, Baltimore, MD, USA

Authors:

Saad Saleh

University of Groningen

Muhammad Usman Ilyas

University of Jeddah

Khawar Khurshid

National University of Sciences and Technology

Show all 5 authorsHide

The identification of encrypted Instant Messaging (IM) channels between users is made difficult by the presence of variable and high levels of uncorrelated background traffic. In this paper, we propose a novel Cross-correlation Outlier Detector (CCOD) to identify communicating end-users in a large group of users. Our technique uses traffic flow traces between individual users and IM service provider's data center. We evaluate the CCOD on a data set of Yahoo! IM traffic traces with an average SNR of −6.11dB (data set includes ground truth). Results show that our technique provides 88% true positives (TP) rate, 3% false positives (FP) rate and 96% ROC area. Performance of the previous correlation-based schemes on the same data set was limited to 63% TP rate, 4% FP rate and 85% ROC area.

Collecting traffic around gateway and IM Data Center.

…

Figures - uploaded by Saad Saleh

Content may be subject to copyright.

Content uploaded by Saad Saleh

Content may be subject to copyright.

IM Session Identiﬁcation by Outlier Detection in

Cross-correlation Functions

Saad Saleh∗, Muhammad U. Ilyas∗, Khawar Khurshid∗, Alex X. Liu‡and Hayder Radha§

∗Dept. of Electrical Engineering, School of Electrical Engineering and Computer Science,

National University of Sciences and Technology, H-12, Islamabad – 44000, Pakistan

‡Dept. of Computer Science and Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA

§Dept. of Electrical and Computer Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA

Email: {saad.saleh, usman.ilyas, khawar.khurshid}@seecs.edu.pk∗,

alexliu@cse.msu.edu‡, radha@egr.msu.edu§

Abstract—The identiﬁcation of encrypted Instant Messaging

(IM) channels between users is made difﬁcult by the presence of

variable and high levels of uncorrelated background trafﬁc. In

this paper, we propose a novel Cross-correlation Outlier Detector

(CCOD) to identify communicating end-users in a large group of

users. Our technique uses trafﬁc ﬂow traces between individual

users and IM service provider’s data center. We evaluate the

CCOD on a data set of Yahoo! IM trafﬁc traces with an average

SNR of −6.11dB (data set includes ground truth). Results show

that our technique provides 88% true positives (TP) rate, 3%

false positives (FP) rate and 96% ROC area. Performance of

the previous correlation-based schemes on the same data set was

limited to 63% TP rate, 4% FP rate and 85% ROC area.

Keywords- Link de-anonymization; instant messaging; se-

curity; privacy;

I. INTRODUCTION

A. Background and Motivation

Instant Messaging (IM) services are projected to reach

1.4 billion users worldwide by 2016 [1]. IM services provide

mostly free, ubiquitous access, mobility and privacy. However,

user privacy has been under attack for unlawful reasons in

the past few years, e.g. the theft of data of 35 million users

in Korea in 2013 [2]. Government agencies, including the

National Security Agency (NSA), have also breached the

privacy of millions of users [3]. The aim of our research is

to assess the vulnerability of IM sessions to de-anonymization

attacks (identifying who is communicating with whom) using

only transport layer session traces. Such link de-anonymization

is challenging for the following reasons: (1) IM messages

are now often times encrypted (only IP and TCP headers are

visible which become infeasible to log on a large scale), (2)

IM data center establishes separate TCP connections between

the source and destination users (at any time, no packet

contains the source and destination IPs of both end users).

The complexity of de-anonymization increases further in the

following practical scenarios, (1) Simultaneous multiple mes-

sage sessions by a user, (2) Thousands of users communicating

through IM data center at any time, (3) Duplicate packets due

to retransmissions and (4) Out-of-order packet delivery.

B. Limitations of Prior Art

Several prior works have focused on link de-

anonymization. Time Series Correlation (TSC), the baseline

approach, has a TP rate of 63% for a signal-to-noise ratio

(SNR) of −6.11dB [4]. Major factors for performance

deterioration include delay, jitter, buffering, reordering,

duplicate messages and server messages. In the area of

de-anonymization of mix-networks several studies focused

on the computation of mutual information between ingress

and egress trafﬁc ﬂows. High time-complexity and reliance

on data that needs to be collected from multiple points

inside the network are major limitations. In social-network

de-anonymization, data sparsity and membership information

is used to de-anonymize networks. Here, the requirement of

detailed user information becomes a major limiting factor.

Several de-anonymization attempts have been made over Tor

network but the major emphasis was the identiﬁcation of

trafﬁc using various ﬁngerprints. Use of correlation attempts

has been limited in breaching attempts.

In our previous works, we showed that the correlation

of wavelet decomposed time series of users’ trafﬁc traces

can successfully breach IM session privacy [4]. In a recent

study, we showed that the cause-effect relationship between

packets appearing in two communicating (“talking”) users’

trafﬁc traces can be leveraged to de-anonymize user sessions

[5]. Time complexity was a major limiting factor for these

approaches.

C. Proposed Approach

In this paper, we propose a novel Cross-correlation Outlier

Detector (CCOD) to de-anonymize users’ IM sessions using

only undirected transport layer trafﬁc traces collected at the

data center (or gateway), in the form of ﬂow logs. Our idea

leverages the limited delay between the appearance of a packet

in a sending user’s trafﬁc trace and its appearance in the

receiving user’s trafﬁc trace. We expect talking users to have

a high cross-correlation statistic, while for non-talking users

the appearance of packets in the same time slot is expected

to be coincidental. To de-anonymize a user we compute the

cross-correlation function of the time series of her trafﬁc trace

derived from her ﬂow log with the respective time series of all

other users. Therefore, the time-complexity of our approach is

Θ(N), where Nis the number of IM users. Next, we estimate

the distribution of the cross-correlation function at all non-zero

time-shifts. Finally, we apply a binary classiﬁer to the value of

the cross-correlation function at zero time shift and determine

whether that value is drawn from the preceding distribution or

not.

D. Experimental Results and Findings

We evaluated the performance of CCOD on a real world

Yahoo! IM based data set collected from the greater New York

area over a duration of 24 hours. Classiﬁcation rates were

estimated for Na¨

ıve Bayes, SVM and C 4.5 classiﬁers. Results

showed that CCOD provides up to 88% TP rate and 3% FP

rate with 95% precision and 88% recall. Results showed that

low SNR has the worst effect on classiﬁer performance. We

compared the results of TSC and wavelet based decomposition

called COLD with CCOD [4][5]. Results showed that correla-

tion outlier outperformed the techniques of TSC and COLD.

E. Key Contributions

Our major contributions are as follows:

1) We propose CCOD, a novel link de-anonymization

approach for IM user session (see Section III).

2) We validated the approach on a real world Yahoo!

messenger data set (see Section IV).

3) We compared the performance gains of our CCOD

approach with previous state-of-the-art schemes (see

Tab. II).

Paper Organization:

The rest of the paper is organized as follows: Section-

II presents the IM network scenario and de-anonymization

challenges. Section-III presents our proposed approach. Re-

sults of simulations over Yahoo! data set have been discussed

in Section-IV. Section-V presents the related work. Finally,

Section-VI concludes the paper.

II. PROBLEM DESCRIPTION

When two users u1and u2communicate with each other

via a relayed IM service, two TCP connections are established;

One session is established between user u1and the IM data

center and another one between user u2and the IM data center.

All ﬂows in source user appear in destination users time series

with some delay and possible reordering. Fig. 1 shows the

setup used for data collection of IM trafﬁc traces. Port ﬁltering

is used to isolate IM trafﬁc from other trafﬁc to/from the data

center. Users are identiﬁed by their IP addresses. Flow logs

can be collected at two possible locations, (1) At the gateway

to the IM data center, (2) At the gateway or proxy server of

a subnet. Logging at the IM data center can de-anonymize all

sessions established through the data center. On the contrary,

trafﬁc logged at the gateway of a subnet can only be used to

de-anonymize conversations between users located inside the

subnet1. Trafﬁc logging tools record a number of packet header

ﬁelds using packet headers including source and destination

IP addresses, source and destination port numbers, protocol

versions, time stamps and packet sizes [7]. Owing to resource

constraints, the ﬂow logs collected as part of the data set we

collected at the data center contained only timestamps and user

IDs / IP addresses [8]. We aim to answer the question, “Is it

possible to link two communicating IM users from among a

large set of users using only user-server ﬂow logs?”

1In a recent case of a bomb hoax at Harvard University, the suspect was

identiﬁed based upon the local subnet trafﬁc at Harvard University [6].

III. PROPOSED DE-ANONYMIZATION APPROACH

We propose a cross-correlation outlier detector to map an

IM user to the user(s) she is communicating with. Let X(t)

and Y(t)be the respective time series of packets sent to and

received from the IM data center by two users Xand Y,

respectively. Let RXY (t−k)denote the cross-correlation of

ﬂow log X(t)and ktime unit (in our case, seconds) delayed

ﬂow log Y(t−k), respectively. The step-by-step procedure of

our approach is as follows:

1) Compute RXY (0) of the ﬂow logs X(t)and Y(t).

2) Compute RXY (t−k)of the ﬂow logs X(t)and Y(t−

k), where k∈Z{[−Z, +Z]\0},i.e. varies from

−Zto +Z, except 0.

3) Estimate the standard deviation σXY of all 2×Z

values RXY (t−k)from the previous step.

4) Compute the deviation of RXY (0) from the mean

μXY of the distribution of RXY (t−k)of all other

values, i.e. RXY (0) −μXY .

5) Normalize the deviation of RXY (0)

(R(0) −μXY ) by the standard deviation σXY ,

i.e. (RXY (0) −μXY )/σXY .

Our breaching strategy is to explore the “high correla-

tion” between undirected ﬂow logs of communicating users,

where the threshold for what constitutes high correlation is

determined by the value of σXY for each pair of users

Xand Y. Without any time shift, talking users have high

cross-correlation. However, the server introduces a processing

delay for packets which decreases the correlation statistics.

Although noise decreases the correlation statistics, any increase

in time-shift decreases correlation sharply. To increase the

performance gains, we developed our own measure which

computes the deviation of central value (non-shifted correlation

statistics) from the mean of all other values. Developed metric

is expressed in terms of standard deviation of all other values.

IV. EXPERIMENTAL RESULTS AND FINDINGS

Performance of the correlation outlier approach was tested

on a data set collected from Yahoo! IM service [8]. The

entire data was collected from the greater New York area over

a duration of 24 hours. We were provided a ﬂow log ﬁle

(containing time and anonymized user IDs) and the ground

truth user communication ﬁle for testing the performance of

the proposed CCOD approach. Analysis of ﬂow log data set

revealed the large noise present in form of about 200 control

and service messages resulting in a mean SNR of −6.11dB and

a standard deviation of 5.81dB. Fig. 2a and Fig. 2b show the

correlation statistics with time shifts −20 ≤k≤+20. Unlike

non-talking pairs, all talking pairs bear a large correlation

statistics without any time shift. Any increase in time shift

decreases the correlation statistics. Correlation does not drop

to zero at most time shifts because of the high levels of

background trafﬁc / low SNR of the data set. A few pairs

show large correlation statistics despite time shifts due to the

variability of the data set. Thus using a number of time shifts

large enough to estimate σXY can remove this anomaly in

classiﬁer design. Fig. 3a and Fig. 3b present the probability

density function of the computed (RXY (0) −μXY )/σXY for

all talking and non-talking pairs. Results show that talking

pairs have a larger spread than non-talking pairs. The data set

,0 'DWD

&HQWHU

X

X

7LPH

3DFNHW

6L]H

X

7LPH

3DFNHW

6L]H

7LPH

3DFNHW

6L]H

*DWHZD\

,QWHUQHW

Fig. 1: Collecting trafﬁc around gateway and IM Data Center.

used for the results in this section is of 10 minute duration

containing 1962 talking and 3000 non-talking user pairs.

Tab. I presents the classiﬁer results. WEKA was used to

design classiﬁers with 10-fold cross validation [9]. Classiﬁers

showing promising results include Na¨

ıve Bayes, SVM and

C 4.5 decision tree. Results show that best performance is

obtained for C 4.5 classiﬁer providing 88% TP rate, 3% FP

rate, 95% precision, 88% recall and 96% ROC.

Tab. II presents the performance comparison of previous

TSC and COLD approaches with CCOD. CCOD provides

88% TP rate while TSC and COLD show only 63% and 55%

TP rates, respectively. Similarly, CCOD shows a 3% FP rate

while TSC and COLD show 3% and 4% FP rates, respectively.

Similar is the case for other statistics. TSC and COLD showed

poor de-anonymization statistics in comparison with CCOD.

V. R ELATED WORK

A. Mix Network De-anonymization

Several studies have focused on the de-anonymization of

mix networks. In [10], Troncoso and Danezis use a Markov

Chain Monte Carlo engine to obtain the probabilities for link

de-anonymization. Authors limited the network size to only 10

nodes and collected data from multiple points. In [11], authors

estimate the mutual information of all the information entering

and leaving the mixers. Their algorithm is exhaustive because

of large computation time. Time series correlation has been

the most logical de-anonymization technique but it provides

only 63% TP rate for IM networks [4] [5]. IM network

differs from mix networks due to large number of control and

TABLE I: Performance of CCOD on Yahoo! data set.

Classiﬁer TP Rate FP Rate Precision Recall ROC

Na¨

ıve Bayes 0.61 0.04 0.92 0.61 0.91

SVM 0.89 0.04 0.94 0.89 0.93

C 4.5 0.88 0.03 0.95 0.88 0.96

TABLE II: Performance comparison of CCOD with previous

de-anonymization approaches [4][5].

Technique TP FP Precision Recall ROC

TSC 0.63 0.04 0.76 0.63 0.85

COLD 0.55 0.03 0.78 0.55 0.84

CCOD 0.88 0.03 0.95 0.88 0.96

server messages, delays, jitters, no network visibility and large

number of users.

B. Social Network De-anonymization

Using people’s friendship graphs to de-anonymize net-

works has been a major focus of various studies. In [12],

authors calculate the distances in d-dimensional data sets of

users to measure the user separation. In [13], authors develop

an identiﬁcation algorithm to de-anonymize the users. In [14],

[15], group membership information has been used to trace the

users. All these studies have been ineffective for IM networks

because these approaches require connectivity and topological

information of users.

C. Tor Network Deanonymization

A number of studies have focused over the deanonymiza-

tion of the Tor network which uses a distributed network archi-

tecture. Location of hidden services is identiﬁed by using route

selection, timing signatures and delays [16][17][18]. Several

researches focus over the identiﬁcation of Tor trafﬁc from other

network trafﬁc using trafﬁc ﬁngerprints, packet sizes and proxy

nodes [19][20][21]. Large number of breaching attempts have

been made using unpopular ports, circuit clogging and man-in-

the-middle strategies [22][23][24]. “Correlation” based attacks

have been used to identify Tor trafﬁc by using delays, times,

stream sizes [25][26][27]. However, no previous study has used

the deviation in correlations from the mean value without any

time shift, like our scheme, to deanonymize the Tor network.

D. IM Network De-anonymization

In our study [4], we used the wavelet based decompo-

sition approach to de-anonymize IM users. Correlation of

coefﬁcients was performed over the decomposed user time

series at multiple frequency scales. Our experiments showed

hit rates of 90% to 98% when the candidate set sizes varied

from 10 to 20. However, COLD showed poor performance

for a candidate set size of 1. In our recent study [5], we

used the causality based detection approaches for link de-

anonymization. Our study showed a TP rate of 99% for the best

performing KCI causality based approach. Large computation

time and complexity were the major disadvantages for some

causality based approaches. In this paper, we focus over a

novel correlation outlier approach with better performance and

computation time.

−20 −15 −10 −5 0 5 10 15 20

100

200

300

400

500

Time Shift (k)

Correlation R(t)

(a) Talking pairs.

−20 −15 −10 −5 0 5 10 15 20

100

150

200

Time Shift (k)

Correlation R(t)

(b) Non-talking pairs.

Fig. 2: Correlation with time shift for (a) talking pairs, and (b) non-talking pairs.

−10 010 20 30 40 50 60 70 80 90 100

0.02

0.04

0.06

0.08

0.1

0.12

[R(0)−μ] / σ

PDF

(a) Talking pairs.

−10 010 20 30 40 50 60 70 80 90 100

0.1

0.2

0.3

0.4

0.5

0.6

[R(0) − μ] / σ

PDF

(b) Non-talking pairs.

Fig. 3: CCOD statistics for (a) talking pairs, and (b) non-talking pairs.

VI. CONCLUSION

Our paper presents a novel Cross-correlation Outlier Detec-

tor (CCOD) approach to breach user session in encrypted IM

networks using only ﬂow log data. Our approach suggests that

talking pairs bear large correlation statistics without any time

shift. Experiments over a real world Yahoo! messenger data

set showed a maximum of 88% TP rate, 3% FP rate, 95%

accuracy, 88% precision and 96% ROC. Study suggests that

CCOD provides better performance than majority of previous

approaches with much less computational complexity. In fu-

ture, we aim to deanonymize the user links in the Tor network

by real world experiments using our proposed approaches.

ACKNOWLEDGMENT

This research is part of the project “Detecting covert links

in instant messaging networks using ﬂow level log data” sup-

ported by National ICT R&D Fund, Pakistan. We are thankful

to the people at Yahoo! Webscope program for providing us

the real world data and assisting us in the estimation of users

vulnerability in IM applications.

REFERENCES

[1] CMO Council. Engage at Every Stage: Using Mobile Relationship

Marketing (MRM) to Put More Interaction in the Hands of the Customer

Report, January 2012.

[2] D. Sancho. Large Data Breach in South Korea, Data of 35M Users

Stolen, July 2013.

[3] E. Chabrow. NSA E-Spying: Bad Governance, October 2013.

[4] M. U. Ilyas, M. Z. Shaﬁq, A. X. Liu, and H. Radha. Who are you

talking to? Breaching privacy in encrypted IM networks. In Intl. Conf.

on Network Protocols, 2013.

[5] S. Saleh, M. Raja, M. Shahnawaz, M. U. Ilyas, K. Khurshid, M. Z.

Shaﬁq, A. X. Liu, H. Radha, and S. S. Karande. Breaching IM session

privacy using causality. In Global Telecommunications Conference

(GLOBECOM), 2014.

[6] B. Crowley and S. Almasy. Harvard student charged in bomb hoax out

on 100,000 dollars bail, December 2013.

[7] B. Claise. Cisco systems NetFlow services export version 9. Wikipedia,

the free encyclopedia, October 2004.

[8] Yahoo! Webscope dataset ydata-ymessenger-client-server-protocol-

events-v1-0. [http://research.yahoo.com/Academic Relations].

[9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.

Witten. The WEKA data mining software: an update. SIGKDD

Explorations, 11(1):10–18, 2009.

[10] C. Troncoso and G. Danezis. The Bayesian trafﬁc analysis of mix

networks. In ACM Computer and Communications Security, 2009.

[11] Y. Zhu, X. Fu, R. Bettati, and W. Zhao. Anonymity Analysis of

Mix Networks against Flow-Correlation Attacks. In IEEE Global

Communications Conference (GLOBECOM), 2005.

[12] A. Narayanan and V. Shmatikov. Robust de-anonymization of large

sparse datasets. In IEEE Symposium on Security and Privacy, 2008.

[13] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In

IEEE Symposium on Security and Privacy, 2009.

[14] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack

to de-anonymize social network users. In IEEE Symposium on Security

and Privacy, 2010.

[15] E. Zheleva and L. Getoor. To Join or not to Join: The Illusion of Privacy

in Social Networks with Mixed Public and Private User Proﬁles. In

World Wide Web (WWW) Conference, 2009.

[16] Lasse Overlier and Paul Syverson. Locating hidden servers. In Security

and Privacy, 2006 IEEE Symposium on, pages 15–pp. IEEE, 2006.

[17] Juan A Elices, Fernando Perez-Gonzalez, and Carmela Troncoso. Fin-

gerprinting tor’s hidden service log ﬁles using a timing channel. In

Information Forensics and Security (WIFS), 2011 IEEE International

Workshop on, pages 1–6. IEEE, 2011.

[18] Karsten Loesing, Werner Sandmann, Christian Wilms, and Guido Wirtz.

Performance measurements and statistics of tor hidden services. In Ap-

plications and the Internet, 2008. SAINT 2008. International Symposium

on, pages 1–7. IEEE, 2008.

[19] Xuefeng Bai, Yong Zhang, and Xiamu Niu. Trafﬁc identiﬁcation of tor

and web-mix. In Intelligent Systems Design and Applications, 2008.

ISDA’08. Eighth International Conference on, volume 1, pages 548–

551. IEEE, 2008.

[20] John Barker, Peter Hannay, and Patryk Szewczyk. Using trafﬁc analysis

to identify the second generation onion router. In Embedded and

Ubiquitous Computing (EUC), 2011 IFIP 9th International Conference

on, pages 72–78. IEEE, 2011.

[21] Sambuddho Chakravarty, Angelos Stavrou, and Angelos D Keromytis.

Identifying proxy nodes in a tor anonymization circuit. In Signal

Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE

International Conference on, pages 633–639. IEEE, 2008.

[22] Muhammad Aliyu Sulaiman and Sami Zhioua. Attacking tor through

unpopular ports. In Distributed Computing Systems Workshops (ICD-

CSW), 2013 IEEE 33rd International Conference on, pages 33–38.

IEEE, 2013.

[23] Eric Chan-Tin, Jiyoung Shin, and Jiangmin Yu. Revisiting circuit

clogging attacks on tor. In Availability, Reliability and Security (ARES),

2013 Eighth International Conference on, pages 131–140. IEEE, 2013.

[24] Xiaogang Wang, Junzhou Luo, Ming Yang, and Zhen Ling. A

novel ﬂow multiplication attack against tor. In Computer Supported

Cooperative Work in Design, 2009. CSCWD 2009. 13th International

Conference on, pages 686–691. IEEE, 2009.

[25] Steven J Murdoch and George Danezis. Low-cost trafﬁc analysis of

tor. In Security and Privacy, 2005 IEEE Symposium on, pages 183–

195. IEEE, 2005.

[26] Lu Zhang, Junzhou Luo, Ming Yang, and Gaofeng He. Application-

level attack against tor’s hidden service. In Pervasive Computing and

Applications (ICPCA), 2011 6th International Conference on, pages

509–516. IEEE, 2011.

[27] Ming Song, Gang Xiong, Zhenzhen Li, Junrui Peng, and Li Guo. A de-

anonymize attack method based on trafﬁc analysis. In Communications

and Networking in China (CHINACOM), 2013 8th International ICST

Conference on, pages 455–460. IEEE, 2013.

NAVAL POSTGRADUATE SCHOOL MONTEREY, CALIFORNIA THESIS SUSCEPTIBILITY OF INSTANT MESSAGE METADATA TO TRAFFIC ANALYSIS

Preprint

Full-text available

Jun 2020

Instant Message (IM) applications are commonly used by both civilian and DoD personnel for both communication and collaboration. The web-based variants of these applications generally ride encrypted channels for message security. However, these channels may be vulnerable to keystroke timing attacks whereby textual content is determined by the timing of network traffic induced by keyboard events. An example of this induced traffic is the activity notifications common to many of these platforms, indicating when a conversant begins typing. Our aim is to determine whether the network traffic that carries this metadata enables recovering portions of the message or leaks information about the sender's identity. Using a combination of network packet capture analysis and local keystroke logging, we characterize traffic patterns of three widely used web-based IM platforms: Facebook Messaging, Google Hangouts, and IRC through the Kiwi IRC web client.

SSLSS: Semi-Supervised Learning-Based Steganalysis Scheme for Instant Voice Communication Network

Article

Full-text available

Nov 2018

In the field of instant voice communication steganalysis, the traditional detecting methods are mainly based on supervised learning scheme that results in a large amount of complex manual preprocessing training data set. The accuracy of supervised learning scheme can be easily destroyed by the difference between the distribution of training and testing data set in the actual voice application. The disadvantages of this method are obvious in the big data environment. In this regard, this paper initially introduced a novel semi-supervised hybrid learning detection model for the instant voice communication network. This provides the progress of manually annotating training data set, that has been removed to solve the problem of complex operations and poor applicability in classifier. Therefore, this model has a simpler structure and more extensive detection scopes with the huge amount of data. Then, we designed a multi-criteria fusion module, which can automatically generate the pseudo-label set from testing data set to train the classifier model. Thus, our scheme will not be affected by the distribution shift. In this module, we defined the confidence level and representative level to judge the feature vector for pseudo-labeled. Through the experimental analysis, the low bit-rate speech coding steganalysis (G.723.1/G.729/iLBC speech codecs) is analyzed on quantization index modulation (QIM), which are common codecs in instant voice communication network. The results show that our method has higher accuracy than un-supervised method. The proposed approach is less affected and more accurate than the previous supervised methods through the distribution of different training and testing data sets. The experiments also proved that our method can be deployed in the different kinds of instant voice communication (IVC) codec by considering huge amount of data set.

Network Traffic and User Behavior Analysis of Internet-Based Mobile Messaging Applications: A Case of WeChat

Conference Paper

Aug 2016

In this paper, we extract the characteristics of WeChat traffic and propose approaches to identify WeChat traffic in cellular data network. WeChat communication mechanisms are discussed. The traffic and usage pattern of Video Call service provided by WeChat are studied from massive traffic data using Spark, differently from previous methods. We analyze the features of WeChat Video Call service, a Voice over Internet Protocol (VoIP) application in three aspects, which are (i) daily/weekly usage pattern, (ii) traffic/usage distribution, (iii) conversation time distribution. Our analysis has two important features. Firstly, the massive mobile subscriber data we used in our experiments was collected from a commercial Internet Service Provider (ISP) covering an entire province in Northern China ensuring that the results reflect the real characteristics of service in question in cellular network. Secondly, we investigate that the WeChat Video Call usage times fit with the power law distribution. Our results are important for cellular network operators and service providers to realize WeChat traffic identification methods and user behavior of Video Call, which imply information for optimization of their services.

Breaching IM Session Privacy Using Causality

Conference Paper

Full-text available

Dec 2014

The breach of privacy in encrypted instant mes-senger (IM) service is a serious threat to user anonymity. Performance of previous de-anonymization strategies was limited to 65%. We perform network de-anonymization by taking advan-tage of the cause-effect relationship between sent and received packet streams and demonstrate this approach on a data set of Yahoo! IM service traffic traces. An investigation of various measures of causality shows that IM networks can be breached with a hit rate of 99%. A KCI Causality based approach alone can provide a true positive rate of about 97%. Individual performances of Granger, Zhang and IGCI causality are limited owing to the very low SNR of packet traces and variable network delays.

Fingerprinting Tor's hidden service log files using a timing channel

Article

Full-text available

Nov 2011

Hidden services are anonymously hosted services that can be accessed over Tor, an anonymity network. In this paper we present an attack that allows an entity to prove, once a machine suspect to host a hidden server has been confiscated, that such machine has in fact hosted a particular content. Our solution is based on leaving a timing channel fingerprint in the confiscated machine's log file. In order to be able to fingerprint the log server through Tor we first study the noise sources: the delay introduced by Tor and the log entries due to other users. We then describe our fingerprint method, and analytically determine the detection probability and the rate of false positives. Finally, we empirically validate our results.

Traffic Identification of Tor and Web-Mix

Conference Paper

Full-text available

Dec 2008

With the wide use of anonymity tools, both blocking and anti-blocking of these tools have become hot topics. And the traffic identifications of the corresponding tools are key issues of both blocking and anti-blocking. In this paper, we address on identifying Tor and Web-Mix traffics, which are two of the most famous anonymity tools. Taking advantage of the typical methods for traffic identification, we proposed a traffic identification scheme based on traffic fingerprint extraction and matching. The fingerprints comprise of the specific strings, packet length and frequency of the packets' sending time. The details of design and implementation of such traffic identification scheme for both Tor and Web-Mix are presented. The feasibility of the proposed scheme is shown by the simulation experiments results.

Who are you talking to? Breaching privacy in encrypted IM networks

Conference Paper

Oct 2013

We present a novel attack on relayed instant messaging (IM) traffic that allows an attacker to infer who's talking to whom with high accuracy. This attack only requires collection of packet header traces between users and IM servers for a short time period, where each packet in the trace goes from a user to an IM server or vice-versa. The specific goal of the attack is to accurately identify a candidate set of top-k users with whom a given user possibly talked to, while using only the information available in packet header traces (packet payloads cannot be used because they are mostly encrypted). Towards this end, we propose a wavelet-based scheme, called COmmunication Link De-anonymization (COLD), and evaluate its effectiveness using a real-world Yahoo! Messenger data set. The results of our experiments show that COLD achieves a hit rate of more than 90% for a candidate set size of 10. For slightly larger candidate set size of 20, COLD achieves almost 100% hit rate. In contrast, a baseline method using time series correlation could only achieve less than 5% hit rate for similar candidate set sizes.

Attacking Tor through Unpopular Ports

Conference Paper

Jul 2013

Anonymity systems try to conceal the relationship between the communicating entities in network communication. Popular systems, such as Tor and JAP, achieve anonymity by forwarding the traffic through a sequence of relays. In particular, Tor protocol constructs a circuit of typically 3 relays such as no single relay knows both the source and destination of the traffic. A known attack on Tor consists in injecting a set of compromised relays and wait until a Tor client picks two of them as entry (first) and exit (last) relays. With the currently large number of relays, this attack is not scalable anymore. In this paper, we take advantage of the presence of unpopular ports in Tor network to significantly increase the scalability of the attack: instead of injecting typical Tor relays (with the default exit policy), we inject only relays allowing unpopular ports. Since only a small fraction of Tor relays allow unpopular ports, the compromised relays will outnumber the valid ones. We show how Tor traffic can be redirected through unpopular ports. The experimental analysis shows that by injecting a relatively small number of compromised relays (30 pairs of relays) allowing a certain unpopular port, more than 50% of constructed circuits are compromised.

Revisiting Circuit Clogging Attacks on Tor

Conference Paper

Sep 2013

Tor is a popular anonymity-providing network used by over 500,000 users daily. The Tor network is made up of volunteer relays. To anonymously connect to a server, a user first creates a circuit, consisting of three relays, and routes traffic through these proxies before connecting to the server. The client is thus hidden from the server through three Tor proxies. If the three Tor proxies used by the client could be identified, the anonymity of the client would be reduced. One particular way of identifying the three Tor relays in a circuit is to perform a circuit clogging attack. This attack requires the client to connect to a malicious server (malicious content, such as an advertising frame, can be hosted on a popular server). The malicious server alternates between sending bursts of data and sending little traffic. During the burst period, the three relays used in the circuit will take longer to relay traffic due to the increase in processing time for the extra messages. If Tor relays are continuously monitored through network latency probes, an increase in network latency indicates that this Tor relay is likely being used in that circuit. We show, through experiments on the real Tor network, that the Tor relays in a circuit can be identified. A detection scheme is also proposed for clients to determine whether a circuit clogging attack is happening. The costs for both the attack and the detection mechanism are small and feasible in the current Tor network.

A de-anonymize attack method based on traffic analysis

Conference Paper

Aug 2013

While providing protection for users' privacy, anonymity network has also been exploited by criminals to carry out crime anonymously. We study the problem how to break the unlinkability between the senders and recipients in order to identify the source of anonymous traffic in this paper. Tor, the most widely deployed anonymity network, is selected as our target. We develop a de-anonymize attack method based on traffic analysis and choose the {time, stream size} as features for k-means algorithm to mine the association between the first hop traffic and last hop traffic of Tor. Experiments show that our method is effective for Tor.

Cisco Systems NetFlow Services Export Version 9

Article

Jan 2004

B. Claise

Application-level attack against Tor's hidden service

Article

Oct 2011

Tor has become one of the most popular overlay networks for anonymizing TCP traffic. Hidden service provided by Tor allows users to run a TCP server under a pseudonym, and its resources can be accessed without the operator's real identity being revealed. In this paper, we propose a novel HTTP based application-level attack against Tor's hidden web service. Under the assumption that the entry of the suspected hidden server's circuit is occupied, we evaluate the time correlation between the web accessing and the generated traffic in the malicious onion router. Furthermore, we analyze the probability that the malicious onion routers occupy the entry of the hidden server's circuit when advertise high bandwidth, which is the foundation of our attack. We conducted real-world experiments to evaluate our attack method. The empirical results demonstrate that the hidden service can be effectively and efficiently located.

Identifying Proxy Nodes in a Tor Anonymization Circuit

Conference Paper

Jan 2009

We present a novel, practical, and effective mechanism that exposes the identity of Tor relays participating in a given circuit. Such an attack can be used by malicious or compromised nodes to identify the rest of the circuit, or as the first step in a follow-on trace-back attack. Our intuition is that by modulating the bandwidth of an anonymous connection (e.g. when the destination server, its router, or an entry point is under our control), we create observable fluctuations that propagate through the Tor network and the Internet to the end-user's host. To that end, we employ LinkWidth, a novel bandwidth-estimation technique. LinkWidth enables network edge-attached entities to estimate the available bandwidth in an arbitrary Internet link without a cooperating peer host, router, or ISP. Our approach also does not require compromise of any Tor nodes. In a series of experiments against the Tor network, we show that we can accurately identify the network location of most participating Tor relays.

IM Session Identification by Outlier Detection in Cross-correlation Functions

Abstract and Figures

Recommended publications

Energy keep us going

University of Groningen in 83rd place on THE ranking list

Origins Center to open at Fundamentals of Life in the Universe symposium

A grand future with small particles. Nanotechnology affects many aspects of our lives

Detection of Point Anomalies in Railway Intelligent Control System Using Fast Clustering Techniques

A Novel Network Traffic Anomaly Detection Based on Multi-Scale Fusion

Quality-of-Service in BiNoc

Improved Clustering for Route-Based Eulerian Air Traffic Modeling