Conference PaperPDF Available

IM Session Identification by Outlier Detection in Cross-correlation Functions

Authors:

Abstract and Figures

The identification of encrypted Instant Messaging (IM) channels between users is made difficult by the presence of variable and high levels of uncorrelated background traffic. In this paper, we propose a novel Cross-correlation Outlier Detector (CCOD) to identify communicating end-users in a large group of users. Our technique uses traffic flow traces between individual users and IM service provider's data center. We evaluate the CCOD on a data set of Yahoo! IM traffic traces with an average SNR of −6.11dB (data set includes ground truth). Results show that our technique provides 88% true positives (TP) rate, 3% false positives (FP) rate and 96% ROC area. Performance of the previous correlation-based schemes on the same data set was limited to 63% TP rate, 4% FP rate and 85% ROC area.
Content may be subject to copyright.
IM Session Identification by Outlier Detection in
Cross-correlation Functions
Saad Saleh, Muhammad U. Ilyas, Khawar Khurshid, Alex X. Liuand Hayder Radha§
Dept. of Electrical Engineering, School of Electrical Engineering and Computer Science,
National University of Sciences and Technology, H-12, Islamabad – 44000, Pakistan
Dept. of Computer Science and Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA
§Dept. of Electrical and Computer Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA
Email: {saad.saleh, usman.ilyas, khawar.khurshid}@seecs.edu.pk,
alexliu@cse.msu.edu, radha@egr.msu.edu§
Abstract—The identification of encrypted Instant Messaging
(IM) channels between users is made difficult by the presence of
variable and high levels of uncorrelated background traffic. In
this paper, we propose a novel Cross-correlation Outlier Detector
(CCOD) to identify communicating end-users in a large group of
users. Our technique uses traffic flow traces between individual
users and IM service provider’s data center. We evaluate the
CCOD on a data set of Yahoo! IM traffic traces with an average
SNR of 6.11dB (data set includes ground truth). Results show
that our technique provides 88% true positives (TP) rate, 3%
false positives (FP) rate and 96% ROC area. Performance of
the previous correlation-based schemes on the same data set was
limited to 63% TP rate, 4% FP rate and 85% ROC area.
Keywords- Link de-anonymization; instant messaging; se-
curity; privacy;
I. INTRODUCTION
A. Background and Motivation
Instant Messaging (IM) services are projected to reach
1.4 billion users worldwide by 2016 [1]. IM services provide
mostly free, ubiquitous access, mobility and privacy. However,
user privacy has been under attack for unlawful reasons in
the past few years, e.g. the theft of data of 35 million users
in Korea in 2013 [2]. Government agencies, including the
National Security Agency (NSA), have also breached the
privacy of millions of users [3]. The aim of our research is
to assess the vulnerability of IM sessions to de-anonymization
attacks (identifying who is communicating with whom) using
only transport layer session traces. Such link de-anonymization
is challenging for the following reasons: (1) IM messages
are now often times encrypted (only IP and TCP headers are
visible which become infeasible to log on a large scale), (2)
IM data center establishes separate TCP connections between
the source and destination users (at any time, no packet
contains the source and destination IPs of both end users).
The complexity of de-anonymization increases further in the
following practical scenarios, (1) Simultaneous multiple mes-
sage sessions by a user, (2) Thousands of users communicating
through IM data center at any time, (3) Duplicate packets due
to retransmissions and (4) Out-of-order packet delivery.
B. Limitations of Prior Art
Several prior works have focused on link de-
anonymization. Time Series Correlation (TSC), the baseline
approach, has a TP rate of 63% for a signal-to-noise ratio
(SNR) of 6.11dB [4]. Major factors for performance
deterioration include delay, jitter, buffering, reordering,
duplicate messages and server messages. In the area of
de-anonymization of mix-networks several studies focused
on the computation of mutual information between ingress
and egress traffic flows. High time-complexity and reliance
on data that needs to be collected from multiple points
inside the network are major limitations. In social-network
de-anonymization, data sparsity and membership information
is used to de-anonymize networks. Here, the requirement of
detailed user information becomes a major limiting factor.
Several de-anonymization attempts have been made over Tor
network but the major emphasis was the identification of
traffic using various fingerprints. Use of correlation attempts
has been limited in breaching attempts.
In our previous works, we showed that the correlation
of wavelet decomposed time series of users’ traffic traces
can successfully breach IM session privacy [4]. In a recent
study, we showed that the cause-effect relationship between
packets appearing in two communicating (“talking”) users’
traffic traces can be leveraged to de-anonymize user sessions
[5]. Time complexity was a major limiting factor for these
approaches.
C. Proposed Approach
In this paper, we propose a novel Cross-correlation Outlier
Detector (CCOD) to de-anonymize users’ IM sessions using
only undirected transport layer traffic traces collected at the
data center (or gateway), in the form of flow logs. Our idea
leverages the limited delay between the appearance of a packet
in a sending user’s traffic trace and its appearance in the
receiving user’s traffic trace. We expect talking users to have
a high cross-correlation statistic, while for non-talking users
the appearance of packets in the same time slot is expected
to be coincidental. To de-anonymize a user we compute the
cross-correlation function of the time series of her traffic trace
derived from her flow log with the respective time series of all
other users. Therefore, the time-complexity of our approach is
Θ(N), where Nis the number of IM users. Next, we estimate
the distribution of the cross-correlation function at all non-zero
time-shifts. Finally, we apply a binary classifier to the value of
the cross-correlation function at zero time shift and determine
whether that value is drawn from the preceding distribution or
not.
D. Experimental Results and Findings
We evaluated the performance of CCOD on a real world
Yahoo! IM based data set collected from the greater New York
area over a duration of 24 hours. Classification rates were
estimated for Na¨
ıve Bayes, SVM and C 4.5 classifiers. Results
showed that CCOD provides up to 88% TP rate and 3% FP
rate with 95% precision and 88% recall. Results showed that
low SNR has the worst effect on classifier performance. We
compared the results of TSC and wavelet based decomposition
called COLD with CCOD [4][5]. Results showed that correla-
tion outlier outperformed the techniques of TSC and COLD.
E. Key Contributions
Our major contributions are as follows:
1) We propose CCOD, a novel link de-anonymization
approach for IM user session (see Section III).
2) We validated the approach on a real world Yahoo!
messenger data set (see Section IV).
3) We compared the performance gains of our CCOD
approach with previous state-of-the-art schemes (see
Tab. II).
Paper Organization:
The rest of the paper is organized as follows: Section-
II presents the IM network scenario and de-anonymization
challenges. Section-III presents our proposed approach. Re-
sults of simulations over Yahoo! data set have been discussed
in Section-IV. Section-V presents the related work. Finally,
Section-VI concludes the paper.
II. PROBLEM DESCRIPTION
When two users u1and u2communicate with each other
via a relayed IM service, two TCP connections are established;
One session is established between user u1and the IM data
center and another one between user u2and the IM data center.
All flows in source user appear in destination users time series
with some delay and possible reordering. Fig. 1 shows the
setup used for data collection of IM traffic traces. Port filtering
is used to isolate IM traffic from other traffic to/from the data
center. Users are identified by their IP addresses. Flow logs
can be collected at two possible locations, (1) At the gateway
to the IM data center, (2) At the gateway or proxy server of
a subnet. Logging at the IM data center can de-anonymize all
sessions established through the data center. On the contrary,
traffic logged at the gateway of a subnet can only be used to
de-anonymize conversations between users located inside the
subnet1. Traffic logging tools record a number of packet header
fields using packet headers including source and destination
IP addresses, source and destination port numbers, protocol
versions, time stamps and packet sizes [7]. Owing to resource
constraints, the flow logs collected as part of the data set we
collected at the data center contained only timestamps and user
IDs / IP addresses [8]. We aim to answer the question, “Is it
possible to link two communicating IM users from among a
large set of users using only user-server flow logs?”
1In a recent case of a bomb hoax at Harvard University, the suspect was
identified based upon the local subnet traffic at Harvard University [6].
III. PROPOSED DE-ANONYMIZATION APPROACH
We propose a cross-correlation outlier detector to map an
IM user to the user(s) she is communicating with. Let X(t)
and Y(t)be the respective time series of packets sent to and
received from the IM data center by two users Xand Y,
respectively. Let RXY (tk)denote the cross-correlation of
flow log X(t)and ktime unit (in our case, seconds) delayed
flow log Y(tk), respectively. The step-by-step procedure of
our approach is as follows:
1) Compute RXY (0) of the flow logs X(t)and Y(t).
2) Compute RXY (tk)of the flow logs X(t)and Y(t
k), where kZ{[Z, +Z]\0},i.e. varies from
Zto +Z, except 0.
3) Estimate the standard deviation σXY of all 2×Z
values RXY (tk)from the previous step.
4) Compute the deviation of RXY (0) from the mean
μXY of the distribution of RXY (tk)of all other
values, i.e. RXY (0) μXY .
5) Normalize the deviation of RXY (0)
(R(0) μXY ) by the standard deviation σXY ,
i.e. (RXY (0) μXY )XY .
Our breaching strategy is to explore the “high correla-
tion” between undirected flow logs of communicating users,
where the threshold for what constitutes high correlation is
determined by the value of σXY for each pair of users
Xand Y. Without any time shift, talking users have high
cross-correlation. However, the server introduces a processing
delay for packets which decreases the correlation statistics.
Although noise decreases the correlation statistics, any increase
in time-shift decreases correlation sharply. To increase the
performance gains, we developed our own measure which
computes the deviation of central value (non-shifted correlation
statistics) from the mean of all other values. Developed metric
is expressed in terms of standard deviation of all other values.
IV. EXPERIMENTAL RESULTS AND FINDINGS
Performance of the correlation outlier approach was tested
on a data set collected from Yahoo! IM service [8]. The
entire data was collected from the greater New York area over
a duration of 24 hours. We were provided a flow log file
(containing time and anonymized user IDs) and the ground
truth user communication file for testing the performance of
the proposed CCOD approach. Analysis of flow log data set
revealed the large noise present in form of about 200 control
and service messages resulting in a mean SNR of 6.11dB and
a standard deviation of 5.81dB. Fig. 2a and Fig. 2b show the
correlation statistics with time shifts 20 k+20. Unlike
non-talking pairs, all talking pairs bear a large correlation
statistics without any time shift. Any increase in time shift
decreases the correlation statistics. Correlation does not drop
to zero at most time shifts because of the high levels of
background traffic / low SNR of the data set. A few pairs
show large correlation statistics despite time shifts due to the
variability of the data set. Thus using a number of time shifts
large enough to estimate σXY can remove this anomaly in
classifier design. Fig. 3a and Fig. 3b present the probability
density function of the computed (RXY (0) μXY )XY for
all talking and non-talking pairs. Results show that talking
pairs have a larger spread than non-talking pairs. The data set
,0 'DWD
&HQWHU
X
X
7LPH
3DFNHW
6L]H
X
7LPH
3DFNHW
6L]H
7LPH
3DFNHW
6L]H
*DWHZD\
,QWHUQHW
Fig. 1: Collecting traffic around gateway and IM Data Center.
used for the results in this section is of 10 minute duration
containing 1962 talking and 3000 non-talking user pairs.
Tab. I presents the classifier results. WEKA was used to
design classifiers with 10-fold cross validation [9]. Classifiers
showing promising results include Na¨
ıve Bayes, SVM and
C 4.5 decision tree. Results show that best performance is
obtained for C 4.5 classifier providing 88% TP rate, 3% FP
rate, 95% precision, 88% recall and 96% ROC.
Tab. II presents the performance comparison of previous
TSC and COLD approaches with CCOD. CCOD provides
88% TP rate while TSC and COLD show only 63% and 55%
TP rates, respectively. Similarly, CCOD shows a 3% FP rate
while TSC and COLD show 3% and 4% FP rates, respectively.
Similar is the case for other statistics. TSC and COLD showed
poor de-anonymization statistics in comparison with CCOD.
V. R ELATED WORK
A. Mix Network De-anonymization
Several studies have focused on the de-anonymization of
mix networks. In [10], Troncoso and Danezis use a Markov
Chain Monte Carlo engine to obtain the probabilities for link
de-anonymization. Authors limited the network size to only 10
nodes and collected data from multiple points. In [11], authors
estimate the mutual information of all the information entering
and leaving the mixers. Their algorithm is exhaustive because
of large computation time. Time series correlation has been
the most logical de-anonymization technique but it provides
only 63% TP rate for IM networks [4] [5]. IM network
differs from mix networks due to large number of control and
TABLE I: Performance of CCOD on Yahoo! data set.
Classifier TP Rate FP Rate Precision Recall ROC
Na¨
ıve Bayes 0.61 0.04 0.92 0.61 0.91
SVM 0.89 0.04 0.94 0.89 0.93
C 4.5 0.88 0.03 0.95 0.88 0.96
TABLE II: Performance comparison of CCOD with previous
de-anonymization approaches [4][5].
Technique TP FP Precision Recall ROC
TSC 0.63 0.04 0.76 0.63 0.85
COLD 0.55 0.03 0.78 0.55 0.84
CCOD 0.88 0.03 0.95 0.88 0.96
server messages, delays, jitters, no network visibility and large
number of users.
B. Social Network De-anonymization
Using people’s friendship graphs to de-anonymize net-
works has been a major focus of various studies. In [12],
authors calculate the distances in d-dimensional data sets of
users to measure the user separation. In [13], authors develop
an identification algorithm to de-anonymize the users. In [14],
[15], group membership information has been used to trace the
users. All these studies have been ineffective for IM networks
because these approaches require connectivity and topological
information of users.
C. Tor Network Deanonymization
A number of studies have focused over the deanonymiza-
tion of the Tor network which uses a distributed network archi-
tecture. Location of hidden services is identified by using route
selection, timing signatures and delays [16][17][18]. Several
researches focus over the identification of Tor traffic from other
network traffic using traffic fingerprints, packet sizes and proxy
nodes [19][20][21]. Large number of breaching attempts have
been made using unpopular ports, circuit clogging and man-in-
the-middle strategies [22][23][24]. “Correlation” based attacks
have been used to identify Tor traffic by using delays, times,
stream sizes [25][26][27]. However, no previous study has used
the deviation in correlations from the mean value without any
time shift, like our scheme, to deanonymize the Tor network.
D. IM Network De-anonymization
In our study [4], we used the wavelet based decompo-
sition approach to de-anonymize IM users. Correlation of
coefficients was performed over the decomposed user time
series at multiple frequency scales. Our experiments showed
hit rates of 90% to 98% when the candidate set sizes varied
from 10 to 20. However, COLD showed poor performance
for a candidate set size of 1. In our recent study [5], we
used the causality based detection approaches for link de-
anonymization. Our study showed a TP rate of 99% for the best
performing KCI causality based approach. Large computation
time and complexity were the major disadvantages for some
causality based approaches. In this paper, we focus over a
novel correlation outlier approach with better performance and
computation time.
20 15 10 5 0 5 10 15 20
0
100
200
300
400
500
Time Shift (k)
Correlation R(t)
(a) Talking pairs.
20 15 10 5 0 5 10 15 20
0
50
100
150
200
Time Shift (k)
Correlation R(t)
(b) Non-talking pairs.
Fig. 2: Correlation with time shift for (a) talking pairs, and (b) non-talking pairs.
10 010 20 30 40 50 60 70 80 90 100
0
0.02
0.04
0.06
0.08
0.1
0.12
[R(0)μ] / σ
PDF
(a) Talking pairs.
10 010 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
[R(0) μ] / σ
PDF
(b) Non-talking pairs.
Fig. 3: CCOD statistics for (a) talking pairs, and (b) non-talking pairs.
VI. CONCLUSION
Our paper presents a novel Cross-correlation Outlier Detec-
tor (CCOD) approach to breach user session in encrypted IM
networks using only flow log data. Our approach suggests that
talking pairs bear large correlation statistics without any time
shift. Experiments over a real world Yahoo! messenger data
set showed a maximum of 88% TP rate, 3% FP rate, 95%
accuracy, 88% precision and 96% ROC. Study suggests that
CCOD provides better performance than majority of previous
approaches with much less computational complexity. In fu-
ture, we aim to deanonymize the user links in the Tor network
by real world experiments using our proposed approaches.
ACKNOWLEDGMENT
This research is part of the project “Detecting covert links
in instant messaging networks using flow level log data” sup-
ported by National ICT R&D Fund, Pakistan. We are thankful
to the people at Yahoo! Webscope program for providing us
the real world data and assisting us in the estimation of users
vulnerability in IM applications.
REFERENCES
[1] CMO Council. Engage at Every Stage: Using Mobile Relationship
Marketing (MRM) to Put More Interaction in the Hands of the Customer
Report, January 2012.
[2] D. Sancho. Large Data Breach in South Korea, Data of 35M Users
Stolen, July 2013.
[3] E. Chabrow. NSA E-Spying: Bad Governance, October 2013.
[4] M. U. Ilyas, M. Z. Shafiq, A. X. Liu, and H. Radha. Who are you
talking to? Breaching privacy in encrypted IM networks. In Intl. Conf.
on Network Protocols, 2013.
[5] S. Saleh, M. Raja, M. Shahnawaz, M. U. Ilyas, K. Khurshid, M. Z.
Shafiq, A. X. Liu, H. Radha, and S. S. Karande. Breaching IM session
privacy using causality. In Global Telecommunications Conference
(GLOBECOM), 2014.
[6] B. Crowley and S. Almasy. Harvard student charged in bomb hoax out
on 100,000 dollars bail, December 2013.
[7] B. Claise. Cisco systems NetFlow services export version 9. Wikipedia,
the free encyclopedia, October 2004.
[8] Yahoo! Webscope dataset ydata-ymessenger-client-server-protocol-
events-v1-0. [http://research.yahoo.com/Academic Relations].
[9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten. The WEKA data mining software: an update. SIGKDD
Explorations, 11(1):10–18, 2009.
[10] C. Troncoso and G. Danezis. The Bayesian traffic analysis of mix
networks. In ACM Computer and Communications Security, 2009.
[11] Y. Zhu, X. Fu, R. Bettati, and W. Zhao. Anonymity Analysis of
Mix Networks against Flow-Correlation Attacks. In IEEE Global
Communications Conference (GLOBECOM), 2005.
[12] A. Narayanan and V. Shmatikov. Robust de-anonymization of large
sparse datasets. In IEEE Symposium on Security and Privacy, 2008.
[13] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In
IEEE Symposium on Security and Privacy, 2009.
[14] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack
to de-anonymize social network users. In IEEE Symposium on Security
and Privacy, 2010.
[15] E. Zheleva and L. Getoor. To Join or not to Join: The Illusion of Privacy
in Social Networks with Mixed Public and Private User Profiles. In
World Wide Web (WWW) Conference, 2009.
[16] Lasse Overlier and Paul Syverson. Locating hidden servers. In Security
and Privacy, 2006 IEEE Symposium on, pages 15–pp. IEEE, 2006.
[17] Juan A Elices, Fernando Perez-Gonzalez, and Carmela Troncoso. Fin-
gerprinting tor’s hidden service log files using a timing channel. In
Information Forensics and Security (WIFS), 2011 IEEE International
Workshop on, pages 1–6. IEEE, 2011.
[18] Karsten Loesing, Werner Sandmann, Christian Wilms, and Guido Wirtz.
Performance measurements and statistics of tor hidden services. In Ap-
plications and the Internet, 2008. SAINT 2008. International Symposium
on, pages 1–7. IEEE, 2008.
[19] Xuefeng Bai, Yong Zhang, and Xiamu Niu. Traffic identification of tor
and web-mix. In Intelligent Systems Design and Applications, 2008.
ISDA’08. Eighth International Conference on, volume 1, pages 548–
551. IEEE, 2008.
[20] John Barker, Peter Hannay, and Patryk Szewczyk. Using traffic analysis
to identify the second generation onion router. In Embedded and
Ubiquitous Computing (EUC), 2011 IFIP 9th International Conference
on, pages 72–78. IEEE, 2011.
[21] Sambuddho Chakravarty, Angelos Stavrou, and Angelos D Keromytis.
Identifying proxy nodes in a tor anonymization circuit. In Signal
Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE
International Conference on, pages 633–639. IEEE, 2008.
[22] Muhammad Aliyu Sulaiman and Sami Zhioua. Attacking tor through
unpopular ports. In Distributed Computing Systems Workshops (ICD-
CSW), 2013 IEEE 33rd International Conference on, pages 33–38.
IEEE, 2013.
[23] Eric Chan-Tin, Jiyoung Shin, and Jiangmin Yu. Revisiting circuit
clogging attacks on tor. In Availability, Reliability and Security (ARES),
2013 Eighth International Conference on, pages 131–140. IEEE, 2013.
[24] Xiaogang Wang, Junzhou Luo, Ming Yang, and Zhen Ling. A
novel flow multiplication attack against tor. In Computer Supported
Cooperative Work in Design, 2009. CSCWD 2009. 13th International
Conference on, pages 686–691. IEEE, 2009.
[25] Steven J Murdoch and George Danezis. Low-cost traffic analysis of
tor. In Security and Privacy, 2005 IEEE Symposium on, pages 183–
195. IEEE, 2005.
[26] Lu Zhang, Junzhou Luo, Ming Yang, and Gaofeng He. Application-
level attack against tor’s hidden service. In Pervasive Computing and
Applications (ICPCA), 2011 6th International Conference on, pages
509–516. IEEE, 2011.
[27] Ming Song, Gang Xiong, Zhenzhen Li, Junrui Peng, and Li Guo. A de-
anonymize attack method based on traffic analysis. In Communications
and Networking in China (CHINACOM), 2013 8th International ICST
Conference on, pages 455–460. IEEE, 2013.
... Generally, within the IM architecture, the server will store the client data and route messages to the proper recipients [9]. When users communicate with each other via IM, there are TCP connections that are established with the IM server for each user [10]. Within a peer-to-peer IM architecture, the connections are made directly between the communicating peers where the only communications with the server are requests to set up the communications between the clients [11]. ...
Preprint
Full-text available
Instant Message (IM) applications are commonly used by both civilian and DoD personnel for both communication and collaboration. The web-based variants of these applications generally ride encrypted channels for message security. However, these channels may be vulnerable to keystroke timing attacks whereby textual content is determined by the timing of network traffic induced by keyboard events. An example of this induced traffic is the activity notifications common to many of these platforms, indicating when a conversant begins typing. Our aim is to determine whether the network traffic that carries this metadata enables recovering portions of the message or leaks information about the sender's identity. Using a combination of network packet capture analysis and local keystroke logging, we characterize traffic patterns of three widely used web-based IM platforms: Facebook Messaging, Google Hangouts, and IRC through the Kiwi IRC web client.
... At the same time, voice big data analysis technology is also mature and inextricably linked with mobile communication with the development of big data. The instant messaging (IM) is now the No.1 application used by netizen to get on the internet, especially instant voice communication (IVC) [4][5][6][7][8]. However, the lawbreakers and cyber attackers can easily get the data due to popularization of instant messaging. ...
Article
Full-text available
In the field of instant voice communication steganalysis, the traditional detecting methods are mainly based on supervised learning scheme that results in a large amount of complex manual preprocessing training data set. The accuracy of supervised learning scheme can be easily destroyed by the difference between the distribution of training and testing data set in the actual voice application. The disadvantages of this method are obvious in the big data environment. In this regard, this paper initially introduced a novel semi-supervised hybrid learning detection model for the instant voice communication network. This provides the progress of manually annotating training data set, that has been removed to solve the problem of complex operations and poor applicability in classifier. Therefore, this model has a simpler structure and more extensive detection scopes with the huge amount of data. Then, we designed a multi-criteria fusion module, which can automatically generate the pseudo-label set from testing data set to train the classifier model. Thus, our scheme will not be affected by the distribution shift. In this module, we defined the confidence level and representative level to judge the feature vector for pseudo-labeled. Through the experimental analysis, the low bit-rate speech coding steganalysis (G.723.1/G.729/iLBC speech codecs) is analyzed on quantization index modulation (QIM), which are common codecs in instant voice communication network. The results show that our method has higher accuracy than un-supervised method. The proposed approach is less affected and more accurate than the previous supervised methods through the distribution of different training and testing data sets. The experiments also proved that our method can be deployed in the different kinds of instant voice communication (IVC) codec by considering huge amount of data set.
Conference Paper
In this paper, we extract the characteristics of WeChat traffic and propose approaches to identify WeChat traffic in cellular data network. WeChat communication mechanisms are discussed. The traffic and usage pattern of Video Call service provided by WeChat are studied from massive traffic data using Spark, differently from previous methods. We analyze the features of WeChat Video Call service, a Voice over Internet Protocol (VoIP) application in three aspects, which are (i) daily/weekly usage pattern, (ii) traffic/usage distribution, (iii) conversation time distribution. Our analysis has two important features. Firstly, the massive mobile subscriber data we used in our experiments was collected from a commercial Internet Service Provider (ISP) covering an entire province in Northern China ensuring that the results reflect the real characteristics of service in question in cellular network. Secondly, we investigate that the WeChat Video Call usage times fit with the power law distribution. Our results are important for cellular network operators and service providers to realize WeChat traffic identification methods and user behavior of Video Call, which imply information for optimization of their services.
Conference Paper
Full-text available
The breach of privacy in encrypted instant mes-senger (IM) service is a serious threat to user anonymity. Performance of previous de-anonymization strategies was limited to 65%. We perform network de-anonymization by taking advan-tage of the cause-effect relationship between sent and received packet streams and demonstrate this approach on a data set of Yahoo! IM service traffic traces. An investigation of various measures of causality shows that IM networks can be breached with a hit rate of 99%. A KCI Causality based approach alone can provide a true positive rate of about 97%. Individual performances of Granger, Zhang and IGCI causality are limited owing to the very low SNR of packet traces and variable network delays.
Article
Full-text available
Hidden services are anonymously hosted services that can be accessed over Tor, an anonymity network. In this paper we present an attack that allows an entity to prove, once a machine suspect to host a hidden server has been confiscated, that such machine has in fact hosted a particular content. Our solution is based on leaving a timing channel fingerprint in the confiscated machine's log file. In order to be able to fingerprint the log server through Tor we first study the noise sources: the delay introduced by Tor and the log entries due to other users. We then describe our fingerprint method, and analytically determine the detection probability and the rate of false positives. Finally, we empirically validate our results.
Conference Paper
Full-text available
With the wide use of anonymity tools, both blocking and anti-blocking of these tools have become hot topics. And the traffic identifications of the corresponding tools are key issues of both blocking and anti-blocking. In this paper, we address on identifying Tor and Web-Mix traffics, which are two of the most famous anonymity tools. Taking advantage of the typical methods for traffic identification, we proposed a traffic identification scheme based on traffic fingerprint extraction and matching. The fingerprints comprise of the specific strings, packet length and frequency of the packets' sending time. The details of design and implementation of such traffic identification scheme for both Tor and Web-Mix are presented. The feasibility of the proposed scheme is shown by the simulation experiments results.
Conference Paper
We present a novel attack on relayed instant messaging (IM) traffic that allows an attacker to infer who's talking to whom with high accuracy. This attack only requires collection of packet header traces between users and IM servers for a short time period, where each packet in the trace goes from a user to an IM server or vice-versa. The specific goal of the attack is to accurately identify a candidate set of top-k users with whom a given user possibly talked to, while using only the information available in packet header traces (packet payloads cannot be used because they are mostly encrypted). Towards this end, we propose a wavelet-based scheme, called COmmunication Link De-anonymization (COLD), and evaluate its effectiveness using a real-world Yahoo! Messenger data set. The results of our experiments show that COLD achieves a hit rate of more than 90% for a candidate set size of 10. For slightly larger candidate set size of 20, COLD achieves almost 100% hit rate. In contrast, a baseline method using time series correlation could only achieve less than 5% hit rate for similar candidate set sizes.
Conference Paper
Anonymity systems try to conceal the relationship between the communicating entities in network communication. Popular systems, such as Tor and JAP, achieve anonymity by forwarding the traffic through a sequence of relays. In particular, Tor protocol constructs a circuit of typically 3 relays such as no single relay knows both the source and destination of the traffic. A known attack on Tor consists in injecting a set of compromised relays and wait until a Tor client picks two of them as entry (first) and exit (last) relays. With the currently large number of relays, this attack is not scalable anymore. In this paper, we take advantage of the presence of unpopular ports in Tor network to significantly increase the scalability of the attack: instead of injecting typical Tor relays (with the default exit policy), we inject only relays allowing unpopular ports. Since only a small fraction of Tor relays allow unpopular ports, the compromised relays will outnumber the valid ones. We show how Tor traffic can be redirected through unpopular ports. The experimental analysis shows that by injecting a relatively small number of compromised relays (30 pairs of relays) allowing a certain unpopular port, more than 50% of constructed circuits are compromised.
Conference Paper
Tor is a popular anonymity-providing network used by over 500,000 users daily. The Tor network is made up of volunteer relays. To anonymously connect to a server, a user first creates a circuit, consisting of three relays, and routes traffic through these proxies before connecting to the server. The client is thus hidden from the server through three Tor proxies. If the three Tor proxies used by the client could be identified, the anonymity of the client would be reduced. One particular way of identifying the three Tor relays in a circuit is to perform a circuit clogging attack. This attack requires the client to connect to a malicious server (malicious content, such as an advertising frame, can be hosted on a popular server). The malicious server alternates between sending bursts of data and sending little traffic. During the burst period, the three relays used in the circuit will take longer to relay traffic due to the increase in processing time for the extra messages. If Tor relays are continuously monitored through network latency probes, an increase in network latency indicates that this Tor relay is likely being used in that circuit. We show, through experiments on the real Tor network, that the Tor relays in a circuit can be identified. A detection scheme is also proposed for clients to determine whether a circuit clogging attack is happening. The costs for both the attack and the detection mechanism are small and feasible in the current Tor network.
Conference Paper
While providing protection for users' privacy, anonymity network has also been exploited by criminals to carry out crime anonymously. We study the problem how to break the unlinkability between the senders and recipients in order to identify the source of anonymous traffic in this paper. Tor, the most widely deployed anonymity network, is selected as our target. We develop a de-anonymize attack method based on traffic analysis and choose the {time, stream size} as features for k-means algorithm to mine the association between the first hop traffic and last hop traffic of Tor. Experiments show that our method is effective for Tor.
Article
Tor has become one of the most popular overlay networks for anonymizing TCP traffic. Hidden service provided by Tor allows users to run a TCP server under a pseudonym, and its resources can be accessed without the operator's real identity being revealed. In this paper, we propose a novel HTTP based application-level attack against Tor's hidden web service. Under the assumption that the entry of the suspected hidden server's circuit is occupied, we evaluate the time correlation between the web accessing and the generated traffic in the malicious onion router. Furthermore, we analyze the probability that the malicious onion routers occupy the entry of the hidden server's circuit when advertise high bandwidth, which is the foundation of our attack. We conducted real-world experiments to evaluate our attack method. The empirical results demonstrate that the hidden service can be effectively and efficiently located.
Conference Paper
We present a novel, practical, and effective mechanism that exposes the identity of Tor relays participating in a given circuit. Such an attack can be used by malicious or compromised nodes to identify the rest of the circuit, or as the first step in a follow-on trace-back attack. Our intuition is that by modulating the bandwidth of an anonymous connection (e.g. when the destination server, its router, or an entry point is under our control), we create observable fluctuations that propagate through the Tor network and the Internet to the end-user's host. To that end, we employ LinkWidth, a novel bandwidth-estimation technique. LinkWidth enables network edge-attached entities to estimate the available bandwidth in an arbitrary Internet link without a cooperating peer host, router, or ISP. Our approach also does not require compromise of any Tor nodes. In a series of experiments against the Tor network, we show that we can accurately identify the network location of most participating Tor relays.