Content uploaded by Saad Saleh
Author content
All content in this area was uploaded by Saad Saleh on Apr 07, 2015
Content may be subject to copyright.
IM Session Identification by Outlier Detection in
Cross-correlation Functions
Saad Saleh∗, Muhammad U. Ilyas∗, Khawar Khurshid∗, Alex X. Liu‡and Hayder Radha§
∗Dept. of Electrical Engineering, School of Electrical Engineering and Computer Science,
National University of Sciences and Technology, H-12, Islamabad – 44000, Pakistan
‡Dept. of Computer Science and Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA
§Dept. of Electrical and Computer Engg, College of Engineering, Michigan State University, East Lansing, MI – 48824, USA
Email: {saad.saleh, usman.ilyas, khawar.khurshid}@seecs.edu.pk∗,
alexliu@cse.msu.edu‡, radha@egr.msu.edu§
Abstract—The identification of encrypted Instant Messaging
(IM) channels between users is made difficult by the presence of
variable and high levels of uncorrelated background traffic. In
this paper, we propose a novel Cross-correlation Outlier Detector
(CCOD) to identify communicating end-users in a large group of
users. Our technique uses traffic flow traces between individual
users and IM service provider’s data center. We evaluate the
CCOD on a data set of Yahoo! IM traffic traces with an average
SNR of −6.11dB (data set includes ground truth). Results show
that our technique provides 88% true positives (TP) rate, 3%
false positives (FP) rate and 96% ROC area. Performance of
the previous correlation-based schemes on the same data set was
limited to 63% TP rate, 4% FP rate and 85% ROC area.
Keywords- Link de-anonymization; instant messaging; se-
curity; privacy;
I. INTRODUCTION
A. Background and Motivation
Instant Messaging (IM) services are projected to reach
1.4 billion users worldwide by 2016 [1]. IM services provide
mostly free, ubiquitous access, mobility and privacy. However,
user privacy has been under attack for unlawful reasons in
the past few years, e.g. the theft of data of 35 million users
in Korea in 2013 [2]. Government agencies, including the
National Security Agency (NSA), have also breached the
privacy of millions of users [3]. The aim of our research is
to assess the vulnerability of IM sessions to de-anonymization
attacks (identifying who is communicating with whom) using
only transport layer session traces. Such link de-anonymization
is challenging for the following reasons: (1) IM messages
are now often times encrypted (only IP and TCP headers are
visible which become infeasible to log on a large scale), (2)
IM data center establishes separate TCP connections between
the source and destination users (at any time, no packet
contains the source and destination IPs of both end users).
The complexity of de-anonymization increases further in the
following practical scenarios, (1) Simultaneous multiple mes-
sage sessions by a user, (2) Thousands of users communicating
through IM data center at any time, (3) Duplicate packets due
to retransmissions and (4) Out-of-order packet delivery.
B. Limitations of Prior Art
Several prior works have focused on link de-
anonymization. Time Series Correlation (TSC), the baseline
approach, has a TP rate of 63% for a signal-to-noise ratio
(SNR) of −6.11dB [4]. Major factors for performance
deterioration include delay, jitter, buffering, reordering,
duplicate messages and server messages. In the area of
de-anonymization of mix-networks several studies focused
on the computation of mutual information between ingress
and egress traffic flows. High time-complexity and reliance
on data that needs to be collected from multiple points
inside the network are major limitations. In social-network
de-anonymization, data sparsity and membership information
is used to de-anonymize networks. Here, the requirement of
detailed user information becomes a major limiting factor.
Several de-anonymization attempts have been made over Tor
network but the major emphasis was the identification of
traffic using various fingerprints. Use of correlation attempts
has been limited in breaching attempts.
In our previous works, we showed that the correlation
of wavelet decomposed time series of users’ traffic traces
can successfully breach IM session privacy [4]. In a recent
study, we showed that the cause-effect relationship between
packets appearing in two communicating (“talking”) users’
traffic traces can be leveraged to de-anonymize user sessions
[5]. Time complexity was a major limiting factor for these
approaches.
C. Proposed Approach
In this paper, we propose a novel Cross-correlation Outlier
Detector (CCOD) to de-anonymize users’ IM sessions using
only undirected transport layer traffic traces collected at the
data center (or gateway), in the form of flow logs. Our idea
leverages the limited delay between the appearance of a packet
in a sending user’s traffic trace and its appearance in the
receiving user’s traffic trace. We expect talking users to have
a high cross-correlation statistic, while for non-talking users
the appearance of packets in the same time slot is expected
to be coincidental. To de-anonymize a user we compute the
cross-correlation function of the time series of her traffic trace
derived from her flow log with the respective time series of all
other users. Therefore, the time-complexity of our approach is
Θ(N), where Nis the number of IM users. Next, we estimate
the distribution of the cross-correlation function at all non-zero
time-shifts. Finally, we apply a binary classifier to the value of
the cross-correlation function at zero time shift and determine
whether that value is drawn from the preceding distribution or
not.
D. Experimental Results and Findings
We evaluated the performance of CCOD on a real world
Yahoo! IM based data set collected from the greater New York
area over a duration of 24 hours. Classification rates were
estimated for Na¨
ıve Bayes, SVM and C 4.5 classifiers. Results
showed that CCOD provides up to 88% TP rate and 3% FP
rate with 95% precision and 88% recall. Results showed that
low SNR has the worst effect on classifier performance. We
compared the results of TSC and wavelet based decomposition
called COLD with CCOD [4][5]. Results showed that correla-
tion outlier outperformed the techniques of TSC and COLD.
E. Key Contributions
Our major contributions are as follows:
1) We propose CCOD, a novel link de-anonymization
approach for IM user session (see Section III).
2) We validated the approach on a real world Yahoo!
messenger data set (see Section IV).
3) We compared the performance gains of our CCOD
approach with previous state-of-the-art schemes (see
Tab. II).
Paper Organization:
The rest of the paper is organized as follows: Section-
II presents the IM network scenario and de-anonymization
challenges. Section-III presents our proposed approach. Re-
sults of simulations over Yahoo! data set have been discussed
in Section-IV. Section-V presents the related work. Finally,
Section-VI concludes the paper.
II. PROBLEM DESCRIPTION
When two users u1and u2communicate with each other
via a relayed IM service, two TCP connections are established;
One session is established between user u1and the IM data
center and another one between user u2and the IM data center.
All flows in source user appear in destination users time series
with some delay and possible reordering. Fig. 1 shows the
setup used for data collection of IM traffic traces. Port filtering
is used to isolate IM traffic from other traffic to/from the data
center. Users are identified by their IP addresses. Flow logs
can be collected at two possible locations, (1) At the gateway
to the IM data center, (2) At the gateway or proxy server of
a subnet. Logging at the IM data center can de-anonymize all
sessions established through the data center. On the contrary,
traffic logged at the gateway of a subnet can only be used to
de-anonymize conversations between users located inside the
subnet1. Traffic logging tools record a number of packet header
fields using packet headers including source and destination
IP addresses, source and destination port numbers, protocol
versions, time stamps and packet sizes [7]. Owing to resource
constraints, the flow logs collected as part of the data set we
collected at the data center contained only timestamps and user
IDs / IP addresses [8]. We aim to answer the question, “Is it
possible to link two communicating IM users from among a
large set of users using only user-server flow logs?”
1In a recent case of a bomb hoax at Harvard University, the suspect was
identified based upon the local subnet traffic at Harvard University [6].
III. PROPOSED DE-ANONYMIZATION APPROACH
We propose a cross-correlation outlier detector to map an
IM user to the user(s) she is communicating with. Let X(t)
and Y(t)be the respective time series of packets sent to and
received from the IM data center by two users Xand Y,
respectively. Let RXY (t−k)denote the cross-correlation of
flow log X(t)and ktime unit (in our case, seconds) delayed
flow log Y(t−k), respectively. The step-by-step procedure of
our approach is as follows:
1) Compute RXY (0) of the flow logs X(t)and Y(t).
2) Compute RXY (t−k)of the flow logs X(t)and Y(t−
k), where k∈Z{[−Z, +Z]\0},i.e. varies from
−Zto +Z, except 0.
3) Estimate the standard deviation σXY of all 2×Z
values RXY (t−k)from the previous step.
4) Compute the deviation of RXY (0) from the mean
μXY of the distribution of RXY (t−k)of all other
values, i.e. RXY (0) −μXY .
5) Normalize the deviation of RXY (0)
(R(0) −μXY ) by the standard deviation σXY ,
i.e. (RXY (0) −μXY )/σXY .
Our breaching strategy is to explore the “high correla-
tion” between undirected flow logs of communicating users,
where the threshold for what constitutes high correlation is
determined by the value of σXY for each pair of users
Xand Y. Without any time shift, talking users have high
cross-correlation. However, the server introduces a processing
delay for packets which decreases the correlation statistics.
Although noise decreases the correlation statistics, any increase
in time-shift decreases correlation sharply. To increase the
performance gains, we developed our own measure which
computes the deviation of central value (non-shifted correlation
statistics) from the mean of all other values. Developed metric
is expressed in terms of standard deviation of all other values.
IV. EXPERIMENTAL RESULTS AND FINDINGS
Performance of the correlation outlier approach was tested
on a data set collected from Yahoo! IM service [8]. The
entire data was collected from the greater New York area over
a duration of 24 hours. We were provided a flow log file
(containing time and anonymized user IDs) and the ground
truth user communication file for testing the performance of
the proposed CCOD approach. Analysis of flow log data set
revealed the large noise present in form of about 200 control
and service messages resulting in a mean SNR of −6.11dB and
a standard deviation of 5.81dB. Fig. 2a and Fig. 2b show the
correlation statistics with time shifts −20 ≤k≤+20. Unlike
non-talking pairs, all talking pairs bear a large correlation
statistics without any time shift. Any increase in time shift
decreases the correlation statistics. Correlation does not drop
to zero at most time shifts because of the high levels of
background traffic / low SNR of the data set. A few pairs
show large correlation statistics despite time shifts due to the
variability of the data set. Thus using a number of time shifts
large enough to estimate σXY can remove this anomaly in
classifier design. Fig. 3a and Fig. 3b present the probability
density function of the computed (RXY (0) −μXY )/σXY for
all talking and non-talking pairs. Results show that talking
pairs have a larger spread than non-talking pairs. The data set
,0 'DWD
&HQWHU
X
X
7LPH
3DFNHW
6L]H
X
7LPH
3DFNHW
6L]H
7LPH
3DFNHW
6L]H
*DWHZD\
,QWHUQHW
Fig. 1: Collecting traffic around gateway and IM Data Center.
used for the results in this section is of 10 minute duration
containing 1962 talking and 3000 non-talking user pairs.
Tab. I presents the classifier results. WEKA was used to
design classifiers with 10-fold cross validation [9]. Classifiers
showing promising results include Na¨
ıve Bayes, SVM and
C 4.5 decision tree. Results show that best performance is
obtained for C 4.5 classifier providing 88% TP rate, 3% FP
rate, 95% precision, 88% recall and 96% ROC.
Tab. II presents the performance comparison of previous
TSC and COLD approaches with CCOD. CCOD provides
88% TP rate while TSC and COLD show only 63% and 55%
TP rates, respectively. Similarly, CCOD shows a 3% FP rate
while TSC and COLD show 3% and 4% FP rates, respectively.
Similar is the case for other statistics. TSC and COLD showed
poor de-anonymization statistics in comparison with CCOD.
V. R ELATED WORK
A. Mix Network De-anonymization
Several studies have focused on the de-anonymization of
mix networks. In [10], Troncoso and Danezis use a Markov
Chain Monte Carlo engine to obtain the probabilities for link
de-anonymization. Authors limited the network size to only 10
nodes and collected data from multiple points. In [11], authors
estimate the mutual information of all the information entering
and leaving the mixers. Their algorithm is exhaustive because
of large computation time. Time series correlation has been
the most logical de-anonymization technique but it provides
only 63% TP rate for IM networks [4] [5]. IM network
differs from mix networks due to large number of control and
TABLE I: Performance of CCOD on Yahoo! data set.
Classifier TP Rate FP Rate Precision Recall ROC
Na¨
ıve Bayes 0.61 0.04 0.92 0.61 0.91
SVM 0.89 0.04 0.94 0.89 0.93
C 4.5 0.88 0.03 0.95 0.88 0.96
TABLE II: Performance comparison of CCOD with previous
de-anonymization approaches [4][5].
Technique TP FP Precision Recall ROC
TSC 0.63 0.04 0.76 0.63 0.85
COLD 0.55 0.03 0.78 0.55 0.84
CCOD 0.88 0.03 0.95 0.88 0.96
server messages, delays, jitters, no network visibility and large
number of users.
B. Social Network De-anonymization
Using people’s friendship graphs to de-anonymize net-
works has been a major focus of various studies. In [12],
authors calculate the distances in d-dimensional data sets of
users to measure the user separation. In [13], authors develop
an identification algorithm to de-anonymize the users. In [14],
[15], group membership information has been used to trace the
users. All these studies have been ineffective for IM networks
because these approaches require connectivity and topological
information of users.
C. Tor Network Deanonymization
A number of studies have focused over the deanonymiza-
tion of the Tor network which uses a distributed network archi-
tecture. Location of hidden services is identified by using route
selection, timing signatures and delays [16][17][18]. Several
researches focus over the identification of Tor traffic from other
network traffic using traffic fingerprints, packet sizes and proxy
nodes [19][20][21]. Large number of breaching attempts have
been made using unpopular ports, circuit clogging and man-in-
the-middle strategies [22][23][24]. “Correlation” based attacks
have been used to identify Tor traffic by using delays, times,
stream sizes [25][26][27]. However, no previous study has used
the deviation in correlations from the mean value without any
time shift, like our scheme, to deanonymize the Tor network.
D. IM Network De-anonymization
In our study [4], we used the wavelet based decompo-
sition approach to de-anonymize IM users. Correlation of
coefficients was performed over the decomposed user time
series at multiple frequency scales. Our experiments showed
hit rates of 90% to 98% when the candidate set sizes varied
from 10 to 20. However, COLD showed poor performance
for a candidate set size of 1. In our recent study [5], we
used the causality based detection approaches for link de-
anonymization. Our study showed a TP rate of 99% for the best
performing KCI causality based approach. Large computation
time and complexity were the major disadvantages for some
causality based approaches. In this paper, we focus over a
novel correlation outlier approach with better performance and
computation time.
−20 −15 −10 −5 0 5 10 15 20
0
100
200
300
400
500
Time Shift (k)
Correlation R(t)
(a) Talking pairs.
−20 −15 −10 −5 0 5 10 15 20
0
50
100
150
200
Time Shift (k)
Correlation R(t)
(b) Non-talking pairs.
Fig. 2: Correlation with time shift for (a) talking pairs, and (b) non-talking pairs.
−10 010 20 30 40 50 60 70 80 90 100
0
0.02
0.04
0.06
0.08
0.1
0.12
[R(0)−μ] / σ
PDF
(a) Talking pairs.
−10 010 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
[R(0) − μ] / σ
PDF
(b) Non-talking pairs.
Fig. 3: CCOD statistics for (a) talking pairs, and (b) non-talking pairs.
VI. CONCLUSION
Our paper presents a novel Cross-correlation Outlier Detec-
tor (CCOD) approach to breach user session in encrypted IM
networks using only flow log data. Our approach suggests that
talking pairs bear large correlation statistics without any time
shift. Experiments over a real world Yahoo! messenger data
set showed a maximum of 88% TP rate, 3% FP rate, 95%
accuracy, 88% precision and 96% ROC. Study suggests that
CCOD provides better performance than majority of previous
approaches with much less computational complexity. In fu-
ture, we aim to deanonymize the user links in the Tor network
by real world experiments using our proposed approaches.
ACKNOWLEDGMENT
This research is part of the project “Detecting covert links
in instant messaging networks using flow level log data” sup-
ported by National ICT R&D Fund, Pakistan. We are thankful
to the people at Yahoo! Webscope program for providing us
the real world data and assisting us in the estimation of users
vulnerability in IM applications.
REFERENCES
[1] CMO Council. Engage at Every Stage: Using Mobile Relationship
Marketing (MRM) to Put More Interaction in the Hands of the Customer
Report, January 2012.
[2] D. Sancho. Large Data Breach in South Korea, Data of 35M Users
Stolen, July 2013.
[3] E. Chabrow. NSA E-Spying: Bad Governance, October 2013.
[4] M. U. Ilyas, M. Z. Shafiq, A. X. Liu, and H. Radha. Who are you
talking to? Breaching privacy in encrypted IM networks. In Intl. Conf.
on Network Protocols, 2013.
[5] S. Saleh, M. Raja, M. Shahnawaz, M. U. Ilyas, K. Khurshid, M. Z.
Shafiq, A. X. Liu, H. Radha, and S. S. Karande. Breaching IM session
privacy using causality. In Global Telecommunications Conference
(GLOBECOM), 2014.
[6] B. Crowley and S. Almasy. Harvard student charged in bomb hoax out
on 100,000 dollars bail, December 2013.
[7] B. Claise. Cisco systems NetFlow services export version 9. Wikipedia,
the free encyclopedia, October 2004.
[8] Yahoo! Webscope dataset ydata-ymessenger-client-server-protocol-
events-v1-0. [http://research.yahoo.com/Academic Relations].
[9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten. The WEKA data mining software: an update. SIGKDD
Explorations, 11(1):10–18, 2009.
[10] C. Troncoso and G. Danezis. The Bayesian traffic analysis of mix
networks. In ACM Computer and Communications Security, 2009.
[11] Y. Zhu, X. Fu, R. Bettati, and W. Zhao. Anonymity Analysis of
Mix Networks against Flow-Correlation Attacks. In IEEE Global
Communications Conference (GLOBECOM), 2005.
[12] A. Narayanan and V. Shmatikov. Robust de-anonymization of large
sparse datasets. In IEEE Symposium on Security and Privacy, 2008.
[13] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In
IEEE Symposium on Security and Privacy, 2009.
[14] G. Wondracek, T. Holz, E. Kirda, and C. Kruegel. A practical attack
to de-anonymize social network users. In IEEE Symposium on Security
and Privacy, 2010.
[15] E. Zheleva and L. Getoor. To Join or not to Join: The Illusion of Privacy
in Social Networks with Mixed Public and Private User Profiles. In
World Wide Web (WWW) Conference, 2009.
[16] Lasse Overlier and Paul Syverson. Locating hidden servers. In Security
and Privacy, 2006 IEEE Symposium on, pages 15–pp. IEEE, 2006.
[17] Juan A Elices, Fernando Perez-Gonzalez, and Carmela Troncoso. Fin-
gerprinting tor’s hidden service log files using a timing channel. In
Information Forensics and Security (WIFS), 2011 IEEE International
Workshop on, pages 1–6. IEEE, 2011.
[18] Karsten Loesing, Werner Sandmann, Christian Wilms, and Guido Wirtz.
Performance measurements and statistics of tor hidden services. In Ap-
plications and the Internet, 2008. SAINT 2008. International Symposium
on, pages 1–7. IEEE, 2008.
[19] Xuefeng Bai, Yong Zhang, and Xiamu Niu. Traffic identification of tor
and web-mix. In Intelligent Systems Design and Applications, 2008.
ISDA’08. Eighth International Conference on, volume 1, pages 548–
551. IEEE, 2008.
[20] John Barker, Peter Hannay, and Patryk Szewczyk. Using traffic analysis
to identify the second generation onion router. In Embedded and
Ubiquitous Computing (EUC), 2011 IFIP 9th International Conference
on, pages 72–78. IEEE, 2011.
[21] Sambuddho Chakravarty, Angelos Stavrou, and Angelos D Keromytis.
Identifying proxy nodes in a tor anonymization circuit. In Signal
Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE
International Conference on, pages 633–639. IEEE, 2008.
[22] Muhammad Aliyu Sulaiman and Sami Zhioua. Attacking tor through
unpopular ports. In Distributed Computing Systems Workshops (ICD-
CSW), 2013 IEEE 33rd International Conference on, pages 33–38.
IEEE, 2013.
[23] Eric Chan-Tin, Jiyoung Shin, and Jiangmin Yu. Revisiting circuit
clogging attacks on tor. In Availability, Reliability and Security (ARES),
2013 Eighth International Conference on, pages 131–140. IEEE, 2013.
[24] Xiaogang Wang, Junzhou Luo, Ming Yang, and Zhen Ling. A
novel flow multiplication attack against tor. In Computer Supported
Cooperative Work in Design, 2009. CSCWD 2009. 13th International
Conference on, pages 686–691. IEEE, 2009.
[25] Steven J Murdoch and George Danezis. Low-cost traffic analysis of
tor. In Security and Privacy, 2005 IEEE Symposium on, pages 183–
195. IEEE, 2005.
[26] Lu Zhang, Junzhou Luo, Ming Yang, and Gaofeng He. Application-
level attack against tor’s hidden service. In Pervasive Computing and
Applications (ICPCA), 2011 6th International Conference on, pages
509–516. IEEE, 2011.
[27] Ming Song, Gang Xiong, Zhenzhen Li, Junrui Peng, and Li Guo. A de-
anonymize attack method based on traffic analysis. In Communications
and Networking in China (CHINACOM), 2013 8th International ICST
Conference on, pages 455–460. IEEE, 2013.