ArticlePDF Available

Identifying peer-to-peer communities in the network by connection graph analysis

Authors:

Abstract and Figures

In this paper we present a unified solution to identify peer-to-peer (P2P) communities operating in the network. We propose an algorithm that is able to progressively discover nodes cooperating in a P2P network and to identify that P2P network. Starting from a single known node, we can easily identify other nodes in the P2P network, through the analysis of widely available and standardized IPFIX (NetFlow) data. Instead of relying on the analysis of content characteristics or packet properties, we monitor connections of known nodes in the network and then progressively discover other nodes through the analysis of their mutual contacts. We show that our method is able to discover cooperating nodes in many P2P networks and present the real computational requirements of the algorithm on a large network. The use of standardized input data allows for easy deployment onto real networks. Copyright © 2014 John Wiley & Sons, Ltd.
Content may be subject to copyright.
INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT
Int. J. Network Mgmt 2014; 24
Published online 29 May 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI:
Identifying peer-to-peer communities in the network by connection
graph analysis
Jan Jusko1;2,!,!and Martin Rehak1,2
1Cisco Systems, San Jose, CA 95134, USA
2Faculty of Electrical Engineering, Czech Technical University in Prague, 166 27 Prague, Czech Republic
SUMMARY
In this paper we present a unified solution to identify peer-to-peer (P2P) communities operating in the
network. We propose an algorithm that is able to progressively discover nodes cooperating in a P2P network
and to identify that P2P network. Starting from a single known node, we can easily identify other nodes in
the P2P network, through the analysis of widely available and standardized IPFIX (NetFlow) data. Instead
of relying on the analysis of content characteristics or packet properties, we monitor connections of known
nodes in the network and then progressively discover other nodes through the analysis of their mutual
contacts. We show that our method is able to discover cooperating nodes in many P2P networks and present
the real computational requirements of the algorithm on a large network. The use of standardized input data
allows for easy deployment onto real networks. Copyright © 2014 John Wiley & Sons, Ltd
Received 31 October 2013; Revised 15 March 2014; Accepted 10 April 2014
1. INTRODUCTION
Although peer-to-peer (P2P) networks are mostly known for file-sharing applications, they are widely
adopted in today’s Internet. They are used for file sharing (e.g. BitTorrent), VoIP applications (Skype),
malware’s command and control (C&C) channel and streaming media (Spotify). Interestingly, there
is also an online currency that relies on P2P architecture for its ecosystem: BitCoin [1]. Last but not
least, P2P networks might be used for military purposes [2]. Since VoIP applications, streaming media
and especially file-sharing applications generate large amounts of traffic, the P2P networks generate a
significant amount of traffic in today’s Internet.
For the sake of both network management and network security, it is important to be able to identify
the network traffic generated by P2P networks. With the knowledge what network traffic is generated
by specific P2P application, one can more effectively manage networks and provide a better quality of
service. From the security standpoint, blocking P2P-based C&C is very effective for disrupting botnet
activity. In addition to these two points, it was also shown that P2P traffic can degrade the performance
of anomaly detection techniques. The detection rate can decrease by up to 30% and the false positive
rate can increase by up to 45% [3].
In this paper we deal with the issue of finding P2P peers in the network for all P2P protocols in gen-
eral, without focus on any particular protocol. That is thanks to exploiting the intrinsic properties that
are common for all P2P protocols and cannot be effectively avoided by any P2P network. Specifically,
in this paper we are looking for answers to the following questions:
1. Having one peer in an unknown P2P network, are we able to find other peers in the respective
network?
!Correspondence to: Jan Jusko, Cisco Systems, San Jose, CA 95134, USA.
!E-mail: jajusko@cisco.com
Copyright © 2014 John Wiley & Sons, Ltd
: 235–252
10.1002/nem.1862
236 J. JUSKO AND M. REHAK
2. Can we determine what particular P2P network it is?
3. Can we enumerate all P2P networks and their peers active in the monitored network?
Answers to these questions are presented in Sections 3, 4 and 5. As a result we get a detector
that provides information about all active P2P overlays in the network. The detector utilizes Net-
Flow data and uses only information about communication endpoints without using any flow-based
statistics. This is shown to be an advantage since such statistics (e.g. flow duration) could be dis-
torted, for example, in the case of a distributed denial of service (DDoS) attack on the protected
network [4].
In our evaluation we show that the detector we propose is able to link hosts cooperating in the usual
P2P networks such as KAD, Gnutella, BitTorrent and Skype, as well as hosts infected by the same
malware using P2P as its C&C channel. Besides knowledge about the cooperating hosts, the detector
is capable of identifying the detected networks if they are of a known type.
Knowing which hosts are engaged in the same overlay with the infected host might help in discov-
ering the botnet or other malware-infested nodes. The method described in this paper can be used as a
pre-processing layer for packet inspection-based detection. One would first find clusters of hosts in the
network and then perform the detection only for a few of them and extend the results on the remaining
hosts in the cluster.
This work is an extension of our earlier paper [5]. The core of the new contribution lies in Section
4. Moreover, this paper offers a more detailed discussion about reasoning of the algorithm. In the
evaluation we provide additional information on the detection performance of the detector in time and
add information about computational performance in the real deployment.
2. RELATED WORK
There is a plethora of research in the field of P2P networks, e.g. BitTorrent [6–8], BitTorrent’s DHT
[9], KAD (which is based on Kademlia) [10,11] and Gnutella [12–14]. There are also many studies
proposing various improvements to P2P protocols, but those are not of primary interest here.
P2P architecture is now often used by botnets for their C&C. An overview of P2P botnets and an
analysis of one of them can be found in Grizzard et al. [15]. A P2P-based C&C, using Kademlia as an
example, is analyzed theoretically in Ha et al. [16]. The authors show that P2P-based C&C is harder
to monitor compared to the centralized C&C architecture. Besides that, they also propose several
mitigation techniques.
A deeper analysis of botnets with P2P-based C&C has been provided in Dagon et al. [17]. The
authors take Cooke et al. [18] as their inspiration and classify botnets into several groups based on the
(theoretical) C&C model they use:
!Erdös–Rényi random graph model;
!WattsStrogatz sm a l l wo r l d m o d e l ;
!Barabási–Albert scale-free model.
All these models assume that bots communicate with each other; thus these are an extension of the
P2P model. The authors themselves note that for the purpose of their theoretical analysis botnets using
unstructured P2P networks as C&C can be roughly approximated by the Barabási–Albert scale-free
model and those using structured P2P networks for their C&C can be roughly approximated by the
Erdös–Rényi random graph model.1Appropriate models for a few major P2P protocols based on this
overlay classification can be found in Table 1.
Another study of P2P architecture from the point of view of the C&C channel was presented in
Davis et al. [19]; it studies two real P2P networks and two theoretical models in order to identify the
optimal P2P overlay for botnet C&C. The studied protocols and theoretical models are:
1Note that P2P networks that use supernodes are in fact unstructured (e.g. Gnutella). Only networks with an explicitly
defined routing structure (e.g. Kademlia) are considered structured.
Copyright © 2014 John Wiley & Sons, Ltd Int. J. Network Mgmt 2014; 24
DOI: 10.1002/nem
: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 237
Table 1. Major P2P protocols and the approximate graph model of their overlay
P2P protocol Graph model of overlay
Skype Barabási–Albert scale-free model
Gnutella (before introduction of super-peers) Erdös–Rényi random graph model
Gnutella (with super-peers) Barabási–Albert scale-free model
BitTorrent Erdös–Rényi random graph model
BitTorrent DHT Erdös–Rényi random graph model
Kademlia Erdös–Rényi random graph model
!Kademlia;
!Gnutella;
!Erdös–Rényi random graph model;
!Barabási–Albert scale-free network model.
The authors first define three performance measures on a P2P overlay that should be of interest to
the botmaster. Once these measures are defined, they evaluate them on the four models when being the
target of a disinfection.
Detection of P2P networks is another topic often dealt with. There are three main groups of detection
methods: packet payload based, flow-based methods and graph methods. Within all three groups the
detection can be based on the observation of either the specific P2P network behavior or inherent P2P
network properties.
A flow-based method to detect peers using inherent properties of P2P networks is introduced
in Bartlett et al. [20]. The method itself does not use any protocol-specific features and thus, in
theory, might be used for any P2P network. The authors validate the method on BitTorrent and
Gnutella networks.
As an example of graph methods, we can mention the one introduced in Constantinou and Mavrom-
matis [21]. The method is agnostic of any specific P2P protocol features. It creates a connection graph
of the peers communicating on a given port. Based on the network diameter and number of hosts that
function as both client and server, the method determines whether they constitute a P2P network.
In the detection of P2P networks, and thus P2P botnets, much attention is given to traffic disper-
sion graphs (TDGs). A TDG is a graphical representation of various interactions of a group of nodes
[22]. In IP networks, nodes in a TDG represent hosts in the network identified by their IP address.
However, definition of an edge in such a graph is more complicated—to give variability to TDG to
describe various forms of interactions, the edges can be defined arbitrarily. Static properties of various
TDG were first analyzed by Iliofotou et al. [22]. A network classification system Graption is based on
these properties [23]. In further work, Ilifotou et al. also investigate dynamic properties of traffic dis-
persion graphs [24]. TDGs were also used to evaluate the limits of local approaches in the P2P botnet
detection [25].
An interesting solution for P2P-based botnets was proposed by Coskun et al. [26]. They propose a
method to identify local members of a botnet that uses an unstructured random P2P network for its
overlay if one bot from the botnet is already known. They do so by post-mortem analysis of the network
traffic. Once one bot is discovered in the network, the traffic preceding its discovery is collected (say
24 hours) and analyzed. The detection of other hosts is centered around the term mutual contact and
the likelihood of other hosts being part of the same botnet are calculated in an iterative manner on a
graph that is created based on hosts’ mutual connections. Since we use mutual contacts paradigm as
well, this work is closest to ours.
Iterative calculation of nodes’ likelihood of being part of botnets is used in François et al. [27]
as well. The goal of their work is to find hidden overlays of P2P-based botnets in the network. A
communication graph is created based on the NetFlow data and two PageRank statistics are calculated
for each node. Nodes are then clustered to find clusters of P2P nodes in the network.
In this work we also use ideas from a paper aimed at detecting botnet C&C [28]. The authors focus
on observing long-term connections that are possibly used for botnet C&C. The method requires a
training phase in which a clean (malware-free) traffic is assumed. In the training phase all long-term
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
238 J. JUSKO AND M. REHAK
Table 2. Overview of methods and goals of the related work
Goals
Methods used P2P properties P2P-based C&C properties P2P detection Botnet detection
Empirical studies [6–14] [15]
Theoretical analysis [16,17,19]
Thresholding — [20]
Graph methods [22,24] [21–27]
Statistical approaches [27] [27]
Persistence — [28]
connections are recorded and whitelisted. In the detection phase, if any long-lasting connection is
observed and contained in the whitelist, it is considered a C&C channel.
An overview of related work with categorization by methods used and goals attained can be found
in Table 2.
Our method uses a different graph representation than is usual in the field of graph-based network
analysis and adds dynamicity to the graphs used, enabling us to distinguish multiple P2P networks
on the same host. The graph we propose changes together with the changes in the observed network
traffic, which allows for online analysis of network traffic.
3. FINDING COOPERATING HOSTS
Despite having an intrusion prevention system (IPS) deployed in the network and host machines pro-
tected by antivirus, hosts can become infected—be it because of the zero-day attack or because of the
infection vector that is not covered by the security solution. When such a compromised host is found,
it is good practice to look for other hosts in the network that exhibit similar behavior to find other
potential victims of the infection.
One way to do this is to look at what hosts the infected machine communicates with, because these
would include the perpetrator. If any other machine shares a similar set of peers, it might be infected as
well. If two machines share a remote peer, we say they are cooperating in at least one overlay network.
In some cases, the shared overlay is benign, like Skype, whereas in other cases the shared overlay will
be malicious.
Comparing the hosts’ peers can be viewed as looking for sets intersection—for each host we keep
asetofIPaddressesofcommunicationpeersandlookforpairwiseintersections.Theshortcomingof
this method is that it does not acknowledge that one host might participate in several P2P networks
at the same time. For a host that is being infected by a P2P-based malware and running Skype at the
same time, looking just at the intersections of sets containing IP addresses we end up marking all other
hosts in the network using Skype as suspicious, and we cannot distinguish between the two. Another
issue with this approach is that this analysis is done offline, after the incident occurs. If we chose to do
it in real time, the definition of sets would have to be extended, e.g. to allow forgetting of peers. It also
brings the requirement of some algorithm extension, such as definition of a time window in which the
two hosts must have an intersection of the peer sets.
Both issues can be addressed by graph-based models. Graphs are commonly used to represent a P2P
network—vertices represent peers and edge connections between the peers. Graph formalism has been
used before in the task of P2P network detection. However, our graph formalism brings two novelties:
different node representation and graph dynamicity.
All graph methods we have come across until now, with the exception of Kim et al. [29], used nodes
to represent hosts (or IP addresses). We move from this definition towards using nodes to represent
tuples (IP, port) which we call endpoints.Thismitigatesthefirstshortcomingofthe‘intersection
method. If a host is participating in several P2P networks, its IP is the same, but the ports used for
communication in different networks are different. Therefore, a single host can be represented by
several nodes, each associated with a specific network. One host can certainly use several ports for
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 239
communication in one overlay, but there always needs to be one port listening for incoming connec-
tions that does not change (often) so that other peers in the P2P network can contact it. Therefore, one
endpoint should be dominant among all endpoints associated with a single host and P2P overlay.
To ov e r c o m e t h e second shortco m i n g , w e e m p l o y a g r a p h a lgorithm that no t o n l y c o n s t r u c t s t he
graph but also modifies it with time, thus assuring that the graph describes the current state of the
overlay of the P2P network that we can observe. The notion of time which is necessary for such graph
dynamicity is captured by edge weights.
To detect the nodes of a P 2 P overlay net wor k w i t h i n o u r n e t w o r k w e u s e a 3 -partite weigh t e d g r a p h :
GD.V; E ; w/
where
VDVc[Vs[Vr
in which Vcis a set of nodes from our network we believe are participating in the P2P network, Vsis
asetofnodesfromournetworkthatwesuspect are participating in the P2P network and Vris a set
of nodes from outside of our network communicating with nodes from Vc[Vs.Eis a set of edges.
Function wassigns each edge a weight—a value equal to time when the edge was added to the graph.
We ignore all intra - n e t w o r k c o m m u n i c a tion2and cannot see communication between the nodes that
are outside of our network. Therefore, the defined graph is indeed a 3-partite weighted graph. This also
implies that GD..Vc[Vs/[Vr;E/can be viewed as a bipartite graph.
3.1. The graph algorithm
Since P2P overlay networks are dynamically changing, so should the graph that represents a P2P
overlay network. The detection algorithm monitors network traffic and constructs (modifies) the graph
based on the observed network activity in the following way:
1. the graph starts with only the seed node n2Vc;
2. when a network connection occurs between any node n2Vcand some node moutside of our
network then there are two options:
(a) m2Vralready; in this case we just update w.¹m; nº/Dcurrent_t i me./;
(b) mVryet; in this case we add mto Vrand ¹m; nºto Eand set w.¹m; nº/D
current_t ime./;
3. when a network connection occurs between any local node not yet in the graph and some node
m2Vr,weaddnto Vs,add¹m; nºto Eand set w.¹m; nº/Dcurrent_t i me./;
4. any edge e2Efor which tnow #w.e/ > tLis removed from the graph;
5. any node n2Vis removed from the graph when it does not have any incident edge (it has a
zero degree);
6. if .9m2Vs/.9n2Vc/.jAdj.m/ \Adj.n/ j>K/then we move mfrom Vsto Vc;Adj.n/ is
a set of vertices adjacent to n.
There are two parameters used in this algorithm:
!amemory limit,tL, which specifies how long a recorded connection (an edge in the graph) is kept
in memory;
!amutual contacts overlap threshold,K,whichspecieshowmanymutualadjacentverticesa
node from Vsneeds to have with any node from Vcto be moved to Vc.
A discussion about the impact of the parameter choice can be found in the Appendix. The first
three steps of the algorithm with an example graph are depicted in Figure 1. The set Vc,atanygiven
moment, contains a list of active P2P nodes in the local network participating in the same overlay.
2Based on the deployment location of the NetFlow probes, the intra-network communication may or may not be available.
We chose not to use this information to avoid susceptibility of the algorithm to the probe deployment location.
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
240 J. JUSKO AND M. REHAK
Figure 1. Algorithm illustration. First we have a seed node A with three recorded contacts. In the
second time interval, another node, B, is observed, sharing two mutual contacts with A. If we consider
KD2,theninthethirdstepnodeBisalreadymovedtotheVc. Moreover, the algorithm detected yet
another node, which has only one mutual contact with a node from Vc. Note that the weights of the
edges in the graph are determined by the time step in which they occurred most recently
There is, of course, a possibility that the graph algorithm will not be able to find any cooperating
hosts for a certain seed. This might happen when the seed is the only peer of the respective P2P overlay
in the network, or when the seed node around which we tried to construct a graph was not participating
in any P2P overlay.
4. IDENTIFYING THE REVEALED P2P NETWORK
Assuming we revealed an overlay network of some P2P protocol, can we determine what particular
P2P network it is? Most of the P2P protocol identification methods rely on payload data or flow-
based statistics. Is it possible to infer the P2P protocol directly form the graph, without looking on any
flow-based or packet-based statistics?
There is an elegant way of identifying a P2P network based on port distribution of peers participating
in the overlay. The port distribution is an empirical probability distribution of ports used by peers and
is represented by a normalized vector with 65 535 elements. While it is often argued that port-based
P2P protocol identification is useless due to the port randomization, the port distribution of peers is
surprisingly stable. The method we propose takes advantage of this observation.
The proposed detector does not have the knowledge of the whole overlay, but we argue that even the
partial knowledge it has is sufficient to identify the P2P overlay using the port distribution of known
remote peers. Once the cooperating peers in the network are found, a port distribution over their remote
peers (nodes in set Vrfrom the graph) is generated and matched against the port distributions of known
P2P networks. The graph is considered to represent the P2P overlay that has the most similar port
distribution to the port distribution created over Vr. As of now, we have port distributions for a few
major P2P networks: Skype, BitTorrent, KAD (Kademlia), Gnutella. Their respective distributions can
be found in Figure 2.
This approach can be formalized as multinomial classification with rejection option [30]—a classi-
fier decides whether the graph represents one of the several known networks, and if none of the known
P2P networks is similar enough it does not make a decision. The classifier is using a one-versus-all
strategy, where for each class there is a binary classifier fi./ that classifies elements in the respective
class. The overall classifier can be then defined as
f.x/ Darg max
ifi.x/
In our case, each binary classifier has a form of dot product of the port distribution of remote
peers and a class prototype, which is also a port distribution of remote peers captured on the graph
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 241
Figure 2. Port distributions of some of the major P2P networks: (a) BitTorrent; (b) Skype; (c) Gnutella;
(d) KAD. These distributions are distinctively different. For example, port distribution for Skype has
strong peaks on port 33 033, which is the legacy control port. Ports around 40 000 are the new control
ports that were introduced after Microsoft’s acquisition of Skype. On the other hand, the BitTor-
rent port distribution has peaks at 6881 and 51 413, which are default ports for several BitTorrent
client applications
representing the known network. One can easily see that this dot product will have values between
Œ0; 1!, with the value getting higher the more similar two vectors are. Since we do not have proto-
type vectors for all existing P2P networks and we never will, it is crucial that the classifier has the
rejection option.
The classification of graph GD.Vc[Vs[Vr;E;w/proceeds as follows:
1. we create port distribution vector d:
(a) we instantiate an vector d;diDj ¹njn2Vr^port.n/ Diºj;i 2Œ1; 65 535!,i.e.each
element of the vector contains the value equal to the number of remote peers whose port
is equal to the element index;
(b) we normalize d;
2. for each known class Cwe define fiDhd; eciwhere eCis the class prototype; ha; bidenotes
a dot product of the two vectors; then the overall classifier is arg maxChd;eCi;
3. we select class Cfor which hd;eciis maximized as a possible match;
4. the maximal score fmax Dmaxifiis compared with the rejection threshold T,iffmax >T the
classifiers identifies the graph as representing class C, otherwise no decision is made.
The rejection option from step 4 is crucial. One of the dot products will always have the maximal
value. This value might be very low, which signals that the classifier is not very sure what P2P network
is represented by the graph. In such a case it is undesirable to make a decision. In our implementation,
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
242 J. JUSKO AND M. REHAK
we set the rejection threshold to 0.3. This value is high enough to avoid ‘accidental’ classification
because vectors in high dimensions are typically orthogonal to each other. On the other hand, it is low
enough to allow differences in graph port distributions class prototypes caused by different sampling
of peers in the overlay (i.e. both distributions were created based on different sets of peers).
4.1. Time stability
The classification approach we employ requires that the port distributions are stable in time. To show
that, we reconstructed overlays of BitTorrent and Skype using the graph algorithm described in Section
3.1 on two different networks at different points in time and compared their port distributions. The
capture times were approximately 9 months apart. For BitTorrent the dot product of the vectors con-
sisting of the port relative frequencies of the two distributions yielded a value of 0.93. This indicates
that BitTorrent peers’ port distribution is stable both in time and space.
For Skype the dot product of the vectors created in the same way yielded a value of only 0.4,
indicating that Skype peers’ port distribution is less stable. This can be explained by a major Skype
overlay architecture change that took place between the two points at which we reconstructed the
Skype overlay. A large number of Skype supernodes were moved to Microsoft’s own data centers,
relying less on nodes on users’ computers. New supernodes usually listen on ports between 40 000
and 40 100. Also, port 33 033, which was associated with supernodes owned by Skype itself, is less
prominent after the acquisition. If we compensate for this change (ignoring ports 33 033 and 40 000–
40 100 in the port distributions) the dot product yields a value of 0.78. The two are not as similar as in
the case of BitTorrent but there is still a strong similarity.
The remaining two P2P overlays shown in Figure 2 were not detected in one of the networks;
therefore the comparison cannot be shown.
5. ENUMERATING ALL ACTIVE PEER-TO-PEER NETWORKS
The previous two sections enable us to find a P2P overlay given the knowledge of one peer in the
overlay and possibly identify it if it is of a known type. To find and enumerate (find all participating
local peers) all active P2P overlays in the network using the same technique, one needs a starter node
for each of the active P2P overlays. If we were to find one starter node for each active overlay in the
network we would need to know:
!what P2P overlays are active in the network;
!how to pinpoint a node participating in a given P2P overlay.
This is a rather recursive problem. To find the starter nodes we need the knowledge that is sufficient
to solve the original problem at hand—to find and enumerate all active P2P overlays. To circumvent
this issue, we can select all endpoints likely to be peers in some overlay and grow graphs around
them. If any of the chosen endpoints is an active peer, the graph would represent an overlay of the
P2P network it belongs to. The guidelines for selecting seeds as likely peers are based on two intrinsic
properties of all P2P overlays that must always hold true:
1. all peers need to listen for incoming connections on an arbitrary but stable port;
2. every peer needs to communicate with at least two other peers.
The first property emerges from the observation that each peer both receives and initiates connec-
tions to other peers; otherwise it would deter the client/server paradigm. In other words, each peer
behaves like both the client and the server; thus it has to keep a port open for incoming connections.
Also, this port cannot be changed often because it produces an overhead in the exchanged messages
and decreases the effectiveness of the overlay. In structured P2P overlays, each change of listening port
is equivalent to leaving and rejoining a network with different address, which for example in Chord
overlay [31] requires O.log2N/ messages [32]. In an unstructured network, changing the listening
port is also functionally equivalent to leaving and rejoining the network. This does not require any
update to the routing table, but former peers of the rejoining peer will not be able to contact it any
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 243
Figure 3. Schema of the detector. As an input it takes flows from the network which are processed
by the persistence module (denoted by PM). The set of seed endpoints is then transferred to the graph
module, which processes the flows induced by persistent endpoints and merges and deletes graphs
as needed. Sets of cooperating peers are sent to the identification module (denoted by IM). The out-
put of the detector consists of sets of endpoints that appear to be cooperating in P2P networks with
identification of the P2P network if available
more. A new search in the network has to take place to re-establish the connection. Therefore, peers
try to avoid changing the listening port while being part of the overlay.
Moreover, it is not sufficient and/or desirable for peers in an overlay to be in touch with only one
other peer. In the worst-case scenario, where each peer knows the address of only one other peer, we
effectively end up with a round-robin structure which is inferior in terms of search within the P2P
network. Besides that, data transfers would have to pass unrelated nodes that are not interested in the
data in transfer. When we consider a P2P overlay where only some peers have degree greater than
1, which is an extreme form of scale-free network, we encounter some reliability issues. Such a P2P
network would have several nodes with very high degree and many nodes with lesser degree. If we
were to remove a high degree node from the scale-free graph with minimal degree >1, we would still
maintain the connectivity within the graph. In contrast, if there were nodes with degree 1 connected to
the removed high degree node, these would be disconnected from the network. This shows that we can
expect active peers to have more than one other peer.
These two properties joined together by information from previous two sections result in the pro-
posal of a detector with three modules, which is depicted in Figure 3. The graph module revolves
about the graph algorithm described in Section 3.1, maintaining several graphs simultaneously. The
identification module represents the classifier described in Section 4. Finally, the persistence module
brings the ability to select seeds around which the graphs are constructed based on the intrinsic P2P
overlay properties.
5.1. Persistence module
The graph algorithm used in the graph module needs a seed node around which it constructs the
connection graph. The sole purpose of this module is to find such nodes. To select seeds for the graphs
module, we utilize two criteria that are based on basic P2P network properties: the persistence criterion
and the peers count criterion.
The persistence criterion is based on the first intrinsic property of P2P networks. Peers also need to
communicate with each other to keep the overlay functional. They exchange messages for the purpose
of routing table updates, searches, data exchanges, etc. The criterion states that we choose endpoints
that are persistent, i.e. are sending or receiving data for longer periods of time. During normal network
activity, a single host uses many ports to communicate. Most of these ports are used only for a short
period of time; these are called ephemeral ports. However, there are some ports that are kept open—
these are usually used for listening for incoming connections.
To illustrate this , w e p e r f o r m e d a s m a ll ex p e riment on a unive r s ity network, in which we w e r e
monitoring network traffic in ten 5-minute intervals. In the first time interval we recorded all active
endpoints in our network. In the following nine time intervals we recorded whether the given endpoints
were reused. This way, we were able to create a histogram showing the number of endpoints used in
either one, two or up to ten time intervals. The histogram can be found in Figure 4. We can see that
most endpoints were used only in one time interval during the experiment. Then the trend declines,
with the exception of endpoints that were used during all time intervals. We believe that these are the
endpoints that represent services (such as web servers or IMAP servers) or active peers of peer-to-
peer networks.
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
244 J. JUSKO AND M. REHAK
0
1000
2000
3000
4000
5000
6000
7000
1 2 3 4 5 6 7 8 9 10
Number of endpoints
Number of time steps endpoint was active
Figure 4. The histogram shows that the majority of endpoints are active only at one or two time
intervals. We can then see only a marginal number of endpoints being active between three and nine
time steps. All services that run steadily and are regularly used are active at all 10 time windows
To define persistence of endpoints formally, we use a simplified method of measuring persistence
introduced in Giroire et al. [28]. The original method was focused on revealing hidden C&C channels.
We are interested o n l y i n p e r s i s t e n c e of endpoints, no mat t e r w h e r e t h e y c o n n e c t t o . We a re not trying
to detect the exact periodicity of connections, but an ongoing character of a connection. The regularity
of endpoint activity is observed by a sliding window W, which is split into nbins b1;:::;b
n. This
window is called the observation window and bins are called measurement windows.Wecanwrite
WDŒb1;b
2;b
3;:::;b
n!. We then define persistence of an endpoint as follows:
p.e; W / D1
n
n
X
iD1
1e;bi
where eis the endpoint for which the persistence is calculated, Wis the observation window and
function 1e;biis equal to 1 if at least one connection to or from the endpoint eoccurred during the
measurement window bi; otherwise it is equal to 0.
The persistence calculation itself is based on three parameters: measurement window size,which
states how long the connections are recorded into one bin before proceeding to another; observation
window size n,whichdetermineshowmanybinsthereareintheobservationwindow;andthethresh-
old persistence p!, which determines what persistence an endpoint must reach to be considered for
seed selection.
The peer count criterion is based on the second intrinsic property of P2P overlays. Therefore, all
endpoints that had at least two distinct peers in the last observation window pass this criterion. This
removes long-lasting connections between two peers on static ports such as clients downloading large
files from the Internet or users connecting to other computers via Remote Desktop or SSH.
With these two criteria together, each time the module is queried for seeds it calculates persistence
for all recorded endpoints and selects those with persistence exceeding the persistence threshold p!.
Those selected are then checked against the second criterion, which is the number of contacted peers
during the last observation window.
Of course, not all selected seeds represent active peers in some P2P network. However, we argue
that all active peers should be selected as seeds.
5.2. Graph module
The graph module is responsible for:
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 245
!constructing graphs around the seed endpoints received from the persistence module;
!merging similar graphs;
!removing graphs that failed to find any cooperating host for the given seed endpoint.
The graph module uses the graph algorithm presented in Section 3.1. Knowing that all active peers
should be represented by an endpoint that is persistent, we can also reduce number of flows we
process in each graph. This reduction is attained by removing all flows that do not originate from
or are not directed towards a persistent endpoint, since this traffic should not be part of the
overlay. Of course, a significant reduction of number of flows can be attained also by using a lower
persistence threshold than 0.8 for filtering flows. In our own implementation we set the persis-
tence threshold for flow selection to 0.3. This has the advantage of detecting active peers even
before they are able to reach the persistence value required by the persistence module to be selected
as seeds.
However, before the module can construct any graph, it first needs to receive seed endpoints from
the persistence module. The persistence module feeds seed endpoints to the graph module periodically.
When the module receives the first set of seed endpoints it creates a graph for each of them. For every
subsequent set of received seed endpoints it checks whether given seed endpoints are already recorded
in any of the graphs. For those that are not, it creates new graphs. This way we prevent the creation of
unnecessary duplicate graphs.
Since we expect this method to find cooperating endpoints and all endpoints to be selected as seeds,
we should after some time construct graphs that are very similar and describe the same P2P network
despite starting from different seed endpoints. There is no point in keeping such graphs separate;
therefore the module joins them together. It raises the question, though, of how to define ‘similarity’
of two graphs.
Two graphs that represent the same P2P network should have similar sets Vc,butsincebothgraphs
were iteratively constructed from different seed nodes they do not necessarily contain similar sets of
edges or set Vr. Therefore we define the textitsimilarity of two graphs G1and G2as
s.G1;G
2/DjVG1
c\VG2
cj
min !jVG1
cj;jVG2
cj"
where VG1
cand VG2
crepresent Vcof graphs G1and G2, respectively. This definition ensures that
similarity of two graphs G1;G
2is high (in fact equal to 1) if VG1
c$VG2
cand jVG1
cj%jVG2
cj. This
is a case of two graphs that represent the same P2P network but one of them is much smaller (either
because it was created later or because the seed was not as ‘active’ as the seed of the bigger graph).
We merge two graphs if their similarity is greater than the merge overlap threshold, which is another
algorithm parameter. Note that the similarity metric is similar to the Jaccard index. The difference is
in similarity of two sets where one is the subset of another. Our metric gives the two a full similarity
value, 1, but the Jaccard index reaches a value of 1 only for identical sets.
There is, of course, a possibility that the graph algorithm will not be able to find any cooperating
hosts for a certain seed. This might happen when the seed is the only peer of the respective P2P overlay
in the network, or when the seed node around which we tried to construct a graph was a service, e.g.
an email server. If any graph fails to find at least one cooperating endpoint in the network for a certain
period of time, called the tryout period, it is removed from the module. Even thought we remove the
graph, it might be recreated next time the seed nodes are received from the persistence module, because
the endpoint might be active despite the fact it has no cooperating nodes. Therefore we define another
time parameter, the ignore period, which determines how long after removing a graph with a specific
seed node this seed node may not be used to construct another graph. We do not want to ignore the
given seed endpoint forever, because a service using the port may change or a cooperating peer might
appear later.
The identification module simply accepts graphs from the graph module, extracts port distribution
for the remote peers and performs protocol identification as described in Section 4.
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 2 3 5 2 5 2
246 J. JUSKO AND M. REHAK
6. EVALUATION
In our evaluation we evaluate both detection and computational performance of the proposed detector.
For each evaluation part we use a different dataset. On both of them the traffic was collected in the
form of NetFlow data by a network probe. Flows were always collected for 5 minutes and then sent
in a batch to the detector. Favoring flow batch processing over stream processing and batch size are
settings of the anomaly detection engine in which the detector was deployed. However, the detector
can process flows in a streaming fashion or work with batches of a different size.
The computational performance is evaluated on a relatively large Telco provider network to test the
detector under a heavy load. Since we could not tamper with the network in any way, the detection
performance evaluation was done on a much smaller university network. In this network, we could
deploy our own P2P nodes and thus establish the ground truth.
Several parameters can be set for the detector. In our experiments we fixed the value of the tryout
period to 1 hour. Ignore period was set to 1 hour as well but it increased by 1 hour for the given
seed every consecutive time the graph around that seed was removed because it failed to find any
cooperating peers. The measurement window had a size of 5 minutes. Each observation window was
composed of 10 bins, which we believe offered a good granularity of information. Values of other
parameters that we experimented with in the evaluation to find the best combination can be found in
Table 3.
6.1. Dataset description
The university dataset was collected in the university network consisting of approximately 1000 hosts.
The network traffic was collected for 20 hours during a working day. Since we did not have access to
all the computers and could not establish the ground truth concerning the overall network activity, i.e.
what service every endpoint in the local network belonged to, we chose 155 hosts from two subnets as
a small control set.
The first subnet contained 36 hosts, of which 18 were running either Windows or Linux. We refer to
these hosts as client hosts. The client hosts were engaged in casual Internet activity, such as browsing
the web, working with email, listening to music, watching videos, etc. On these we also installed client
applications for several P2P networks, where one host could participate in P2P networks. The list of
installed client applications can be found in Table 4.
To ex a m i ne whether the algo r i t h m w a s c a p a b l e o f l i n k ing hosts particip a t i n g i n a b o t n e t , w e i nfected
three computers with Trojan.Sirefef-6 malware, which uses a P2P overlay for its C&C [33]. We set all
Table 3. Parameters and their values used in the experiment. Parameter
values used in the final evaluation are shown in bold
Parameter Values
Persistence threshold 0.5, 0.8
Mutual contacts overlap threshold 3, 4, 5,6
Memory limit 60, 90, 120 min
Merge overlap threshold 0.3, 0.5, 0.7
Table 4. List of peer-to-peer networks with their
respective clients installed on the client hosts in the
control set. The last column specifies how many hosts
are running a given client application
Network Client application Hosts
Skype Official client 18
BitTorrent "Torrent 26
KAD eMule 15
Gnutella Phex 18
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 247
client applications belonging to the same P2P network to use the same port to ease up evaluation of
the results. This had no effect on detection capabilities of our algorithm.
The second subnet contained servers; we refer to these hosts as server hosts.Noneofthethem
was running any of the aforementioned applications. They run many services, such as web servers,
IMAP/POP services and others.
The Telco dataset contains traces mainly of homes with a DSL uplink. The network encompasses
tens of thousands of users and has a throughput of 40 Gbps, with number of flows per 5 minutes
ranging approximately from 2 million to 11 million. The dataset spans 3 days in November 2011. For
the computational performance evaluation we are only interested in the size of the network; therefore
we do not provide any additional information.
6.2. Computation performance evaluation
For the sake of performance evaluation, the detector was deployed on a 24-core Intel Xeon computer
(24 virtual on 12 physical cores) and tested on the Telco dataset. We monitored processing times of the
three modules and retained memory as a function of time (and thus number of flows in the network).
In Figure 5(a) we can see a stacked plot of the processing times of the three modules. It is obvious
that the graph algorithm takes most of the time, whereas seed selection and P2P identification take only
a fraction of the time. There is also a clear relationship between number of flows and the processing
time. One can see trends in the traffic as the users use their computers most in the evening or at night.
The computation takes peaks at around 300 seconds, which is actually the time span of the dataset
processed. The detector reaches its limits when dealing with 10M+ flows (approximately 34k+ flows
per second). While is is partially parallelized, it could certainly be optimized to allow for greater
throughput. As for memory consumption, shown in Figure 5(b), this exhibits the same dependence on
number of flows. One more thing can be noted from the plot: the memory footprint is slightly lower in
the third peak, which is caused by blacklisting of seeds. The memory footprint ranges from 3 GB to
15 GB during the peak hours.
6.3. Detection performance evaluation
6.3.1. Detection rate
Since the algorithm runs continually and modifies the graphs according to the changes in the network,
we measure detection rate in time. Thanks to this we can see how much time the detector needs to
detect a P2P network since the start of the client application.
After every batch of flows (every 5 minutes) we query the identification module for the list
of recognized P2P networks and nodes that participate in them. As can be seen in Figure 6(a),
0
0.2
0.4
0.6
0.8
1
050 100 150 200
detection rate
minutes since starting of nodes
in particular P2P network
(a) Detection rate as a function of time.
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30 35 40 45
detection rate
memory limit
(b) Detection rate as a function of memory
limit and mutual contacts threshold
Figure 5. Detection rate of the proposed detector: (a) detection rate as a function of time; (b) detection
rate as a function of memory limit and mutual contacts threshold
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
248 J. JUSKO AND M. REHAK
0
50
100
150
200
250
300
Processing Time [s]
Time [h]
(a) Processing Times
0
2000
4000
6000
8000
10000
12000
14000
16000
01224364860720 12 24 36 48 60 72 0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
Memory Footprint [MB]
Number of Flows
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
Number of Flows
Time [h]
(b) Memory Footprint
Figure 6. Stacked plots of (a) processing time and (b) memory footprint as a function of time (and
thus number of flows). The graph algorithm takes most of the processing time required by the detector.
Strong trends are visible in the data that are caused by the users usually using computers in the evening
or at night
the algorithm was able to find all hosts participating in Skype, BitTorrent, Kademlia and
Trojan.Sirefef-6 peer-to-peer networks. On the other hand, detection rate for Gnutella was considerably
lower: 44%.
Figure 6(a) also shows that detection of P2P nodes is not immediate and the algorithm needs some
time before it detects them. All P2P networks except Gnutella were at least partially detected within
an hour since the client applications were started. One can also notice that some Skype nodes were
identified even earlier than endpoints, representing Skype clients that reached the required persistence
threshold. This happened as a result of the other Skype nodes commonly running in the university
network—the graph for Skype was already present when we started the Skype clients in the control set
and endpoints belonging to these clients were simply added to the graph without the need to become a
graph seed. This illustrates an important property of the detector: peers that join the overlay for which
a graph already exists are detected much faster than the first peer(s) of a P2P network that does not yet
have a corresponding graph.
6.3.2. False positive rate
For various P2P networks we use different methodologies for evaluation of false positives. For KAD,
Gnutella, BitTorrent and Trojan.Sirefef-6, we consider every detected endpoint not associated with the
host from the control set and the respective listening port of the client application to be a false positive.
Since these P2P networks are used only rarely at the university, such an approach is viable. Using this
approach we determine the upper bound of the false positives detected by our algorithm. We cannot
do the same with Skype since it is very popular at the university. Instead, we evaluate false and true
positives only on the control set.
There were no false positives for four of the P2P networks: Skype, KAD, Gnutella and
Trojan.Sirefef-6. Only one false positive was found when linking cooperating hosts in the BitTorrent
overlay. We refrain from calculating the false positive rate, since it would only have a negligible value
due to the low number of false positives.
7. INTERPRETATION OF RESULTS
To ex p l a in the inferior perf o r m a n c e w i t h G n u tella detection w e n e e d t o i nvestigate how nodes in va r i-
ous P2P networks communicate. Some P2P overlays use UDP-based communication, while others use
TCP-based communication or a combination of the two. If the P2P overlay is UDP-based, both incom-
ing and outgoing connections use the main port on which our detector focuses. On the other hand, if the
P2P overlay is TCP-based, the main port is used only by incoming connections. Outgoing connections
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 249
use ephemeral ports assigned by the operating system which change frequently.3Therefore, for
TCP-based overlays our detector can only take advantage of the node’s incoming connections (because
those target the main port). Of the P2P networks we do our experiments with, Kademlia and
Trojan.Sirefef-6 both use UDP for their P2P overlay, Gnutella uses TCP for its P2P overlay, and Skype
and BitTorrent use a combination of the two. Therefore, in order to detect Gnutella there needs to
be a reasonable number of incoming connections from remote peers to increase the chance of having
mutual contacts with other local nodes in the Gnutella overlay. Gnutella has two types of peers: leaf
nodes and ultrapeers. Leaf nodes only connect to ultrapeers and ultrapeers connect to both ultrapeers
and leaf nodes. Also, ultrapeers have a higher frequency of connections with other peers. There-
fore it is much more probable to detect and link together Gnutella ultrapeers than ordinary hosts.
And indeed, most of the cooperating hosts found for the Gnutella network were in fact ultrapeers.
The longer ramp-up period for Gnutella is due to the fact that it took time until the detected nodes
became ultrapeers.
We mentioned that Skype and B i t To r r e n t u s e b o t h T C P and UDP. Sk y p e u s e s b o t h p r otocols in the
single overlay and exchanges them as necessary. On the other hand, BitTorrent keeps a separate overlay
network on TCP and UDP. TCP is used in the original BitTorrent protocol for downloading files in the
swarm where the first set of peers is received from a tracker. UDP is used in the newest implementations
for distributed tracker functionality to avoid using centralized trackers when it is not necessary. This
overlay is BitTorrent’s own DHT implementation. Therefore, there are two possibilities of how to
detect BitTorrent clients: via DHT overlay or via the original BitTorrent overlay. Our experiments show
that the detector is capable of linking BitTorrent clients using either protocol.
Here we need to realize the difference between the original BitTorrent protocol and other peer-to-
peer protocols in this evaluation. While other P2P networks maintain an overlay network at all times,
the BitTorrent is intermittent. The client participates in the overlay only when it wants to download a
file and joins a swarm (unless it is using DHT). Therefore, when we talk about detecting cooperating
hosts for BitTorrent using only the BitTorrent protocol, we mean hosts that are members of the same
swarm—not all BitTorrent clients in the network.
We mentioned that our detector is able to identify both overlays run by the BitTorrent client. Of
the two, detection of the UDP-based DHT implementation is faster because communication in this
overlay starts as soon as the client application is launched, without any user activity. In fact, all P2P
networks in our experiment, with the exception of BitTorrent’s original protocol, were detected with-
out any user activity. The original BitTorrent overlay can be observed only after the user initiates a
file download.
8. CONCLUSION
In this paper we presented a detector that is able to link hosts cooperating in a P2P overlay and identify
this overlay if it was of a known type. The detector uses only inherent properties of P2P networks. It
reconstructs the P2P overlay based on the observed connection in the network. Since the method does
not use either packet payloads or flow statistics, it is a viable option for deployment on the backbone
network where computationally expensive models are not an option. Identification of the overlay is
based on port distributions which we show are stable in time.
In the process of designing the detector we address the following questions:
1. Having one peer in an unknown P2P network, are we able to find other peers in the respective
network?
2. Can we determine what particular P2P network it is?
3. Can we enumerate all P2P networks and their peers active in the monitored network?
3Endpoints representing ephemeral ports might occasionally appear in the graph of the P2P overlay. In a rigorous
understanding, these endpoints are true positives because they are used for the communication in the P2P overlay. On
the other hand, they are present in the same graph as the endpoint representing the main listening port on the same host.
We have therefore ignored these ephemeral endpoints in the evaluation.
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
250 J. JUSKO AND M. REHAK
In Section 3 we showed how, using our graph algorithm, we can find other hosts participating in
the same P2P overlay network as the first peer. The algorithm is based on monitoring mutual peers
of hosts. The next logical step was to identify the observed P2P overlay, and we presented a simple
classification method based on remote peers port distribution in Section 4. Finally, in Section 5 we
showed how to select peers that were likely participating in a P2P overlay and which were used as an
input to the graph algorithm.
The method was able to detect all cooperating peers in most of the P2P networks and attained almost
zero false positive rate in the controlled experiment.
We believ e t h a t t h i s m e thod presents a vi a b l e a p p r o a c h t o d e t ecting peers in overlay networks: bo t h
well-known file-sharing networks and specialized peer-to-peer networks used by botnets as a C&C
channel. It has been used as a part of anomaly detection engine for 2 years now.
APPENDIX: PARAMETERS AND THEIR IMPACT
The persistence module and graph module can be tuned using several parameters that were introduced
in the text.
Parameters of the persistence module impact mainly computational performance, but they can also
affect the detection performance if chosen improperly. Tryout period and ignore period affect only
computation performance and have no impact on the detection or false positive rate. Measurement
window size and observation window size determine which endpoints are selected as seeds. As the
measurement window size increases, endpoints need to be active for a longer period of time to be
selected. Observation window size determines how fine is the calculated persistence. For example,
using only two bins an endpoint can have only one of the three values of persistence: 0, 0.5 or 1.
Making the measurement window too small, even ephemeral ports can have high persistence, which
would increase the number of graphs to be processed by the graph module. The last parameter used
by the persistence module is persistence threshold.Choiceofitsvalueimpactsthecomputational
performance: the higher the value, the fewer models are created in the graph module and thus fewer
resources are needed to process the graphs. Another important impact of this parameter is on the
duration of the ramp-up period in detection. The higher the value, the longer it takes for the detector to
find nodes participating in new P2P overlays since the client application is started. Each endpoint needs
to be active for a specified time before a graph is created for the given endpoint. As the persistence
threshold increases, the time required gets longer.
The graph module uses three parameters: merge overlap threshold,mutual contacts overlap thresh-
old and memory limit. The graph module is responsible for merging graphs that represent the same P2P
overlay, and merge overlap threshold is the parameter that states how strict the module is when decid-
ing whether two graphs represent the same P2P overlay. In our experiments we used three values of this
parameter without any impact on the detection results. However, the value of this parameter cannot be
arbitrary, as choosing a very low value could cause even unrelated graphs to be merged. On the other
hand, choosing too high a value could have a performance penalty because the graph module would
keep working with several very similar graphs. Mutual contacts overlap threshold and memory limit
both have a significant impact on the detection rate and some impact on the false positive rate. Having
the memory limit fixed, increasing the mutual contacts overlap threshold decreases the detection rate
and false positive rate. Similarly, having the mutual contacts overlap threshold fixed, increasing the
memory limit increases the detection rate and false positive rate. The impact on the false positive rate
is usually only marginal, but setting mutual contacts overlap threshold to a very low value can rapidly
increase the false positive rate. For example, if we set the mutual contacts overlap threshold to 1 and
the network is the target of a scan then all scanned endpoints are added to the graph, increasing the
false positive rate considerably. As can be seen in Figure 6(b), these two parameters can compensate
for each other. Increase in the mutual contacts overlap threshold decreases the detection rate unless the
memory limit is increased appropriately as well.
ACKNOWLEDGEMENTS
The work was supported by MVCR grant number VG2VS/242.
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 251
REFERENCES
1. Bitcoin: open source p2p money. Available: http://bitcoin.org [18 October 2013].
2. Wearden G. US Army aims to take p2p into battle. Available: http://www.zdnet.com/us-army-aims- to- take-p2p- into-
battle-3002094181 [20 October 2013].
3. Haq IU, Ali S, Khan H, Khayam SA. What is the impact of p2p traffic on anomaly detection?. In Proceedings of the 13th
International Conference on Recent Advances in Intrusion Detection: RAID’10,Springer:Berlin,2010;117.
4. Sadre R, Sperotto A, Pras A. The effects of DDoS attacks on flow monitoring applications, In IEEE Symposium on Network
Operations and Management,2012;269277.
5. Jusko J, Rehak M. Revealing cooperating hosts by connection graph analysis. In Security and Privacy in Communication
Networks: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering,
Vo l . 1 0 6 , S p r i n g e r : B e r l i n , 2 0 1 3 ; 2 4 1 – 2 5 5 .
6. Móczár Z, Molnár S. Characterization of BitTorrent traffic in a broadband access network. In AccessNets: Lecture Notes
of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering,Vol.63,Springer:Berlin,
2010; 176–183.
7. Kryczka M, Cuevas R, Guerrero C, Azcorra A. Unrevealing the structure of live bittorrent swarms: methodology and
analysis, In IEEE International Conference on Peer-to-Peer Computing,Kyoto,2011;230239.
8. Qi J, Zhang H, Ji Z, Yun L. Analyzing BitTorrent traffic across large network, In 2008 International Conference on
Cyberworlds,2008;759764.
9. Falkner J, Piatek M, John JP, Krishnamurthy A, Anderson T. Profiling a million user DHT. In Proceedings of the 7th ACM
SIGCOMM conference on Internet Measurement: IMC’07,ACM:NewYork,2007;129134.
10. Steiner M, En-Najjary T, Biersack EW. A global view of KAD. In Proceedings of the 7th ACM SIGCOMM Conference on
Internet Measurement: IMC’07,ACM:NewYork,2007;117122.
11. Liu X, Li Y, Li Z, Cheng X. Social network analysis on kad and its application. In Proceedings of the 13th Asia–Pacific
Web Conference on Web Technologies and Applications: APWeb’11,Springer:Berlin,2011;327332.
12. Li C, Chen C. Topology analysis of Gnutella by large scale mining, In International Conference on Communication
Tec h n o l o g y : I C C T 06,2006;14.
13. Acosta W, Chandra S. Trace driven analysis of the long term evolution of gnutella peer-to-peer traffic. In Proceedings
of the 8th International Conference on Passive and Active Network Measurement: PAM’07,Louvain-la-Neuve,Belgium,
Springer: Berlin, 2007; 42–51.
14. Markatos EP. Tracing a large-scale peer to peer system: an hour in the life of gnutella. In Proceedings of the 2nd IEEE/ACM
International Symposium on Cluster Computing and the Grid: CCGRID’02, IEEE Computer Society: Washington, DC,
2002; 65–75.
15. Grizzard JB, Sharma V, Nunnery C, Kang BB, Dagon D. Peer-to-peer botnets: overview and case study. In Proceedings of
the First Conference on First Workshop on Hot Topics in Understanding Botnets: HotBots07, Cambridge, MA, USENIX
Association: Berkeley, CA, 2007; 1–1.
16. Ha DT, Yan G, Eidenbenz S, Ngo HQ. On the effectiveness of structural detection and defense against p2p-based botnets,
In IEEE/IFIP International Conference on Dependable Systems and Networks: DSN’09,2009;297306.
17. Dagon D, Gu G, Lee CP, Lee W. A taxonomy of botnet structures, In Proceedings of the 23rd Annual Computer Security
Applications Conference: ACSAC’07,2007;325339.
18. Cooke E, Jahanian F, McPherson D. The zombie roundup: understanding, detecting, and disrupting botnets. In Proceedings
of the USENIX SRUTI Workshop,Cambridge,MA,USENIXAssociation:Berkeley,CA,2005;3944.
19. Davis CR, Neville S, Fernandez JM, Robert J-M, McHugh J. Structured peer-to-peer overlay networks: ideal botnets com-
mand and control infrastructures?. In ESORICS,LectureNotesinComputerScience,Vol.5283,Springer:Berlin,2008;
461–480.
20. Bartlett G, Heidemann J, Papadopoulos C. Inherent behaviors for on-line detection of peer-to-peer file sharing, In IEEE
Global Internet Symposium,Anchorage,AK,2007;5560.
21. Constantinou F, Mavrommatis P. Identifying known and unknown peer-to-peer traffic, In Fifth IE EE I nt er national
Symposium on Network Computing and Applications: NCA 2006,Cambridge,Massachusetts,USA,2006;93102.
22. Iliofotou M, Pappu P, Faloutsos M, Mitzenmacher M, Singh S, Varghese G. Network monitoring using traffic dispersion
graphs (TDGs). In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement: IMC’07,SanDiego,
California, USA, ACM: New York, 2007; 315–320.
23. Iliofotou M, Kim HC, Faloutsos M, Mitzenmacher M, Pappu P, Varghese G. Graption: a graph-based p2p traffic
classification framework for the internet backbone, Computer Networks 2011; 55(8): 1909–1920.
24. Iliofotou M, Faloutsos M, Mitzenmacher M. Exploiting dynamicity in graph-based traffic analysis: techniques and appli-
cations. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies:
CoNEXT’09,ACM:NewYork,2009;241252.
25. Jelasity M, Bilicki V. Towards automated detection of peer-to-peer botnets: on the limits of local approaches. In Proceedings
of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More:
LEET’09,Boston,MA,USENIXAssociation:Berkeley,CA,2009;33.
26. Coskun B, Dietrich S, Memon N. Friends of an enemy: identifying local members of peer-to-peer botnets using mutual
contacts. In Proceedings of the 26th Annual Computer Security Applications Conference: ACSAC ’10,Austin,Texas,ACM:
New York, 2010; 131–140.
27. François J, Wang S, State R, Engel T. Bottrack: tracking botnets using NetFlow and PageRank. In Networking 2011, Lecture
Notes in Computer Science, Vol. 6640, Springer: Berlin, 2011; 1–14.
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 235–252
252 J. JUSKO AND M. REHAK
28. Giroire F, Chandrashekar J, Taft N, Schooler E, Papagiannaki D. Exploiting temporal persistence to detect covert botnet
channels. In Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection: RAID’09,Saint-
Malo, France, Springer-Verlag: Berlin, Heidelberg, 2009; 326–345.
29. Kim J, Shah K, Bohacek S. Detecting p2p traffic from the p2p flow graph. In 7th International Wireless Communications
and Mobile Computing Conference: IWCMC’11,2011;17951800.
30. Duda RO, Hart PE, Stork DG. Pat t e rn Cl a s s ific a t i on an d S c e ne An a l y sis ,Wiley:Chichester,2001.
31. Stoica I, Morris R, Karger D, Kaashoek MF, Balakrishnan H. Chord: a scalable peer-to-peer lookup service for inter-
net applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for
Computer Communications: SIGCOMM’01,SanDiego,California,USA,ACM:NewYork,NY,USA,2001;149160.
32. Kwok YKR. Pe e r-t o - Peer Co m p uti n g : A ppl i c a tio n s , A rc h ite c t u re , P ro t oco l s , a nd Ch a l l eng e s, Taylor & Francis: Boca
Raton, FL, 2011.
33. McNamee K. Malware analysis report–botnet: ZeroAccess/Sirefef. 2012. Available: http://www.kindsight.net/sites/default/
files/Kindsight_Malware_Analysis-ZeroAcess-Botnet-final.pdf [6 May 2014].
AUTHORS’ BIOGRAPHIES
Jan Jusko is a researcher at Cisco Systems, focusing on malware C&C channel detection. He holds a Masters
degree in computer science from Czech Technical University in Prague and now pursues a PhD in artificial
intelligence. His research interests include network security, graph theory and machine learning.
Martin Rehak is a principal engineer with Cisco Systems security group. He has been working in the area of
machine learning, anomaly detection and network security. In the past, he was a founder & CEO of Cognitive
Security, acquired by Cisco in 2013. The VC-funded spin-off company was created to develop a commercial
technology based on the research performed by Martin and his team at Czech Technical University. Martin holds
an engineering degree from Ecole Centrale Paris and a PhD in AI from CTU in Prague.
Copyright © 2014 John Wiley & Sons, Ltd
DOI: 10.1002/nem
Int. J. Network Mgmt 2014; 24: 23 –2525
... Popular Community Detection Algorithms with low runtime complexities, such as Louvain, Label Propagation, and Infomap methods, have been implemented and compared on Peer-topeer (P2P) networks. Existing methods [6][7][8][9][10][11][12] mainly utilize only topological data and neglect the rich data obtained from the content data. As the size and complexity of P2P networks increases, more sophisticated techniques are needed to detect communities. ...
... As the size and complexity of P2P networks increases, more sophisticated techniques are needed to detect communities. In [9], the authors have proposed a method to monitor connections of known nodes in the network and then progressively discover other nodes through the analysis of their mutual contacts; instead of relying on the study of content characteristics or packet properties. In [10], the authors have proposed a Decentralized Iterative Community Clustering Approach (DICCA) to reveal the community structure for large networks using the LFR benchmark model. ...
Article
Full-text available
Community detection is essential in P2P network analysis as it helps identify connectivity structure, undesired centralization, and influential nodes. Existing methods primarily utilize topological data and neglect the rich content data. This paper proposes a technique combining topological and content data to detect communities inside the Bitcoin network using a deep feature representation algorithm and Deep Feedforward Autoencoders. Our results show that the Bitcoin network has a higher clustering coefficient, assortativity coefficient, and community structure than expected from a random P2P network. In the Bitcoin network, nodes prefer to connect to other nodes that share the same characteristics.
... On the other hand, behavioral techniques aim at finding patterns among endto-end communications in a network. It also studies community patterns where the communities are conformed of hosts at different points [107,80,90,3]. The most common representation of behavioral patterns in the network is through graph modeling, in which graph theory is used to find highly connected nodes (hosts), number of connections, and opened ports, among others [107]. ...
... These rules take into account features, such as the percentage of nodes and the average node degree of the graph. The work in [90] proposes an approach to identify P2P communities, where the interactions in the network are represented by graphs. The nodes are formed by the tuple (IP, port), and the connections are given by the number of packets interchanged between nodes. ...
Thesis
The Internet has become indispensable for the daily activities of human beings. Nowadays, this network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this thesis project aims at finding new Internet traffic classification approaches to improving customer satisfaction by improving the QoS.Machine Learning (ML) algorithms will be studied and deployed to classify Internet traffic. All the necessary elements, to couple an ML solution over a well-known Satellite Communication and QoS management architecture, will be evaluated. In this architecture, one or more monitoring points will intercept Satellite Internet traffic, which in turn will be treated and marked with predefined classes by ML-based classification techniques. The marked traffic will be interpreted by a QoS management architecture that will take actions according to the class type.To develop this ML-based solution, a rich and complete set of Internet traffic is required; however, historical labeled data is hardly publicly available. In this context, binary packets should be monitored and stored to generate historical data. To do so, an emulated cloud platform will serve as a data generation environment in which different Internet communications will be launched and captured. This study is escalated to a Satellite Communication architecture. Moreover, statistical-based features are extracted from the packet flows. Some statistical-based computations will be adapted to achieve accurate Internet traffic classification for encrypted and unencrypted packets in the historical data. Afterward, a proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Besides, to cope with the evolution of Internet applications, a new method is presented to induce updates over the original classification system. Finally, some experiments in the cloud emulated platform validate our proposal and set guidelines for its deployment over a Satellite architecture.
... These rules take into account features, such as the percentage of nodes and the average node degree of the graph. The work in [58] proposes an approach to identify P2P communities, where the interactions in the network are represented by graphs. The nodes are formed by the tuple (IP, port), and the connections are given by the number of packets interchanged between nodes. ...
... As the previous cases, the nodes can be viewed as the hosts, and the edges the flows in opened sessions. Different variations can be found to create the links between the edges; for example, the work in [58] proposes the edges as the pair IP and port, motivated by the application behaviors that can open more than one port for P2P communications. ...
Article
Full-text available
Traffic analysis is a compound of strategies intended to find relationships, patterns, anomalies, and misconfigurations, among others things, in Internet traffic. In particular, traffic classification is a subgroup of strategies in this field that aims at identifying the application’s name or type of Internet traffic. Nowadays, traffic classification has become a challenging task due to the rise of new technologies, such as traffic encryption and encapsulation, which decrease the performance of classical traffic classification strategies. Machine Learning gains interest as a new direction in this field, showing signs of future success, such as knowledge extraction from encrypted traffic, and more accurate Quality of Service management. Machine Learning is fast becoming a key tool to build traffic classification solutions in real network traffic scenarios; in this sense, the purpose of this investigation is to explore the elements that allow this technique to work in the traffic classification field. Therefore, a systematic review is introduced based on the steps to achieve traffic classification by using Machine Learning techniques. The main aim is to understand and to identify the procedures followed by the existing works to achieve their goals. As a result, this survey paper finds a set of trends derived from the analysis performed on this domain; in this manner, the authors expect to outline future directions for Machine Learning based traffic classification.
... Meanwhile, in [62], the authors developed a system that is capable of identifying P2P nodes operating in a network, through analysis of Net flow data, rather than via analysis of the properties of the content itself. Hence, cooperative communities in the network can be discovered or identified. ...
Article
Full-text available
Centralized file-sharing networks have low reliability, scalability issues, and possess a single point of failure, thus making peer-to-peer (P2P) networks an attractive alternative since they are mostly anonymous, autonomous, cooperative, and decentralized. Although, there are review articles on P2P overlay networks and technologies, however, other aspects such as hybrid P2P networks, modelling of P2P, trust and reputation management issues, coexistence with other existing networks, and so on have not been comprehensively reviewed. In addition, existing reviews were limited to articles published in or before 2012. This paper performs a state-of-the-art literature survey on the emerging research areas of P2P networks, applications and ensuing challenges along with proposed solutions by scholars. The literature search for this survey was limited to the top-rated publisher of scholarly articles. This research shows that issues with security, privacy, the confidentiality of information and trust management will need greater attention, especially in sensitive applications like health services and vehicle to vehicle communication ad hoc networks. In addition, more work is needed in developing solutions to effectively investigate and curb deviant behaviours among some P2P networks.
... It has also redefined the need for network monitoring and long-term storage of network and transaction logs. The increase in volume also brings computational problems for more sophisticated detection and classification algorithms [2], as they may easily become increasingly difficult to compute on the full traffic log. Storing large amounts of traffic monitoring data also complicates ex post network forensics and dramatically increases data retention and investigation costs. ...
Article
In order to cope with an increasing volume of network traffic, flow sampling methods are deployed to reduce the volume of log data stored for monitoring, attack detection and forensic purposes. Sampling frequently changes the statistical properties of the data and can reduce the effectiveness of subsequent analysis or processing. We propose two concepts that mitigate the negative impact of sampling on the data. Late sampling is based on a simple idea that the features used by the analytic algorithms can be extracted before the sampling and attached to the surviving flows. The surviving flows thus carry the representation of the original statistical distribution in these attached features. The second concept we introduce is that of adaptive sampling. Adaptive sampling deliberatively skews the distribution of the surviving data to over-represent the rare flows or flows with rare feature values. This preserves the variability of the data and is critical for the analysis of malicious traffic, such as the detection of stealthy, hidden threats. Our approach has been extensively validated on standard NetFlow data, as well as on HTTP proxy logs that approximate the use-case of enriched IPFIX for the network forensics.
... The behavioral analysis of any botnet traffic has to deal with the actions, connections and patterns of botnets over time. However, these patterns can be so complex and interdependent that they may only by seen by analyzing a long-term capture [88]. Most botnets are use several different modules for their actions, and each module produces a different behavioral pattern on the network. ...
Thesis
Full-text available
Botnets are the technological backbone supporting myriad of attacks, including identity stealing, organizational spying, DoS, SPAM, government-sponsored attacks and spying of political dissidents among others. The research community works hard creating detection algorithms of botnet network traffic. These algorithms have been partially successful, but are difficult to reproduce and verify; being often commercialized. However, the advances in machine learning algorithms and the access to better botnet datasets start showing promising results. The shift of the detection techniques to behavioral-based models has proved to be a better approach to the analysis of botnet patterns. However, the current knowledge of the botnet actions and patterns does not seem to be deep enough to create adequate traffic models that could be used to detect botnets in real networks. This thesis proposes three new botnet detection methods and a new model of botnet behavior that are based in a deep understanding of the botnet behaviors in the network. First the SimDetect method, that analyzes the structural similarities of clustered botnet traffic. Second the BClus method, that clusters traffic according to its connection patterns and uses decision rules to detect unknown botnet in the network. Third, the CCDetector method, that uses a novel state-based behavioral model of known Command and Control channels to train a Markov Chain and to detect similar traffic in unknown real networks. The BClus and CCDetector methods were compared with third-party detection methods, showing their use in real environments. The core of the CCDetector method is our state-based behavioral model of botnet ac tions. This model is capable of representing the changes in the behaviors over time. To support the research we use a huge dataset of botnet traffic that was captured in our Malware Capture Facility Project. The dataset is varied, large, public, real and has Background,Normal and Botnet labels. The tools, dataset and algorithms were released as free software. Our algorithms give a new high-level interface to identify, visualize and block botnet behaviors in the networks.
... The behavioral analysis of botnet traffic deals with the actions, connections and patterns of botnets over time. However, these patterns can be so complex and interdependent that they may only by seen by analyzing a long-term capture [10]. Only a large botnet capture would give time to the patterns and behaviors to emerge. ...
Conference Paper
Full-text available
Through the analysis of a long-term botnet capture, we identified and modeled the behaviors of its C&C channels. They were found and characterized by periodicity analyses and statistical representations. The relationships found between the behaviors of the UDP, TCP and HTTP C&C channels allowed us to unify them in a general model of the botnet behavior. Our behavioral analysis of the C&C channels gives a new perspective on the modeling of malware behavior, helping to better understand botnets.
Article
Nowadays, the Internet network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this work aims at finding new Internet traffic classification approaches to improving the QoS. Machine Learning (ML) and Deep Learning (DL) techniques will be studied and deployed to classify Internet traffic. All the necessary elements to couple an ML or DL solution over a well-known Satellite Communication and QoS management architecture will be evaluated. To develop this solution, a rich and complete set of Internet traffic is required. In this context, an emulated Satellite Communication platform will serve as a data generation environment in which different Internet communications will be launched and captured. The proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Finally, some experiments on a cloud emulated platform validates our proposal and set guidelines for its deployment over a Satellite architecture.
Conference Paper
Full-text available
BitTorrent as one of the leading P2P file sharing applications has dominant traffic in broadband access networks. In this paper we present the main characteristics of BitTorrent traffic based on actual measurements taken from a commercial network. Analysis results at both application- and flow-levels are presented and discussed.
Article
Full-text available
Flow-based monitoring has become a popular approach in many areas of network management. However, flow monitoring is, by design, susceptible to anomalies that generate a large number of flows, such as Distributed Denial-Of-Service attacks. This paper aims at getting a better understanding on how a flow monitoring application reacts to the presence of massive attacks. We analyze the performance of a flow monitoring application from the perspective of the flow data it has to process. We first identify the changes in the flow data caused by a massive attack and propose a simple queueing model that describes the behavior of the flow monitoring application. Secondly, we present a case study based on a real attack trace collected at the University of Twente and we analyze the performance of the flow monitoring application by means of simulation experiments. We conclude that the observed changes in the flow data might cause unwanted effects in monitoring applications. Furthermore, our results show that our model can help to parametrize and dimension flow-based monitoring systems.
Article
Full-text available
Botnets have recently been identified as one of the most important threats to the security of the Internet. Traditionally, botnets organize themselves in an hierarchical manner with a central command and control location. This location can be statically defined in the bot, or it can be dynamically defined based on a directory server. Presently, the centralized characteristic of botnets is useful to security professionals because it offers a central point of failure for the botnet. In the near future, we believe attackers will move to more resilient architectures. In particular, one class of botnet structure that has entered initial stages of development is peer-to-peer based architectures. In this paper, we present an overview of peer-to-peer botnets. We also present a case study of a Kademlia-based Trojan.Peacomm bot.
Book
While people are now using peer-to-peer (P2P) applications for various processes, such as file sharing and video streaming, many research and engineering issues still need to be tackled in order to further advance P2P technologies. Peer-to-Peer Computing: Applications, Architecture, Protocols, and Challenges provides comprehensive theoretical and practical coverage of the major features of contemporary P2P systems and examines the obstacles to further success. Setting the stage for understanding important research issues in P2P systems, the book first introduces various P2P network architectures. It then details the topology control research problem as well as existing technologies for handling topology control issues. The author describes novel and interesting incentive schemes for enticing peers to cooperate and explores recent innovations on trust issues. He also examines security problems in a P2P network. The final chapter addresses the future of the field. Throughout the text, the highly popular P2P IPTV application, PPLive, is used as a case study to illustrate the practical aspects of the concepts covered. Addressing the unique challenges of P2P systems, this book presents practical applications of recent theoretical results in P2P computing. It also stimulates further research on critical issues, including performance and security problems.
Conference Paper
In this paper we present an algorithm that is able to progressively discover nodes cooperating in a P2P network. Starting from a single known node, we can easily identify other nodes in the peer-to-peer network, through the analysis of widely available and standardized IPFIX (NetFlow) data. Instead of relying on the analysis of content characteristics or packet properties, we monitor connections of known nodes in the network and then progressively discover other nodes through the analysis of their mutual contacts. We show that our method is able to discover all cooperating nodes in many P2P networks. The use of standardized input data allows for easy deployment onto real networks. Moreover, because this approach requires only short processing times, it scales very well in larger and higher speed networks. © 2013 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.
Article
Global Internet threats are undergoing a profound transformation from attacks designed solely to disable infrastructure to those that also target people and or- ganizations. Behind these new attacks is a large pool of compromised hosts sitting in homes, schools, busi- nesses, and governments around the world. These sys- tems are infected with a bot that communicates with a bot controller and other bots to form what is commonly referred to as a zombie army or botnet. Botnets are a very real and quickly evolving problem that is still not well understood or studied. In this paper we outline the origins and structure of bots and botnets and use data from the operator community, the Internet Motion Sen- sor project, and a honeypot experiment to illustrate the botnet problem today. We then study the effectiveness of detecting botnets by directly monitoring IRC communi- cation or other command and control activity and show a more comprehensive approach is required. We con- clude by describing a system to detect botnets that utilize advanced command and control systems by correlating secondary detection data from multiple sources.
Article
State-of-the-art approaches for the detection of peer-to-peer (P2P) botnets are on the one hand mostly local and on the other hand tailored to specific botnets involving a great amount of human time, effort, skill and creativity. Enhancing or even replacing this labor-intensive process with automated and, if possible, local network monitoring tools is clearly extremely desirable. To investigate the feasibility of automated and local monitoring, we present an experimental analysis of the traffic dispersion graph (TDG)--a key concept in P2P network detection--of P2P overlay maintenance and search traffic as seen at a single AS. We focus on a feasible scenario where an imaginary P2P botnet uses some basic P2P techniques to hide its overlay network. The simulations are carried out on an AS-level model of the Internet. We show that the visibility of P2P botnet traffic at any single AS (let alone a single router) can be very limited. While we strongly believe that the automated detection and mapping of complete P2P botnets is possible, our results imply that it cannot be achieved by a local approach: it will inevitably require very close cooperation among many different administrative domains and it will require state-of-the-art P2P algorithms as well.