ArticlePDF Available

Identifying peer-to-peer communities in the network by connection graph analysis

July 2014
International Journal of Network Management 24(4)

July 2014
24(4)

Authors:

In this paper we present a unified solution to identify peer-to-peer (P2P) communities operating in the network. We propose an algorithm that is able to progressively discover nodes cooperating in a P2P network and to identify that P2P network. Starting from a single known node, we can easily identify other nodes in the P2P network, through the analysis of widely available and standardized IPFIX (NetFlow) data. Instead of relying on the analysis of content characteristics or packet properties, we monitor connections of known nodes in the network and then progressively discover other nodes through the analysis of their mutual contacts. We show that our method is able to discover cooperating nodes in many P2P networks and present the real computational requirements of the algorithm on a large network. The use of standardized input data allows for easy deployment onto real networks. Copyright © 2014 John Wiley & Sons, Ltd.

. Overview of methods and goals of the related work Goals Methods used P2P properties P2P-based C&C properties P2P detection Botnet detection

…

Figures - uploaded by Jan Jusko

Content may be subject to copyright.

Content uploaded by Jan Jusko

Content may be subject to copyright.

INTERNATIONAL JOURNAL OF NETWORK MANAGEMENT

Int. J. Network Mgmt 2014; 24

Published online 29 May 2014 in Wiley Online Library (wileyonlinelibrary.com). DOI:

Identifying peer-to-peer communities in the network by connection

graph analysis

Jan Jusko1;2,!,!and Martin Rehak1,2

1Cisco Systems, San Jose, CA 95134, USA

2Faculty of Electrical Engineering, Czech Technical University in Prague, 166 27 Prague, Czech Republic

SUMMARY

In this paper we present a uniﬁed solution to identify peer-to-peer (P2P) communities operating in the

network. We propose an algorithm that is able to progressively discover nodes cooperating in a P2P network

and to identify that P2P network. Starting from a single known node, we can easily identify other nodes in

the P2P network, through the analysis of widely available and standardized IPFIX (NetFlow) data. Instead

of relying on the analysis of content characteristics or packet properties, we monitor connections of known

nodes in the network and then progressively discover other nodes through the analysis of their mutual

contacts. We show that our method is able to discover cooperating nodes in many P2P networks and present

the real computational requirements of the algorithm on a large network. The use of standardized input data

Received 31 October 2013; Revised 15 March 2014; Accepted 10 April 2014

1. INTRODUCTION

Although peer-to-peer (P2P) networks are mostly known for ﬁle-sharing applications, they are widely

adopted in today’s Internet. They are used for ﬁle sharing (e.g. BitTorrent), VoIP applications (Skype),

malware’s command and control (C&C) channel and streaming media (Spotify). Interestingly, there

is also an online currency that relies on P2P architecture for its ecosystem: BitCoin [1]. Last but not

least, P2P networks might be used for military purposes [2]. Since VoIP applications, streaming media

and especially ﬁle-sharing applications generate large amounts of trafﬁc, the P2P networks generate a

signiﬁcant amount of trafﬁc in today’s Internet.

For the sake of both network management and network security, it is important to be able to identify

the network trafﬁc generated by P2P networks. With the knowledge what network trafﬁc is generated

by speciﬁc P2P application, one can more effectively manage networks and provide a better quality of

service. From the security standpoint, blocking P2P-based C&C is very effective for disrupting botnet

activity. In addition to these two points, it was also shown that P2P trafﬁc can degrade the performance

of anomaly detection techniques. The detection rate can decrease by up to 30% and the false positive

rate can increase by up to 45% [3].

In this paper we deal with the issue of ﬁnding P2P peers in the network for all P2P protocols in gen-

eral, without focus on any particular protocol. That is thanks to exploiting the intrinsic properties that

are common for all P2P protocols and cannot be effectively avoided by any P2P network. Speciﬁcally,

in this paper we are looking for answers to the following questions:

1. Having one peer in an unknown P2P network, are we able to ﬁnd other peers in the respective

network?

!Correspondence to: Jan Jusko, Cisco Systems, San Jose, CA 95134, USA.

!E-mail: jajusko@cisco.com

: 235–252

10.1002/nem.1862

236 J. JUSKO AND M. REHAK

2. Can we determine what particular P2P network it is?

3. Can we enumerate all P2P networks and their peers active in the monitored network?

Answers to these questions are presented in Sections 3, 4 and 5. As a result we get a detector

that provides information about all active P2P overlays in the network. The detector utilizes Net-

Flow data and uses only information about communication endpoints without using any ﬂow-based

statistics. This is shown to be an advantage since such statistics (e.g. ﬂow duration) could be dis-

torted, for example, in the case of a distributed denial of service (DDoS) attack on the protected

network [4].

In our evaluation we show that the detector we propose is able to link hosts cooperating in the usual

P2P networks such as KAD, Gnutella, BitTorrent and Skype, as well as hosts infected by the same

malware using P2P as its C&C channel. Besides knowledge about the cooperating hosts, the detector

is capable of identifying the detected networks if they are of a known type.

Knowing which hosts are engaged in the same overlay with the infected host might help in discov-

ering the botnet or other malware-infested nodes. The method described in this paper can be used as a

pre-processing layer for packet inspection-based detection. One would ﬁrst ﬁnd clusters of hosts in the

network and then perform the detection only for a few of them and extend the results on the remaining

hosts in the cluster.

This work is an extension of our earlier paper [5]. The core of the new contribution lies in Section

4. Moreover, this paper offers a more detailed discussion about reasoning of the algorithm. In the

evaluation we provide additional information on the detection performance of the detector in time and

add information about computational performance in the real deployment.

2. RELATED WORK

There is a plethora of research in the ﬁeld of P2P networks, e.g. BitTorrent [6–8], BitTorrent’s DHT

[9], KAD (which is based on Kademlia) [10,11] and Gnutella [12–14]. There are also many studies

proposing various improvements to P2P protocols, but those are not of primary interest here.

P2P architecture is now often used by botnets for their C&C. An overview of P2P botnets and an

analysis of one of them can be found in Grizzard et al. [15]. A P2P-based C&C, using Kademlia as an

example, is analyzed theoretically in Ha et al. [16]. The authors show that P2P-based C&C is harder

to monitor compared to the centralized C&C architecture. Besides that, they also propose several

mitigation techniques.

A deeper analysis of botnets with P2P-based C&C has been provided in Dagon et al. [17]. The

authors take Cooke et al. [18] as their inspiration and classify botnets into several groups based on the

(theoretical) C&C model they use:

!Erdös–Rényi random graph model;

!Watts–Strogatz sm a l l wo r l d m o d e l ;

!Barabási–Albert scale-free model.

All these models assume that bots communicate with each other; thus these are an extension of the

P2P model. The authors themselves note that for the purpose of their theoretical analysis botnets using

unstructured P2P networks as C&C can be roughly approximated by the Barabási–Albert scale-free

model and those using structured P2P networks for their C&C can be roughly approximated by the

Erdös–Rényi random graph model.1Appropriate models for a few major P2P protocols based on this

overlay classiﬁcation can be found in Table 1.

Another study of P2P architecture from the point of view of the C&C channel was presented in

Davis et al. [19]; it studies two real P2P networks and two theoretical models in order to identify the

optimal P2P overlay for botnet C&C. The studied protocols and theoretical models are:

1Note that P2P networks that use supernodes are in fact unstructured (e.g. Gnutella). Only networks with an explicitly

deﬁned routing structure (e.g. Kademlia) are considered structured.

DOI: 10.1002/nem

: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 237

Table 1. Major P2P protocols and the approximate graph model of their overlay

P2P protocol Graph model of overlay

Skype Barabási–Albert scale-free model

Gnutella (before introduction of super-peers) Erdös–Rényi random graph model

Gnutella (with super-peers) Barabási–Albert scale-free model

BitTorrent Erdös–Rényi random graph model

BitTorrent DHT Erdös–Rényi random graph model

Kademlia Erdös–Rényi random graph model

!Kademlia;

!Gnutella;

!Erdös–Rényi random graph model;

!Barabási–Albert scale-free network model.

The authors ﬁrst deﬁne three performance measures on a P2P overlay that should be of interest to

the botmaster. Once these measures are deﬁned, they evaluate them on the four models when being the

target of a disinfection.

Detection of P2P networks is another topic often dealt with. There are three main groups of detection

methods: packet payload based, ﬂow-based methods and graph methods. Within all three groups the

detection can be based on the observation of either the speciﬁc P2P network behavior or inherent P2P

network properties.

A ﬂow-based method to detect peers using inherent properties of P2P networks is introduced

in Bartlett et al. [20]. The method itself does not use any protocol-speciﬁc features and thus, in

theory, might be used for any P2P network. The authors validate the method on BitTorrent and

Gnutella networks.

As an example of graph methods, we can mention the one introduced in Constantinou and Mavrom-

matis [21]. The method is agnostic of any speciﬁc P2P protocol features. It creates a connection graph

of the peers communicating on a given port. Based on the network diameter and number of hosts that

function as both client and server, the method determines whether they constitute a P2P network.

In the detection of P2P networks, and thus P2P botnets, much attention is given to trafﬁc disper-

sion graphs (TDGs). A TDG is a graphical representation of various interactions of a group of nodes

[22]. In IP networks, nodes in a TDG represent hosts in the network identiﬁed by their IP address.

However, deﬁnition of an edge in such a graph is more complicated—to give variability to TDG to

describe various forms of interactions, the edges can be deﬁned arbitrarily. Static properties of various

TDG were ﬁrst analyzed by Iliofotou et al. [22]. A network classiﬁcation system Graption is based on

these properties [23]. In further work, Ilifotou et al. also investigate dynamic properties of trafﬁc dis-

persion graphs [24]. TDGs were also used to evaluate the limits of local approaches in the P2P botnet

detection [25].

An interesting solution for P2P-based botnets was proposed by Coskun et al. [26]. They propose a

method to identify local members of a botnet that uses an unstructured random P2P network for its

overlay if one bot from the botnet is already known. They do so by post-mortem analysis of the network

trafﬁc. Once one bot is discovered in the network, the trafﬁc preceding its discovery is collected (say

24 hours) and analyzed. The detection of other hosts is centered around the term mutual contact and

the likelihood of other hosts being part of the same botnet are calculated in an iterative manner on a

graph that is created based on hosts’ mutual connections. Since we use mutual contacts paradigm as

well, this work is closest to ours.

Iterative calculation of nodes’ likelihood of being part of botnets is used in François et al. [27]

as well. The goal of their work is to ﬁnd hidden overlays of P2P-based botnets in the network. A

communication graph is created based on the NetFlow data and two PageRank statistics are calculated

for each node. Nodes are then clustered to ﬁnd clusters of P2P nodes in the network.

In this work we also use ideas from a paper aimed at detecting botnet C&C [28]. The authors focus

on observing long-term connections that are possibly used for botnet C&C. The method requires a

training phase in which a clean (malware-free) trafﬁc is assumed. In the training phase all long-term

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

238 J. JUSKO AND M. REHAK

Table 2. Overview of methods and goals of the related work

Goals

Methods used P2P properties P2P-based C&C properties P2P detection Botnet detection

Empirical studies [6–14] [15] — —

Theoretical analysis — [16,17,19] — —

Thresholding — — [20] —

Graph methods [22,24] — [21–27] —

Statistical approaches — — [27] [27]

Persistence — — — [28]

connections are recorded and whitelisted. In the detection phase, if any long-lasting connection is

observed and contained in the whitelist, it is considered a C&C channel.

An overview of related work with categorization by methods used and goals attained can be found

in Table 2.

Our method uses a different graph representation than is usual in the ﬁeld of graph-based network

analysis and adds dynamicity to the graphs used, enabling us to distinguish multiple P2P networks

on the same host. The graph we propose changes together with the changes in the observed network

trafﬁc, which allows for online analysis of network trafﬁc.

3. FINDING COOPERATING HOSTS

Despite having an intrusion prevention system (IPS) deployed in the network and host machines pro-

tected by antivirus, hosts can become infected—be it because of the zero-day attack or because of the

infection vector that is not covered by the security solution. When such a compromised host is found,

it is good practice to look for other hosts in the network that exhibit similar behavior to ﬁnd other

potential victims of the infection.

One way to do this is to look at what hosts the infected machine communicates with, because these

would include the perpetrator. If any other machine shares a similar set of peers, it might be infected as

well. If two machines share a remote peer, we say they are cooperating in at least one overlay network.

In some cases, the shared overlay is benign, like Skype, whereas in other cases the shared overlay will

be malicious.

Comparing the hosts’ peers can be viewed as looking for sets intersection—for each host we keep

asetofIPaddressesofcommunicationpeersandlookforpairwiseintersections.Theshortcomingof

this method is that it does not acknowledge that one host might participate in several P2P networks

at the same time. For a host that is being infected by a P2P-based malware and running Skype at the

same time, looking just at the intersections of sets containing IP addresses we end up marking all other

hosts in the network using Skype as suspicious, and we cannot distinguish between the two. Another

issue with this approach is that this analysis is done ofﬂine, after the incident occurs. If we chose to do

it in real time, the deﬁnition of sets would have to be extended, e.g. to allow forgetting of peers. It also

brings the requirement of some algorithm extension, such as deﬁnition of a time window in which the

two hosts must have an intersection of the peer sets.

Both issues can be addressed by graph-based models. Graphs are commonly used to represent a P2P

network—vertices represent peers and edge connections between the peers. Graph formalism has been

used before in the task of P2P network detection. However, our graph formalism brings two novelties:

different node representation and graph dynamicity.

All graph methods we have come across until now, with the exception of Kim et al. [29], used nodes

to represent hosts (or IP addresses). We move from this deﬁnition towards using nodes to represent

tuples (IP, port) which we call endpoints.Thismitigatestheﬁrstshortcomingofthe‘intersection’

method. If a host is participating in several P2P networks, its IP is the same, but the ports used for

communication in different networks are different. Therefore, a single host can be represented by

several nodes, each associated with a speciﬁc network. One host can certainly use several ports for

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 239

communication in one overlay, but there always needs to be one port listening for incoming connec-

tions that does not change (often) so that other peers in the P2P network can contact it. Therefore, one

endpoint should be dominant among all endpoints associated with a single host and P2P overlay.

To ov e r c o m e t h e second shortco m i n g , w e e m p l o y a g r a p h a lgorithm that no t o n l y c o n s t r u c t s t he

graph but also modiﬁes it with time, thus assuring that the graph describes the current state of the

overlay of the P2P network that we can observe. The notion of time which is necessary for such graph

dynamicity is captured by edge weights.

To detect the nodes of a P 2 P overlay net wor k w i t h i n o u r n e t w o r k w e u s e a 3 -partite weigh t e d g r a p h :

GD.V; E ; w/

where

VDVc[Vs[Vr

in which Vcis a set of nodes from our network we believe are participating in the P2P network, Vsis

asetofnodesfromournetworkthatwesuspect are participating in the P2P network and Vris a set

of nodes from outside of our network communicating with nodes from Vc[Vs.Eis a set of edges.

Function wassigns each edge a weight—a value equal to time when the edge was added to the graph.

We ignore all intra - n e t w o r k c o m m u n i c a tion2and cannot see communication between the nodes that

are outside of our network. Therefore, the deﬁned graph is indeed a 3-partite weighted graph. This also

implies that GD..Vc[Vs/[Vr;E/can be viewed as a bipartite graph.

3.1. The graph algorithm

Since P2P overlay networks are dynamically changing, so should the graph that represents a P2P

overlay network. The detection algorithm monitors network trafﬁc and constructs (modiﬁes) the graph

based on the observed network activity in the following way:

1. the graph starts with only the seed node n2Vc;

2. when a network connection occurs between any node n2Vcand some node moutside of our

network then there are two options:

(a) m2Vralready; in this case we just update w.¹m; nº/Dcurrent_t i me./;

(b) m…Vryet; in this case we add mto Vrand ¹m; nºto Eand set w.¹m; nº/D

current_t ime./;

3. when a network connection occurs between any local node not yet in the graph and some node

m2Vr,weaddnto Vs,add¹m; nºto Eand set w.¹m; nº/Dcurrent_t i me./;

4. any edge e2Efor which tnow #w.e/ > tLis removed from the graph;

5. any node n2Vis removed from the graph when it does not have any incident edge (it has a

zero degree);

6. if .9m2Vs/.9n2Vc/.jAdj.m/ \Adj.n/ j>K/then we move mfrom Vsto Vc;Adj.n/ is

a set of vertices adjacent to n.

There are two parameters used in this algorithm:

!amemory limit,tL, which speciﬁes how long a recorded connection (an edge in the graph) is kept

in memory;

!amutual contacts overlap threshold,K,whichspeciﬁeshowmanymutualadjacentverticesa

node from Vsneeds to have with any node from Vcto be moved to Vc.

A discussion about the impact of the parameter choice can be found in the Appendix. The ﬁrst

three steps of the algorithm with an example graph are depicted in Figure 1. The set Vc,atanygiven

moment, contains a list of active P2P nodes in the local network participating in the same overlay.

2Based on the deployment location of the NetFlow probes, the intra-network communication may or may not be available.

We chose not to use this information to avoid susceptibility of the algorithm to the probe deployment location.

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

240 J. JUSKO AND M. REHAK

Figure 1. Algorithm illustration. First we have a seed node A with three recorded contacts. In the

second time interval, another node, B, is observed, sharing two mutual contacts with A. If we consider

KD2,theninthethirdstepnodeBisalreadymovedtotheVc. Moreover, the algorithm detected yet

another node, which has only one mutual contact with a node from Vc. Note that the weights of the

edges in the graph are determined by the time step in which they occurred most recently

There is, of course, a possibility that the graph algorithm will not be able to ﬁnd any cooperating

hosts for a certain seed. This might happen when the seed is the only peer of the respective P2P overlay

in the network, or when the seed node around which we tried to construct a graph was not participating

in any P2P overlay.

4. IDENTIFYING THE REVEALED P2P NETWORK

Assuming we revealed an overlay network of some P2P protocol, can we determine what particular

P2P network it is? Most of the P2P protocol identiﬁcation methods rely on payload data or ﬂow-

based statistics. Is it possible to infer the P2P protocol directly form the graph, without looking on any

ﬂow-based or packet-based statistics?

There is an elegant way of identifying a P2P network based on port distribution of peers participating

in the overlay. The port distribution is an empirical probability distribution of ports used by peers and

is represented by a normalized vector with 65 535 elements. While it is often argued that port-based

P2P protocol identiﬁcation is useless due to the port randomization, the port distribution of peers is

surprisingly stable. The method we propose takes advantage of this observation.

The proposed detector does not have the knowledge of the whole overlay, but we argue that even the

partial knowledge it has is sufﬁcient to identify the P2P overlay using the port distribution of known

remote peers. Once the cooperating peers in the network are found, a port distribution over their remote

peers (nodes in set Vrfrom the graph) is generated and matched against the port distributions of known

P2P networks. The graph is considered to represent the P2P overlay that has the most similar port

distribution to the port distribution created over Vr. As of now, we have port distributions for a few

major P2P networks: Skype, BitTorrent, KAD (Kademlia), Gnutella. Their respective distributions can

be found in Figure 2.

This approach can be formalized as multinomial classiﬁcation with rejection option [30]—a classi-

ﬁer decides whether the graph represents one of the several known networks, and if none of the known

P2P networks is similar enough it does not make a decision. The classiﬁer is using a one-versus-all

strategy, where for each class there is a binary classiﬁer fi./ that classiﬁes elements in the respective

class. The overall classiﬁer can be then deﬁned as

f.x/ Darg max

ifi.x/

In our case, each binary classiﬁer has a form of dot product of the port distribution of remote

peers and a class prototype, which is also a port distribution of remote peers captured on the graph

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 241

Figure 2. Port distributions of some of the major P2P networks: (a) BitTorrent; (b) Skype; (c) Gnutella;

(d) KAD. These distributions are distinctively different. For example, port distribution for Skype has

strong peaks on port 33 033, which is the legacy control port. Ports around 40 000 are the new control

ports that were introduced after Microsoft’s acquisition of Skype. On the other hand, the BitTor-

rent port distribution has peaks at 6881 and 51 413, which are default ports for several BitTorrent

client applications

representing the known network. One can easily see that this dot product will have values between

Œ0; 1!, with the value getting higher the more similar two vectors are. Since we do not have proto-

type vectors for all existing P2P networks and we never will, it is crucial that the classiﬁer has the

rejection option.

The classiﬁcation of graph GD.Vc[Vs[Vr;E;w/proceeds as follows:

1. we create port distribution vector d:

(a) we instantiate an vector d;diDj ¹njn2Vr^port.n/ Diºj;i 2Œ1; 65 535!,i.e.each

element of the vector contains the value equal to the number of remote peers whose port

is equal to the element index;

(b) we normalize d;

2. for each known class Cwe deﬁne fiDhd; eciwhere eCis the class prototype; ha; bidenotes

a dot product of the two vectors; then the overall classiﬁer is arg maxChd;eCi;

3. we select class Cfor which hd;eciis maximized as a possible match;

4. the maximal score fmax Dmaxifiis compared with the rejection threshold T,iffmax >T the

classiﬁers identiﬁes the graph as representing class C, otherwise no decision is made.

The rejection option from step 4 is crucial. One of the dot products will always have the maximal

value. This value might be very low, which signals that the classiﬁer is not very sure what P2P network

is represented by the graph. In such a case it is undesirable to make a decision. In our implementation,

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

242 J. JUSKO AND M. REHAK

we set the rejection threshold to 0.3. This value is high enough to avoid ‘accidental’ classiﬁcation

because vectors in high dimensions are typically orthogonal to each other. On the other hand, it is low

enough to allow differences in graph port distributions class prototypes caused by different sampling

of peers in the overlay (i.e. both distributions were created based on different sets of peers).

4.1. Time stability

The classiﬁcation approach we employ requires that the port distributions are stable in time. To show

that, we reconstructed overlays of BitTorrent and Skype using the graph algorithm described in Section

3.1 on two different networks at different points in time and compared their port distributions. The

capture times were approximately 9 months apart. For BitTorrent the dot product of the vectors con-

sisting of the port relative frequencies of the two distributions yielded a value of 0.93. This indicates

that BitTorrent peers’ port distribution is stable both in time and space.

For Skype the dot product of the vectors created in the same way yielded a value of only 0.4,

indicating that Skype peers’ port distribution is less stable. This can be explained by a major Skype

overlay architecture change that took place between the two points at which we reconstructed the

Skype overlay. A large number of Skype supernodes were moved to Microsoft’s own data centers,

relying less on nodes on users’ computers. New supernodes usually listen on ports between 40 000

and 40 100. Also, port 33 033, which was associated with supernodes owned by Skype itself, is less

prominent after the acquisition. If we compensate for this change (ignoring ports 33 033 and 40 000–

40 100 in the port distributions) the dot product yields a value of 0.78. The two are not as similar as in

the case of BitTorrent but there is still a strong similarity.

The remaining two P2P overlays shown in Figure 2 were not detected in one of the networks;

therefore the comparison cannot be shown.

5. ENUMERATING ALL ACTIVE PEER-TO-PEER NETWORKS

The previous two sections enable us to ﬁnd a P2P overlay given the knowledge of one peer in the

overlay and possibly identify it if it is of a known type. To ﬁnd and enumerate (ﬁnd all participating

local peers) all active P2P overlays in the network using the same technique, one needs a starter node

for each of the active P2P overlays. If we were to ﬁnd one starter node for each active overlay in the

network we would need to know:

!what P2P overlays are active in the network;

!how to pinpoint a node participating in a given P2P overlay.

This is a rather recursive problem. To ﬁnd the starter nodes we need the knowledge that is sufﬁcient

to solve the original problem at hand—to ﬁnd and enumerate all active P2P overlays. To circumvent

this issue, we can select all endpoints likely to be peers in some overlay and grow graphs around

them. If any of the chosen endpoints is an active peer, the graph would represent an overlay of the

P2P network it belongs to. The guidelines for selecting seeds as likely peers are based on two intrinsic

properties of all P2P overlays that must always hold true:

1. all peers need to listen for incoming connections on an arbitrary but stable port;

2. every peer needs to communicate with at least two other peers.

The ﬁrst property emerges from the observation that each peer both receives and initiates connec-

tions to other peers; otherwise it would deter the client/server paradigm. In other words, each peer

behaves like both the client and the server; thus it has to keep a port open for incoming connections.

Also, this port cannot be changed often because it produces an overhead in the exchanged messages

and decreases the effectiveness of the overlay. In structured P2P overlays, each change of listening port

is equivalent to leaving and rejoining a network with different address, which for example in Chord

overlay [31] requires O.log2N/ messages [32]. In an unstructured network, changing the listening

port is also functionally equivalent to leaving and rejoining the network. This does not require any

update to the routing table, but former peers of the rejoining peer will not be able to contact it any

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 243

Figure 3. Schema of the detector. As an input it takes ﬂows from the network which are processed

by the persistence module (denoted by PM). The set of seed endpoints is then transferred to the graph

module, which processes the ﬂows induced by persistent endpoints and merges and deletes graphs

as needed. Sets of cooperating peers are sent to the identiﬁcation module (denoted by IM). The out-

put of the detector consists of sets of endpoints that appear to be cooperating in P2P networks with

identiﬁcation of the P2P network if available

more. A new search in the network has to take place to re-establish the connection. Therefore, peers

try to avoid changing the listening port while being part of the overlay.

Moreover, it is not sufﬁcient and/or desirable for peers in an overlay to be in touch with only one

other peer. In the worst-case scenario, where each peer knows the address of only one other peer, we

effectively end up with a round-robin structure which is inferior in terms of search within the P2P

network. Besides that, data transfers would have to pass unrelated nodes that are not interested in the

data in transfer. When we consider a P2P overlay where only some peers have degree greater than

1, which is an extreme form of scale-free network, we encounter some reliability issues. Such a P2P

network would have several nodes with very high degree and many nodes with lesser degree. If we

were to remove a high degree node from the scale-free graph with minimal degree >1, we would still

maintain the connectivity within the graph. In contrast, if there were nodes with degree 1 connected to

the removed high degree node, these would be disconnected from the network. This shows that we can

expect active peers to have more than one other peer.

These two properties joined together by information from previous two sections result in the pro-

posal of a detector with three modules, which is depicted in Figure 3. The graph module revolves

about the graph algorithm described in Section 3.1, maintaining several graphs simultaneously. The

identiﬁcation module represents the classiﬁer described in Section 4. Finally, the persistence module

brings the ability to select seeds around which the graphs are constructed based on the intrinsic P2P

overlay properties.

5.1. Persistence module

The graph algorithm used in the graph module needs a seed node around which it constructs the

connection graph. The sole purpose of this module is to ﬁnd such nodes. To select seeds for the graphs

module, we utilize two criteria that are based on basic P2P network properties: the persistence criterion

and the peers count criterion.

The persistence criterion is based on the ﬁrst intrinsic property of P2P networks. Peers also need to

communicate with each other to keep the overlay functional. They exchange messages for the purpose

of routing table updates, searches, data exchanges, etc. The criterion states that we choose endpoints

that are persistent, i.e. are sending or receiving data for longer periods of time. During normal network

activity, a single host uses many ports to communicate. Most of these ports are used only for a short

period of time; these are called ephemeral ports. However, there are some ports that are kept open—

these are usually used for listening for incoming connections.

To illustrate this , w e p e r f o r m e d a s m a ll ex p e riment on a unive r s ity network, in which we w e r e

monitoring network trafﬁc in ten 5-minute intervals. In the ﬁrst time interval we recorded all active

endpoints in our network. In the following nine time intervals we recorded whether the given endpoints

were reused. This way, we were able to create a histogram showing the number of endpoints used in

either one, two or up to ten time intervals. The histogram can be found in Figure 4. We can see that

most endpoints were used only in one time interval during the experiment. Then the trend declines,

with the exception of endpoints that were used during all time intervals. We believe that these are the

endpoints that represent services (such as web servers or IMAP servers) or active peers of peer-to-

peer networks.

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

244 J. JUSKO AND M. REHAK

1000

2000

3000

4000

5000

6000

7000

1 2 3 4 5 6 7 8 9 10

Number of endpoints

Number of time steps endpoint was active

Figure 4. The histogram shows that the majority of endpoints are active only at one or two time

intervals. We can then see only a marginal number of endpoints being active between three and nine

time steps. All services that run steadily and are regularly used are active at all 10 time windows

To deﬁne persistence of endpoints formally, we use a simpliﬁed method of measuring persistence

introduced in Giroire et al. [28]. The original method was focused on revealing hidden C&C channels.

We are interested o n l y i n p e r s i s t e n c e of endpoints, no mat t e r w h e r e t h e y c o n n e c t t o . We a re not trying

to detect the exact periodicity of connections, but an ongoing character of a connection. The regularity

of endpoint activity is observed by a sliding window W, which is split into nbins b1;:::;b

n. This

window is called the observation window and bins are called measurement windows.Wecanwrite

WDŒb1;b

2;b

3;:::;b

n!. We then deﬁne persistence of an endpoint as follows:

p.e; W / D1

iD1

1e;bi

where eis the endpoint for which the persistence is calculated, Wis the observation window and

function 1e;biis equal to 1 if at least one connection to or from the endpoint eoccurred during the

measurement window bi; otherwise it is equal to 0.

The persistence calculation itself is based on three parameters: measurement window size,which

states how long the connections are recorded into one bin before proceeding to another; observation

window size n,whichdetermineshowmanybinsthereareintheobservationwindow;andthethresh-

old persistence p!, which determines what persistence an endpoint must reach to be considered for

seed selection.

The peer count criterion is based on the second intrinsic property of P2P overlays. Therefore, all

endpoints that had at least two distinct peers in the last observation window pass this criterion. This

removes long-lasting connections between two peers on static ports such as clients downloading large

ﬁles from the Internet or users connecting to other computers via Remote Desktop or SSH.

With these two criteria together, each time the module is queried for seeds it calculates persistence

for all recorded endpoints and selects those with persistence exceeding the persistence threshold p!.

Those selected are then checked against the second criterion, which is the number of contacted peers

during the last observation window.

Of course, not all selected seeds represent active peers in some P2P network. However, we argue

that all active peers should be selected as seeds.

5.2. Graph module

The graph module is responsible for:

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 245

!constructing graphs around the seed endpoints received from the persistence module;

!merging similar graphs;

!removing graphs that failed to ﬁnd any cooperating host for the given seed endpoint.

The graph module uses the graph algorithm presented in Section 3.1. Knowing that all active peers

should be represented by an endpoint that is persistent, we can also reduce number of ﬂows we

process in each graph. This reduction is attained by removing all ﬂows that do not originate from

or are not directed towards a persistent endpoint, since this trafﬁc should not be part of the

overlay. Of course, a signiﬁcant reduction of number of ﬂows can be attained also by using a lower

persistence threshold than 0.8 for ﬁltering ﬂows. In our own implementation we set the persis-

tence threshold for ﬂow selection to 0.3. This has the advantage of detecting active peers even

before they are able to reach the persistence value required by the persistence module to be selected

as seeds.

However, before the module can construct any graph, it ﬁrst needs to receive seed endpoints from

the persistence module. The persistence module feeds seed endpoints to the graph module periodically.

When the module receives the ﬁrst set of seed endpoints it creates a graph for each of them. For every

subsequent set of received seed endpoints it checks whether given seed endpoints are already recorded

in any of the graphs. For those that are not, it creates new graphs. This way we prevent the creation of

unnecessary duplicate graphs.

Since we expect this method to ﬁnd cooperating endpoints and all endpoints to be selected as seeds,

we should after some time construct graphs that are very similar and describe the same P2P network

despite starting from different seed endpoints. There is no point in keeping such graphs separate;

therefore the module joins them together. It raises the question, though, of how to deﬁne ‘similarity’

of two graphs.

Two graphs that represent the same P2P network should have similar sets Vc,butsincebothgraphs

were iteratively constructed from different seed nodes they do not necessarily contain similar sets of

edges or set Vr. Therefore we deﬁne the textitsimilarity of two graphs G1and G2as

s.G1;G

2/DjVG1

c\VG2

min !jVG1

cj;jVG2

cj"

where VG1

cand VG2

crepresent Vcof graphs G1and G2, respectively. This deﬁnition ensures that

similarity of two graphs G1;G

2is high (in fact equal to 1) if VG1

c$VG2

cand jVG1

cj%jVG2

cj. This

is a case of two graphs that represent the same P2P network but one of them is much smaller (either

because it was created later or because the seed was not as ‘active’ as the seed of the bigger graph).

We merge two graphs if their similarity is greater than the merge overlap threshold, which is another

algorithm parameter. Note that the similarity metric is similar to the Jaccard index. The difference is

in similarity of two sets where one is the subset of another. Our metric gives the two a full similarity

value, 1, but the Jaccard index reaches a value of 1 only for identical sets.

There is, of course, a possibility that the graph algorithm will not be able to ﬁnd any cooperating

hosts for a certain seed. This might happen when the seed is the only peer of the respective P2P overlay

in the network, or when the seed node around which we tried to construct a graph was a service, e.g.

an email server. If any graph fails to ﬁnd at least one cooperating endpoint in the network for a certain

period of time, called the tryout period, it is removed from the module. Even thought we remove the

graph, it might be recreated next time the seed nodes are received from the persistence module, because

the endpoint might be active despite the fact it has no cooperating nodes. Therefore we deﬁne another

time parameter, the ignore period, which determines how long after removing a graph with a speciﬁc

seed node this seed node may not be used to construct another graph. We do not want to ignore the

given seed endpoint forever, because a service using the port may change or a cooperating peer might

appear later.

The identiﬁcation module simply accepts graphs from the graph module, extracts port distribution

for the remote peers and performs protocol identiﬁcation as described in Section 4.

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 2 3 5 – 2 5 2

246 J. JUSKO AND M. REHAK

6. EVALUATION

In our evaluation we evaluate both detection and computational performance of the proposed detector.

For each evaluation part we use a different dataset. On both of them the trafﬁc was collected in the

form of NetFlow data by a network probe. Flows were always collected for 5 minutes and then sent

in a batch to the detector. Favoring ﬂow batch processing over stream processing and batch size are

settings of the anomaly detection engine in which the detector was deployed. However, the detector

can process ﬂows in a streaming fashion or work with batches of a different size.

The computational performance is evaluated on a relatively large Telco provider network to test the

detector under a heavy load. Since we could not tamper with the network in any way, the detection

performance evaluation was done on a much smaller university network. In this network, we could

deploy our own P2P nodes and thus establish the ground truth.

Several parameters can be set for the detector. In our experiments we ﬁxed the value of the tryout

period to 1 hour. Ignore period was set to 1 hour as well but it increased by 1 hour for the given

seed every consecutive time the graph around that seed was removed because it failed to ﬁnd any

cooperating peers. The measurement window had a size of 5 minutes. Each observation window was

composed of 10 bins, which we believe offered a good granularity of information. Values of other

parameters that we experimented with in the evaluation to ﬁnd the best combination can be found in

Table 3.

6.1. Dataset description

The university dataset was collected in the university network consisting of approximately 1000 hosts.

The network trafﬁc was collected for 20 hours during a working day. Since we did not have access to

all the computers and could not establish the ground truth concerning the overall network activity, i.e.

what service every endpoint in the local network belonged to, we chose 155 hosts from two subnets as

a small control set.

The ﬁrst subnet contained 36 hosts, of which 18 were running either Windows or Linux. We refer to

these hosts as client hosts. The client hosts were engaged in casual Internet activity, such as browsing

the web, working with email, listening to music, watching videos, etc. On these we also installed client

applications for several P2P networks, where one host could participate in P2P networks. The list of

installed client applications can be found in Table 4.

To ex a m i ne whether the algo r i t h m w a s c a p a b l e o f l i n k ing hosts particip a t i n g i n a b o t n e t , w e i nfected

three computers with Trojan.Sirefef-6 malware, which uses a P2P overlay for its C&C [33]. We set all

Table 3. Parameters and their values used in the experiment. Parameter

values used in the ﬁnal evaluation are shown in bold

Parameter Values

Persistence threshold 0.5, 0.8

Mutual contacts overlap threshold 3, 4, 5,6

Memory limit 60, 90, 120 min

Merge overlap threshold 0.3, 0.5, 0.7

Table 4. List of peer-to-peer networks with their

respective clients installed on the client hosts in the

control set. The last column speciﬁes how many hosts

are running a given client application

Network Client application Hosts

Skype Ofﬁcial client 18

BitTorrent "Torrent 26

KAD eMule 15

Gnutella Phex 18

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 247

client applications belonging to the same P2P network to use the same port to ease up evaluation of

the results. This had no effect on detection capabilities of our algorithm.

The second subnet contained servers; we refer to these hosts as server hosts.Noneofthethem

was running any of the aforementioned applications. They run many services, such as web servers,

IMAP/POP services and others.

The Telco dataset contains traces mainly of homes with a DSL uplink. The network encompasses

tens of thousands of users and has a throughput of 40 Gbps, with number of ﬂows per 5 minutes

ranging approximately from 2 million to 11 million. The dataset spans 3 days in November 2011. For

the computational performance evaluation we are only interested in the size of the network; therefore

we do not provide any additional information.

6.2. Computation performance evaluation

For the sake of performance evaluation, the detector was deployed on a 24-core Intel Xeon computer

(24 virtual on 12 physical cores) and tested on the Telco dataset. We monitored processing times of the

three modules and retained memory as a function of time (and thus number of ﬂows in the network).

In Figure 5(a) we can see a stacked plot of the processing times of the three modules. It is obvious

that the graph algorithm takes most of the time, whereas seed selection and P2P identiﬁcation take only

a fraction of the time. There is also a clear relationship between number of ﬂows and the processing

time. One can see trends in the trafﬁc as the users use their computers most in the evening or at night.

The computation takes peaks at around 300 seconds, which is actually the time span of the dataset

processed. The detector reaches its limits when dealing with 10M+ ﬂows (approximately 34k+ ﬂows

per second). While is is partially parallelized, it could certainly be optimized to allow for greater

throughput. As for memory consumption, shown in Figure 5(b), this exhibits the same dependence on

number of ﬂows. One more thing can be noted from the plot: the memory footprint is slightly lower in

the third peak, which is caused by blacklisting of seeds. The memory footprint ranges from 3 GB to

15 GB during the peak hours.

6.3. Detection performance evaluation

6.3.1. Detection rate

Since the algorithm runs continually and modiﬁes the graphs according to the changes in the network,

we measure detection rate in time. Thanks to this we can see how much time the detector needs to

detect a P2P network since the start of the client application.

After every batch of ﬂows (every 5 minutes) we query the identiﬁcation module for the list

of recognized P2P networks and nodes that participate in them. As can be seen in Figure 6(a),

0.2

0.4

0.6

0.8

050 100 150 200

detection rate

minutes since starting of nodes

in particular P2P network

(a) Detection rate as a function of time.

0.2

0.4

0.6

0.8

5 10 15 20 25 30 35 40 45

detection rate

memory limit

(b) Detection rate as a function of memory

limit and mutual contacts threshold

Figure 5. Detection rate of the proposed detector: (a) detection rate as a function of time; (b) detection

rate as a function of memory limit and mutual contacts threshold

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

248 J. JUSKO AND M. REHAK

100

150

200

250

300

Processing Time [s]

Time [h]

(a) Processing Times

2000

4000

6000

8000

10000

12000

14000

16000

01224364860720 12 24 36 48 60 72 0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

Memory Footprint [MB]

Number of Flows

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

Number of Flows

Time [h]

(b) Memory Footprint

Figure 6. Stacked plots of (a) processing time and (b) memory footprint as a function of time (and

thus number of ﬂows). The graph algorithm takes most of the processing time required by the detector.

Strong trends are visible in the data that are caused by the users usually using computers in the evening

or at night

the algorithm was able to ﬁnd all hosts participating in Skype, BitTorrent, Kademlia and

Trojan.Sirefef-6 peer-to-peer networks. On the other hand, detection rate for Gnutella was considerably

lower: 44%.

Figure 6(a) also shows that detection of P2P nodes is not immediate and the algorithm needs some

time before it detects them. All P2P networks except Gnutella were at least partially detected within

an hour since the client applications were started. One can also notice that some Skype nodes were

identiﬁed even earlier than endpoints, representing Skype clients that reached the required persistence

threshold. This happened as a result of the other Skype nodes commonly running in the university

network—the graph for Skype was already present when we started the Skype clients in the control set

and endpoints belonging to these clients were simply added to the graph without the need to become a

graph seed. This illustrates an important property of the detector: peers that join the overlay for which

a graph already exists are detected much faster than the ﬁrst peer(s) of a P2P network that does not yet

have a corresponding graph.

6.3.2. False positive rate

For various P2P networks we use different methodologies for evaluation of false positives. For KAD,

Gnutella, BitTorrent and Trojan.Sirefef-6, we consider every detected endpoint not associated with the

host from the control set and the respective listening port of the client application to be a false positive.

Since these P2P networks are used only rarely at the university, such an approach is viable. Using this

approach we determine the upper bound of the false positives detected by our algorithm. We cannot

do the same with Skype since it is very popular at the university. Instead, we evaluate false and true

positives only on the control set.

There were no false positives for four of the P2P networks: Skype, KAD, Gnutella and

Trojan.Sirefef-6. Only one false positive was found when linking cooperating hosts in the BitTorrent

overlay. We refrain from calculating the false positive rate, since it would only have a negligible value

due to the low number of false positives.

7. INTERPRETATION OF RESULTS

To ex p l a in the inferior perf o r m a n c e w i t h G n u tella detection w e n e e d t o i nvestigate how nodes in va r i-

ous P2P networks communicate. Some P2P overlays use UDP-based communication, while others use

TCP-based communication or a combination of the two. If the P2P overlay is UDP-based, both incom-

ing and outgoing connections use the main port on which our detector focuses. On the other hand, if the

P2P overlay is TCP-based, the main port is used only by incoming connections. Outgoing connections

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 249

use ephemeral ports assigned by the operating system which change frequently.3Therefore, for

TCP-based overlays our detector can only take advantage of the node’s incoming connections (because

those target the main port). Of the P2P networks we do our experiments with, Kademlia and

Trojan.Sirefef-6 both use UDP for their P2P overlay, Gnutella uses TCP for its P2P overlay, and Skype

and BitTorrent use a combination of the two. Therefore, in order to detect Gnutella there needs to

be a reasonable number of incoming connections from remote peers to increase the chance of having

mutual contacts with other local nodes in the Gnutella overlay. Gnutella has two types of peers: leaf

nodes and ultrapeers. Leaf nodes only connect to ultrapeers and ultrapeers connect to both ultrapeers

and leaf nodes. Also, ultrapeers have a higher frequency of connections with other peers. There-

fore it is much more probable to detect and link together Gnutella ultrapeers than ordinary hosts.

And indeed, most of the cooperating hosts found for the Gnutella network were in fact ultrapeers.

The longer ramp-up period for Gnutella is due to the fact that it took time until the detected nodes

became ultrapeers.

We mentioned that Skype and B i t To r r e n t u s e b o t h T C P and UDP. Sk y p e u s e s b o t h p r otocols in the

single overlay and exchanges them as necessary. On the other hand, BitTorrent keeps a separate overlay

network on TCP and UDP. TCP is used in the original BitTorrent protocol for downloading ﬁles in the

swarm where the ﬁrst set of peers is received from a tracker. UDP is used in the newest implementations

for distributed tracker functionality to avoid using centralized trackers when it is not necessary. This

overlay is BitTorrent’s own DHT implementation. Therefore, there are two possibilities of how to

detect BitTorrent clients: via DHT overlay or via the original BitTorrent overlay. Our experiments show

that the detector is capable of linking BitTorrent clients using either protocol.

Here we need to realize the difference between the original BitTorrent protocol and other peer-to-

peer protocols in this evaluation. While other P2P networks maintain an overlay network at all times,

the BitTorrent is intermittent. The client participates in the overlay only when it wants to download a

ﬁle and joins a swarm (unless it is using DHT). Therefore, when we talk about detecting cooperating

hosts for BitTorrent using only the BitTorrent protocol, we mean hosts that are members of the same

swarm—not all BitTorrent clients in the network.

We mentioned that our detector is able to identify both overlays run by the BitTorrent client. Of

the two, detection of the UDP-based DHT implementation is faster because communication in this

overlay starts as soon as the client application is launched, without any user activity. In fact, all P2P

networks in our experiment, with the exception of BitTorrent’s original protocol, were detected with-

out any user activity. The original BitTorrent overlay can be observed only after the user initiates a

ﬁle download.

8. CONCLUSION

In this paper we presented a detector that is able to link hosts cooperating in a P2P overlay and identify

this overlay if it was of a known type. The detector uses only inherent properties of P2P networks. It

reconstructs the P2P overlay based on the observed connection in the network. Since the method does

not use either packet payloads or ﬂow statistics, it is a viable option for deployment on the backbone

network where computationally expensive models are not an option. Identiﬁcation of the overlay is

based on port distributions which we show are stable in time.

In the process of designing the detector we address the following questions:

1. Having one peer in an unknown P2P network, are we able to ﬁnd other peers in the respective

network?

2. Can we determine what particular P2P network it is?

3. Can we enumerate all P2P networks and their peers active in the monitored network?

3Endpoints representing ephemeral ports might occasionally appear in the graph of the P2P overlay. In a rigorous

understanding, these endpoints are true positives because they are used for the communication in the P2P overlay. On

the other hand, they are present in the same graph as the endpoint representing the main listening port on the same host.

We have therefore ignored these ephemeral endpoints in the evaluation.

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

250 J. JUSKO AND M. REHAK

In Section 3 we showed how, using our graph algorithm, we can ﬁnd other hosts participating in

the same P2P overlay network as the ﬁrst peer. The algorithm is based on monitoring mutual peers

of hosts. The next logical step was to identify the observed P2P overlay, and we presented a simple

classiﬁcation method based on remote peers port distribution in Section 4. Finally, in Section 5 we

showed how to select peers that were likely participating in a P2P overlay and which were used as an

input to the graph algorithm.

The method was able to detect all cooperating peers in most of the P2P networks and attained almost

zero false positive rate in the controlled experiment.

We believ e t h a t t h i s m e thod presents a vi a b l e a p p r o a c h t o d e t ecting peers in overlay networks: bo t h

well-known ﬁle-sharing networks and specialized peer-to-peer networks used by botnets as a C&C

channel. It has been used as a part of anomaly detection engine for 2 years now.

APPENDIX: PARAMETERS AND THEIR IMPACT

The persistence module and graph module can be tuned using several parameters that were introduced

in the text.

Parameters of the persistence module impact mainly computational performance, but they can also

affect the detection performance if chosen improperly. Tryout period and ignore period affect only

computation performance and have no impact on the detection or false positive rate. Measurement

window size and observation window size determine which endpoints are selected as seeds. As the

measurement window size increases, endpoints need to be active for a longer period of time to be

selected. Observation window size determines how ﬁne is the calculated persistence. For example,

using only two bins an endpoint can have only one of the three values of persistence: 0, 0.5 or 1.

Making the measurement window too small, even ephemeral ports can have high persistence, which

would increase the number of graphs to be processed by the graph module. The last parameter used

by the persistence module is persistence threshold.Choiceofitsvalueimpactsthecomputational

performance: the higher the value, the fewer models are created in the graph module and thus fewer

resources are needed to process the graphs. Another important impact of this parameter is on the

duration of the ramp-up period in detection. The higher the value, the longer it takes for the detector to

ﬁnd nodes participating in new P2P overlays since the client application is started. Each endpoint needs

to be active for a speciﬁed time before a graph is created for the given endpoint. As the persistence

threshold increases, the time required gets longer.

The graph module uses three parameters: merge overlap threshold,mutual contacts overlap thresh-

old and memory limit. The graph module is responsible for merging graphs that represent the same P2P

overlay, and merge overlap threshold is the parameter that states how strict the module is when decid-

ing whether two graphs represent the same P2P overlay. In our experiments we used three values of this

parameter without any impact on the detection results. However, the value of this parameter cannot be

arbitrary, as choosing a very low value could cause even unrelated graphs to be merged. On the other

hand, choosing too high a value could have a performance penalty because the graph module would

keep working with several very similar graphs. Mutual contacts overlap threshold and memory limit

both have a signiﬁcant impact on the detection rate and some impact on the false positive rate. Having

the memory limit ﬁxed, increasing the mutual contacts overlap threshold decreases the detection rate

and false positive rate. Similarly, having the mutual contacts overlap threshold ﬁxed, increasing the

memory limit increases the detection rate and false positive rate. The impact on the false positive rate

is usually only marginal, but setting mutual contacts overlap threshold to a very low value can rapidly

increase the false positive rate. For example, if we set the mutual contacts overlap threshold to 1 and

the network is the target of a scan then all scanned endpoints are added to the graph, increasing the

false positive rate considerably. As can be seen in Figure 6(b), these two parameters can compensate

for each other. Increase in the mutual contacts overlap threshold decreases the detection rate unless the

memory limit is increased appropriately as well.

ACKNOWLEDGEMENTS

The work was supported by MVCR grant number VG2VS/242.

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

IDENTIFYING P2P COMMUNITIES IN THE NETWORK USING GRAPH ANALYSIS 251

REFERENCES

1. Bitcoin: open source p2p money. Available: http://bitcoin.org [18 October 2013].

2. Wearden G. US Army aims to take p2p into battle. Available: http://www.zdnet.com/us-army-aims- to- take-p2p- into-

battle-3002094181 [20 October 2013].

3. Haq IU, Ali S, Khan H, Khayam SA. What is the impact of p2p trafﬁc on anomaly detection?. In Proceedings of the 13th

International Conference on Recent Advances in Intrusion Detection: RAID’10,Springer:Berlin,2010;1–17.

4. Sadre R, Sperotto A, Pras A. The effects of DDoS attacks on ﬂow monitoring applications, In IEEE Symposium on Network

Operations and Management,2012;269–277.

5. Jusko J, Rehak M. Revealing cooperating hosts by connection graph analysis. In Security and Privacy in Communication

Networks: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering,

Vo l . 1 0 6 , S p r i n g e r : B e r l i n , 2 0 1 3 ; 2 4 1 – 2 5 5 .

6. Móczár Z, Molnár S. Characterization of BitTorrent trafﬁc in a broadband access network. In AccessNets: Lecture Notes

of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering,Vol.63,Springer:Berlin,

2010; 176–183.

7. Kryczka M, Cuevas R, Guerrero C, Azcorra A. Unrevealing the structure of live bittorrent swarms: methodology and

analysis, In IEEE International Conference on Peer-to-Peer Computing,Kyoto,2011;230–239.

8. Qi J, Zhang H, Ji Z, Yun L. Analyzing BitTorrent trafﬁc across large network, In 2008 International Conference on

Cyberworlds,2008;759–764.

9. Falkner J, Piatek M, John JP, Krishnamurthy A, Anderson T. Proﬁling a million user DHT. In Proceedings of the 7th ACM

SIGCOMM conference on Internet Measurement: IMC’07,ACM:NewYork,2007;129–134.

10. Steiner M, En-Najjary T, Biersack EW. A global view of KAD. In Proceedings of the 7th ACM SIGCOMM Conference on

Internet Measurement: IMC’07,ACM:NewYork,2007;117–122.

11. Liu X, Li Y, Li Z, Cheng X. Social network analysis on kad and its application. In Proceedings of the 13th Asia–Paciﬁc

Web Conference on Web Technologies and Applications: APWeb’11,Springer:Berlin,2011;327–332.

12. Li C, Chen C. Topology analysis of Gnutella by large scale mining, In International Conference on Communication

Tec h n o l o g y : I C C T ’06,2006;1–4.

13. Acosta W, Chandra S. Trace driven analysis of the long term evolution of gnutella peer-to-peer trafﬁc. In Proceedings

of the 8th International Conference on Passive and Active Network Measurement: PAM’07,Louvain-la-Neuve,Belgium,

Springer: Berlin, 2007; 42–51.

14. Markatos EP. Tracing a large-scale peer to peer system: an hour in the life of gnutella. In Proceedings of the 2nd IEEE/ACM

International Symposium on Cluster Computing and the Grid: CCGRID’02, IEEE Computer Society: Washington, DC,

2002; 65–75.

15. Grizzard JB, Sharma V, Nunnery C, Kang BB, Dagon D. Peer-to-peer botnets: overview and case study. In Proceedings of

the First Conference on First Workshop on Hot Topics in Understanding Botnets: HotBots07, Cambridge, MA, USENIX

Association: Berkeley, CA, 2007; 1–1.

16. Ha DT, Yan G, Eidenbenz S, Ngo HQ. On the effectiveness of structural detection and defense against p2p-based botnets,

In IEEE/IFIP International Conference on Dependable Systems and Networks: DSN’09,2009;297–306.

17. Dagon D, Gu G, Lee CP, Lee W. A taxonomy of botnet structures, In Proceedings of the 23rd Annual Computer Security

Applications Conference: ACSAC’07,2007;325–339.

18. Cooke E, Jahanian F, McPherson D. The zombie roundup: understanding, detecting, and disrupting botnets. In Proceedings

of the USENIX SRUTI Workshop,Cambridge,MA,USENIXAssociation:Berkeley,CA,2005;39–44.

19. Davis CR, Neville S, Fernandez JM, Robert J-M, McHugh J. Structured peer-to-peer overlay networks: ideal botnets com-

mand and control infrastructures?. In ESORICS,LectureNotesinComputerScience,Vol.5283,Springer:Berlin,2008;

461–480.

20. Bartlett G, Heidemann J, Papadopoulos C. Inherent behaviors for on-line detection of peer-to-peer ﬁle sharing, In IEEE

Global Internet Symposium,Anchorage,AK,2007;55–60.

21. Constantinou F, Mavrommatis P. Identifying known and unknown peer-to-peer trafﬁc, In Fifth IE EE I nt er national

Symposium on Network Computing and Applications: NCA 2006,Cambridge,Massachusetts,USA,2006;93–102.

22. Iliofotou M, Pappu P, Faloutsos M, Mitzenmacher M, Singh S, Varghese G. Network monitoring using trafﬁc dispersion

graphs (TDGs). In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement: IMC’07,SanDiego,

California, USA, ACM: New York, 2007; 315–320.

23. Iliofotou M, Kim HC, Faloutsos M, Mitzenmacher M, Pappu P, Varghese G. Graption: a graph-based p2p trafﬁc

classiﬁcation framework for the internet backbone, Computer Networks 2011; 55(8): 1909–1920.

24. Iliofotou M, Faloutsos M, Mitzenmacher M. Exploiting dynamicity in graph-based trafﬁc analysis: techniques and appli-

cations. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies:

CoNEXT’09,ACM:NewYork,2009;241–252.

25. Jelasity M, Bilicki V. Towards automated detection of peer-to-peer botnets: on the limits of local approaches. In Proceedings

of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More:

LEET’09,Boston,MA,USENIXAssociation:Berkeley,CA,2009;3–3.

26. Coskun B, Dietrich S, Memon N. Friends of an enemy: identifying local members of peer-to-peer botnets using mutual

contacts. In Proceedings of the 26th Annual Computer Security Applications Conference: ACSAC ’10,Austin,Texas,ACM:

New York, 2010; 131–140.

27. François J, Wang S, State R, Engel T. Bottrack: tracking botnets using NetFlow and PageRank. In Networking 2011, Lecture

Notes in Computer Science, Vol. 6640, Springer: Berlin, 2011; 1–14.

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 235–252

252 J. JUSKO AND M. REHAK

28. Giroire F, Chandrashekar J, Taft N, Schooler E, Papagiannaki D. Exploiting temporal persistence to detect covert botnet

channels. In Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection: RAID’09,Saint-

Malo, France, Springer-Verlag: Berlin, Heidelberg, 2009; 326–345.

29. Kim J, Shah K, Bohacek S. Detecting p2p trafﬁc from the p2p ﬂow graph. In 7th International Wireless Communications

and Mobile Computing Conference: IWCMC’11,2011;1795–1800.

30. Duda RO, Hart PE, Stork DG. Pat t e rn Cl a s s iﬁc a t i on an d S c e ne An a l y sis ,Wiley:Chichester,2001.

31. Stoica I, Morris R, Karger D, Kaashoek MF, Balakrishnan H. Chord: a scalable peer-to-peer lookup service for inter-

net applications. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for

Computer Communications: SIGCOMM’01,SanDiego,California,USA,ACM:NewYork,NY,USA,2001;149–160.

32. Kwok YKR. Pe e r-t o - Peer Co m p uti n g : A ppl i c a tio n s , A rc h ite c t u re , P ro t oco l s , a nd Ch a l l eng e s, Taylor & Francis: Boca

Raton, FL, 2011.

33. McNamee K. Malware analysis report–botnet: ZeroAccess/Sirefef. 2012. Available: http://www.kindsight.net/sites/default/

ﬁles/Kindsight_Malware_Analysis-ZeroAcess-Botnet-ﬁnal.pdf [6 May 2014].

AUTHORS’ BIOGRAPHIES

Jan Jusko is a researcher at Cisco Systems, focusing on malware C&C channel detection. He holds a Masters

degree in computer science from Czech Technical University in Prague and now pursues a PhD in artiﬁcial

intelligence. His research interests include network security, graph theory and machine learning.

Martin Rehak is a principal engineer with Cisco Systems security group. He has been working in the area of

machine learning, anomaly detection and network security. In the past, he was a founder & CEO of Cognitive

Security, acquired by Cisco in 2013. The VC-funded spin-off company was created to develop a commercial

technology based on the research performed by Martin and his team at Czech Technical University. Martin holds

an engineering degree from Ecole Centrale Paris and a PhD in AI from CTU in Prague.

DOI: 10.1002/nem

Int. J. Network Mgmt 2014; 24: 23 –2525

Deep Learning-Based Community Detection Approach on Bitcoin Network

Article

Full-text available

Nov 2022

Community detection is essential in P2P network analysis as it helps identify connectivity structure, undesired centralization, and influential nodes. Existing methods primarily utilize topological data and neglect the rich content data. This paper proposes a technique combining topological and content data to detect communities inside the Bitcoin network using a deep feature representation algorithm and Deep Feedforward Autoencoders. Our results show that the Bitcoin network has a higher clustering coefficient, assortativity coefficient, and community structure than expected from a random P2P network. In the Bitcoin network, nodes prefer to connect to other nodes that share the same characteristics.

Classification techniques for the management of the "Quality of Service" in Satellite Communication systems

Thesis

Nov 2019

Fannia Pacheco

The Internet has become indispensable for the daily activities of human beings. Nowadays, this network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this thesis project aims at finding new Internet traffic classification approaches to improving customer satisfaction by improving the QoS.Machine Learning (ML) algorithms will be studied and deployed to classify Internet traffic. All the necessary elements, to couple an ML solution over a well-known Satellite Communication and QoS management architecture, will be evaluated. In this architecture, one or more monitoring points will intercept Satellite Internet traffic, which in turn will be treated and marked with predefined classes by ML-based classification techniques. The marked traffic will be interpreted by a QoS management architecture that will take actions according to the class type.To develop this ML-based solution, a rich and complete set of Internet traffic is required; however, historical labeled data is hardly publicly available. In this context, binary packets should be monitored and stored to generate historical data. To do so, an emulated cloud platform will serve as a data generation environment in which different Internet communications will be launched and captured. This study is escalated to a Satellite Communication architecture. Moreover, statistical-based features are extracted from the packet flows. Some statistical-based computations will be adapted to achieve accurate Internet traffic classification for encrypted and unencrypted packets in the historical data. Afterward, a proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Besides, to cope with the evolution of Internet applications, a new method is presented to induce updates over the original classification system. Finally, some experiments in the cloud emulated platform validate our proposal and set guidelines for its deployment over a Satellite architecture.

Towards the Deployment of Machine Learning Solutions in Network Traffic Classification: A Systematic Survey

Article

Full-text available

Nov 2018

Traffic analysis is a compound of strategies intended to find relationships, patterns, anomalies, and misconfigurations, among others things, in Internet traffic. In particular, traffic classification is a subgroup of strategies in this field that aims at identifying the application’s name or type of Internet traffic. Nowadays, traffic classification has become a challenging task due to the rise of new technologies, such as traffic encryption and encapsulation, which decrease the performance of classical traffic classification strategies. Machine Learning gains interest as a new direction in this field, showing signs of future success, such as knowledge extraction from encrypted traffic, and more accurate Quality of Service management. Machine Learning is fast becoming a key tool to build traffic classification solutions in real network traffic scenarios; in this sense, the purpose of this investigation is to explore the elements that allow this technique to work in the traffic classification field. Therefore, a systematic review is introduced based on the steps to achieve traffic classification by using Machine Learning techniques. The main aim is to understand and to identify the procedures followed by the existing works to achieve their goals. As a result, this survey paper finds a set of trends derived from the analysis performed on this domain; in this manner, the authors expect to outline future directions for Machine Learning based traffic classification.

A State-of-the-Art Survey of Peer-to-Peer Networks: Research Directions, Applications and Challenges

Article

Full-text available

Feb 2022

Centralized file-sharing networks have low reliability, scalability issues, and possess a single point of failure, thus making peer-to-peer (P2P) networks an attractive alternative since they are mostly anonymous, autonomous, cooperative, and decentralized. Although, there are review articles on P2P overlay networks and technologies, however, other aspects such as hybrid P2P networks, modelling of P2P, trust and reputation management issues, coexistence with other existing networks, and so on have not been comprehensively reviewed. In addition, existing reviews were limited to articles published in or before 2012. This paper performs a state-of-the-art literature survey on the emerging research areas of P2P networks, applications and ensuing challenges along with proposed solutions by scholars. The literature search for this survey was limited to the top-rated publisher of scholarly articles. This research shows that issues with security, privacy, the confidentiality of information and trust management will need greater attention, especially in sensitive applications like health services and vehicle to vehicle communication ad hoc networks. In addition, more work is needed in developing solutions to effectively investigate and curb deviant behaviours among some P2P networks.

IFS: Intelligent flow sampling for network security–an adaptive approach

Article

Jul 2015
Int J Netw Manag

In order to cope with an increasing volume of network traffic, flow sampling methods are deployed to reduce the volume of log data stored for monitoring, attack detection and forensic purposes. Sampling frequently changes the statistical properties of the data and can reduce the effectiveness of subsequent analysis or processing. We propose two concepts that mitigate the negative impact of sampling on the data. Late sampling is based on a simple idea that the features used by the analytic algorithms can be extracted before the sampling and attached to the surviving flows. The surviving flows thus carry the representation of the original statistical distribution in these attached features. The second concept we introduce is that of adaptive sampling. Adaptive sampling deliberatively skews the distribution of the surviving data to over-represent the rare flows or flows with rare feature values. This preserves the variability of the data and is critical for the analysis of malicious traffic, such as the detection of stealthy, hidden threats. Our approach has been extensively validated on standard NetFlow data, as well as on HTTP proxy logs that approximate the use-case of enriched IPFIX for the network forensics.

Identifying, Modeling and Detecting Botnet Behaviors in the Network

Thesis

Full-text available

Nov 2014

Sebastián García

Botnets are the technological backbone supporting myriad of attacks, including identity stealing, organizational spying, DoS, SPAM, government-sponsored attacks and spying of political dissidents among others. The research community works hard creating detection algorithms of botnet network traffic. These algorithms have been partially successful, but are difficult to reproduce and verify; being often commercialized. However, the advances in machine learning algorithms and the access to better botnet datasets start showing promising results. The shift of the detection techniques to behavioral-based models has proved to be a better approach to the analysis of botnet patterns. However, the current knowledge of the botnet actions and patterns does not seem to be deep enough to create adequate traffic models that could be used to detect botnets in real networks. This thesis proposes three new botnet detection methods and a new model of botnet behavior that are based in a deep understanding of the botnet behaviors in the network. First the SimDetect method, that analyzes the structural similarities of clustered botnet traffic. Second the BClus method, that clusters traffic according to its connection patterns and uses decision rules to detect unknown botnet in the network. Third, the CCDetector method, that uses a novel state-based behavioral model of known Command and Control channels to train a Markov Chain and to detect similar traffic in unknown real networks. The BClus and CCDetector methods were compared with third-party detection methods, showing their use in real environments. The core of the CCDetector method is our state-based behavioral model of botnet ac tions. This model is capable of representing the changes in the behaviors over time. To support the research we use a huge dataset of botnet traffic that was captured in our Malware Capture Facility Project. The dataset is varied, large, public, real and has Background,Normal and Botnet labels. The tools, dataset and algorithms were released as free software. Our algorithms give a new high-level interface to identify, visualize and block botnet behaviors in the networks.

Identifying and Modeling Botnet C&C Behaviors

Conference Paper

Full-text available

May 2014

Through the analysis of a long-term botnet capture, we identified and modeled the behaviors of its C&C channels. They were found and characterized by periodicity analyses and statistical representations. The relationships found between the behaviors of the UDP, TCP and HTTP C&C channels allowed us to unify them in a general model of the botnet behavior. Our behavioral analysis of the C&C channels gives a new perspective on the modeling of malware behavior, helping to better understand botnets.

A framework to classify heterogeneous Internet traffic with Machine Learning and Deep Learning techniques for Satellite Communications

Article

Mar 2020
COMPUT NETW

Nowadays, the Internet network system serves as a platform for communication, transaction, and entertainment, among others. This communication system is characterized by terrestrial and Satellite components that interact between themselves to provide transmission paths of information between endpoints. Particularly, Satellite Communication providers’ interest is to improve customer satisfaction by optimally exploiting on demand available resources and offering Quality of Service (QoS). Improving the QoS implies to reduce errors linked to information loss and delays of Internet packets in Satellite Communications. In this sense, according to Internet traffic (Streaming, VoIP, Browsing, etc.) and those error conditions, the Internet flows can be classified into different sensitive and non-sensitive classes. Following this idea, this work aims at finding new Internet traffic classification approaches to improving the QoS. Machine Learning (ML) and Deep Learning (DL) techniques will be studied and deployed to classify Internet traffic. All the necessary elements to couple an ML or DL solution over a well-known Satellite Communication and QoS management architecture will be evaluated. To develop this solution, a rich and complete set of Internet traffic is required. In this context, an emulated Satellite Communication platform will serve as a data generation environment in which different Internet communications will be launched and captured. The proposed classification system will deal with different Internet communications (encrypted, unencrypted, and tunneled). This system will process the incoming traffic hierarchically to achieve a high classification performance. Finally, some experiments on a cloud emulated platform validates our proposal and set guidelines for its deployment over a Satellite architecture.

Characterization of BitTorrent Traffic in a Broadband Access Network

Conference Paper

Full-text available

Nov 2010

BitTorrent as one of the leading P2P file sharing applications has dominant traffic in broadband access networks. In this paper we present the main characteristics of BitTorrent traffic based on actual measurements taken from a commercial network. Analysis results at both application- and flow-levels are presented and discussed.

The effects of DDoS attacks on flow monitoring applications

Article

Full-text available

Apr 2012

Flow-based monitoring has become a popular approach in many areas of network management. However, flow monitoring is, by design, susceptible to anomalies that generate a large number of flows, such as Distributed Denial-Of-Service attacks. This paper aims at getting a better understanding on how a flow monitoring application reacts to the presence of massive attacks. We analyze the performance of a flow monitoring application from the perspective of the flow data it has to process. We first identify the changes in the flow data caused by a massive attack and propose a simple queueing model that describes the behavior of the flow monitoring application. Secondly, we present a case study based on a real attack trace collected at the University of Twente and we analyze the performance of the flow monitoring application by means of simulation experiments. We conclude that the observed changes in the flow data might cause unwanted effects in monitoring applications. Furthermore, our results show that our model can help to parametrize and dimension flow-based monitoring systems.

Peer-to-Peer Botnets: Overview and case study

Article

Full-text available

Jan 2007

Botnets have recently been identified as one of the most important threats to the security of the Internet. Traditionally, botnets organize themselves in an hierarchical manner with a central command and control location. This location can be statically defined in the bot, or it can be dynamically defined based on a directory server. Presently, the centralized characteristic of botnets is useful to security professionals because it offers a central point of failure for the botnet. In the near future, we believe attackers will move to more resilient architectures. In particular, one class of botnet structure that has entered initial stages of development is peer-to-peer based architectures. In this paper, we present an overview of peer-to-peer botnets. We also present a case study of a Kademlia-based Trojan.Peacomm bot.

Peer-to-Peer Computing: Applications, Architecture, Protocols, and Challenges

Book

Aug 2011

Yu-Kwong Ricky Kwok

While people are now using peer-to-peer (P2P) applications for various processes, such as file sharing and video streaming, many research and engineering issues still need to be tackled in order to further advance P2P technologies. Peer-to-Peer Computing: Applications, Architecture, Protocols, and Challenges provides comprehensive theoretical and practical coverage of the major features of contemporary P2P systems and examines the obstacles to further success. Setting the stage for understanding important research issues in P2P systems, the book first introduces various P2P network architectures. It then details the topology control research problem as well as existing technologies for handling topology control issues. The author describes novel and interesting incentive schemes for enticing peers to cooperate and explores recent innovations on trust issues. He also examines security problems in a P2P network. The final chapter addresses the future of the field. Throughout the text, the highly popular P2P IPTV application, PPLive, is used as a case study to illustrate the practical aspects of the concepts covered. Addressing the unique challenges of P2P systems, this book presents practical applications of recent theoretical results in P2P computing. It also stimulates further research on critical issues, including performance and security problems.

A Taxonomy of Botnet Structures

Conference Paper

Dec 2007

Revealing Cooperating Hosts by Connection Graph Analysis

Conference Paper

Jan 2013

In this paper we present an algorithm that is able to progressively discover nodes cooperating in a P2P network. Starting from a single known node, we can easily identify other nodes in the peer-to-peer network, through the analysis of widely available and standardized IPFIX (NetFlow) data. Instead of relying on the analysis of content characteristics or packet properties, we monitor connections of known nodes in the network and then progressively discover other nodes through the analysis of their mutual contacts. We show that our method is able to discover all cooperating nodes in many P2P networks. The use of standardized input data allows for easy deployment onto real networks. Moreover, because this approach requires only short processing times, it scales very well in larger and higher speed networks. © 2013 ICST Institute for Computer Science, Social Informatics and Telecommunications Engineering.

Pattern Classification and Scene Analysis

Article

Jan 1973

The Zombie Roundup: Understanding, Detecting, and Disrupting Botnets

Article

Jul 2005

Global Internet threats are undergoing a profound transformation from attacks designed solely to disable infrastructure to those that also target people and or- ganizations. Behind these new attacks is a large pool of compromised hosts sitting in homes, schools, busi- nesses, and governments around the world. These sys- tems are infected with a bot that communicates with a bot controller and other bots to form what is commonly referred to as a zombie army or botnet. Botnets are a very real and quickly evolving problem that is still not well understood or studied. In this paper we outline the origins and structure of bots and botnets and use data from the operator community, the Internet Motion Sen- sor project, and a honeypot experiment to illustrate the botnet problem today. We then study the effectiveness of detecting botnets by directly monitoring IRC communi- cation or other command and control activity and show a more comprehensive approach is required. We con- clude by describing a system to detect botnets that utilize advanced command and control systems by correlating secondary detection data from multiple sources.

Patter Classification and Scene Analysis

Article

Jan 2001

Towards automated detection of peer-to-peer botnets: on the limits of local approaches

Article

Jan 2009

State-of-the-art approaches for the detection of peer-to-peer (P2P) botnets are on the one hand mostly local and on the other hand tailored to specific botnets involving a great amount of human time, effort, skill and creativity. Enhancing or even replacing this labor-intensive process with automated and, if possible, local network monitoring tools is clearly extremely desirable. To investigate the feasibility of automated and local monitoring, we present an experimental analysis of the traffic dispersion graph (TDG)--a key concept in P2P network detection--of P2P overlay maintenance and search traffic as seen at a single AS. We focus on a feasible scenario where an imaginary P2P botnet uses some basic P2P techniques to hide its overlay network. The simulations are carried out on an AS-level model of the Internet. We show that the visibility of P2P botnet traffic at any single AS (let alone a single router) can be very limited. While we strongly believe that the automated detection and mapping of complete P2P botnets is possible, our results imply that it cannot be achieved by a local approach: it will inevitably require very close cooperation among many different administrative domains and it will require state-of-the-art P2P algorithms as well.

Identifying peer-to-peer communities in the network by connection graph analysis

Abstract and Figures

Recommended publications

Revealing Cooperating Hosts by Connection Graph Analysis

A memory efficient privacy preserving representation of connection graphs

Identifying Skype nodes by NetFlow-based graph analysis

Identifying Skype Nodes in the Network Exploiting Mutual Contacts