ArticlePDF Available

ToRank: Identifying the Most Influential Suspicious Domains in the Tor Network

June 2019
Expert Systems with Applications 123

June 2019
123

DOI:10.1016/j.eswa.2019.01.029

Authors:

Wesam Al Nabki

Universidad de León

Eduardo Fidalgo

Universidad de León

Enrique Alegre

Universidad de León

Laura Fernández-Robles

Universidad de León

The Tor network hosts a significant amount of hidden services related to suspicious activities. Law Enforcement Agencies need to monitor and to investigate crimes hidden behind the anonymity provided by Tor. In this paper, we propose a new algorithm, named ToRank, that ranks hidden services in Tor better than the known algorithms used for the Surface Web. We also thoroughly analyze the content present in Tor, creating a dataset, DUTA-10K, that extends the previous Darknet Usage Text Address (DUTA) dataset. We quantitatively compared ToRank with some of the most popular ranking algorithms, like PageRank, HITS, and Katz. Results showed that our proposal obtains a higher harm to the Tor network robustness than all of them, what indicates its superiority for this problem. The analysis of DUTA-10K reveals that only 20% of the hidden services that can be accessed are related to suspicious activities, and 48% are associated with normal ones. We also discovered that domains related to suspicious activities usually present multiple clones under different addresses, what could be used as an additional feature for identifying them. We consider that our new algorithm, the extended dataset, and the findings obtained from the analysis carried out are helpful for LEAs to fight against crimes that take place in the Tor hidden services.

The number of live Tor HS between 2015-April-01 and 2018-September-01. The horizontal axis represents the years, while the vertical axis corresponds to the number of unique Tor addresses.

…

The three components of onion domains monitoring pipeline.

…

The distribution of the suspicious activities in DUTA-10K.

…

The distribution of the normal activities in DUTA-10K.

…

+10

Illustration of the replication of Tor HS. The percentages represent the number of unique domains in the corresponding category. The majority of the suspicious services (upper char) tends to have duplicated copies of their HS, while the majority of the normal ones (bottom chart) have unique content.

…

Figures - uploaded by Wesam Al Nabki

Content may be subject to copyright.

Content uploaded by Wesam Al Nabki

Content may be subject to copyright.

ToRank: Identifying the Most Inﬂuential Suspicious Domains in the Tor Network

Mhd Wesam Al-Nabkia,b, Eduardo Fidalgoa,b, Enrique Alegrea,b, Laura Fern´

andez-Roblesa,b,c

aDepartment of Electrical, Systems and Automation, Universidad de Le´on, Spain

bResearcher at INCIBE (Spanish National Cybersecurity Institute), Le´on, Spain

cDepartment of Mechanical, Informatics and Aerospace Engineering, Universidad de Le´on, Spain

Abstract

The Tor network hosts a signiﬁcant amount of hidden services related to suspicious activities. Law Enforcement

Agencies need to monitor and to investigate crimes hidden behind the anonymity provided by Tor. In this paper, we

propose a new algorithm, named ToRank, that ranks hidden services in Tor better than the known algorithms used for

the Surface Web. We also thoroughly analyze the content present in Tor, creating a dataset, DUTA-10K, that extends

the previous Darknet Usage Text Address (DUTA) dataset. We quantitatively compared ToRank with some of the

most popular ranking algorithms, like PageRank, HITS, and Katz. Results showed that our proposal obtains a higher

harm to the Tor network robustness than all of them, what indicates its superiority for this problem. The analysis

of DUTA-10K reveals that only 20% of the hidden services that can be accessed are related to suspicious activities,

and 48% are associated with normal ones. We also discovered that domains related to suspicious activities usually

present multiple clones under diﬀerent addresses, what could be used as an additional feature for identifying them.

We consider that our new algorithm, the extended dataset, and the ﬁndings obtained from the analysis carried out are

helpful for LEAs to ﬁght against crimes that take place in the Tor hidden services.

Keywords: Darknet, Dataset, Inﬂuence detection, Graph analysis, Ranking algorithm, Hidden Services.

1. Introduction

Nowadays, the most general approach to access in-

formation on the Internet is through standard search en-

gines, such as Google or Bing. However, despite their

eﬃcient and powerful performance, they can not index

all the Web content (Bergman, 2001). Thus, a division

is made between the part of the web whose content is in-

dexable, the Surface Web, and the rest of the Web which

is not, the Deep Web (Noor et al., 2011; Moore and Rid,

2016; Al Nabki et al., 2017a). In the depths of the Deep

Web, there is a portion called Darknet or Dark Web

(Al Nabki et al., 2017a), the fragment of the Web inten-

tionally hidden which only can be accessed through spe-

ciﬁc software applications. “The Onion Router”1(Tor)

which is one of the most popular Darknet networks, and

its domains, that are known as hidden services (HS),

can be accessed through Tor Browser2or a proxy as

∗E-mail addresses: {mnab, eﬁdf, ealeg, l.fernandez}@unileon.es

1www.torproject.org

2https://www.torproject.org/projects/torbrowser.

html.en

Tor2Web3. The Tor metrics website4reported that the

number of unique addresses has increased from 30K to

almost 100K between April 2015 and September 2018

(Fig. 15). It is worth mentioning that between the years

2016 and 2017, there were two peaks in the ﬁrst quar-

ter of each year followed by a sharp decrease but there

is no clear reason behind this as Kate Krauss, the di-

rector of communications and public policy for the Tor

Project, declared (Gallagher and UTC, 2016). However,

the spikes might be explained due to political events like

when Uganda government blocked the social network

before the election in February 2016 Duggan (2016).

Consequently, new domains were created in the Tor net-

work and more people used this net6.

The privacy and the high-level of anonymity provided

by the structure of the Tor network always attracted

3https://tor2web.org/

4https://metrics.torproject.org/

5Source: https://metrics.torproject.org

6Tor traﬃc during the Ugandan elections in 2016:

https://metrics.torproject.org/

userstats-relay- country.html?start=2015-11- 22&end=

2016-02- 20&country=ug&events=off

Preprint submitted to Expert Systems With Applications October 23, 2020

Figure 1: The number of live Tor HS between 2015-April-01 and

2018-September-01. The horizontal axis represents the years, while

the vertical axis corresponds to the number of unique Tor addresses.

suspicious services traders to promote their business

by creating new HS, causing new critical challenges

to the world security (Ling et al., 2015; Ciancaglini

et al., 2013; Norbutas, 2018; Foley et al., 2018). To

address these threats, the Law Enforcement Agencies

(LEAs) need techniques and automatic tools to monitor

the HS activities eﬃciently. Based on our collaboration

with the Spanish National Cybersecurity Institute (IN-

CIBE7), we divided this process into a pipeline of three

main components (Fig. 2).

Figure 2: The three components of onion domains monitoring

pipeline.

First, we classify the HS content into normal and sus-

picious activities, the latter refer to the contents that

LEAs are interested in monitoring. Next, we categorize

the suspicious ones into diﬀerent groups based on the

crime or activity they could be related to. Our previous

7In Spanish, it stands for the Instituto Nacional de Ciberseguridad

de Espa˜

work (Al Nabki et al., 2017a) targeted this component

by means of creating a supervised text classiﬁer to iso-

late the suspicious domains, and currently, it is being

used by one Spanish LEA.

The second component of our pipeline, which is the

objective of this paper, is responsible for ranking and

detecting the most inﬂuential HS within the network.

It is fed with a list of onion domains, that our classi-

ﬁer determined as suspicious, and it ranks them using

an algorithm that we propose to reﬂect their popularity

among other HS. Recognizing the inﬂuential HS could

provide clues to the LEAs about who the market leaders

of each activity are. For example, identifying the most

inﬂuential drugs marketplaces that attract people could

be useful to draw insights into the common products,

sellers nicknames, main countries involved and possi-

ble exporting destinations of that market.

Thirdly, the de-anonymizing or locating the IP ad-

dress of the HS would take place (Biryukov et al., 2013;

Jansen et al., 2014; Kwon et al., 2015; Matic et al.,

2015). Although this task is challenging due to the high

level of security in the Tor network, thanks to the rank-

ing component, the onion domains could be prioritized.

Indeed, even if the LEAs could take an unlawful HS

down, the domain could be easily replaced by cloning

its content into a new one. An additional advantage of

the ranking module -second component- is that it can

detect the domain again, mainly if it has a high rele-

vance, what could help in neutralizing this threat anew.

Although the proposed method cannot prevent people

from accessing these suspicious HS, it helps LEAs to

keep a close eye on the inﬂuential domains.

Hence, the presented ranking algorithm is a supple-

mentary and valuable resource for the LEAs in leverag-

ing the use of resources because it recommends where

to focus their localization and monitoring eﬀorts.

Since the ﬁrst and the last components are out of the

scope of this paper, in the following we explain how

we carried out the second one only. In particular, this

work presents ToRank, a novel ranking algorithm to

rank and to detect the most inﬂuential onion domains

in the Tor network which practice suspicious activities.

At present, the result of this work is being used by the

Spanish Police Forces to monitor the Tor Darknet. The

main contributions of this paper are summarized as fol-

lows.

•We introduce ToRank, a link-based ranking algo-

rithm for onion domains that detects which the

most inﬂuential ones are. It proved to outperform

well-known ranking algorithms, such as PageR-

ank, HITS, and Katz, in terms of the reduction in

the giant component, the clustering coeﬃcient, and

the density, while increasing the average shortest

path and the diameter of the Tor network when it is

represented by a directed graph (Fig. 3).

•We propose and make publicly available DUTA-

10K8, an extended version of “Darknet Usage

Text Addresses” (DUTA) (Al Nabki et al., 2017a)

dataset up to 10367 manually labeled onion do-

mains. To follow up with the most recent activi-

ties on the Tor HS, DUTA-10K introduces Cryp-

toLocker, a new category which has spread widely

especially after WannaCry virus (Mitchell, 2017).

•And ﬁnally, we carried out and presented here

a statistical analysis of DUTA-10K regarding the

distribution of the activities of its onion web pages,

the domains that have content replicated, and the

distribution of the languages in the analyzed pages.

The rest of the paper is organized as follows. Sec-

tion 2 reviews the previously published related work.

Next, Section 3 introduces DUTA-10K, the updated ver-

sion of DUTA dataset. After that, the proposed ranking

method, ToRank, is described in Section 4. In Section

5 we describe the conducted experiments and how we

evaluated the proposed ranking method. We discuss the

obtained results in Section 6. Finally, Section 7 presents

the main conclusions that can be drawn from this work,

pointing out to some other successive approaches al-

ready in progress.

2. Related work

Few works have addressed the problem of ranking the

HS in the Tor network. (Biryukov et al., 2014) proposed

a solution that exploits the concept of entry guard nodes

(Elahi et al., 2012) to de-anonymize clients of a Tor

HS. They estimated the popularity of onion domains in

the Tor network by examining incoming traﬃc to those

domains, whose weakness is that the analysis can be

blocked if this vulnerability is ﬁxed. The approached

followed in this paper is entirely diﬀerent. Our purpose

was to determine which HS are more signiﬁcant than

others as a possible source of suspicious content but

without measuring the incoming traﬃc. The solution

that we are presenting here uses, among other things,

some of the concepts that worked on the Surface Web

to search for a web page. We represent the Tor Dark-

net as a directed graph, and we rely upon graph theory

8https://goo.gl/forms/bmJCaKthwxoQAwMm1

for ﬁrst ranking and later detecting the most inﬂuential

onion domains.

Graph data structures have been used widely to rep-

resent a set of entities with their connections including,

but not limited to, social network analysis (Scott, 2017;

Backstrom and Kleinberg, 2014; Ji et al., 2016) and data

mining (Al Nabki et al., 2017b). Henni et al. (2018)

used a graph-based approach to build an unsupervised

features selection method whereas nodes correspond to

features and the relationship between those features was

captured by the graph edges. Next, to assign an impor-

tance score for each feature, the addressed several graph

centrality measures, and, in particular, PageRank algo-

rithm. Hasan et al. (2013) used also graph theory to

build a trust relationships graph between the network

users, represented by nodes, whereas the edges reﬂect

a binary trust relation between them. Al Nabki et al.

(2017b) proposed a method to detect the emerging prod-

ucts within the Tor network using the K-shell algorithm

(Carmi et al., 2007). They employed an undirected

graph where the nodes refer to the marketplace prod-

ucts and the edges express the presence of two products

within the same marketplace.

The detection of inﬂuential nodes within a given

graph is performed either through analyzing the connec-

tivity between the nodes, called link-based, or by evalu-

ating the content of the nodes, known as content-based

(Derhami et al., 2013; Bidoki et al., 2010; Bidoki and

Yazdani, 2008). Both approaches could be merged into

a hybrid one by extracting features from the graph and

utilizing the content of the nodes (Anwar and Abulaish,

2015). The link-based ranking algorithms have been

studied widely and employed to solve several problems

(Borodin et al., 2005). Xu et al. (2009) proposed an al-

gorithm that incorporates link-based ranking algorithm

with a Support Vector Machine classiﬁer to determine

the eligibility of applicants for a credit card or loans

to banks. Fronzetti Colladon and Remondi (2017) ex-

plored the use of social network metrics, such as in-

degree, out-degree, closeness and betweenness central-

ity to combat money-laundering oﬀenses. Ferrara et al.

(2014) introduced an algorithm called LogAnalysis that

is based on several social network analysis and com-

munity detection algorithms to detect criminal commu-

nities via the log of their phone calls records. Taha

and Yoo (2017) presented an algorithm using the Mini-

mum Spanning Tree (MST) to build a network of crim-

inals. Each node was assigned a score that was pro-

portional to the number of nodes whose existence de-

pend on the existence of the targeted node. Arulsel-

van et al. (2009) conducted a study to detect the crit-

ical nodes in sparse networks by proposing an algo-

Figure 3: An overview of the followed procedure to detect and to rank the inﬂuential HS. First, the hyperlinks of the dataset are extracted.

Afterward, for every single category of the suspicious activities, a Suspicious Activities Graphs (S AG) is constructed, like S AGPorno and S AG Drugs.

Additionally, another graph is built for all the interesting activities, named S AGAll . Ultimately, the ranking algorithms are applied to rank and to

detect the most inﬂuential HS.

rithm to minimize the pair-wise connectivity between

the nodes. Zhang et al. (2015) proposed a method to

identify the inﬂuential nodes using two node central-

ity techniques, the betweenness (Freeman et al., 1979),

and Katz (Katz, 1953) centralities. Hu et al. (2016)

studied the GitHub repositories network to detect the

inﬂuential ones using a graph and built a star relation

graph between the repositories. Then, they assessed the

performance of the weighted HITS (Kleinberg, 1999)

and PageRank (Page et al., 1999). Later one, Hu et al.

(2018) proposed UserRank algorithm, which is a per-

sonalized version of PageRank but dedicated for GitHub

developers network. Another study by Nouh and Nurse

(2015) focused on identifying the key players nodes in a

Facebook group using social network analysis metrics,

namely, the eigenvector centrality and the betweenness

centrality (Ruhnau, 2000).

Concerning content-based and hybrid approaches,

Anwar and Abulaish (2015) developed an algorithm

to rank and to detect the inﬂuential leaders of radical

groups in the Darknet forums. They extracted a set

of features that measure the radicalness of the users.

Then, they developed an algorithm based on PageRank

to build a ranked list of radically inﬂuential users. Cossu

et al. (2015) proposed an algorithm to detect inﬂuence

through the Twitter social network. A directed graph

of nodes and edges was used to represent the network

where the nodes refer to the users and the edges cap-

ture the following relation between them. Each Twitter

user was described by some features that were extracted

from the content, such as the tweets characteristics, and

from the constructed graph properties like the degree

(Seidman, 1983) and the betweenness centrality (Free-

man et al., 1979)). However, to the best of our knowl-

edge, none of the previously commented methods have

been applied to rank and to detect the inﬂuential HS

in the Tor network. Hereafter, to ﬁll this gap, we pro-

pose ToRank algorithm, which belongs to the link-based

family. Next, we compared its behavior on the onion

domains network with three ranking algorithms related

to the same family, namely, PageRank, HITS, and Katz.

We selected those algorithms due to their basic role in

the majority of the link-based approaches commented

above. However, investigating content-based and hy-

brid approaches are out of the scope of this work, and

will be considered in future works.

The interpretation of what “inﬂuence” means varies

according to the pursued goal. In viral marketing, the

opinion leaders who can convince their audience with a

point of view, regarding a product, a service, or even an

idea, are considered as inﬂuential (Gohari and Moham-

madi, 2014). While in the ﬁeld of terrorist networks,

the inﬂuence could refer to detecting people who have

connectivity with the majority of the network users in

a decentralized network, such as the ﬁnancial managers

(Berzinji et al., 2012). Eliacik and Erdogan (2018) ana-

lyzed social networks and micro-blogging communities

to recognize users who are able to change the decisions

of the others via a sentiment analysis algorithm. In such

a context, the cluster of those social actors represents

the inﬂuential bloc. Anger and Kittl (2011) evaluated

the inﬂuence of the users by their social networking po-

tential, which is a score that captures the amount of in-

teraction that a user receives from his or her followers

with respect to all the published tweets. Taha and Yoo

(2017) employed the existence dependency concept on

a network of criminals.

In the Tor network, as in the Surface Web (COCK-

BURN and MCKENZIE, 2001; Levene, 2011), a user

surfs the network moving from a domain to another,

thanks to the hyperlinks connecting the web pages, till

he/she reaches to a market-leading domain; such a do-

main is interpreted as inﬂuential in this context. In this

paper, we employ social network analysis techniques

and algorithms to establish a link-based approach that

analyzes hyperlinks between the onion domains in order

to rank the onion domain and to recognize the inﬂuen-

tial ones. However, to the best of authors’ knowledge,

there is no ground truth rank to judge the correctness of

a given order. Alternatively, and for consistency with

previous studies (Booker, 2012; Fronzetti Colladon and

Gloor, 2018; Fronzetti Colladon and Vagaggini, 2017),

the inﬂuence of a given domain is interpreted by the

amount of disruption that can be caused to the connec-

tivity of the network by removing that domain. This

criterion is correlated with network robustness which

is deﬁned by the ability of a network to retain its sys-

tem structure intact despite being exposed to pertur-

bations, i.e. removing the inﬂuential nodes (Holm-

gren, 2007). This intent is backed up by a considerable

amount of literature which already proposed algorithms

and techniques to eﬀectively attach a network robust-

ness (Chaurasia and Tiwari, 2013; Duijn et al., 2014;

Zhang et al., 2015; Memon and Larsen, 2006). This pa-

per targets the robustness of the Tor network and tests

its destabilization cost by the eﬀect of eliminating the

top-ranked nodes for the purpose of inﬂuential domains

detection.

3. Onion domains dataset

3.1. Why to build DUTA dataset

The ﬁrst component of the proposed monitoring

pipeline is responsible for isolating the suspicious do-

mains out of the normal ones and then classifying

them into categories. Initially, this component was a

keyword-based system that ﬁlters the traﬃc according

to a predeﬁned list of keywords set by experts. How-

ever, other than the diﬃculty of maintaining such a sys-

tem up to date, it could produce a high error rate. This

error could be false positive due to polysemy word pat-

terns in the keywords list, and a high false negative due

to the shortness in the keywords list. The alternative so-

lution was to build an automatic text classiﬁcation sys-

tem based on Supervised Machine Learning algorithms.

Such a system needs to be trained on labeled samples

for each category of the activities, and for this purpose,

INCIBE in collaboration with the Spanish Police Forces

granted the authors access to the keywords lists and pre-

classiﬁed samples of each category. Hence, allowing

the authors to have an idea about the possible content

of each category and consequently to introduce the ﬁrst

version of “Darknet Usage Text Addresses” (DUTA)

dataset (Al Nabki et al., 2017a).

3.2. Why to extend DUTA dataset

Indeed, despite the lifespan of some DUTA domains

is short, as they might be taken down by some LEA or

closed by their hosts, the value of DUTA is preserved.

Apart from giving insights into Tor in a speciﬁc period,

it could be used to research several problems, including,

but not limited to, the detection of emerging products

in the onion domains (Al Nabki et al., 2017b), image

classiﬁcation (Fidalgo et al., 2017), text summarization

(Joshi et al., 2018), or recognition of onion domains ser-

vices (Biswas et al., 2017). Therefore, we decided to

extend DUTA and we collected new onion domains be-

tween May and July 2017.

3.2.1. DUTA-10K extension procedure

The Tor metrics website indicated that there were

more than 100KHS alive at the time of writing this

paper. However, for security reasons, the Tor network

structure does not have a public DNS server where all

the HS addresses are registered. Instead, it uses a Hid-

den Service Directory (HSDir), which is a Tor relay that

functions as a middle point between a HS, as it pub-

lishes its descriptors there, and clients, who communi-

cate with it to learn the address of the HS’s introduction

points (Biryukov et al., 2013, 2014). However, a Tor

relay needs a speciﬁc ﬂag to be assigned by Tor authori-

ties to function as HSDir. Instead of asking for that ﬂag,

we built a crawler that searched the Web for new onion

addresses.

To extend DUTA, we incorporated more onion ad-

dresses by searching in diﬀerent sources. First, we de-

veloped a customized crawler that looks for onion ad-

dresses in three resources 1) the online notepad services

in the Surface Web, 2) the search engines of the Tor net-

work, and 3) DUTA dataset hyperlinks. And later, we

detected the addresses using a parser that employs reg-

ular expression pattern to match the onion ones.

The Surface Web has plenty of addresses that are

posted by anonymous people in public notepad web-

sites, such as Pastebin 9. Fortunately, Pastebin is pow-

ered by Google search engine, what allowed us to search

for onion addresses by typing keywords like Onion Ad-

dresses,hidden services 2018,darknet links 2019, and

.onion links. We scraped the content of the retrieved

9https://pastebin.com/

pastes and parsed the onion addresses. Additionally,

we used Tor network search engines, such as Ahmia.ﬁ

and onion.link, looking for random words like onion

address,Tor services,Tor markets, and Tor products.

Then, we scraped the retrieved onion web pages and

parsed the onion addresses. Finally, we used the content

of DUTA samples to parse new HS addresses; then we

scraped their content and iteratively repeated this pro-

cess two more times. Those three strategies returned

124589 new unique onion addresses, but only 3536 ones

were active at that moment, while the majority were

down, i.e. they returned a connection time-out error

message. For each active onion domain, we crawled the

root page and the ﬁrst level in depth for the sub-pages.

Next, we concatenated the pages of each domain into

a single HTML ﬁle. Therefore, the collected samples

represent a real case of the domains in the Tor network,

without any bias towards some speciﬁc category. More-

over, we saved each sample of DUTA-10K in a textual

format after removing the HTML tags. For this end, we

used a simple workaround; we loaded the HTML pages

with Lynx10, a text-based web browser, then we saved

the cleaned text into a text ﬁle. We extended DUTA by

adding the newly collected samples, i.e. the 3536 onion

domains, and denoted it as DUTA-10K because it holds

10367 unique onion addresses.

3.2.2. DUTA-10K labeling procedure

To ease the labeling process of the new samples, we

split this task into two phases. The ﬁrst one utilizes the

previously proposed text classiﬁer to isolate the suspi-

cious activities from the normal ones. Then, we validate

the assigned labels manually, one by one. The second

phase is a manual labeling for the samples that were

classiﬁed as Others. We respected the same regulations

than the previous version of DUTA that can be sum-

marized in the following points: 1) an author labels a

domain based on user-visible textual content only, 2) a

domain must receive one single tag only according to

its activity. In case it holds more, it is tagged as mar-

ketplace black or white following the suspiciousness of

the activity, and 3) if any author hesitates about the tag,

an open discussion is established with the rest of the au-

thors.

The DUTA-10K samples are distributed over 25

classes with some small changes in their names com-

pared to the original DUTA. In DUTA-10K, we joined

the “Leaked-data” category to “Fraud” because both of

them had a small number of samples and are related to

10http://lynx.browser.org/

the same topic. In the same way, we mixed the category

“Wiki” with the category “Hosting/Directory”. Ad-

ditionally, DUTA-10K presents a new category, Cryp-

toLocker, related to domains used to pay a ransom to

decrypt a machine infected with Ransomware, like the

WannaCry virus (Mitchell, 2017) (Table 1). Thanks to

the collaboration of Spanish LEAs with INCIBE in de-

veloping solutions for monitoring the suspicious activ-

ities of the onion domains, the latter one provided us

with a list of activities that are considered as interest-

ing for one of the main Spanish LEAs. Because of that,

we tagged the activities related to this list as Suspicious

Activities, whereas the rest were denoted as normal Ac-

tivities.

In Table 1 it can be observed a third activity type

named Unknown, which contains HS whose content

could not be assessed by the crawler. It comprises three

domain categories. (i) Locked, domains which require

a human interaction, such as solving a CAPTCHA or

entering a log-in credential to access. (ii) Empty, do-

mains without text or with graphical content only. Also,

based on our previous work (Al Nabki et al., 2017a), we

assigned to this category those very small HS, with an

amount of text less than or equal to 5 words. (iii) Down,

when the crawler returns an error while downloading

the textual content. For example, many HS require

JavaScript activation which was disabled in the crawler

due to security reasons.

Every sample of the dataset contains the HTML code

of the web page, the language11, and the assigned ac-

tivity by the authors. Some activities were divided into

sub-activities, and, therefore, subcategory labels were

assigned to them. For example, the documents forging

activity has a main activity named Counterfeit Personal

Identiﬁcation with three sub-activities branches named:

Passport,ID, and Driving License.

3.3. Statistical analysis of DUTA-10K

To have a deeper understanding of the onion domains,

we made a statistical analysis of DUTA-10K with re-

spect to: 1. the activities distribution, 2. the domains

content replication, and 3. the used languages.

3.3.1. Activities distribution

Out of the 10367 samples of DUTA-10K, we found

that the suspicious activities represent 20% of the Tor

HS and the normal ones 48%. The third category,

i.e. the content which cannot be accessed, forms 32%.

Drugs trading is the most popular suspicious activity

11The LangDetect 1.0.7 library was used for the language detection.

Table 1: DUTA-10K dataset activities distribution. The letter C. de-

notes a Counterfeit activity.

Activity Type Activity Sub-Activity #HS

Suspicious

Activities

20%;

2013 HS

Pornography Child 105

Adult 148

Drugs - 465

Violence

Hate 19

Hitman 28

Weapons 48

C. Credit Cards 399

C. Money 83

C. Personal

Identiﬁcation

Passport 48

ID 20

Driving-License 4

Hacking - 205

Cryptolocker - 185

Marketplace Suspicious 127

Services Suspicious 20

Forum Suspicious 63

Fraud - 43

Human-Traﬃcking - 3

Normal

Activities

48%;

5016 HS

Art & Music - 15

Casino & Gambling - 29

Services Normal 341

Cryptocurrency - 868

Forum Normal 163

Marketplace Normal 138

Library & Books - 45

Personal - 616

Politics - 12

Religion - 21

Hosting & Software

File-Sharing 205

Folders 99

Search-Engine 92

Server 1416

Software 411

Directory 133

Social-Network

Blog 219

Chat 79

Email 72

News 42

Unknown 32%;

3338 HS

Down - 864

Empty - 1653

Locked - 821

Total sum 10367

on the onion domains, representing 23% of the suspi-

cious HS. It is followed by Credit Cards Counterfeit and

Hacking, being 20% and 10%, respectively, of the sus-

picious HS. Conversely, the Human-Traﬃcking activity

counts the lowest presence with only 0.2% (Fig. 4).

Conversely, for the normal activities (Fig. 5), we

found that the HS that oﬀer Hosting Servers represent

47% of the normal onion domains. Whereas in the sec-

ond position, we found that 17% of the domains are re-

lated to Cryptocurrency and Bitcoin trading. Next, the

category Personal represents 12% of DUTA-10K nor-

mal activities. The political websites occupy the lowest

count with 0.2%.

Along with the insights that could be drawn from the

Figure 4: The distribution of the suspicious activities in DUTA-10K.

Figure 5: The distribution of the normal activities in DUTA-10K.

established analysis, we used the activities distribution

of DUTA-10K, the network of the suspicious and the

normal domains for the purpose of evaluating ToRank

and the benchmark algorithms. Hence, we investigated

the detection of the most inﬂuential domain with respect

to suspicious and normal ones.

3.3.2. Domains content replication

During the process of labeling DUTA-10K, we de-

tected that some samples had identical or quasi-identical

textual content but hosted under diﬀerent addresses.

The latter ones refer to HS that after preprocessing their

text, for example by removing the PGP signature, the

dates, or the price of the products, become exactly iden-

tical. We obtained the MD5 hash (Rivest, 1992) for

each cleaned onion domain and grouped them into the

deﬁned 25 categories. After that, we found only 5368

unique texts based on their hashes, i.e. 51% of DUTA-

10K samples, vary between 2 and 496 copies per do-

main.

In Fig. 6(a), it can be noticed that the majority of

the illegal suspicious HS are cloned and appear sev-

eral times using diﬀerent onion addresses. For exam-

ple, only 60% of the HS which were labeled as Drugs

have unique content and only 10% of the Cryptolocker

domains are unique. In contrast, the normal activities

domains present the reverse behavior as shown in Fig.

6(b). However, Hosting category is an exception be-

cause only 40% of its domains are unique. The reason

behind the high number of replications is due to a host-

ing company called “Freedom Hosting”12 that occupies

35% of those clones. The high number of suspicious

HS clones reﬂects the possible concern of their owners

to provide smooth access for the customers in case any

LEA takes some of their domains down.

3.3.3. Language analysis

We did not ﬁnd representative diﬀerences between

normal and suspicious activities in terms of DUTA-10K

language analysis. We detected 38 languages in DUTA-

10K domains, but only ﬁve of them, which are illus-

trated in Fig. 7, have a frequency higher than or equal

to 1%. The English language is the most common, and

the one used in 84% of the samples, followed by Rus-

sian with 6%. From the point of view of a researcher

or a LEA, this ﬁnding means that training a language

model only on an English corpus would be suﬃcient to

cover the majority of the Tor HS.

4. Methodology

The ranking procedure starts by constructing a Sus-

picious Activities Graph (S AG) that holds the HS and

their interconnections. Then, we apply ToRank to rank

the onion domains and to detect the inﬂuential ones.

4.1. Onion domains representation

Due to our collaboration with INCIBE and the inter-

est of Spanish LEA on certain activities carried out in

Tor HS, i.e. suspicious, we focus our eﬀort on ranking

only the 13 classes of DUTA-10K labeled as suspicious

in Table 1.

12https://en.wikipedia.org/wiki/Freedom_Hosting

(a) Suspicious Activities

(b) Normal Activities

Figure 6: Illustration of the replication of Tor HS. The percentages

represent the number of unique domains in the corresponding cate-

gory. The majority of the suspicious services (upper char) tends to

have duplicated copies of their HS, while the majority of the normal

ones (bottom chart) have unique content.

Figure 7: The languages used in onion domains of DUTA-10K.

4.1.1. Hyperlinks extraction

For each onion domain in DUTA-10K, we extracted

only the incoming and outgoing HTTP and HTTPS hy-

perlinks. Next, we removed the ones pointing to the

Surface Web, with the purpose of focusing our analy-

sis only on the onion domains. We also excluded the

duplicated links, which have the same source and des-

tination, to avoid having a multi-graph. We found that

some of the suspicious activity domains were referenced

or were pointing to web pages within Tor, but either they

were not related to the designated suspicious categories

or were web pages that did not exist in DUTA-10K. In

both cases, we added those nodes to the graph, but the

diﬀerence was that when a sample exists in DUTA-10K,

we labeled it based on DUTA-10K classiﬁcation; other-

wise, we assigned a diﬀerent labeled called “new node”.

In the end, each domain ended up with two lists of hy-

perlinks, one for the addresses that were referencing it

and another list containing the domains inside Tor that

this domain was pointing to.

4.1.2. Interesting activity graph implementation

Once the hyperlinks for the suspicious activities were

extracted, we modeled the Tor network as a directed

graph and denoted it as “Suspicious Activity Graph”

(S AG) as shown in Fig. 8. The S AG =(N,E) model is

composed of a set of nodes denoted by N, which are the

HS, and a set of edges E, which correspond to their hy-

perlinks. A new edge is created from an onion domain

Ato an onion domain Beither if Ahas referenced Bor

if Bhas been referenced by Aat least once.

Figure 8: A snapshot of the DUTA-10K S AGAll, where the dots cor-

respond to HS. Each color represents an activity in DUTA-10K. The

gray lines reﬂect the hyperlinks between the domains. Due to the very

high connections density in the center of the graph, it appears as a

gray spot.

4.2. ToRank algorithm

The objective of the proposed algorithm, ToRank, is

to identify the most inﬂuential node in a graph by mea-

suring the number of nodes to which traﬃc can be pro-

pagated or from which it can be received. The algo-

rithm consists of two phases, a weights initialization

and a weights update. The algorithm starts by assigning

an initial weight for each node. Given an S AG with N

nodes, the initial weight of the node n∈Nis computed

in the following equation (1).

Wn=Di

n+Do

n(1)

where Wnis the initial weight for node n. The Di

nand

the Do

ncorrespond to the in-degree and the out-degree

of the node n, respectively.

Next, we accumulate the weight of n’s followers,

which are the nodes that point to it, and the weight of its

followings, the nodes that are referenced by n. Finally,

the weights are calculated again for each node using 2.

A rank value, T Rn, is assigned according to the weight

of the node, such that the higher weight corresponds to

the higher rank.

T Rn=Wnlog(1 +αWfr+βWfw) (2)

Where Wfrand Wfware the accumulated weight

of followers and the followings of the node nrespec-

tively. The parameters αand βcontrol the contribution

of the followers and the followings nodes to the weight

of n. More speciﬁcally, ToRank formula considers the

accumulated weight of the follower and the following

nodes. Inﬂuenced by PageRank formula, the neighbor-

ing nodes do not have an equal impact, however, both

are still valuable factors to identify the inﬂuence of the

node n. The role of alpha and beta comes to capture this

diﬀerence in the weight of the followings and followers

nodes. When both are set to zero, T Rnwould be ex-

actly the degree centrality which is not a recommended

approach (see the description of the degree centrality in

Section 4.4). In contrast, when both are equal to one,

the formula will not reﬂect the intended purpose of giv-

ing a higher importance for the following nodes. During

the labeling process, we observed the existence of some

nodes that function like directories or wiki pages that

are pointing out to hundreds or even thousands of do-

mains in the network but being referenced by zero or

very few domains. The removal of such nodes would

fragment the graph strongly; unfortunately, their detec-

tion is worthless as they are not practicing any suspi-

cious action. Hence, in the Tor network, ToRank rec-

ommends assigning a high value to alpha to increase the

weight of the follower’s nodes but a low value to beta to

decrease the impact of the following nodes.

ToRank is intended to detect the most important

nodes in the Tor network but no ground truth tells that.

The straightforward approach is to use degree centrality

but its main limitation is that it does not consider the

graph structure. We use the degree centrality to initial-

ize the weights of the nodes in a graph. Then, to over-

come the shortcoming of the degree centrality, we pro-

pose the second formula of ToRank to incorporate the

neighbor’s degree as well. ToRank introduces the usage

of the logarithm function to the neighbor’s weight so

they do not overshadow the weight of the studied node.

Therefore, in ToRank algorithm, the rank value does not

depend exclusively on the weight of the studied node,

because it also considers the weights of its neighbors.

Consequently, the weights of those neighbors depend

on their neighbors, in cascade. The weight Wnof the

node nis calculated accordingly with the weight of its

neighbors. The logarithm function is used to respond

to skewness towards the nodes which have a very high

degree, and the ﬁrst term of the expression, the number

one, is added to avoid the indeterminate form when the

value of log argument is zero.

4.3. Benchmark link analysis algorithms

4.3.1. PageRank

Developed by Page et al. (1999) and it is considered

as an enhanced version of the in-degree centrality. It as-

sumes that an inﬂuential node is likely to receive more

links from other inﬂuential nodes. The inﬂuence of a

given node iis calculated iteratively using (3) and it

stops automatically when it converges or reaches a max-

imum number of iterations.

PR(i)=(1 −d)+dX

j∈B(i)

PR(j)

(3)

Where dis the damping factor which indicates the

probability of a random surfer who will continue or stop

navigating the graph nodes, iand jare nodes of a di-

rected graph G,B(i) is the set of nodes that point to i,

PR(i) and PR(j) are rank scores of the nodes iand jre-

spectively. Njindicates the number of outgoing links of

the node j. A high-rank value reﬂects a higher inﬂuence

of a node over the other nodes.

4.3.2. HITS

The Hyperlink-Induced Topic Search (HITS) algo-

rithm Kleinberg (1999) measures the nodes importance

recursively by assigning two mutual scores for each

node: a hub score hubi, which grows higher if a node

is referencing many high authority scores nodes, and an

authority score authi, which increases if a node is ref-

erenced by many high hub scores nodes. To this end,

the recursion behavior is deﬁned: good hubs are those

nodes that reference many good authorities and good

authorities are those referenced by many good hubs.

Each node iin a graph Gholds two non-negative scores,

an authority score authiand a hub score hubi, and they

are initialized with arbitrary nonzero values. Next, the

scores are updated iteratively until convergence; which

reaches a stationary solution. Equation (4) shows an

update to the hub and another to the authority scores

which captures the intuitive notions behind the HITS

algorithm.

auth(k)

i=X

j→i∈E

hub(k)

i=X

i→j∈E

auth(k)

(4)

Where jis a node in Gand i→jindicates a hyper-

link from the node ito the node jout of the graph edges

set E.kis an iterator index that starts from 1 and in-

creases to ∞but in practice, this loop is beaked where

there is no signiﬁcant change between consecutive iter-

ates or according to a maximum number of iterations

variable.

4.3.3. Katz

Introduced by Katz (1953), it is used to measure the

centrality of a node by assigning a score that depends

on the ﬁrst-degree neighbors and the nodes connected

with them. In mathematical form, the rank is calculated

according to Equation (5).

ki=αX

Ai jkj+β(5)

Where kiand kjare Katz centrality values for the

nodes iand jin a given graph G,Aij is the adjacency

matrix of Gthat captures the connectivity of the nodes.

βcorresponds to the initial centrality and αcorresponds

to the attenuation factor.

4.4. Graph robustness metrics

(Fronzetti Colladon and Gloor, 2018) carried out a

comprehensive study regarding graph robustness and

stability metrics. Below, we explore in short few of

them that we used to evaluate the benchmark algo-

rithms.

4.4.1. Degree centrality

Based on the number of connected nodes that are di-

rect neighbors of a given node, the degree centrality

measure uses three diﬀerent values in directed graphs:

in-degree, out-degree, and degree. The in-degree counts

only the number of incoming links from a given node,

whereas the out-degree counts the number of outgoing

ones (Seidman, 1983). The degree of a node is calcu-

lated as the sum of the in-degree and the out-degree val-

ues. However, this approach ignores the global structure

of the graph and focus only on the direct neighbors of

the targeted node (Wei et al., 2018). To come over this

limitation, Srinivas and Velusamy (2015) proposed an

enhanced version of the degree centrality that incorpo-

rates the clustering coeﬃcient.

4.4.2. Graph density

It is deﬁned as the number of existing edges over the

number of possible ones. Hence, the more connected is

the graph, the higher the density and vice versa. When

the graph is fully connected, the density is equal to 1,

and it is equal to 0 when it is free of edges. The density

of a directed graph Gis computed as shown in (6) where

Eis the number of edges, and Nis the number of nodes

in the directed graph G.

D=E

N(N−1) (6)

4.4.3. Average shortest path (ASP)

It refers to the average length of the shortest paths

along all possible pairs of network nodes (Mao and

Zhang, 2017).

4.4.4. Diameter (Dim)

It is deﬁned as the longest path among the shortest

paths between any two nodes in a given graph (Ye et al.,

2010). The removal of central nodes that occupies a

core location in the graph would increase the shortest

paths, and consequently, the diameter of the graph will

increase.

4.4.5. Clustering coeﬃcient (CC)

It calculates the number of triangles in a graph. It is

calculated by dividing the number of closed triples of

nodes by the total number of connected triplets in the

network (Watts and Strogatz, 1998).

4.4.6. Giant component (GC)

It refers to the largest fraction of nodes that are con-

nected, i.e. there exists a path between each pair of

nodes in that component. An attack to the graph robust-

ness could be measured by the reduction in the giant

component mass (Holme et al., 2002).

5. Experimental results

5.1. Experimental setting

The experiments were conducted on a PC with an

Intel(R) Core(TM) i7 processor with 32 GB of RAM

under Windows-10. The domains addresses were ex-

tracted from DUTA-10K using the Regular Expression

library13. The S AG was constructed using the Net-

workX library14 with Python3. For the graph visualiza-

tion, we used vis.js library15. Concerning the ranking

algorithms, we compared ToRank with the link-based

ranking algorithms presented in Section 4.3, namely

PageRank, HITS and Katz. We tuned all the meth-

ods by evaluating a range of values for each parameter,

as shown in Table 2, and we selected the ones which

achieved the highest performance in our experiments.

ToRank has two conﬁgurable parameters αand βthat

were set empirically (we refer the reader to Section 4.2

for a more in deep explanation of these parameters). Af-

ter evaluating several conﬁgurations for αand βvalues,

we found that setting them to 0.9 and 0.2 respectively

can achieve the best result.

Table 2: Evaluated values for the parameters of the ranking algo-

rithms. Bolded numbers correspond to the selected conﬁguration that

achieved the lowest area under the GDC curve.

Algorithm Name Parameter Experienced values

PageRank alpha 0.5, 0.70, 0.75.0.80,0.85,0.90

max iter 10, 100, 1000, 10000

HITS max iter 10, 100, 1000, 10000

Katz

alpha 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.9

beta 0.1, 0.3, 0.5, 0.7, 0.9, 1.0

max iter 10, 100, 1000, 10000

ToRank alpha 0.1. 0.2, 0.4, 0.6, 0.8, 0.9, 1.0

beta 0.1, 0.2, 0.4, 0.6, 0.8, 0.9, 1.0

5.2. Evaluation measure

Consistently with previous research (Booker, 2012;

Fronzetti Colladon and Gloor, 2018; Fronzetti Colladon

and Vagaggini, 2017), we employed several standard

metrics to judge the structure of the studied graph (see

their explanation in Section 4.4). More speciﬁcally,

concerning the density criterion, the evaluation proce-

dure starts by peeling away the top-ranked nodes one

13https://pypi.python.org/pypi/regex

14https://networkx.github.io/

15http://visjs.org/

by one iteratively, and at every cycle, the graph density

is evaluated. The iterator stops when the graph is com-

pletely disconnected while the density is zero. We con-

sider that the ranking algorithm that achieves the lowest

area under the Graph Density Curve (GDC) corresponds

to the algorithm that better measures the inﬂuence of

a domain inside the Tor network. Consequently, the

GDC is used as a proxy to evaluate the graph robustness

(Wang et al., 2014), and hence, the iterative removal of

the top-ranked nodes with their edges would result in a

reduction in the graph density. Therefore, if the nodes

are correctly ranked, the top-ranked nodes would lead

to a high reduction in the density because the removal

of this node should cause a harmful fragmentation to the

graph structure. But, if the removed nodes are not inﬂu-

ential, its removal will not break the graph, and hence,

it should not be ranked at the top of the list.

Besides looking at the graph density, we consider the

reduction in the size of the giant component (GC) and

the decrease in the clustering coeﬃcient (CC) as eﬃ-

cient indicators of the produced disruption. However,

one by one nodes removal is an expensive process due

to the time needed to calculate the GC and the CC of

the graph at every iteration. Alternatively, we analyze

only the removal of the top-(1st, 5th and 10th) percentile

of the ranked nodes, and hence, the GC and CC metrics

are evaluated three times only. A higher decrease in the

giant component size, the graph density, and the clus-

tering coeﬃcient reﬂect more disruption to the graph

robustness (Chang, 2017; Iyer et al., 2013) and conse-

quently a better ranking. Similarly, the diameter and

the average shortest path measures can be used to test

the robustness at multi-levels of top-ranked nodes re-

moval16 (Cohen and Havlin, 2010). An increase in the

graph diameter and in the average shortest path indi-

cate better ranking; this increase is due to removing the

top-ranked nodes from the graph. Hence, the higher the

AS P and the Dim and the lower the CC and the GC are,

the better the ranking algorithm.

5.3. Analysis of the suspicious activities in Tor

We created two types of S AG: ﬁrst, a S AG for all

the suspicious activities, and second, a S AG for every

single suspicious activity in DUTA-10k. We denoted

them by S AGAll and S AGX, respectively, where Xcor-

responds to the activity name, for example, S AGDrug s

refers to the drugs HS. Table 3 shows the speciﬁcations

of S AGAll and S AGXgraphs.

16We could not manage a one-by-one node removal due to metrics

complexity, hence we did it over the top-1, 5, and 10 percentile of the

ranked nodes only

Table 3: Details of the created Suspicious Activities Graphs. The #Ac-

tivity nodes column refers to the number of nodes related to the studied

activity, the SAG #nodes and the SAG #edges columns represent the

number of nodes and the number of edges in the corresponding SAG.

Activity name #Activity

domains SAG #nodes SAG #edges Average

degree

All 2013 2908 14,511 4.99

C. Credit Cards 399 583 2622 4.49

Forum: Suspicious 63 436 1527 3.50

Violence 95 240 795 3.31

C. Money 83 202 796 3.94

C. Personal

Identiﬁcation 72 180 763 4.23

Marketplace:

suspicious 127 389 1670 4.29

Drugs 465 743 4130 5.55

Hacking 205 402 1381 3.43

CryptoLocker 185 198 611 3.08

Services: suspicious 20 46 76 1.65

Pornography 253 686 2765 4.03

Fraud 43 145 386 2.66

human-traﬃcking 3 15 16 1.06

Fig. 9 shows ﬁve diﬀerent Graph Density Curves

(GDC) of S AGAll for the four ranking algorithms17. Fol-

lowing our previous reasoning, ToRank outperforms the

other methods because it achieves the lowest area un-

der the GDC, with a value of 1.31, presenting as well

a very gentle and homogeneous decrease in the density

curve. In contrast, PageRank suﬀers from a sudden in-

crease in its curve, then a sharp decrease, what yields the

highest GDC of 2.07 (Table 4), and this phenomenon is

discussed in Section 6. In Fig. 9 it can be observed

how the density reaches zero but the domain count is

13. Those nodes are normal HS that were referenced

by suspicious onion domains such as wiki pages or Tor

directories pages.

Figure 9: Density Analysis for the S AGAll of DUTA-10K. ToRank

achieves the lowest GDC.

17HITS algorithm produces two curves, one for the authorities, and

another for the hubs

Figure 10: Comparing the GDC value for four ranking algorithms with respect to the suspicious activities of DUTA-10K. The vertical axis

represents the GDC value for the corresponding ranking algorithm, whereas the horizontal axis shows the 12 individual activities S AGX, plus all

the interesting activities graph S AGAll.

Table 4: A GDC comparison for the four ranking algorithms over the

S AGs. The bolded numbers correspond to the lowestGDC value.

Activity name PageRank HITSAuth HITSHub Katz ToRank

All 2.07 1.63 1.96 1.43 1.31

C. Credit Cards 2.00 1.65 2.13 1.51 1.48

C. Money 0.64 0.56 0.72 0.52 0.52

C. Personal

Identiﬁcation 0.73 0.63 0.95 0.60 0.60

CryptoLocker 2.41 2.53 3.48 2.32 2.29

Drugs 2.24 1.74 2.19 1.58 1.49

Forum: Suspicious 0.48 0.36 0.35 0.35 0.26

Fraud 0.68 0.62 0.57 0.54 0.42

Hacking 0.90 0.72 0.80 0.68 0.63

Marketplace: Suspicious 1.25 0.94 1.10 0.87 0.80

Pornography 0.76 0.52 0.67 0.50 0.42

Services: Suspicious 0.44 0.43 0.52 0.34 0.33

Violence 0.69 0.60 0.66 0.52 0.51

Fig. 10 shows a comparison of the GDC of the four

ranking algorithms with respect to S AGAll and to 12

single activities graphs only. We could not evaluate

the Human-Traﬃcking activity as it contains only three

onion domains and two of them have identical content,

what yielded only two unique domains without any hy-

perlink between them. This ﬁgure shows that ToRank

outperforms the other ranking algorithms because it has

the lowest GDC area. However, despite the superiority

of ToRank over Katz, the latter is approaching ToRank

in all of the suspicious categories except the counterfeit

of personal identiﬁcation and the counterfeit of money

where they have the same GDC value. This is because

ToRank and Katz use the in-degree, but the advantage

of ToRank is based on the use of the out-degree of the

domains and its weighting.

In addition to the density analysis, Table 5 explores

four graph structure metrics: it compares the full net-

work structure before applying the top-ranked node re-

moval, with the three levels of nodes reduction (1, 5,

and 10 percentile of the full network). From the table,

we can see that ToRank achieved the sharpest reduc-

tion in the GC with a high decrease in the CC. Also,

with ToRank, the AS P increased to 8.9, which reﬂects

a higher disruption to the network structure by remov-

ing the core nodes ﬁrst. This observation is reﬂected in

the graph dimension as it increased from 9 for 22 after

removing the top-10% of the top-ranked nodes. Those

observations prove that ToRank outperforms the other

ranking algorithms.

In Table 6, we explore the top-10 onion domains

nominated by ToRank algorithm as inﬂuential HS

within S AGAll.

5.4. Analysis of the Normal Activities in Tor

We carried out the same analysis for the normal ac-

tivities of DUTA-10K (Fig. 11). We created a Normal

Activities Graph NAG. Then, we ranked the nodes ac-

cording to the four ranking algorithms and evaluated the

performance using the GDC measure. The N AG has

22965 nodes where 5016 of them are from the normal

activities. The left ones are HS that are connected to

them, and the edges count to 85699 with a node average

degree of 3.73.

Fig. 11 shows that ToRank has the lowest area un-

der the GDC with value of 0.02 where HITSHub, Katz,

HITSAuth, and PageRank achieved 0.03, 0.17, 0.22 and

0.26 respectively. In this case, the extraordinary good

performance of ToRank as well as HITSHub is explained

by their ability to detect ﬁrst the onion domains that

function as directories, having those domains a high

number of connections with other HS in the Tor net-

work. Due to their impact on the GDC curve, the con-

clusion is that the onion directories are one of the main

ways to redirect Tor users to the activities domains.

Table 5 shows that ToRank outperformed the other

algorithms even for the normal activities graph. This

superiority is observed in the reduction of the CC and

Table 5: Impact of top-ranked nodes removal on four graph metrics with respect to three graph datasets: suspicious activities, normal activities, and

9/11 Hijackers Network. Clustering Coeﬃcient (CC), Average Shortest Path (AS P), Giant Component (GC) and Diameter (Dim), and FN refers

to the full network before nodes removal. The bolded values refer to the best performance whereas the object is to decrease the CC and the GC and

to increase the AS P and the Dim variables.

Suspicious Activities Network Normal Activities Network 9/11 Hijackers Network

Metrics Algorithms FN Top-1% Top-5% Top-10% FN Top-1% Top-5% Top-10% FN Top-1% Top-5% Top-10%

ToRank

0.056

0.042 0.029 0.023

0.3577

0.009 0.004 0.003

0.476

0.465 0.334 0.303

PR 0.048 0.041 0.042 0.299 0.285 0.267 0.452 0.334 0.342

HITS (Hub) 0.055 0.043 0.023 0.013 0.004 0.004 0.471 0.458 0.449

HITS (Auth) 0.055 0.051 0.039 0.357 0.341 0.252 0.465 0.446 0.441

Katz 0.051 0.034 0.027 0.355 0.370 0.391 0.465 0.446 0.339

ToRank

2748

1539 998 406

22572

1156 42 32

59 55 34

PR 2691 2498 2351 15690 14760 13620 59 55 28

HITS (Hub) 1880 1048 452 2964 619 129 59 57 54

HITS (Auth) 2705 2484 2272 22281 21316 14118 59 31 26

Katz 2686 2399 2166 21249 20481 19086 59 31 26

ASP

ToRank

3.151

5.234 7.684 8.902

2.690

7.340 6.066 5.371

3.606

4.291 4.698 3.695

PR 3.174 3.193 3.255 2.665 2.698 2.725 3.653 4.698 3.183

HITS (Hub) 4.658 5.956 6.441 7.014 1.997 1.985 3.635 3.677 4.253

HITS (Auth) 3.169 3.225 3.281 2.755 2.800 2.934 4.291 3.179 3.129

Katz 3.180 3.229 3.281 2.593 2.550 2.399 4.291 3.179 3.129

Dim

ToRank

15 19 22

18 14 12

9 11 9

PR 9 9 9 8 8 8 7 11 7

HITS (Hub) 12 17 18 20 2 2 7 7 9

HITS (Auth) 9 10 10 8 8 8 97 7

Katz 10 10 10 8 7 6 97 7

Table 6: The top-10 HS ranked using ToRank algorithm

Rank HS Address Activity Category HS Title Short Description

1matangareonmy6bg.onion Drugs - Online market for suspicious drugs

2y3nau3mnibjbpmh4.onion Pornography Tor Links 2.0 A Tor directory for pornography

content

3hansamkt2rr6nfg3.onion Marketplace suspicious HANSA Market A famous marketplace for suspicious

product

4vfvfq64rtrefmdtd.onion Drugs - Russian market for suspicious drugs

5silkkitiehdg5mug.onion Drugs Silkkitie Valhalla Market (known by its

Finnish name, Silkkitie)

6shops3jckh3dexzy.onion Drugs - Online market for suspicious drugs

7abbujjh5vqtq77wg.onion C. personal identiﬁcation Onion Identity Services HS for producing fake passports

8gxmrzk2s56oxzb3e.onion Pornography German Pi-X-Board Multi-languages forum for child-

pornography

9boysopidonajtogl.onion Pornography Central Park Guides and links to porno websites

10 newpdsuslmzqazvr.onion Drugs Peoples Drug Store Online market for suspicious drugs

Figure 11: Graph density analysis for the N AG.

the GC as well as an increase in the AS P. However, in

this case, the PageRank achieved better diameter after

removing the top-1% of the nodes only.

5.5. Analysis of the 9/11 Hijackers Network

With the intention of testing deeply our proposal and

to evaluate its generalization capability, we looked for

other similar datasets or at least other problems where

the main purpose was to rank collected data. We found

a similar problem in the analysis of the 9/11 Hijackers

Network. This is a famous dataset containing informa-

tion about the terrorists involved in the 9/11 bombing

of the World Trade Center, in 2011 (Krebs, 2002). The

goal here is to detect the most inﬂuential people who

contributed to that attack, whereas the most inﬂuential

node, the key-player, is the one whose removal would

lead to an extreme break in the connectivity between

the other network members. Therefore, this objective

is similar to ours because we try to detect which the

most inﬂuential node in the Tor network in each step is.

To apply the link-based ranking algorithms, we repre-

sented the network by a directed graph of nodes, which

refers to the hijackers, and edges to describe a relation

between any two people whereas each undirected edge

was replaced with two directed ones.

There are plenty of research papers regarding the

analysis of the network structure (Husslage et al., 2015;

Memon and Larsen, 2006), but fewer works about rank-

ing them. Therefore, we considered the rank proposed

by Choudhary and Singh (2016) as the ground truth to

compare with our work. Kendall rank correlation coeﬃ-

cient (Abdi, 2007), which is known as Kendall’s tau co-

eﬃcient, was used to assess the correlation between two

ranked lists. Its value ranges between +1 and −1, such

that the closer the value to +1 or −1, the stronger the

relationship, while the closer the value to 0, the weaker

the relationship. Table 7 shows the top-1018 Hijackers

nominated by each ranking algorithm and a comparison

with the ground truth rank with respect to Kendall’s tau

measure.

Table 7: The top-10 Hijackers in 9/11 dataset

Ground truth PageRank HITSAuth HITSHub Katz ToRank

M.Atta M. Al-Shehhi M.Atta R.al-Shibh M.Atta M.Atta

E.Khemail M. Atta M.Al-Shehhi S.Bahaji M.Al-Shehhi E.Khemail

Z.Moussaoui E. Khemail Z.Jarrah Z.Essabar Z.Jarrah M.Al-Shehhi

H.Hanjour Z.Jarrah R.al-Shibh A.Budiman E.Khemail D.Benghal

N.Alhazmi A. Al-Omari S.Bahaji M.Motassadeq Z.Moussaoui Z.Moussaoui

M.Al-Shehhi W. Alshehri Z.Essabar L.Raissi D.Benghal R.al-Shibh

S.Suqami H. Alghamdi M.Motassadeq M.Darkazanli A.Qatada T.Maarouﬁ

D.Benghal D. Benghal Z.Moussaoui Z.Moussaoui T.Maarouﬁ Z.Jarrah

R.al-Shibh H. Hanjour A.Qatada M.al-Hisawi H.Hanjour A.Qatada

Z.Jarrah N. Alhazmi D.Benghal A.Al-Omari R.al-Shibh N.Alhazmi

Kendall’s tau 0.47 0.43 0.05 0.63 0.70

Additionally, we tested the GDC evaluation proce-

dure used in this paper to compare the ranking algo-

rithms (Fig. 12), and we found that, again, ToRank out-

performed the other algorithms with a GDC of 1.13,

while the other ranking algorithms were as follows:

1.48 Katz, 1.70 PageRank, 2.55 HITSAuth, and 2.98

HITSHub.

Table 5 shows that ToRank is slightly better than the

other algorithms. However, the results of the algorithms

are signiﬁcantly close due to the small number of nodes

in this graph.

18We selected the top-10 only because the ground truth enumerates

only the top-10 Hijackers.

Figure 12: Graph Density Analysis for the 9/11 Hijackers dataset.

6. Discussion

We designed the proposed algorithm, ToRank, with

the intention of detecting the nodes that contribute the

most to both propagating and receiving traﬃc to and

from the other nodes. By setting the initial weight of

each node as its degree, ToRank establishes an initial

measure of each node inﬂuence. Later on, the start-

ing value is adjusted using the weights of its adjacent

nodes. Therefore, the weight of every node depends on

the weight of its direct and its indirect neighbors, it con-

siders the degree of the followings and follower’s nodes.

Also, what distinguishes ToRank from the benchmarked

algorithms is that it is not iterative like PageRank or

HITS, so it does not have a convergence behavior, and

it considers the degree of the followings and follower’s

nodes.

It was surprising to ﬁnd out that, probably the most

popular ranking algorithm for the surface web, PageR-

ank, performs poorly for this problem according to the

graph density analysis metric. We found that the iter-

ative removal of the top-ranked nodes using PageRank

yielded the bigger area under the density curve, espe-

cially after removing between 10% and 20% of the top-

ranked nodes. The reason for that is that PageRank as-

signs high ranks to low connectivity nodes, and during

the iterative evaluation, those nodes were dropped ﬁrst,

resulting in a decrease in the number of nodes without a

signiﬁcant impact on the graph connectivity. Hence, the

number of edges is kept high while the count of nodes

decreases, which leads to an increment in the density

curve and a high GDC value. Additionally, the GDC

evaluation procedure measures the eﬃciency of a rank-

ing algorithm in breaking down the connectivity of a

given graph, whereas the PageRank design was not for

that end. Therefore, we conclude that despite the popu-

larity of PageRank, it is not a suitable ranking algorithm

for this task.

However, notwithstanding the success of ToRank al-

gorithm in detecting and ranking the inﬂuential nodes

in Tor, our method fails in assessing onion domains that

are isolated with no incoming or outgoing hyperlinks.

Luckily, this issue barely has an impact in our problem

because in S AGAll there are only 94 onion domains that

do not have any hyperlinks to or from HS within Tor

which counts only 3% out of the total.

6.1. ToRank Computational complexity

ToRank has two phases: the initialization phase

which iterates over all the nodes to assign an initial

weight to them, and a ranking phase that calculates a

rank value thanks to ToRank equation. Hence, ToRank

complexity is proportional to the number of nodes N,

and the time complexity would be O(2N). Table 8 com-

pares the needed time to rank relatively large and small

graphs with their corresponding processing time.

Table 8: A comparison for the processing time, in terms of seconds,

of multiple ranking algorithms. The ﬁrst column to the left refers to

the graph name and its number of nodes (N) and edges (E). Using

NetworkX library we were not able to rank relatively big graphs so

we replaced it with (-) sing.

PageRank HITS Katz ToRank

Google web graph

Source: (Leskovec et al., 2009)

(N=875713, E=5105039)

284.99 128.95 - 32.42

Note Dame web graph

Source: (Albert et al., 1999)

(N=325729, E=1497134)

90.40 74.87 - 11.14

Stanford web graph

Source: (Leskovec et al., 2009)

(N=281903, E=2312497)

141.52 22.90 - 11.72

Suspicious Activities

(N=2908, E=14511) 1.47 0.42 0.78 0.14

Normal Activities

(N=22965, E=85699) 5.98 36.33 99.85 0.80

9/11 Hijackers Network

(N=60, E=194) 0.03 0.07 0.62 0.05

6.2. ToRank with Big Graphs

Beside the explored graph structures, we ran the rank-

ing algorithms over three relatively large graphs (see

their speciﬁcations in Table 8). In particular, we ex-

perimented with Google web graph 19 (Leskovec et al.,

2009), Stanford web graph20 (Leskovec et al., 2009),

and Note Dame web graph21 (Albert et al., 1999). We

explored only the reduction in the giant component met-

ric as we could not manage to the other metrics due

to their high time complexity using NetworkX library.

Also, we compared the processing time of ToRank with

19https://snap.stanford.edu/data/web- Google.html

20https://snap.stanford.edu/data/web- Stanford.html

21https://snap.stanford.edu/data/web- NotreDame.

html

the other ranking algorithm with respect to the com-

mented graphs. Table 9 shows that both ToRank and

PageRank algorithms are competing for the ﬁrst posi-

tion. ToRank outperformed PageRank in Note Dame

web graph but failed in Google Web graph. However,

considering the GC reduction time, shown in Table 8,

ToRank is superior to PageRank for large graphs.

Table 9: Comparison for the reduction in the Giant Component (GC)

of multiple large graphs. The lower GC is, the better ranking.

Graph Structure Full network

giant component Removal PR HITS(Auth) HITS(Hub) Katz ToRank

Google web

graph 855802

Top-1% 771062 818528 845134 - 772966

Top-5% 626437 714767 801869 - 665315

Top-10% 502855 603827 742537 - 525710

Note Dame

web graph 325729

Top-1% 254729 315000 322468 - 228957

Top-5% 207088 285246 280401 - 125635

Top-10% 180889 225480 157022 - 50600

Stanford

web graph 255265

Top-1% 204408 232634 252396 - 200723

Top-5% 127013 221680 241120 - 129735

Top-10% 77765 183794 223617 - 104960

7. Conclusions and Future Work

In this paper, we presented and made a publicly avail-

able DUTA-10K dataset with 10367 HS, manually la-

beled into 25 categories. Against the widespread belief

that most of Tor’s content is related to criminal activi-

ties, the statistical analysis on DUTA-10K showed that

only around 20% of the tested onion domains are as-

sociated with suspicious activities, while 48% are re-

lated to normal ones. Nevertheless, the left 32% were

not accessible because they were either down, locked,

or empty, and we classiﬁed them as unknown domains.

We also veriﬁed that 84% of the HS are in English what

corroborates the idea that a text-based model to classify

Tor content, trained only on this language, will cover

the majority of the onion pages. Additionally, we found

out that the domains related to suspicious activities tend

to have multiple clones under diﬀerent addresses, what

can be even used as an additional feature for identifying

them.

One of the main contributions of this paper is the

new algorithm that we proposed to rank Tor web pages,

which we named ToRank. In order to facilitate the pro-

cess of monitoring the HS, ToRank is designed to iden-

tify and to rank on the top the most inﬂuential onion

domains. We employed graph theory to model the

Tor network, where nodes correspond to the HS and

edges refer to the hyperlinks between them. ToRank

was evaluated quantitatively by peeling away the top-

ranked nodes iteratively and checking if the density of

the graph decreases every cycle. Its performance was

compared against three popular link-based ranking al-

gorithms, ﬁnding that the area under the Graph Density

Curve (lower is better) was of 1.31 for ToRank, 1.41

Katz, 1.63 HITSAuth, 1.96 HITSHub , and 2.07 PageR-

ank.

These ﬁndings have made us reﬂect on how to deter-

mine the inﬂuence of a node in Tor, beyond the analysis

made based on hyperlinks. At this moment, we are fo-

cused on extracting textual features, such as products

names, vendors nicknames, locations, or even date and

time formats (Lample et al., 2016; Aguilar et al., 2017).

Additionally, we are planning to extract visual informa-

tion, by categorizing the images (Fidalgo et al., 2017)

in Tor HS and generating textual descriptions for them,

using image captioning (You et al., 2016). Our idea is

to combine those features with ToRank to improve the

ranking. And ﬁnally, we are also considering to intro-

duce in our analysis the hyperlinks related to the surface

Web, which might help to understand and to determine

which the inﬂuential domains are.

Acknowledgments

This research is supported by the INCIBE grant

“INCIBEI-2015-27359” corresponding to the “Ayudas

para la Excelencia de los Equipos de Investigaci´

on avan-

zada en ciberseguridad” and also by the framework

agreement between the University of Le´

on and INCIBE

(Spanish National Cybersecurity Institute) under Ad-

denda 22 and 01.

References

Abdi, H., 2007. The kendall rank correlation coeﬃcient. Encyclopedia

of Measurement and Statistics. Sage, Thousand Oaks, CA, 508–

510.

Aguilar, G., Maharjan, S., Monroy, A. P. L., Solorio, T., 2017. A

multi-task approach for named entity recognition in social me-

dia data. In: Proceedings of the 3rd Workshop on Noisy User-

generated Text. pp. 148–153.

Al Nabki, M. W., Fidalgo, E., Alegre, E., de Paz, I., 2017a. Classify-

ing illegal activities on tor network based on web textual contents.

In: Proceedings of the 15th Conference of the European Chapter

of the Association for Computational Linguistics: Volume 1, Long

Papers. Vol. 1. pp. 35–43.

Al Nabki, M. W., Fidalgo, E., Alegre, E., Gonz´

alez-Castro, V.,

2017b. Detecting emerging products in tor network based on k-

shell graph decomposition. III Jornadas Nacionales de Investi-

gaci´

on en Ciberseguridad (JNIC) 1 (1), 24–30.

Albert, R., Jeong, H., Barab´

asi, A.-L., 1999. Internet: Diameter of the

world-wide web. nature 401 (6749), 130.

Anger, I., Kittl, C., 2011. Measuring inﬂuence on twitter. In: Proceed-

ings of the 11th International Conference on Knowledge Manage-

ment and Knowledge Technologies. ACM, p. 31.

Anwar, T., Abulaish, M., 2015. Ranking radically inﬂuential web fo-

rum users. IEEE Transactions on Information Forensics and Secu-

rity 10 (6), 1289–1298.

Arulselvan, A., Commander, C. W., Elefteriadou, L., Pardalos, P. M.,

2009. Detecting critical nodes in sparse graphs. Computers & Op-

erations Research 36 (7), 2193–2200.

Backstrom, L., Kleinberg, J., 2014. Romantic partnerships and the

dispersion of social ties: a network analysis of relationship sta-

tus on facebook. In: Proceedings of the 17th ACM conference on

Computer supported cooperative work & social computing. ACM,

pp. 831–841.

Bergman, M. K., 2001. White paper: the deep web: surfacing hidden

value. Journal of electronic publishing 7 (1).

Berzinji, A., Kaati, L., Rezine, A., 2012. Detecting key players in

terrorist networks. In: Intelligence and Security Informatics Con-

ference (EISIC), 2012 European. IEEE, pp. 297–302.

Bidoki, A. M. Z., Ghodsnia, P., Yazdani, N., Oroumchian, F., 2010.

A3crank: An adaptive ranking method based on connectivity, con-

tent and click-through data. Information processing & manage-

ment 46 (2), 159–169.

Bidoki, A. M. Z., Yazdani, N., 2008. Distancerank: An intelligent

ranking algorithm for web pages. Information Processing & Man-

agement 44 (2), 877–892.

Biryukov, A., Pustogarov, I., Thill, F., Weinmann, R.-P., 2014. Con-

tent and popularity analysis of tor hidden services. In: Distributed

Computing Systems Workshops (ICDCSW), 2014 IEEE 34th In-

ternational Conference on. IEEE, pp. 188–193.

Biryukov, A., Pustogarov, I., Weinmann, R.-P., 2013. Trawling for

tor hidden services: Detection, measurement, deanonymization.

In: Security and Privacy (SP), 2013 IEEE Symposium on. IEEE,

pp. 80–94.

Biswas, R., Fidalgo, E., Alegre, E., 2017. Recognition of service do-

mains on tor dark net using perceptual hashing and image classi-

ﬁcation techniques. 8th International Conference on Imaging for

Crime Detection and Prevention, ICDP-2017 14, 15.

Booker, L. B., 2012. The eﬀects of observation errors on the at-

tack vulnerability of complex networks. Tech. rep., MITRE CORP

MCLEAN VA.

Borodin, A., Roberts, G. O., Rosenthal, J. S., Tsaparas, P., 2005.

Link analysis ranking: algorithms, theory, and experiments. ACM

Transactions on Internet Technology (TOIT) 5 (1), 231–297.

Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E., 2007. A

model of internet topology using k-shell decomposition. Proceed-

ings of the National Academy of Sciences 104 (27), 11150–11154.

Chang, V., 2017. A cybernetics social cloud. Journal of Systems and

Software 124, 195–211.

Chaurasia, N., Tiwari, A., 2013. Eﬃcient algorithm for destabiliza-

tion of terrorist networks. IJ Information Technology and Com-

puter Science 12, 21–30.

Choudhary, P., Singh, U., 2016. Ranking terrorist nodes of 9/11 net-

work using analytical hierarchy process with social network anal-

ysis. In: International Symposium on the Analytic Hierarchy Pro-

cess (ISAHP 2016). pp. 1–10.

Ciancaglini, V., Balduzzi, M., Goncharov, M., McArdle, R., 2013.

Deepweb and cybercrime. Trend Micro Report 9.

COCKBURN, A., MCKENZIE, B., 2001. What do web users do?

an empirical analysis of web use. International Journal of Human-

Computer Studies 54 (6), 903–922.

Cohen, R., Havlin, S., 2010. Complex networks: structure, robustness

and function. Cambridge university press.

Cossu, J.-V., Dugu´

e, N., Labatut, V., 2015. Detecting real-world inﬂu-

ence through twitter. In: Network Intelligence Conference (ENIC),

2015 Second European. IEEE, pp. 83–90.

Derhami, V., Khodadadian, E., Ghasemzadeh, M., Bidoki, A. M. Z.,

2013. Applying reinforcement learning for web pages ranking al-

gorithms. Applied Soft Computing 13 (4), 1686–1692.

Duggan, B., Feb. 2016. Uganda elections: Government shuts down

social media - cnn @misc.

URL https://edition.cnn.com/2016/02/18/world/

uganda-election- social-media- shutdown/

Duijn, P. A., Kashirin, V., Sloot, P. M., 2014. The relative ineﬀective-

ness of criminal network disruption. Scientiﬁc reports 4, 4238.

Elahi, T., Bauer, K., AlSabah, M., Dingledine, R., Goldberg, I., 2012.

Changing of the guards: A framework for understanding and im-

proving entry guard selection in tor. In: Proceedings of the 2012

ACM Workshop on Privacy in the Electronic Society. ACM, pp.

43–54.

Eliacik, A. B., Erdogan, N., 2018. Inﬂuential user weighted sentiment

analysis on topic based microblogging community. Expert Systems

with Applications 92, 403–418.

Ferrara, E., De Meo, P., Catanese, S., Fiumara, G., 2014. Detecting

criminal organizations in mobile phone networks. Expert Systems

with Applications 41 (13), 5733–5750.

Fidalgo, E., Alegre, E., Gonz´

alez-Castro, V., Fern´

andez-Robles, L.,

2017. Illegal activity categorisation in darknet based on image

classiﬁcation using creic method. In: International Joint Confer-

ence SOCO’17-CISIS’17-ICEUTE’17 Le´

on, Spain, September 6–

8, 2017, Proceeding. Springer, pp. 600–609.

Foley, S., Karlsen, J., Putnin¸ˇ

s, T. J., 2018. Sex, drugs, and bitcoin:

How much illegal activity is ﬁnanced through cryptocurrencies?

SSRN Electronic Journal.

Freeman, L. C., Roeder, D., Mulholland, R. R., 1979. Centrality in

social networks: Ii. experimental results. Social networks 2 (2),

119–141.

Fronzetti Colladon, A., Gloor, P. A., 2018. Measuring the impact of

spammers on e-mail and twitter networks. International Journal of

Information Management.

Fronzetti Colladon, A., Remondi, E., 2017. Using social network

analysis to prevent money laundering. Expert Systems with Ap-

plications 67, 49–58.

Fronzetti Colladon, A., Vagaggini, F., 2017. Robustness and stability

of enterprise intranet social networks: The impact of moderators.

Information Processing & Management 53 (6), 1287–1298.

Gallagher, S., UTC, Mar 2016. Whole lotta onions: Number of tor

hidden sites spikes-along with paranoia.

URL https://bit.ly/2MGTkrU

Gohari, F. S., Mohammadi, S., January 2014. A comprehensive frame-

work for identifying viral marketing’s inﬂuencers in twitter. Inter-

national SAMANM Journal of Marketing and Management 2 (1),

27–43.

Hasan, O., Brunie, L., Bertino, E., Shang, N., 2013. A decentralized

privacy preserving reputation protocol for the malicious adversar-

ial model. IEEE Transactions on Information Forensics and Secu-

rity 8 (6), 949–962.

Henni, K., Mezghani, N., Gouin-Vallerand, C., 2018. Unsupervised

graph-based feature selection via subspace and pagerank centrality.

Expert Systems with Applications 114, 46–53.

Holme, P., Kim, B. J., Yoon, C. N., Han, S. K., 2002. Attack vulnera-

bility of complex networks. Physical review E 65 (5), 056109.

Holmgren, Å. J., 2007. A framework for vulnerability assessment of

electric power systems. In: Critical Infrastructure. Springer, pp.

31–55.

Hu, Y., Wang, S., Ren, Y., Choo, K.-K. R., 2018. User inﬂuence analy-

sis for github developer social networks. Expert Systems with Ap-

plications 108, 108–118.

Hu, Y., Zhang, J., Bai, X., Yu, S., Yang, Z., 2016. Inﬂuence analysis

of github repositories. SpringerPlus 5 (1), 1268.

Husslage, B., Borm, P., Burg, T., Hamers, H., Lindelauf, R., 2015.

Ranking terrorists in networks: A sensitivity analysis of al qaeda’s

9/11 attack. Social Networks 42, 1–7.

Iyer, S., Killingback, T., Sundaram, B., Wang, Z., 2013. Attack

robustness and centrality of complex networks. PloS one 8 (4),

e59613.

Jansen, R., Tschorsch, F., Johnson, A., Scheuermann, B., 2014. The

sniper attack: Anonymously deanonymizing and disabling the tor

network. Tech. rep., OFFICE OF NAVAL RESEARCH ARLING-

TON VA.

Ji, S., Li, W., Gong, N. Z., Mittal, P., Beyah, R., 2016. Seed-based

de-anonymizability quantiﬁcation of social networks. IEEE Trans-

actions on Information Forensics and Security 11 (7), 1398–1411.

Joshi, A., Fidalgo, E., Alegre, E., Al Nabki, M. W., 2018. Extractive

text summarization in dark web: A preliminary study. International

Conference of Applications of Intelligent Systems.

Katz, L., 1953. A new status index derived from sociometric analysis.

Psychometrika 18 (1), 39–43.

Kleinberg, J. M., 1999. Authoritative sources in a hyperlinked envi-

ronment. Journal of the ACM (JACM) 46 (5), 604–632.

Krebs, V. E., 2002. Mapping networks of terrorist cells. Connections

24 (3), 43–52.

Kwon, A., AlSabah, M., Lazar, D., Dacier, M., Devadas, S., 2015. Cir-

cuit ﬁngerprinting attacks: Passive deanonymization of tor hidden

services. In: 24th USENIX Security Symposium (USENIX Secu-

rity 15). USENIX Association, Washington, D.C., pp. 287–302.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer,

C., June 2016. Neural architectures for named entity recognition.

In: Proceedings of the 2016 Conference of the North American

Chapter of the Association for Computational Linguistics: Human

Language Technologies. Association for Computational Linguis-

tics, San Diego, California, pp. 260–270.

URL http://www.aclweb.org/anthology/N16-1030

Leskovec, J., Lang, K. J., Dasgupta, A., Mahoney, M. W., 2009. Com-

munity structure in large networks: Natural cluster sizes and the

absence of large well-deﬁned clusters. Internet Mathematics 6 (1),

29–123.

Levene, M., 2011. An introduction to search engines and web naviga-

tion. John Wiley & Sons.

Ling, Z., Luo, J., Wu, K., Yu, W., Fu, X., 2015. Torward: Discovery,

blocking, and traceback of malicious traﬃc over tor. IEEE Trans-

actions on Information Forensics and Security 10 (12), 2515–2530.

Mao, G., Zhang, N., 2017. Fast approximation of average shortest

path length of directed ba networks. Physica A: Statistical Mechan-

ics and its Applications 466, 243–248.

Matic, S., Kotzias, P., Caballero, J., 2015. Caronte: Detecting loca-

tion leaks for deanonymizing tor hidden services. In: Proceedings

of the 22nd ACM SIGSAC Conference on Computer and Commu-

nications Security. ACM, pp. 1455–1466.

Memon, N., Larsen, H. L., 2006. Structural analysis and destabilizing

terrorist networks. In: DMIN. Citeseer, pp. 296–302.

Mitchell, J., 2017. Want to cry? ITNOW 59 (3), 12–13.

Moore, D., Rid, T., 2016. Cryptopolitik and the darknet. Survival

58 (1), 7–38.

Noor, U., Rashid, Z., Rauf, A., 2011. A survey of automatic deep

web classiﬁcation techniques. International Journal of Computer

Applications 19 (6), 43–50.

Norbutas, L., 2018. Oﬄine constraints in online drug marketplaces:

An exploratory analysis of a cryptomarket trade network. Interna-

tional Journal of Drug Policy 56, 92–100.

Nouh, M., Nurse, J. R., 2015. Identifying key-players in online activist

groups on the facebook social network. In: Data Mining Work-

shop (ICDMW), 2015 IEEE International Conference on. IEEE,

pp. 969–978.

Page, L., Brin, S., Motwani, R., Winograd, T., 1999. The pagerank

citation ranking: Bringing order to the web. Tech. rep., Stanford

InfoLab.

Rivest, R., 1992. The md5 message-digest algorithm. Internet Engi-

neering Task Force.

Ruhnau, B., 2000. Eigenvector-centrality—a node-centrality? Social

networks 22 (4), 357–365.

Scott, J., 2017. Social network analysis. Sage.

Seidman, S. B., 1983. Network structure and minimum degree. Social

networks 5 (3), 269–287.

Srinivas, A., Velusamy, R. L., 2015. Identiﬁcation of inﬂuential nodes

from social networks based on enhanced degree centrality mea-

sure. In: Advance Computing Conference (IACC), 2015 IEEE In-

ternational. IEEE, pp. 1179–1184.

Taha, K., Yoo, P. D., 2017. Using the spanning tree of a criminal net-

work for identifying its leaders. IEEE Transactions on Information

Forensics and Security 12 (2), 445–453.

Wang, Y., Nelissen, N., Adamczuk, K., De Weer, A.-S., Vandenbul-

cke, M., Sunaert, S., Vandenberghe, R., Dupont, P., 2014. Re-

producibility and robustness of graph measures of the associative-

semantic network. PloS one 9 (12), e115215.

Watts, D. J., Strogatz, S. H., 1998. Collective dynamics of ‘small-

world’networks. nature 393 (6684), 440.

Wei, H., Pan, Z., Hu, G., Zhang, L., Yang, H., Li, X., Zhou, X.,

2018. Identifying inﬂuential nodes based on network representa-

tion learning in complex networks. PloS one 13 (7), e0200091.

Xu, X., Zhou, C., Wang, Z., 2009. Credit scoring algorithm based on

link analysis ranking with support vector machine. Expert Systems

with Applications 36 (2), 2625–2632.

Ye, Q., Wu, B., Wang, B., 2010. Distance distribution and average

shortest path length estimation in real-world networks. In: Inter-

national Conference on Advanced Data Mining and Applications.

Springer, pp. 322–333.

You, Q., Jin, H., Wang, Z., Fang, C., Luo, J., 2016. Image captioning

with semantic attention. In: Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition. pp. 4651–4659.

Zhang, Y., Bao, Y., Zhao, S., Chen, J., Tang, J., 2015. Identifying node

importance by combining betweenness centrality and katz central-

ity. In: Cloud Computing and Big Data (CCBD), 2015 Interna-

tional Conference on. IEEE, pp. 354–357.

A Big Data architecture for early identification and categorization of dark web sites

Article

Full-text available

Mar 2024
FUTURE GENER COMP SY

Dark Side of the Web: Dark Web Classification Based on TextCNN and Topic Modeling Weight

Article

Full-text available

Jan 2023

The Dark Web is an internet domain that ensures user anonymity and has increasingly become a focal point for illegal activities and a repository for information on cyberattacks owing to the challenges in tracking its users. This study examined the classification of the Dark Web in relation to these cyber threats. We processed Dark Web texts to extract vector types suitable for machine learning classification. Traditional methods utilizing the entirety of Dark Web texts to generate features result in vectors including all words found on the Dark Web. However, this approach incorporates extraneous information in the vectors, diminishing learning effectiveness and extending processing duration. The research aimed to optimize the classification process by selectively focusing on keywords within each class, thereby curtailing word vector dimensions. This optimization was facilitated by leveraging the anonymity characteristic of the Dark Web and employing topic-modeling-based weight generation. These methods enabled the creation of word vectors with a constrained feature set, enhancing the distinction of Dark Web classes. To further improve classification performance, we integrated TextCNN with topic modeling weights. For validation, we employed two datasets and compared the performance of the model with other text classification algorithms, where the proposed model demonstrated superior effectiveness in Dark Web classification.

Multidimensional Forensic Investigation of Onion Sites Based on Fuzzy Encoded LSTM

Article

Full-text available

Mar 2024

Dinesha H.A. Preeti S. Joshi

The only way to access onion services is via the TOR browser providing anonymity and privacy to the client as well as the server. Information about these hidden services and the contents available on them cannot be gathered like websites on the surface web. So, they become a fertile ground for illegal content dissemination and hosting for cybercriminals. There is a persistent need to classify and block such content from onion sites. In this paper, we investigate data requested from onion services to help law enforcement agencies collect traces of cybercrime on these hidden services. We propose a system using fuzzy encoded LSTM to analyze contents retrieved from these sites and raise alerts if found illegal. The accuracy of fuzzy-encoded LSTM is found to be 81.04 % and it outperforms other classifiers.

Darkweb research: Past, present, and future trends and mapping to sustainable development goals

Article

Full-text available

Nov 2023

The Darkweb, part of the deep web, can be accessed only through specialized computer software and used for illegal activities such as cybercrime, drug trafficking, and exploitation. Technological advancements like Tor, bitcoin, and cryptocurrencies allow criminals to carry out these activities anonymously, leading to increased use of the Darkweb. At the same time, computers have become an integral part of our daily lives, shaping our behavior, and influencing how we interact with each other and the world. This work carries out the bibliometric study on the research conducted on Darkweb over the last decade. The findings illustrate that most research on Darkweb can be clustered into four areas based on keyword co-occurrence analysis: (i) network security, malware, and cyber-attacks, (ii) cybercrime, data privacy, and cryptography, (iii) machine learning, social media, and artificial intelligence, and (iv) drug trafficking, cryptomarket. National Science Foundation from the United States is the top funder. Darkweb activities interfere with the Sustainable Development Goals (SDG) laid forth by the United Nations to promote peace and sustainability for current and future generations. SDG 16 (Peace, Justice, and Strong Institutions) has the highest number of publications and citations but has an inverse relationship with Darkweb, as the latter undermines the former. This study highlights the need for further research in bitcoin, blockchain, IoT, NLP, cryptocurrencies, phishing and cybercrime, botnets and malware, digital forensics, and electronic crime countermeasures about the Darkweb. The study further elucidates the multi-dimensional nature of the Darkweb, emphasizing the intricate relationship between technology, psychology, and geopolitics. This comprehensive understanding serves as a cornerstone for evolving effective countermeasures and calls for an interdisciplinary research approach. The study also delves into the psychological motivations driving individuals towards illegal activities on the Darkweb, highlighting the urgency for targeted interventions to promote pro-social online behavior.

Unveiling local patterns of child pornography consumption in France using Tor

Article

Full-text available

Jun 2024

Child pornography—better known as child sexual abuse material (CSAM)—represents a severe form of exploitation and victimization of children, leaving the victims with emotional and physical trauma. In this study, we aim to analyze local patterns of CSAM consumption across 1341 French communes in 20 metropolitan regions of France between March 16 to May 31, 2019 using fine-grained mobile traffic data of Tor network-related web services. We estimate that approx. 0.08% of Tor mobile download traffic observed in France is linked to the consumption of CSAM by correlating it with local-level temporal porn consumption patterns. This compares to 0.19% of what we conservatively estimate to be the share of CSAM content in global Tor traffic. In line with existing literature on the link between sexual child abuse and the consumption of image-based content thereof, we observe a positive and statistically significant effect of our CSAM consumption estimates on the reported number of victims of sexual violence and vice versa, which validates our findings, after controlling for a set of geographically disaggregated features including socio-demographic characteristics, voting behavior, nearby points of interest and Google Trends queries. While this is a first, exploratory attempt to look at CSAM from a spatial epidemiological angle, we believe this research provides public health officials with valuable information to prioritize target areas for public awareness campaigns as another step to fulfill the global community’s pledge to target 16.2 of the sustainable development goals: “end abuse, exploitation, trafficking and all forms of violence and torture against children".

Beyond the Onion Routing: Unmasking Illicit Activities on the Dark Web

Article

Jun 2024

This comprehensive study delves into the complexities of the Dark Web, a concealed segment of the internet that remains invisible to standard search engines and is accessible only through specialized tools like The Onion Router (TOR), which ensures user anonymity. While the Dark Web is celebrated for its capacity to safeguard privacy and foster free expression, it concurrently serves as a sanctuary for illegal endeavours, encompassing drug trafficking, unauthorized arms trading, and a spectrum of cybercrime. The primary objective of this research is to scrutinize the efficacy of onion routing, the foundational technology behind the Dark Web, in preserving user anonymity amidst escalating efforts by law enforcement agencies to dismantle illegal activities. This paper adopts a rigorous approach that melds an exhaustive review of pertinent literature with empirical investigations to pinpoint the intrinsic vulnerabilities within the onion routing framework. Furthermore, the study introduces innovative methodologies aimed at bolstering the detection and neutralization of illicit transactions and communications on the Dark Web. These proposed methods seek to establish a delicate balance between upholding the Dark Web's legitimate functions—such as protecting privacy and enabling free speech—and curtailing its misuse for criminal activities. The paper culminates in a discussion of the broader implications of these findings for policymakers, law enforcement officials, and privacy advocates. It provides a set of recommendations for future research and policy formulation in this intricate and ever-evolving domain, to navigate the challenges posed by the Dark Web while preserving its essential values.

Systematic Literature Review and Assessment for Cyber Terrorism Communication and Recruitment Activities

Chapter

Mar 2024

Terrorist Network Analysis has been one of the most commonly discussed ways for safeguarding Online Social Network accessing and transmission through decentralized, trustless, peer-to-peer networks since Gabriel Weimann’s research paper on Cyberterrorism published in 2005. This study analyzes peer-reviewed literature attempting to use Terrorist Network Analysis for Cybercrime and gives a comprehensive analysis of the most often used Terrorist Network Analysis Security applications. The Proposed Finding recommends Malicious activities (Messages and Post relating to Terrorist events on Social Networking) for alerting users, so that they could cut away his/her communication from Malicious Actors and also help Security Agencies to identify possible Attacks and Vulnerabilities on modern Online Social Network, their main reasons and countermeasures. This systematic study also offers light on future directions in Terrorist Network Analysis and Cybercrime research, education, and practices, such as Terrorist Network Analysis security in Online Social Network, Security for Machine Learning and automated techniques.

Keyword-Based Information Retrieval Model for the Dark Web

Conference Paper

Nov 2023

The Dark Side of the Language: Pre-trained Transformers in the DarkNet

Conference Paper

Jan 2023

The Dark Side of the Language: Syntax-Based Neural Networks Rivaling Transformers in Definitely Unseen Sentences

Conference Paper

Oct 2023

A ComprehensiveFramework for Identifying Viral Marketing's Influencers in Twitter

Article

Full-text available

Jan 2014

Viral marketing can lead to extensive knowledge of marketing campaigns across customers with lower costs. The important point of viral marketing is targeting the subset of customers that can influence on others. Such customers enhance the efficiency of a marketing campaign by maximizing propagation of viral message throughout the network. According to increasing the importance of the Twitter network for marketing efforts in recent years, the aim of this work is to identify the best influential individuals for the efficient performance of viral marketing campaigns in this network. Recent works on Twitter reveal the lack of a comprehensive framework for differentiation of influencers in viral marketing. Our qualitative research aims at the synthesis of results and theories from previous studies in a new comprehensive framework. The paper first provides a detailed review on previous works about the influence and diffusion of information on Twitter. Second, according to the important features of viral marketing's influencers, it proposes a comprehensive framework for evaluating these features in terms of Twitter functions.This frameworkconcentrates on all of the important factors for identifying viral marketing's influencers. So, the most worthy twitterers with highest marketing value can be identified effectively based on our proposed framework.

Identifying influential nodes based on network representation learning in complex networks

Article

Full-text available

Jul 2018
PLOS ONE

Identifying influential nodes is an important topic in many diverse applications, such as accelerating information propagation, controlling rumors and diseases. Many methods have been put forward to identify influential nodes in complex networks, ranging from node centrality to diffusion-based processes. However, most of the previous studies do not take into account overlapping communities in networks. In this paper, we propose an effective method based on network representation learning. The method considers not only the overlapping communities in networks, but also the network structure. Experiments on real-world networks show that the proposed method outperforms many benchmark algorithms and can be used in large-scale networks.

Recognition of service domains on TOR dark net using perceptual hashing and image classification techniques

Conference Paper

Full-text available

Jan 2017

Offline constraints in online drug marketplaces: An exploratory analysis of a cryptomarket trade network

Article

Full-text available

Apr 2018

Lukas Norbutas

Background: Cryptomarkets, or illegal anonymizing online platforms that facilitate drug trade, have been analyzed in a rapidly growing body of research. Previous research has found that, despite increased risks, cryptomarket sellers are often willing to ship illegal drugs internationally. There is little to no information, however, about the extent to which uncertainty and risk related to geographic constraints shapes buyers' behavior and, in turn, the structure of the global online drug trade network. In this paper, we analyze the structure of a complete cryptomarket trade network with a focus on the role of geographic clustering of buyers and sellers. Methods: We use publicly available crawls of the cryptomarket Abraxas, encompassing market transactions between 463 sellers and 3542 buyers of drugs in 2015. We use descriptive social network analysis and Exponential Random Graph Models (ERGM) to analyze the structure of the trade network. Results: The structure of the online drug trade network is primarily shaped by geographical boundaries. Buyers are more likely to buy from multiple sellers within a single country, and avoid buying from sellers in different countries, which leads to strong geographic clustering. The effect is especially strong between continents and weaker for countries within Europe. A small fraction of buyers (10%) account for more than a half of all drug purchases, while most buyers only buy once. Conclusion: Online drug trade networks might still be heavily shaped by offline (geographic) constraints, despite their ability to provide access for end-users to large international supply. Cryptomarkets might be more "localized" and less international than thought before. We discuss potential explanations for such geographical clustering and implications of the findings.

A Multi-task Approach for Named Entity Recognition in Social Media Data

Conference Paper

Full-text available

Jan 2017

Sex, Drugs, and Bitcoin: How Much Illegal Activity Is Financed through Cryptocurrencies?

Article

May 2019

Cryptocurrencies are among the largest unregulated markets in the world. We find that approximately one-quarter of bitcoin users are involved in illegal activity. We estimate that around $76 billion of illegal activity per year involve bitcoin (46% of bitcoin transactions), which is close to the scale of the U.S. and European markets for illegal drugs. The illegal share of bitcoin activity declines with mainstream interest in bitcoin and with the emergence of more opaque cryptocurrencies. The techniques developed in this paper have applications in cryptocurrency surveillance. Our findings suggest that cryptocurrencies are transforming the black markets by enabling “black e-commerce.” Received June 1, 2017; editorial decision December 8, 2018 by Editor Andrew Karolyi. Authors have furnished an Internet Appendix, which is available on the Oxford University Press Web site next to the link to the final published paper online.

Measuring the impact of spammers on e-mail and Twitter networks

Article

Oct 2019
INT J INFORM MANAGE

This paper investigates the research question if senders of large amounts of irrelevant or unsolicited information – commonly called “spammers” – distort the network structure of social networks. Two large social networks are analyzed, the first extracted from the Twitter discourse about a big telecommunication company, and the second obtained from three years of email communication of 200 managers working for a large multinational company. This work compares network robustness and the stability of centrality and interaction metrics, as well as the use of language, after removing spammers and the most and least connected nodes. The results show that spammers do not significantly alter the structure of the information-carrying network, for most of the social indicators. The authors additionally investigate the correlation between e-mail subject line and content by tracking language sentiment, emotionality, and complexity, addressing the cases where collecting email bodies is not permitted for privacy reasons. The findings extend the research about robustness and stability of social networks metrics, after the application of graph simplification strategies. The results have practical implication for network analysts and for those company managers who rely on network analytics (applied to company emails and social media data) to support their decision-making processes.

Unsupervised graph-based feature selection via subspace and pagerank centrality

Article

Dec 2018
EXPERT SYST APPL

Feature selection has become an indispensable part of intelligent systems, especially with the proliferation of high dimensional data. It identifies the subset of discriminative features leading to better learning performances, i.e., higher learning accuracy, lower computational cost and significant model interpretability. This paper proposes a new efficient unsupervised feature selection method based on graph centrality and subspace learning called UGFS for ‘Unsupervised Graph-based Feature Selection’. The method maps features on an affinity graph where the relationships (edges) between feature nodes are defined by means of data points subspace preference. Feature importance score is then computed on the entire graph using a centrality measure. For this purpose, we investigated the Google's PageRank method originally introduced to rank web-pages. The proposed feature selection method has been evaluated using classification and redundancy rates measured on the selected feature subsets. Comparisons with the well-known unsupervised feature selection methods, on gene/expression benchmark datasets, demonstrate the validity and the efficiency of the proposed method.

User Influence Analysis for Github Developer Social Networks

Article

May 2018
EXPERT SYST APPL

Github, one of the largest social coding platforms, offers software developers the opportunity to engage in social activities relating to software development and to store or share their codes/projects with the wider community using the repositories. Analysis of data representing the social interactions of Github users can reveal a number of interesting features. In this paper, we analyze the data to understand user social influence on the platform. Specifically, we propose a Following-Star-Fork-Activity based approach to measure user influence in the Github developer social network. We first preprocess the Github data, and construct the social network. Then, we analyze user influence in the social network, in terms of popularity, centrality, content value, contribution and activity. Finally, we analyze the correlation of different user influence measures, and use Borda Count to comprehensively quantify user influence and verify the results.

Influential User Weighted Sentiment Analysis on Topic Based Microblogging Community

Article

Oct 2017
EXPERT SYST APPL

Nowadays, social microblogging services have become a popular expression platform of what people think. People use these platforms to produce content on different topics from finance, politics and sports to sociological fields in real-time. With the proliferation of social microblogging sites, the massive amount of opinion texts have become available in digital forms, thus enabling research on sentiment analysis to both deepen and broaden in different sociological fields. Previous sentiment analysis research on microblogging services generally focused on text as the unique source of information, and did not consider the social microblogging service network information. Inspired by the social network analysis research and sentiment analysis studies, we find that people's trust in a community have an important place in determining the community's sentiment polarity about a topic. When studies in the literature are examined, it is seen that trusted users in a community are actually influential users. Hence, we propose a novel sentiment analysis approach that takes into account the social network information as well. We concentrate on the effect of influential users on the sentiment polarity of a topic based microblogging community. Our approach extends the classical sentiment analysis methods, which only consider text content, by adding a novel PageRank-based influential user finding algorithm. We have carried out a comprehensive empirical study of two real-world Twitter datasets to analyze the correlation between the mood of the financial social community and the behavior of the stock exchange of Turkey, namely BIST100, using Pearson correlation coefficient method. Experimental results validate our assumptions and show that the proposed sentiment analysis method is more effective in finding topic based microblogging community's sentiment polarity.

ToRank: Identifying the Most Influential Suspicious Domains in the Tor Network

Abstract and Figures

Recommended publications

The Anonymous Works of Robert Boyle and the Reasons Why a Protestant Should not Turn Papist (1687)

Genomic Clone OS2 (D10S20) Detects Different Restriction Fragment Length Polymorphisms in Caucasians...

A Panel Regression Study on Multiple Predictors of Environmental Concern for 82 Countries Across Sev...

Text Categorization for Authorship Verification