ArticlePDF Available

ToRank: Identifying the Most Influential Suspicious Domains in the Tor Network

Authors:

Abstract and Figures

The Tor network hosts a significant amount of hidden services related to suspicious activities. Law Enforcement Agencies need to monitor and to investigate crimes hidden behind the anonymity provided by Tor. In this paper, we propose a new algorithm, named ToRank, that ranks hidden services in Tor better than the known algorithms used for the Surface Web. We also thoroughly analyze the content present in Tor, creating a dataset, DUTA-10K, that extends the previous Darknet Usage Text Address (DUTA) dataset. We quantitatively compared ToRank with some of the most popular ranking algorithms, like PageRank, HITS, and Katz. Results showed that our proposal obtains a higher harm to the Tor network robustness than all of them, what indicates its superiority for this problem. The analysis of DUTA-10K reveals that only 20% of the hidden services that can be accessed are related to suspicious activities, and 48% are associated with normal ones. We also discovered that domains related to suspicious activities usually present multiple clones under different addresses, what could be used as an additional feature for identifying them. We consider that our new algorithm, the extended dataset, and the findings obtained from the analysis carried out are helpful for LEAs to fight against crimes that take place in the Tor hidden services.
Content may be subject to copyright.
ToRank: Identifying the Most Influential Suspicious Domains in the Tor Network
Mhd Wesam Al-Nabkia,b, Eduardo Fidalgoa,b, Enrique Alegrea,b, Laura Fern´
andez-Roblesa,b,c
aDepartment of Electrical, Systems and Automation, Universidad de Le´on, Spain
bResearcher at INCIBE (Spanish National Cybersecurity Institute), Le´on, Spain
cDepartment of Mechanical, Informatics and Aerospace Engineering, Universidad de Le´on, Spain
Abstract
The Tor network hosts a significant amount of hidden services related to suspicious activities. Law Enforcement
Agencies need to monitor and to investigate crimes hidden behind the anonymity provided by Tor. In this paper, we
propose a new algorithm, named ToRank, that ranks hidden services in Tor better than the known algorithms used for
the Surface Web. We also thoroughly analyze the content present in Tor, creating a dataset, DUTA-10K, that extends
the previous Darknet Usage Text Address (DUTA) dataset. We quantitatively compared ToRank with some of the
most popular ranking algorithms, like PageRank, HITS, and Katz. Results showed that our proposal obtains a higher
harm to the Tor network robustness than all of them, what indicates its superiority for this problem. The analysis
of DUTA-10K reveals that only 20% of the hidden services that can be accessed are related to suspicious activities,
and 48% are associated with normal ones. We also discovered that domains related to suspicious activities usually
present multiple clones under dierent addresses, what could be used as an additional feature for identifying them.
We consider that our new algorithm, the extended dataset, and the findings obtained from the analysis carried out are
helpful for LEAs to fight against crimes that take place in the Tor hidden services.
Keywords: Darknet, Dataset, Influence detection, Graph analysis, Ranking algorithm, Hidden Services.
1. Introduction
Nowadays, the most general approach to access in-
formation on the Internet is through standard search en-
gines, such as Google or Bing. However, despite their
ecient and powerful performance, they can not index
all the Web content (Bergman, 2001). Thus, a division
is made between the part of the web whose content is in-
dexable, the Surface Web, and the rest of the Web which
is not, the Deep Web (Noor et al., 2011; Moore and Rid,
2016; Al Nabki et al., 2017a). In the depths of the Deep
Web, there is a portion called Darknet or Dark Web
(Al Nabki et al., 2017a), the fragment of the Web inten-
tionally hidden which only can be accessed through spe-
cific software applications. “The Onion Router”1(Tor)
which is one of the most popular Darknet networks, and
its domains, that are known as hidden services (HS),
can be accessed through Tor Browser2or a proxy as
E-mail addresses: {mnab, efidf, ealeg, l.fernandez}@unileon.es
1www.torproject.org
2https://www.torproject.org/projects/torbrowser.
html.en
Tor2Web3. The Tor metrics website4reported that the
number of unique addresses has increased from 30K to
almost 100K between April 2015 and September 2018
(Fig. 15). It is worth mentioning that between the years
2016 and 2017, there were two peaks in the first quar-
ter of each year followed by a sharp decrease but there
is no clear reason behind this as Kate Krauss, the di-
rector of communications and public policy for the Tor
Project, declared (Gallagher and UTC, 2016). However,
the spikes might be explained due to political events like
when Uganda government blocked the social network
before the election in February 2016 Duggan (2016).
Consequently, new domains were created in the Tor net-
work and more people used this net6.
The privacy and the high-level of anonymity provided
by the structure of the Tor network always attracted
3https://tor2web.org/
4https://metrics.torproject.org/
5Source: https://metrics.torproject.org
6Tor trac during the Ugandan elections in 2016:
https://metrics.torproject.org/
userstats-relay- country.html?start=2015-11- 22&end=
2016-02- 20&country=ug&events=off
Preprint submitted to Expert Systems With Applications October 23, 2020
Figure 1: The number of live Tor HS between 2015-April-01 and
2018-September-01. The horizontal axis represents the years, while
the vertical axis corresponds to the number of unique Tor addresses.
suspicious services traders to promote their business
by creating new HS, causing new critical challenges
to the world security (Ling et al., 2015; Ciancaglini
et al., 2013; Norbutas, 2018; Foley et al., 2018). To
address these threats, the Law Enforcement Agencies
(LEAs) need techniques and automatic tools to monitor
the HS activities eciently. Based on our collaboration
with the Spanish National Cybersecurity Institute (IN-
CIBE7), we divided this process into a pipeline of three
main components (Fig. 2).
Figure 2: The three components of onion domains monitoring
pipeline.
First, we classify the HS content into normal and sus-
picious activities, the latter refer to the contents that
LEAs are interested in monitoring. Next, we categorize
the suspicious ones into dierent groups based on the
crime or activity they could be related to. Our previous
7In Spanish, it stands for the Instituto Nacional de Ciberseguridad
de Espa˜
na
work (Al Nabki et al., 2017a) targeted this component
by means of creating a supervised text classifier to iso-
late the suspicious domains, and currently, it is being
used by one Spanish LEA.
The second component of our pipeline, which is the
objective of this paper, is responsible for ranking and
detecting the most influential HS within the network.
It is fed with a list of onion domains, that our classi-
fier determined as suspicious, and it ranks them using
an algorithm that we propose to reflect their popularity
among other HS. Recognizing the influential HS could
provide clues to the LEAs about who the market leaders
of each activity are. For example, identifying the most
influential drugs marketplaces that attract people could
be useful to draw insights into the common products,
sellers nicknames, main countries involved and possi-
ble exporting destinations of that market.
Thirdly, the de-anonymizing or locating the IP ad-
dress of the HS would take place (Biryukov et al., 2013;
Jansen et al., 2014; Kwon et al., 2015; Matic et al.,
2015). Although this task is challenging due to the high
level of security in the Tor network, thanks to the rank-
ing component, the onion domains could be prioritized.
Indeed, even if the LEAs could take an unlawful HS
down, the domain could be easily replaced by cloning
its content into a new one. An additional advantage of
the ranking module -second component- is that it can
detect the domain again, mainly if it has a high rele-
vance, what could help in neutralizing this threat anew.
Although the proposed method cannot prevent people
from accessing these suspicious HS, it helps LEAs to
keep a close eye on the influential domains.
Hence, the presented ranking algorithm is a supple-
mentary and valuable resource for the LEAs in leverag-
ing the use of resources because it recommends where
to focus their localization and monitoring eorts.
Since the first and the last components are out of the
scope of this paper, in the following we explain how
we carried out the second one only. In particular, this
work presents ToRank, a novel ranking algorithm to
rank and to detect the most influential onion domains
in the Tor network which practice suspicious activities.
At present, the result of this work is being used by the
Spanish Police Forces to monitor the Tor Darknet. The
main contributions of this paper are summarized as fol-
lows.
We introduce ToRank, a link-based ranking algo-
rithm for onion domains that detects which the
most influential ones are. It proved to outperform
well-known ranking algorithms, such as PageR-
ank, HITS, and Katz, in terms of the reduction in
2
the giant component, the clustering coecient, and
the density, while increasing the average shortest
path and the diameter of the Tor network when it is
represented by a directed graph (Fig. 3).
We propose and make publicly available DUTA-
10K8, an extended version of “Darknet Usage
Text Addresses” (DUTA) (Al Nabki et al., 2017a)
dataset up to 10367 manually labeled onion do-
mains. To follow up with the most recent activi-
ties on the Tor HS, DUTA-10K introduces Cryp-
toLocker, a new category which has spread widely
especially after WannaCry virus (Mitchell, 2017).
And finally, we carried out and presented here
a statistical analysis of DUTA-10K regarding the
distribution of the activities of its onion web pages,
the domains that have content replicated, and the
distribution of the languages in the analyzed pages.
The rest of the paper is organized as follows. Sec-
tion 2 reviews the previously published related work.
Next, Section 3 introduces DUTA-10K, the updated ver-
sion of DUTA dataset. After that, the proposed ranking
method, ToRank, is described in Section 4. In Section
5 we describe the conducted experiments and how we
evaluated the proposed ranking method. We discuss the
obtained results in Section 6. Finally, Section 7 presents
the main conclusions that can be drawn from this work,
pointing out to some other successive approaches al-
ready in progress.
2. Related work
Few works have addressed the problem of ranking the
HS in the Tor network. (Biryukov et al., 2014) proposed
a solution that exploits the concept of entry guard nodes
(Elahi et al., 2012) to de-anonymize clients of a Tor
HS. They estimated the popularity of onion domains in
the Tor network by examining incoming trac to those
domains, whose weakness is that the analysis can be
blocked if this vulnerability is fixed. The approached
followed in this paper is entirely dierent. Our purpose
was to determine which HS are more significant than
others as a possible source of suspicious content but
without measuring the incoming trac. The solution
that we are presenting here uses, among other things,
some of the concepts that worked on the Surface Web
to search for a web page. We represent the Tor Dark-
net as a directed graph, and we rely upon graph theory
8https://goo.gl/forms/bmJCaKthwxoQAwMm1
for first ranking and later detecting the most influential
onion domains.
Graph data structures have been used widely to rep-
resent a set of entities with their connections including,
but not limited to, social network analysis (Scott, 2017;
Backstrom and Kleinberg, 2014; Ji et al., 2016) and data
mining (Al Nabki et al., 2017b). Henni et al. (2018)
used a graph-based approach to build an unsupervised
features selection method whereas nodes correspond to
features and the relationship between those features was
captured by the graph edges. Next, to assign an impor-
tance score for each feature, the addressed several graph
centrality measures, and, in particular, PageRank algo-
rithm. Hasan et al. (2013) used also graph theory to
build a trust relationships graph between the network
users, represented by nodes, whereas the edges reflect
a binary trust relation between them. Al Nabki et al.
(2017b) proposed a method to detect the emerging prod-
ucts within the Tor network using the K-shell algorithm
(Carmi et al., 2007). They employed an undirected
graph where the nodes refer to the marketplace prod-
ucts and the edges express the presence of two products
within the same marketplace.
The detection of influential nodes within a given
graph is performed either through analyzing the connec-
tivity between the nodes, called link-based, or by evalu-
ating the content of the nodes, known as content-based
(Derhami et al., 2013; Bidoki et al., 2010; Bidoki and
Yazdani, 2008). Both approaches could be merged into
a hybrid one by extracting features from the graph and
utilizing the content of the nodes (Anwar and Abulaish,
2015). The link-based ranking algorithms have been
studied widely and employed to solve several problems
(Borodin et al., 2005). Xu et al. (2009) proposed an al-
gorithm that incorporates link-based ranking algorithm
with a Support Vector Machine classifier to determine
the eligibility of applicants for a credit card or loans
to banks. Fronzetti Colladon and Remondi (2017) ex-
plored the use of social network metrics, such as in-
degree, out-degree, closeness and betweenness central-
ity to combat money-laundering oenses. Ferrara et al.
(2014) introduced an algorithm called LogAnalysis that
is based on several social network analysis and com-
munity detection algorithms to detect criminal commu-
nities via the log of their phone calls records. Taha
and Yoo (2017) presented an algorithm using the Mini-
mum Spanning Tree (MST) to build a network of crim-
inals. Each node was assigned a score that was pro-
portional to the number of nodes whose existence de-
pend on the existence of the targeted node. Arulsel-
van et al. (2009) conducted a study to detect the crit-
ical nodes in sparse networks by proposing an algo-
3
Figure 3: An overview of the followed procedure to detect and to rank the influential HS. First, the hyperlinks of the dataset are extracted.
Afterward, for every single category of the suspicious activities, a Suspicious Activities Graphs (S AG) is constructed, like S AGPorno and S AG Drugs.
Additionally, another graph is built for all the interesting activities, named S AGAll . Ultimately, the ranking algorithms are applied to rank and to
detect the most influential HS.
rithm to minimize the pair-wise connectivity between
the nodes. Zhang et al. (2015) proposed a method to
identify the influential nodes using two node central-
ity techniques, the betweenness (Freeman et al., 1979),
and Katz (Katz, 1953) centralities. Hu et al. (2016)
studied the GitHub repositories network to detect the
influential ones using a graph and built a star relation
graph between the repositories. Then, they assessed the
performance of the weighted HITS (Kleinberg, 1999)
and PageRank (Page et al., 1999). Later one, Hu et al.
(2018) proposed UserRank algorithm, which is a per-
sonalized version of PageRank but dedicated for GitHub
developers network. Another study by Nouh and Nurse
(2015) focused on identifying the key players nodes in a
Facebook group using social network analysis metrics,
namely, the eigenvector centrality and the betweenness
centrality (Ruhnau, 2000).
Concerning content-based and hybrid approaches,
Anwar and Abulaish (2015) developed an algorithm
to rank and to detect the influential leaders of radical
groups in the Darknet forums. They extracted a set
of features that measure the radicalness of the users.
Then, they developed an algorithm based on PageRank
to build a ranked list of radically influential users. Cossu
et al. (2015) proposed an algorithm to detect influence
through the Twitter social network. A directed graph
of nodes and edges was used to represent the network
where the nodes refer to the users and the edges cap-
ture the following relation between them. Each Twitter
user was described by some features that were extracted
from the content, such as the tweets characteristics, and
from the constructed graph properties like the degree
(Seidman, 1983) and the betweenness centrality (Free-
man et al., 1979)). However, to the best of our knowl-
edge, none of the previously commented methods have
been applied to rank and to detect the influential HS
in the Tor network. Hereafter, to fill this gap, we pro-
pose ToRank algorithm, which belongs to the link-based
family. Next, we compared its behavior on the onion
domains network with three ranking algorithms related
to the same family, namely, PageRank, HITS, and Katz.
We selected those algorithms due to their basic role in
the majority of the link-based approaches commented
above. However, investigating content-based and hy-
brid approaches are out of the scope of this work, and
will be considered in future works.
The interpretation of what “influence” means varies
according to the pursued goal. In viral marketing, the
opinion leaders who can convince their audience with a
point of view, regarding a product, a service, or even an
idea, are considered as influential (Gohari and Moham-
madi, 2014). While in the field of terrorist networks,
the influence could refer to detecting people who have
connectivity with the majority of the network users in
a decentralized network, such as the financial managers
(Berzinji et al., 2012). Eliacik and Erdogan (2018) ana-
lyzed social networks and micro-blogging communities
to recognize users who are able to change the decisions
of the others via a sentiment analysis algorithm. In such
a context, the cluster of those social actors represents
the influential bloc. Anger and Kittl (2011) evaluated
the influence of the users by their social networking po-
tential, which is a score that captures the amount of in-
teraction that a user receives from his or her followers
with respect to all the published tweets. Taha and Yoo
(2017) employed the existence dependency concept on
a network of criminals.
In the Tor network, as in the Surface Web (COCK-
4
BURN and MCKENZIE, 2001; Levene, 2011), a user
surfs the network moving from a domain to another,
thanks to the hyperlinks connecting the web pages, till
he/she reaches to a market-leading domain; such a do-
main is interpreted as influential in this context. In this
paper, we employ social network analysis techniques
and algorithms to establish a link-based approach that
analyzes hyperlinks between the onion domains in order
to rank the onion domain and to recognize the influen-
tial ones. However, to the best of authors’ knowledge,
there is no ground truth rank to judge the correctness of
a given order. Alternatively, and for consistency with
previous studies (Booker, 2012; Fronzetti Colladon and
Gloor, 2018; Fronzetti Colladon and Vagaggini, 2017),
the influence of a given domain is interpreted by the
amount of disruption that can be caused to the connec-
tivity of the network by removing that domain. This
criterion is correlated with network robustness which
is defined by the ability of a network to retain its sys-
tem structure intact despite being exposed to pertur-
bations, i.e. removing the influential nodes (Holm-
gren, 2007). This intent is backed up by a considerable
amount of literature which already proposed algorithms
and techniques to eectively attach a network robust-
ness (Chaurasia and Tiwari, 2013; Duijn et al., 2014;
Zhang et al., 2015; Memon and Larsen, 2006). This pa-
per targets the robustness of the Tor network and tests
its destabilization cost by the eect of eliminating the
top-ranked nodes for the purpose of influential domains
detection.
3. Onion domains dataset
3.1. Why to build DUTA dataset
The first component of the proposed monitoring
pipeline is responsible for isolating the suspicious do-
mains out of the normal ones and then classifying
them into categories. Initially, this component was a
keyword-based system that filters the trac according
to a predefined list of keywords set by experts. How-
ever, other than the diculty of maintaining such a sys-
tem up to date, it could produce a high error rate. This
error could be false positive due to polysemy word pat-
terns in the keywords list, and a high false negative due
to the shortness in the keywords list. The alternative so-
lution was to build an automatic text classification sys-
tem based on Supervised Machine Learning algorithms.
Such a system needs to be trained on labeled samples
for each category of the activities, and for this purpose,
INCIBE in collaboration with the Spanish Police Forces
granted the authors access to the keywords lists and pre-
classified samples of each category. Hence, allowing
the authors to have an idea about the possible content
of each category and consequently to introduce the first
version of “Darknet Usage Text Addresses” (DUTA)
dataset (Al Nabki et al., 2017a).
3.2. Why to extend DUTA dataset
Indeed, despite the lifespan of some DUTA domains
is short, as they might be taken down by some LEA or
closed by their hosts, the value of DUTA is preserved.
Apart from giving insights into Tor in a specific period,
it could be used to research several problems, including,
but not limited to, the detection of emerging products
in the onion domains (Al Nabki et al., 2017b), image
classification (Fidalgo et al., 2017), text summarization
(Joshi et al., 2018), or recognition of onion domains ser-
vices (Biswas et al., 2017). Therefore, we decided to
extend DUTA and we collected new onion domains be-
tween May and July 2017.
3.2.1. DUTA-10K extension procedure
The Tor metrics website indicated that there were
more than 100KHS alive at the time of writing this
paper. However, for security reasons, the Tor network
structure does not have a public DNS server where all
the HS addresses are registered. Instead, it uses a Hid-
den Service Directory (HSDir), which is a Tor relay that
functions as a middle point between a HS, as it pub-
lishes its descriptors there, and clients, who communi-
cate with it to learn the address of the HS’s introduction
points (Biryukov et al., 2013, 2014). However, a Tor
relay needs a specific flag to be assigned by Tor authori-
ties to function as HSDir. Instead of asking for that flag,
we built a crawler that searched the Web for new onion
addresses.
To extend DUTA, we incorporated more onion ad-
dresses by searching in dierent sources. First, we de-
veloped a customized crawler that looks for onion ad-
dresses in three resources 1) the online notepad services
in the Surface Web, 2) the search engines of the Tor net-
work, and 3) DUTA dataset hyperlinks. And later, we
detected the addresses using a parser that employs reg-
ular expression pattern to match the onion ones.
The Surface Web has plenty of addresses that are
posted by anonymous people in public notepad web-
sites, such as Pastebin 9. Fortunately, Pastebin is pow-
ered by Google search engine, what allowed us to search
for onion addresses by typing keywords like Onion Ad-
dresses,hidden services 2018,darknet links 2019, and
.onion links. We scraped the content of the retrieved
9https://pastebin.com/
5
pastes and parsed the onion addresses. Additionally,
we used Tor network search engines, such as Ahmia.fi
and onion.link, looking for random words like onion
address,Tor services,Tor markets, and Tor products.
Then, we scraped the retrieved onion web pages and
parsed the onion addresses. Finally, we used the content
of DUTA samples to parse new HS addresses; then we
scraped their content and iteratively repeated this pro-
cess two more times. Those three strategies returned
124589 new unique onion addresses, but only 3536 ones
were active at that moment, while the majority were
down, i.e. they returned a connection time-out error
message. For each active onion domain, we crawled the
root page and the first level in depth for the sub-pages.
Next, we concatenated the pages of each domain into
a single HTML file. Therefore, the collected samples
represent a real case of the domains in the Tor network,
without any bias towards some specific category. More-
over, we saved each sample of DUTA-10K in a textual
format after removing the HTML tags. For this end, we
used a simple workaround; we loaded the HTML pages
with Lynx10, a text-based web browser, then we saved
the cleaned text into a text file. We extended DUTA by
adding the newly collected samples, i.e. the 3536 onion
domains, and denoted it as DUTA-10K because it holds
10367 unique onion addresses.
3.2.2. DUTA-10K labeling procedure
To ease the labeling process of the new samples, we
split this task into two phases. The first one utilizes the
previously proposed text classifier to isolate the suspi-
cious activities from the normal ones. Then, we validate
the assigned labels manually, one by one. The second
phase is a manual labeling for the samples that were
classified as Others. We respected the same regulations
than the previous version of DUTA that can be sum-
marized in the following points: 1) an author labels a
domain based on user-visible textual content only, 2) a
domain must receive one single tag only according to
its activity. In case it holds more, it is tagged as mar-
ketplace black or white following the suspiciousness of
the activity, and 3) if any author hesitates about the tag,
an open discussion is established with the rest of the au-
thors.
The DUTA-10K samples are distributed over 25
classes with some small changes in their names com-
pared to the original DUTA. In DUTA-10K, we joined
the “Leaked-data” category to “Fraud” because both of
them had a small number of samples and are related to
10http://lynx.browser.org/
the same topic. In the same way, we mixed the category
“Wiki” with the category “Hosting/Directory”. Ad-
ditionally, DUTA-10K presents a new category, Cryp-
toLocker, related to domains used to pay a ransom to
decrypt a machine infected with Ransomware, like the
WannaCry virus (Mitchell, 2017) (Table 1). Thanks to
the collaboration of Spanish LEAs with INCIBE in de-
veloping solutions for monitoring the suspicious activ-
ities of the onion domains, the latter one provided us
with a list of activities that are considered as interest-
ing for one of the main Spanish LEAs. Because of that,
we tagged the activities related to this list as Suspicious
Activities, whereas the rest were denoted as normal Ac-
tivities.
In Table 1 it can be observed a third activity type
named Unknown, which contains HS whose content
could not be assessed by the crawler. It comprises three
domain categories. (i) Locked, domains which require
a human interaction, such as solving a CAPTCHA or
entering a log-in credential to access. (ii) Empty, do-
mains without text or with graphical content only. Also,
based on our previous work (Al Nabki et al., 2017a), we
assigned to this category those very small HS, with an
amount of text less than or equal to 5 words. (iii) Down,
when the crawler returns an error while downloading
the textual content. For example, many HS require
JavaScript activation which was disabled in the crawler
due to security reasons.
Every sample of the dataset contains the HTML code
of the web page, the language11, and the assigned ac-
tivity by the authors. Some activities were divided into
sub-activities, and, therefore, subcategory labels were
assigned to them. For example, the documents forging
activity has a main activity named Counterfeit Personal
Identification with three sub-activities branches named:
Passport,ID, and Driving License.
3.3. Statistical analysis of DUTA-10K
To have a deeper understanding of the onion domains,
we made a statistical analysis of DUTA-10K with re-
spect to: 1. the activities distribution, 2. the domains
content replication, and 3. the used languages.
3.3.1. Activities distribution
Out of the 10367 samples of DUTA-10K, we found
that the suspicious activities represent 20% of the Tor
HS and the normal ones 48%. The third category,
i.e. the content which cannot be accessed, forms 32%.
Drugs trading is the most popular suspicious activity
11The LangDetect 1.0.7 library was used for the language detection.
6
Table 1: DUTA-10K dataset activities distribution. The letter C. de-
notes a Counterfeit activity.
Activity Type Activity Sub-Activity #HS
Suspicious
Activities
20%;
2013 HS
Pornography Child 105
Adult 148
Drugs - 465
Violence
Hate 19
Hitman 28
Weapons 48
C. Credit Cards 399
C. Money 83
C. Personal
Identification
Passport 48
ID 20
Driving-License 4
Hacking - 205
Cryptolocker - 185
Marketplace Suspicious 127
Services Suspicious 20
Forum Suspicious 63
Fraud - 43
Human-Tracking - 3
Normal
Activities
48%;
5016 HS
Art & Music - 15
Casino & Gambling - 29
Services Normal 341
Cryptocurrency - 868
Forum Normal 163
Marketplace Normal 138
Library & Books - 45
Personal - 616
Politics - 12
Religion - 21
Hosting & Software
File-Sharing 205
Folders 99
Search-Engine 92
Server 1416
Software 411
Directory 133
Social-Network
Blog 219
Chat 79
Email 72
News 42
Unknown 32%;
3338 HS
Down - 864
Empty - 1653
Locked - 821
Total sum 10367
on the onion domains, representing 23% of the suspi-
cious HS. It is followed by Credit Cards Counterfeit and
Hacking, being 20% and 10%, respectively, of the sus-
picious HS. Conversely, the Human-Tracking activity
counts the lowest presence with only 0.2% (Fig. 4).
Conversely, for the normal activities (Fig. 5), we
found that the HS that oer Hosting Servers represent
47% of the normal onion domains. Whereas in the sec-
ond position, we found that 17% of the domains are re-
lated to Cryptocurrency and Bitcoin trading. Next, the
category Personal represents 12% of DUTA-10K nor-
mal activities. The political websites occupy the lowest
count with 0.2%.
Along with the insights that could be drawn from the
Figure 4: The distribution of the suspicious activities in DUTA-10K.
Figure 5: The distribution of the normal activities in DUTA-10K.
established analysis, we used the activities distribution
of DUTA-10K, the network of the suspicious and the
normal domains for the purpose of evaluating ToRank
and the benchmark algorithms. Hence, we investigated
the detection of the most influential domain with respect
to suspicious and normal ones.
3.3.2. Domains content replication
During the process of labeling DUTA-10K, we de-
tected that some samples had identical or quasi-identical
textual content but hosted under dierent addresses.
The latter ones refer to HS that after preprocessing their
text, for example by removing the PGP signature, the
dates, or the price of the products, become exactly iden-
tical. We obtained the MD5 hash (Rivest, 1992) for
each cleaned onion domain and grouped them into the
defined 25 categories. After that, we found only 5368
7
unique texts based on their hashes, i.e. 51% of DUTA-
10K samples, vary between 2 and 496 copies per do-
main.
In Fig. 6(a), it can be noticed that the majority of
the illegal suspicious HS are cloned and appear sev-
eral times using dierent onion addresses. For exam-
ple, only 60% of the HS which were labeled as Drugs
have unique content and only 10% of the Cryptolocker
domains are unique. In contrast, the normal activities
domains present the reverse behavior as shown in Fig.
6(b). However, Hosting category is an exception be-
cause only 40% of its domains are unique. The reason
behind the high number of replications is due to a host-
ing company called “Freedom Hosting”12 that occupies
35% of those clones. The high number of suspicious
HS clones reflects the possible concern of their owners
to provide smooth access for the customers in case any
LEA takes some of their domains down.
3.3.3. Language analysis
We did not find representative dierences between
normal and suspicious activities in terms of DUTA-10K
language analysis. We detected 38 languages in DUTA-
10K domains, but only five of them, which are illus-
trated in Fig. 7, have a frequency higher than or equal
to 1%. The English language is the most common, and
the one used in 84% of the samples, followed by Rus-
sian with 6%. From the point of view of a researcher
or a LEA, this finding means that training a language
model only on an English corpus would be sucient to
cover the majority of the Tor HS.
4. Methodology
The ranking procedure starts by constructing a Sus-
picious Activities Graph (S AG) that holds the HS and
their interconnections. Then, we apply ToRank to rank
the onion domains and to detect the influential ones.
4.1. Onion domains representation
Due to our collaboration with INCIBE and the inter-
est of Spanish LEA on certain activities carried out in
Tor HS, i.e. suspicious, we focus our eort on ranking
only the 13 classes of DUTA-10K labeled as suspicious
in Table 1.
12https://en.wikipedia.org/wiki/Freedom_Hosting
(a) Suspicious Activities
(b) Normal Activities
Figure 6: Illustration of the replication of Tor HS. The percentages
represent the number of unique domains in the corresponding cate-
gory. The majority of the suspicious services (upper char) tends to
have duplicated copies of their HS, while the majority of the normal
ones (bottom chart) have unique content.
Figure 7: The languages used in onion domains of DUTA-10K.
8
4.1.1. Hyperlinks extraction
For each onion domain in DUTA-10K, we extracted
only the incoming and outgoing HTTP and HTTPS hy-
perlinks. Next, we removed the ones pointing to the
Surface Web, with the purpose of focusing our analy-
sis only on the onion domains. We also excluded the
duplicated links, which have the same source and des-
tination, to avoid having a multi-graph. We found that
some of the suspicious activity domains were referenced
or were pointing to web pages within Tor, but either they
were not related to the designated suspicious categories
or were web pages that did not exist in DUTA-10K. In
both cases, we added those nodes to the graph, but the
dierence was that when a sample exists in DUTA-10K,
we labeled it based on DUTA-10K classification; other-
wise, we assigned a dierent labeled called “new node”.
In the end, each domain ended up with two lists of hy-
perlinks, one for the addresses that were referencing it
and another list containing the domains inside Tor that
this domain was pointing to.
4.1.2. Interesting activity graph implementation
Once the hyperlinks for the suspicious activities were
extracted, we modeled the Tor network as a directed
graph and denoted it as “Suspicious Activity Graph”
(S AG) as shown in Fig. 8. The S AG =(N,E) model is
composed of a set of nodes denoted by N, which are the
HS, and a set of edges E, which correspond to their hy-
perlinks. A new edge is created from an onion domain
Ato an onion domain Beither if Ahas referenced Bor
if Bhas been referenced by Aat least once.
Figure 8: A snapshot of the DUTA-10K S AGAll, where the dots cor-
respond to HS. Each color represents an activity in DUTA-10K. The
gray lines reflect the hyperlinks between the domains. Due to the very
high connections density in the center of the graph, it appears as a
gray spot.
4.2. ToRank algorithm
The objective of the proposed algorithm, ToRank, is
to identify the most influential node in a graph by mea-
suring the number of nodes to which trac can be pro-
pagated or from which it can be received. The algo-
rithm consists of two phases, a weights initialization
and a weights update. The algorithm starts by assigning
an initial weight for each node. Given an S AG with N
nodes, the initial weight of the node nNis computed
in the following equation (1).
Wn=Di
n+Do
n(1)
where Wnis the initial weight for node n. The Di
nand
the Do
ncorrespond to the in-degree and the out-degree
of the node n, respectively.
Next, we accumulate the weight of n’s followers,
which are the nodes that point to it, and the weight of its
followings, the nodes that are referenced by n. Finally,
the weights are calculated again for each node using 2.
A rank value, T Rn, is assigned according to the weight
of the node, such that the higher weight corresponds to
the higher rank.
T Rn=Wnlog(1 +αWfr+βWfw) (2)
Where Wfrand Wfware the accumulated weight
of followers and the followings of the node nrespec-
tively. The parameters αand βcontrol the contribution
of the followers and the followings nodes to the weight
of n. More specifically, ToRank formula considers the
accumulated weight of the follower and the following
nodes. Influenced by PageRank formula, the neighbor-
ing nodes do not have an equal impact, however, both
are still valuable factors to identify the influence of the
node n. The role of alpha and beta comes to capture this
dierence in the weight of the followings and followers
nodes. When both are set to zero, T Rnwould be ex-
actly the degree centrality which is not a recommended
approach (see the description of the degree centrality in
Section 4.4). In contrast, when both are equal to one,
the formula will not reflect the intended purpose of giv-
ing a higher importance for the following nodes. During
the labeling process, we observed the existence of some
nodes that function like directories or wiki pages that
are pointing out to hundreds or even thousands of do-
mains in the network but being referenced by zero or
very few domains. The removal of such nodes would
fragment the graph strongly; unfortunately, their detec-
tion is worthless as they are not practicing any suspi-
cious action. Hence, in the Tor network, ToRank rec-
ommends assigning a high value to alpha to increase the
9
weight of the follower’s nodes but a low value to beta to
decrease the impact of the following nodes.
ToRank is intended to detect the most important
nodes in the Tor network but no ground truth tells that.
The straightforward approach is to use degree centrality
but its main limitation is that it does not consider the
graph structure. We use the degree centrality to initial-
ize the weights of the nodes in a graph. Then, to over-
come the shortcoming of the degree centrality, we pro-
pose the second formula of ToRank to incorporate the
neighbor’s degree as well. ToRank introduces the usage
of the logarithm function to the neighbor’s weight so
they do not overshadow the weight of the studied node.
Therefore, in ToRank algorithm, the rank value does not
depend exclusively on the weight of the studied node,
because it also considers the weights of its neighbors.
Consequently, the weights of those neighbors depend
on their neighbors, in cascade. The weight Wnof the
node nis calculated accordingly with the weight of its
neighbors. The logarithm function is used to respond
to skewness towards the nodes which have a very high
degree, and the first term of the expression, the number
one, is added to avoid the indeterminate form when the
value of log argument is zero.
4.3. Benchmark link analysis algorithms
4.3.1. PageRank
Developed by Page et al. (1999) and it is considered
as an enhanced version of the in-degree centrality. It as-
sumes that an influential node is likely to receive more
links from other influential nodes. The influence of a
given node iis calculated iteratively using (3) and it
stops automatically when it converges or reaches a max-
imum number of iterations.
PR(i)=(1 d)+dX
jB(i)
PR(j)
Nj
(3)
Where dis the damping factor which indicates the
probability of a random surfer who will continue or stop
navigating the graph nodes, iand jare nodes of a di-
rected graph G,B(i) is the set of nodes that point to i,
PR(i) and PR(j) are rank scores of the nodes iand jre-
spectively. Njindicates the number of outgoing links of
the node j. A high-rank value reflects a higher influence
of a node over the other nodes.
4.3.2. HITS
The Hyperlink-Induced Topic Search (HITS) algo-
rithm Kleinberg (1999) measures the nodes importance
recursively by assigning two mutual scores for each
node: a hub score hubi, which grows higher if a node
is referencing many high authority scores nodes, and an
authority score authi, which increases if a node is ref-
erenced by many high hub scores nodes. To this end,
the recursion behavior is defined: good hubs are those
nodes that reference many good authorities and good
authorities are those referenced by many good hubs.
Each node iin a graph Gholds two non-negative scores,
an authority score authiand a hub score hubi, and they
are initialized with arbitrary nonzero values. Next, the
scores are updated iteratively until convergence; which
reaches a stationary solution. Equation (4) shows an
update to the hub and another to the authority scores
which captures the intuitive notions behind the HITS
algorithm.
auth(k)
i=X
jiE
hub(k)
j
hub(k)
i=X
ijE
auth(k)
j
(4)
Where jis a node in Gand ijindicates a hyper-
link from the node ito the node jout of the graph edges
set E.kis an iterator index that starts from 1 and in-
creases to but in practice, this loop is beaked where
there is no significant change between consecutive iter-
ates or according to a maximum number of iterations
variable.
4.3.3. Katz
Introduced by Katz (1953), it is used to measure the
centrality of a node by assigning a score that depends
on the first-degree neighbors and the nodes connected
with them. In mathematical form, the rank is calculated
according to Equation (5).
ki=αX
j
Ai jkj+β(5)
Where kiand kjare Katz centrality values for the
nodes iand jin a given graph G,Aij is the adjacency
matrix of Gthat captures the connectivity of the nodes.
βcorresponds to the initial centrality and αcorresponds
to the attenuation factor.
4.4. Graph robustness metrics
(Fronzetti Colladon and Gloor, 2018) carried out a
comprehensive study regarding graph robustness and
stability metrics. Below, we explore in short few of
them that we used to evaluate the benchmark algo-
rithms.
10
4.4.1. Degree centrality
Based on the number of connected nodes that are di-
rect neighbors of a given node, the degree centrality
measure uses three dierent values in directed graphs:
in-degree, out-degree, and degree. The in-degree counts
only the number of incoming links from a given node,
whereas the out-degree counts the number of outgoing
ones (Seidman, 1983). The degree of a node is calcu-
lated as the sum of the in-degree and the out-degree val-
ues. However, this approach ignores the global structure
of the graph and focus only on the direct neighbors of
the targeted node (Wei et al., 2018). To come over this
limitation, Srinivas and Velusamy (2015) proposed an
enhanced version of the degree centrality that incorpo-
rates the clustering coecient.
4.4.2. Graph density
It is defined as the number of existing edges over the
number of possible ones. Hence, the more connected is
the graph, the higher the density and vice versa. When
the graph is fully connected, the density is equal to 1,
and it is equal to 0 when it is free of edges. The density
of a directed graph Gis computed as shown in (6) where
Eis the number of edges, and Nis the number of nodes
in the directed graph G.
D=E
N(N1) (6)
4.4.3. Average shortest path (ASP)
It refers to the average length of the shortest paths
along all possible pairs of network nodes (Mao and
Zhang, 2017).
4.4.4. Diameter (Dim)
It is defined as the longest path among the shortest
paths between any two nodes in a given graph (Ye et al.,
2010). The removal of central nodes that occupies a
core location in the graph would increase the shortest
paths, and consequently, the diameter of the graph will
increase.
4.4.5. Clustering coecient (CC)
It calculates the number of triangles in a graph. It is
calculated by dividing the number of closed triples of
nodes by the total number of connected triplets in the
network (Watts and Strogatz, 1998).
4.4.6. Giant component (GC)
It refers to the largest fraction of nodes that are con-
nected, i.e. there exists a path between each pair of
nodes in that component. An attack to the graph robust-
ness could be measured by the reduction in the giant
component mass (Holme et al., 2002).
5. Experimental results
5.1. Experimental setting
The experiments were conducted on a PC with an
Intel(R) Core(TM) i7 processor with 32 GB of RAM
under Windows-10. The domains addresses were ex-
tracted from DUTA-10K using the Regular Expression
library13. The S AG was constructed using the Net-
workX library14 with Python3. For the graph visualiza-
tion, we used vis.js library15. Concerning the ranking
algorithms, we compared ToRank with the link-based
ranking algorithms presented in Section 4.3, namely
PageRank, HITS and Katz. We tuned all the meth-
ods by evaluating a range of values for each parameter,
as shown in Table 2, and we selected the ones which
achieved the highest performance in our experiments.
ToRank has two configurable parameters αand βthat
were set empirically (we refer the reader to Section 4.2
for a more in deep explanation of these parameters). Af-
ter evaluating several configurations for αand βvalues,
we found that setting them to 0.9 and 0.2 respectively
can achieve the best result.
Table 2: Evaluated values for the parameters of the ranking algo-
rithms. Bolded numbers correspond to the selected configuration that
achieved the lowest area under the GDC curve.
Algorithm Name Parameter Experienced values
PageRank alpha 0.5, 0.70, 0.75.0.80,0.85,0.90
max iter 10, 100, 1000, 10000
HITS max iter 10, 100, 1000, 10000
Katz
alpha 0.01, 0.1, 0.2, 0.3, 0.4, 0.6, 0.9
beta 0.1, 0.3, 0.5, 0.7, 0.9, 1.0
max iter 10, 100, 1000, 10000
ToRank alpha 0.1. 0.2, 0.4, 0.6, 0.8, 0.9, 1.0
beta 0.1, 0.2, 0.4, 0.6, 0.8, 0.9, 1.0
5.2. Evaluation measure
Consistently with previous research (Booker, 2012;
Fronzetti Colladon and Gloor, 2018; Fronzetti Colladon
and Vagaggini, 2017), we employed several standard
metrics to judge the structure of the studied graph (see
their explanation in Section 4.4). More specifically,
concerning the density criterion, the evaluation proce-
dure starts by peeling away the top-ranked nodes one
13https://pypi.python.org/pypi/regex
14https://networkx.github.io/
15http://visjs.org/
11
by one iteratively, and at every cycle, the graph density
is evaluated. The iterator stops when the graph is com-
pletely disconnected while the density is zero. We con-
sider that the ranking algorithm that achieves the lowest
area under the Graph Density Curve (GDC) corresponds
to the algorithm that better measures the influence of
a domain inside the Tor network. Consequently, the
GDC is used as a proxy to evaluate the graph robustness
(Wang et al., 2014), and hence, the iterative removal of
the top-ranked nodes with their edges would result in a
reduction in the graph density. Therefore, if the nodes
are correctly ranked, the top-ranked nodes would lead
to a high reduction in the density because the removal
of this node should cause a harmful fragmentation to the
graph structure. But, if the removed nodes are not influ-
ential, its removal will not break the graph, and hence,
it should not be ranked at the top of the list.
Besides looking at the graph density, we consider the
reduction in the size of the giant component (GC) and
the decrease in the clustering coecient (CC) as e-
cient indicators of the produced disruption. However,
one by one nodes removal is an expensive process due
to the time needed to calculate the GC and the CC of
the graph at every iteration. Alternatively, we analyze
only the removal of the top-(1st, 5th and 10th) percentile
of the ranked nodes, and hence, the GC and CC metrics
are evaluated three times only. A higher decrease in the
giant component size, the graph density, and the clus-
tering coecient reflect more disruption to the graph
robustness (Chang, 2017; Iyer et al., 2013) and conse-
quently a better ranking. Similarly, the diameter and
the average shortest path measures can be used to test
the robustness at multi-levels of top-ranked nodes re-
moval16 (Cohen and Havlin, 2010). An increase in the
graph diameter and in the average shortest path indi-
cate better ranking; this increase is due to removing the
top-ranked nodes from the graph. Hence, the higher the
AS P and the Dim and the lower the CC and the GC are,
the better the ranking algorithm.
5.3. Analysis of the suspicious activities in Tor
We created two types of S AG: first, a S AG for all
the suspicious activities, and second, a S AG for every
single suspicious activity in DUTA-10k. We denoted
them by S AGAll and S AGX, respectively, where Xcor-
responds to the activity name, for example, S AGDrug s
refers to the drugs HS. Table 3 shows the specifications
of S AGAll and S AGXgraphs.
16We could not manage a one-by-one node removal due to metrics
complexity, hence we did it over the top-1, 5, and 10 percentile of the
ranked nodes only
Table 3: Details of the created Suspicious Activities Graphs. The #Ac-
tivity nodes column refers to the number of nodes related to the studied
activity, the SAG #nodes and the SAG #edges columns represent the
number of nodes and the number of edges in the corresponding SAG.
Activity name #Activity
domains SAG #nodes SAG #edges Average
degree
All 2013 2908 14,511 4.99
C. Credit Cards 399 583 2622 4.49
Forum: Suspicious 63 436 1527 3.50
Violence 95 240 795 3.31
C. Money 83 202 796 3.94
C. Personal
Identification 72 180 763 4.23
Marketplace:
suspicious 127 389 1670 4.29
Drugs 465 743 4130 5.55
Hacking 205 402 1381 3.43
CryptoLocker 185 198 611 3.08
Services: suspicious 20 46 76 1.65
Pornography 253 686 2765 4.03
Fraud 43 145 386 2.66
human-tracking 3 15 16 1.06
Fig. 9 shows five dierent Graph Density Curves
(GDC) of S AGAll for the four ranking algorithms17. Fol-
lowing our previous reasoning, ToRank outperforms the
other methods because it achieves the lowest area un-
der the GDC, with a value of 1.31, presenting as well
a very gentle and homogeneous decrease in the density
curve. In contrast, PageRank suers from a sudden in-
crease in its curve, then a sharp decrease, what yields the
highest GDC of 2.07 (Table 4), and this phenomenon is
discussed in Section 6. In Fig. 9 it can be observed
how the density reaches zero but the domain count is
13. Those nodes are normal HS that were referenced
by suspicious onion domains such as wiki pages or Tor
directories pages.
Figure 9: Density Analysis for the S AGAll of DUTA-10K. ToRank
achieves the lowest GDC.
17HITS algorithm produces two curves, one for the authorities, and
another for the hubs
12
Figure 10: Comparing the GDC value for four ranking algorithms with respect to the suspicious activities of DUTA-10K. The vertical axis
represents the GDC value for the corresponding ranking algorithm, whereas the horizontal axis shows the 12 individual activities S AGX, plus all
the interesting activities graph S AGAll.
Table 4: A GDC comparison for the four ranking algorithms over the
S AGs. The bolded numbers correspond to the lowestGDC value.
Activity name PageRank HITSAuth HITSHub Katz ToRank
All 2.07 1.63 1.96 1.43 1.31
C. Credit Cards 2.00 1.65 2.13 1.51 1.48
C. Money 0.64 0.56 0.72 0.52 0.52
C. Personal
Identification 0.73 0.63 0.95 0.60 0.60
CryptoLocker 2.41 2.53 3.48 2.32 2.29
Drugs 2.24 1.74 2.19 1.58 1.49
Forum: Suspicious 0.48 0.36 0.35 0.35 0.26
Fraud 0.68 0.62 0.57 0.54 0.42
Hacking 0.90 0.72 0.80 0.68 0.63
Marketplace: Suspicious 1.25 0.94 1.10 0.87 0.80
Pornography 0.76 0.52 0.67 0.50 0.42
Services: Suspicious 0.44 0.43 0.52 0.34 0.33
Violence 0.69 0.60 0.66 0.52 0.51
Fig. 10 shows a comparison of the GDC of the four
ranking algorithms with respect to S AGAll and to 12
single activities graphs only. We could not evaluate
the Human-Tracking activity as it contains only three
onion domains and two of them have identical content,
what yielded only two unique domains without any hy-
perlink between them. This figure shows that ToRank
outperforms the other ranking algorithms because it has
the lowest GDC area. However, despite the superiority
of ToRank over Katz, the latter is approaching ToRank
in all of the suspicious categories except the counterfeit
of personal identification and the counterfeit of money
where they have the same GDC value. This is because
ToRank and Katz use the in-degree, but the advantage
of ToRank is based on the use of the out-degree of the
domains and its weighting.
In addition to the density analysis, Table 5 explores
four graph structure metrics: it compares the full net-
work structure before applying the top-ranked node re-
moval, with the three levels of nodes reduction (1, 5,
and 10 percentile of the full network). From the table,
we can see that ToRank achieved the sharpest reduc-
tion in the GC with a high decrease in the CC. Also,
with ToRank, the AS P increased to 8.9, which reflects
a higher disruption to the network structure by remov-
ing the core nodes first. This observation is reflected in
the graph dimension as it increased from 9 for 22 after
removing the top-10% of the top-ranked nodes. Those
observations prove that ToRank outperforms the other
ranking algorithms.
In Table 6, we explore the top-10 onion domains
nominated by ToRank algorithm as influential HS
within S AGAll.
5.4. Analysis of the Normal Activities in Tor
We carried out the same analysis for the normal ac-
tivities of DUTA-10K (Fig. 11). We created a Normal
Activities Graph NAG. Then, we ranked the nodes ac-
cording to the four ranking algorithms and evaluated the
performance using the GDC measure. The N AG has
22965 nodes where 5016 of them are from the normal
activities. The left ones are HS that are connected to
them, and the edges count to 85699 with a node average
degree of 3.73.
Fig. 11 shows that ToRank has the lowest area un-
der the GDC with value of 0.02 where HITSHub, Katz,
HITSAuth, and PageRank achieved 0.03, 0.17, 0.22 and
0.26 respectively. In this case, the extraordinary good
performance of ToRank as well as HITSHub is explained
by their ability to detect first the onion domains that
function as directories, having those domains a high
number of connections with other HS in the Tor net-
work. Due to their impact on the GDC curve, the con-
clusion is that the onion directories are one of the main
ways to redirect Tor users to the activities domains.
Table 5 shows that ToRank outperformed the other
algorithms even for the normal activities graph. This
superiority is observed in the reduction of the CC and
13
Table 5: Impact of top-ranked nodes removal on four graph metrics with respect to three graph datasets: suspicious activities, normal activities, and
9/11 Hijackers Network. Clustering Coecient (CC), Average Shortest Path (AS P), Giant Component (GC) and Diameter (Dim), and FN refers
to the full network before nodes removal. The bolded values refer to the best performance whereas the object is to decrease the CC and the GC and
to increase the AS P and the Dim variables.
Suspicious Activities Network Normal Activities Network 9/11 Hijackers Network
Metrics Algorithms FN Top-1% Top-5% Top-10% FN Top-1% Top-5% Top-10% FN Top-1% Top-5% Top-10%
CC
ToRank
0.056
0.042 0.029 0.023
0.3577
0.009 0.004 0.003
0.476
0.465 0.334 0.303
PR 0.048 0.041 0.042 0.299 0.285 0.267 0.452 0.334 0.342
HITS (Hub) 0.055 0.043 0.023 0.013 0.004 0.004 0.471 0.458 0.449
HITS (Auth) 0.055 0.051 0.039 0.357 0.341 0.252 0.465 0.446 0.441
Katz 0.051 0.034 0.027 0.355 0.370 0.391 0.465 0.446 0.339
GC
ToRank
2748
1539 998 406
22572
1156 42 32
60
59 55 34
PR 2691 2498 2351 15690 14760 13620 59 55 28
HITS (Hub) 1880 1048 452 2964 619 129 59 57 54
HITS (Auth) 2705 2484 2272 22281 21316 14118 59 31 26
Katz 2686 2399 2166 21249 20481 19086 59 31 26
ASP
ToRank
3.151
5.234 7.684 8.902
2.690
7.340 6.066 5.371
3.606
4.291 4.698 3.695
PR 3.174 3.193 3.255 2.665 2.698 2.725 3.653 4.698 3.183
HITS (Hub) 4.658 5.956 6.441 7.014 1.997 1.985 3.635 3.677 4.253
HITS (Auth) 3.169 3.225 3.281 2.755 2.800 2.934 4.291 3.179 3.129
Katz 3.180 3.229 3.281 2.593 2.550 2.399 4.291 3.179 3.129
Dim
ToRank
9
15 19 22
8
18 14 12
7
9 11 9
PR 9 9 9 8 8 8 7 11 7
HITS (Hub) 12 17 18 20 2 2 7 7 9
HITS (Auth) 9 10 10 8 8 8 97 7
Katz 10 10 10 8 7 6 97 7
Table 6: The top-10 HS ranked using ToRank algorithm
Rank HS Address Activity Category HS Title Short Description
1matangareonmy6bg.onion Drugs - Online market for suspicious drugs
2y3nau3mnibjbpmh4.onion Pornography Tor Links 2.0 A Tor directory for pornography
content
3hansamkt2rr6nfg3.onion Marketplace suspicious HANSA Market A famous marketplace for suspicious
product
4vfvfq64rtrefmdtd.onion Drugs - Russian market for suspicious drugs
5silkkitiehdg5mug.onion Drugs Silkkitie Valhalla Market (known by its
Finnish name, Silkkitie)
6shops3jckh3dexzy.onion Drugs - Online market for suspicious drugs
7abbujjh5vqtq77wg.onion C. personal identification Onion Identity Services HS for producing fake passports
8gxmrzk2s56oxzb3e.onion Pornography German Pi-X-Board Multi-languages forum for child-
pornography
9boysopidonajtogl.onion Pornography Central Park Guides and links to porno websites
10 newpdsuslmzqazvr.onion Drugs Peoples Drug Store Online market for suspicious drugs
Figure 11: Graph density analysis for the N AG.
the GC as well as an increase in the AS P. However, in
this case, the PageRank achieved better diameter after
removing the top-1% of the nodes only.
5.5. Analysis of the 9/11 Hijackers Network
With the intention of testing deeply our proposal and
to evaluate its generalization capability, we looked for
other similar datasets or at least other problems where
the main purpose was to rank collected data. We found
a similar problem in the analysis of the 9/11 Hijackers
Network. This is a famous dataset containing informa-
tion about the terrorists involved in the 9/11 bombing
of the World Trade Center, in 2011 (Krebs, 2002). The
goal here is to detect the most influential people who
14
contributed to that attack, whereas the most influential
node, the key-player, is the one whose removal would
lead to an extreme break in the connectivity between
the other network members. Therefore, this objective
is similar to ours because we try to detect which the
most influential node in the Tor network in each step is.
To apply the link-based ranking algorithms, we repre-
sented the network by a directed graph of nodes, which
refers to the hijackers, and edges to describe a relation
between any two people whereas each undirected edge
was replaced with two directed ones.
There are plenty of research papers regarding the
analysis of the network structure (Husslage et al., 2015;
Memon and Larsen, 2006), but fewer works about rank-
ing them. Therefore, we considered the rank proposed
by Choudhary and Singh (2016) as the ground truth to
compare with our work. Kendall rank correlation coe-
cient (Abdi, 2007), which is known as Kendall’s tau co-
ecient, was used to assess the correlation between two
ranked lists. Its value ranges between +1 and 1, such
that the closer the value to +1 or 1, the stronger the
relationship, while the closer the value to 0, the weaker
the relationship. Table 7 shows the top-1018 Hijackers
nominated by each ranking algorithm and a comparison
with the ground truth rank with respect to Kendall’s tau
measure.
Table 7: The top-10 Hijackers in 9/11 dataset
Ground truth PageRank HITSAuth HITSHub Katz ToRank
M.Atta M. Al-Shehhi M.Atta R.al-Shibh M.Atta M.Atta
E.Khemail M. Atta M.Al-Shehhi S.Bahaji M.Al-Shehhi E.Khemail
Z.Moussaoui E. Khemail Z.Jarrah Z.Essabar Z.Jarrah M.Al-Shehhi
H.Hanjour Z.Jarrah R.al-Shibh A.Budiman E.Khemail D.Benghal
N.Alhazmi A. Al-Omari S.Bahaji M.Motassadeq Z.Moussaoui Z.Moussaoui
M.Al-Shehhi W. Alshehri Z.Essabar L.Raissi D.Benghal R.al-Shibh
S.Suqami H. Alghamdi M.Motassadeq M.Darkazanli A.Qatada T.Maaroufi
D.Benghal D. Benghal Z.Moussaoui Z.Moussaoui T.Maaroufi Z.Jarrah
R.al-Shibh H. Hanjour A.Qatada M.al-Hisawi H.Hanjour A.Qatada
Z.Jarrah N. Alhazmi D.Benghal A.Al-Omari R.al-Shibh N.Alhazmi
Kendall’s tau 0.47 0.43 0.05 0.63 0.70
Additionally, we tested the GDC evaluation proce-
dure used in this paper to compare the ranking algo-
rithms (Fig. 12), and we found that, again, ToRank out-
performed the other algorithms with a GDC of 1.13,
while the other ranking algorithms were as follows:
1.48 Katz, 1.70 PageRank, 2.55 HITSAuth, and 2.98
HITSHub.
Table 5 shows that ToRank is slightly better than the
other algorithms. However, the results of the algorithms
are significantly close due to the small number of nodes
in this graph.
18We selected the top-10 only because the ground truth enumerates
only the top-10 Hijackers.
Figure 12: Graph Density Analysis for the 9/11 Hijackers dataset.
6. Discussion
We designed the proposed algorithm, ToRank, with
the intention of detecting the nodes that contribute the
most to both propagating and receiving trac to and
from the other nodes. By setting the initial weight of
each node as its degree, ToRank establishes an initial
measure of each node influence. Later on, the start-
ing value is adjusted using the weights of its adjacent
nodes. Therefore, the weight of every node depends on
the weight of its direct and its indirect neighbors, it con-
siders the degree of the followings and follower’s nodes.
Also, what distinguishes ToRank from the benchmarked
algorithms is that it is not iterative like PageRank or
HITS, so it does not have a convergence behavior, and
it considers the degree of the followings and follower’s
nodes.
It was surprising to find out that, probably the most
popular ranking algorithm for the surface web, PageR-
ank, performs poorly for this problem according to the
graph density analysis metric. We found that the iter-
ative removal of the top-ranked nodes using PageRank
yielded the bigger area under the density curve, espe-
cially after removing between 10% and 20% of the top-
ranked nodes. The reason for that is that PageRank as-
signs high ranks to low connectivity nodes, and during
the iterative evaluation, those nodes were dropped first,
resulting in a decrease in the number of nodes without a
significant impact on the graph connectivity. Hence, the
number of edges is kept high while the count of nodes
decreases, which leads to an increment in the density
curve and a high GDC value. Additionally, the GDC
evaluation procedure measures the eciency of a rank-
ing algorithm in breaking down the connectivity of a
given graph, whereas the PageRank design was not for
that end. Therefore, we conclude that despite the popu-
larity of PageRank, it is not a suitable ranking algorithm
for this task.
However, notwithstanding the success of ToRank al-
gorithm in detecting and ranking the influential nodes
15
in Tor, our method fails in assessing onion domains that
are isolated with no incoming or outgoing hyperlinks.
Luckily, this issue barely has an impact in our problem
because in S AGAll there are only 94 onion domains that
do not have any hyperlinks to or from HS within Tor
which counts only 3% out of the total.
6.1. ToRank Computational complexity
ToRank has two phases: the initialization phase
which iterates over all the nodes to assign an initial
weight to them, and a ranking phase that calculates a
rank value thanks to ToRank equation. Hence, ToRank
complexity is proportional to the number of nodes N,
and the time complexity would be O(2N). Table 8 com-
pares the needed time to rank relatively large and small
graphs with their corresponding processing time.
Table 8: A comparison for the processing time, in terms of seconds,
of multiple ranking algorithms. The first column to the left refers to
the graph name and its number of nodes (N) and edges (E). Using
NetworkX library we were not able to rank relatively big graphs so
we replaced it with (-) sing.
PageRank HITS Katz ToRank
Google web graph
Source: (Leskovec et al., 2009)
(N=875713, E=5105039)
284.99 128.95 - 32.42
Note Dame web graph
Source: (Albert et al., 1999)
(N=325729, E=1497134)
90.40 74.87 - 11.14
Stanford web graph
Source: (Leskovec et al., 2009)
(N=281903, E=2312497)
141.52 22.90 - 11.72
Suspicious Activities
(N=2908, E=14511) 1.47 0.42 0.78 0.14
Normal Activities
(N=22965, E=85699) 5.98 36.33 99.85 0.80
9/11 Hijackers Network
(N=60, E=194) 0.03 0.07 0.62 0.05
6.2. ToRank with Big Graphs
Beside the explored graph structures, we ran the rank-
ing algorithms over three relatively large graphs (see
their specifications in Table 8). In particular, we ex-
perimented with Google web graph 19 (Leskovec et al.,
2009), Stanford web graph20 (Leskovec et al., 2009),
and Note Dame web graph21 (Albert et al., 1999). We
explored only the reduction in the giant component met-
ric as we could not manage to the other metrics due
to their high time complexity using NetworkX library.
Also, we compared the processing time of ToRank with
19https://snap.stanford.edu/data/web- Google.html
20https://snap.stanford.edu/data/web- Stanford.html
21https://snap.stanford.edu/data/web- NotreDame.
html
the other ranking algorithm with respect to the com-
mented graphs. Table 9 shows that both ToRank and
PageRank algorithms are competing for the first posi-
tion. ToRank outperformed PageRank in Note Dame
web graph but failed in Google Web graph. However,
considering the GC reduction time, shown in Table 8,
ToRank is superior to PageRank for large graphs.
Table 9: Comparison for the reduction in the Giant Component (GC)
of multiple large graphs. The lower GC is, the better ranking.
Graph Structure Full network
giant component Removal PR HITS(Auth) HITS(Hub) Katz ToRank
Google web
graph 855802
Top-1% 771062 818528 845134 - 772966
Top-5% 626437 714767 801869 - 665315
Top-10% 502855 603827 742537 - 525710
Note Dame
web graph 325729
Top-1% 254729 315000 322468 - 228957
Top-5% 207088 285246 280401 - 125635
Top-10% 180889 225480 157022 - 50600
Stanford
web graph 255265
Top-1% 204408 232634 252396 - 200723
Top-5% 127013 221680 241120 - 129735
Top-10% 77765 183794 223617 - 104960
7. Conclusions and Future Work
In this paper, we presented and made a publicly avail-
able DUTA-10K dataset with 10367 HS, manually la-
beled into 25 categories. Against the widespread belief
that most of Tor’s content is related to criminal activi-
ties, the statistical analysis on DUTA-10K showed that
only around 20% of the tested onion domains are as-
sociated with suspicious activities, while 48% are re-
lated to normal ones. Nevertheless, the left 32% were
not accessible because they were either down, locked,
or empty, and we classified them as unknown domains.
We also verified that 84% of the HS are in English what
corroborates the idea that a text-based model to classify
Tor content, trained only on this language, will cover
the majority of the onion pages. Additionally, we found
out that the domains related to suspicious activities tend
to have multiple clones under dierent addresses, what
can be even used as an additional feature for identifying
them.
One of the main contributions of this paper is the
new algorithm that we proposed to rank Tor web pages,
which we named ToRank. In order to facilitate the pro-
cess of monitoring the HS, ToRank is designed to iden-
tify and to rank on the top the most influential onion
domains. We employed graph theory to model the
Tor network, where nodes correspond to the HS and
edges refer to the hyperlinks between them. ToRank
was evaluated quantitatively by peeling away the top-
ranked nodes iteratively and checking if the density of
the graph decreases every cycle. Its performance was
compared against three popular link-based ranking al-
gorithms, finding that the area under the Graph Density
Curve (lower is better) was of 1.31 for ToRank, 1.41
16
Katz, 1.63 HITSAuth, 1.96 HITSHub , and 2.07 PageR-
ank.
These findings have made us reflect on how to deter-
mine the influence of a node in Tor, beyond the analysis
made based on hyperlinks. At this moment, we are fo-
cused on extracting textual features, such as products
names, vendors nicknames, locations, or even date and
time formats (Lample et al., 2016; Aguilar et al., 2017).
Additionally, we are planning to extract visual informa-
tion, by categorizing the images (Fidalgo et al., 2017)
in Tor HS and generating textual descriptions for them,
using image captioning (You et al., 2016). Our idea is
to combine those features with ToRank to improve the
ranking. And finally, we are also considering to intro-
duce in our analysis the hyperlinks related to the surface
Web, which might help to understand and to determine
which the influential domains are.
Acknowledgments
This research is supported by the INCIBE grant
“INCIBEI-2015-27359” corresponding to the “Ayudas
para la Excelencia de los Equipos de Investigaci´
on avan-
zada en ciberseguridad” and also by the framework
agreement between the University of Le´
on and INCIBE
(Spanish National Cybersecurity Institute) under Ad-
denda 22 and 01.
References
Abdi, H., 2007. The kendall rank correlation coecient. Encyclopedia
of Measurement and Statistics. Sage, Thousand Oaks, CA, 508–
510.
Aguilar, G., Maharjan, S., Monroy, A. P. L., Solorio, T., 2017. A
multi-task approach for named entity recognition in social me-
dia data. In: Proceedings of the 3rd Workshop on Noisy User-
generated Text. pp. 148–153.
Al Nabki, M. W., Fidalgo, E., Alegre, E., de Paz, I., 2017a. Classify-
ing illegal activities on tor network based on web textual contents.
In: Proceedings of the 15th Conference of the European Chapter
of the Association for Computational Linguistics: Volume 1, Long
Papers. Vol. 1. pp. 35–43.
Al Nabki, M. W., Fidalgo, E., Alegre, E., Gonz´
alez-Castro, V.,
2017b. Detecting emerging products in tor network based on k-
shell graph decomposition. III Jornadas Nacionales de Investi-
gaci´
on en Ciberseguridad (JNIC) 1 (1), 24–30.
Albert, R., Jeong, H., Barab´
asi, A.-L., 1999. Internet: Diameter of the
world-wide web. nature 401 (6749), 130.
Anger, I., Kittl, C., 2011. Measuring influence on twitter. In: Proceed-
ings of the 11th International Conference on Knowledge Manage-
ment and Knowledge Technologies. ACM, p. 31.
Anwar, T., Abulaish, M., 2015. Ranking radically influential web fo-
rum users. IEEE Transactions on Information Forensics and Secu-
rity 10 (6), 1289–1298.
Arulselvan, A., Commander, C. W., Elefteriadou, L., Pardalos, P. M.,
2009. Detecting critical nodes in sparse graphs. Computers & Op-
erations Research 36 (7), 2193–2200.
Backstrom, L., Kleinberg, J., 2014. Romantic partnerships and the
dispersion of social ties: a network analysis of relationship sta-
tus on facebook. In: Proceedings of the 17th ACM conference on
Computer supported cooperative work & social computing. ACM,
pp. 831–841.
Bergman, M. K., 2001. White paper: the deep web: surfacing hidden
value. Journal of electronic publishing 7 (1).
Berzinji, A., Kaati, L., Rezine, A., 2012. Detecting key players in
terrorist networks. In: Intelligence and Security Informatics Con-
ference (EISIC), 2012 European. IEEE, pp. 297–302.
Bidoki, A. M. Z., Ghodsnia, P., Yazdani, N., Oroumchian, F., 2010.
A3crank: An adaptive ranking method based on connectivity, con-
tent and click-through data. Information processing & manage-
ment 46 (2), 159–169.
Bidoki, A. M. Z., Yazdani, N., 2008. Distancerank: An intelligent
ranking algorithm for web pages. Information Processing & Man-
agement 44 (2), 877–892.
Biryukov, A., Pustogarov, I., Thill, F., Weinmann, R.-P., 2014. Con-
tent and popularity analysis of tor hidden services. In: Distributed
Computing Systems Workshops (ICDCSW), 2014 IEEE 34th In-
ternational Conference on. IEEE, pp. 188–193.
Biryukov, A., Pustogarov, I., Weinmann, R.-P., 2013. Trawling for
tor hidden services: Detection, measurement, deanonymization.
In: Security and Privacy (SP), 2013 IEEE Symposium on. IEEE,
pp. 80–94.
Biswas, R., Fidalgo, E., Alegre, E., 2017. Recognition of service do-
mains on tor dark net using perceptual hashing and image classi-
fication techniques. 8th International Conference on Imaging for
Crime Detection and Prevention, ICDP-2017 14, 15.
Booker, L. B., 2012. The eects of observation errors on the at-
tack vulnerability of complex networks. Tech. rep., MITRE CORP
MCLEAN VA.
Borodin, A., Roberts, G. O., Rosenthal, J. S., Tsaparas, P., 2005.
Link analysis ranking: algorithms, theory, and experiments. ACM
Transactions on Internet Technology (TOIT) 5 (1), 231–297.
Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E., 2007. A
model of internet topology using k-shell decomposition. Proceed-
ings of the National Academy of Sciences 104 (27), 11150–11154.
Chang, V., 2017. A cybernetics social cloud. Journal of Systems and
Software 124, 195–211.
Chaurasia, N., Tiwari, A., 2013. Ecient algorithm for destabiliza-
tion of terrorist networks. IJ Information Technology and Com-
puter Science 12, 21–30.
Choudhary, P., Singh, U., 2016. Ranking terrorist nodes of 9/11 net-
work using analytical hierarchy process with social network anal-
ysis. In: International Symposium on the Analytic Hierarchy Pro-
cess (ISAHP 2016). pp. 1–10.
Ciancaglini, V., Balduzzi, M., Goncharov, M., McArdle, R., 2013.
Deepweb and cybercrime. Trend Micro Report 9.
COCKBURN, A., MCKENZIE, B., 2001. What do web users do?
an empirical analysis of web use. International Journal of Human-
Computer Studies 54 (6), 903–922.
Cohen, R., Havlin, S., 2010. Complex networks: structure, robustness
and function. Cambridge university press.
Cossu, J.-V., Dugu´
e, N., Labatut, V., 2015. Detecting real-world influ-
ence through twitter. In: Network Intelligence Conference (ENIC),
2015 Second European. IEEE, pp. 83–90.
Derhami, V., Khodadadian, E., Ghasemzadeh, M., Bidoki, A. M. Z.,
2013. Applying reinforcement learning for web pages ranking al-
gorithms. Applied Soft Computing 13 (4), 1686–1692.
Duggan, B., Feb. 2016. Uganda elections: Government shuts down
social media - cnn @misc.
URL https://edition.cnn.com/2016/02/18/world/
uganda-election- social-media- shutdown/
Duijn, P. A., Kashirin, V., Sloot, P. M., 2014. The relative ineective-
17
ness of criminal network disruption. Scientific reports 4, 4238.
Elahi, T., Bauer, K., AlSabah, M., Dingledine, R., Goldberg, I., 2012.
Changing of the guards: A framework for understanding and im-
proving entry guard selection in tor. In: Proceedings of the 2012
ACM Workshop on Privacy in the Electronic Society. ACM, pp.
43–54.
Eliacik, A. B., Erdogan, N., 2018. Influential user weighted sentiment
analysis on topic based microblogging community. Expert Systems
with Applications 92, 403–418.
Ferrara, E., De Meo, P., Catanese, S., Fiumara, G., 2014. Detecting
criminal organizations in mobile phone networks. Expert Systems
with Applications 41 (13), 5733–5750.
Fidalgo, E., Alegre, E., Gonz´
alez-Castro, V., Fern´
andez-Robles, L.,
2017. Illegal activity categorisation in darknet based on image
classification using creic method. In: International Joint Confer-
ence SOCO’17-CISIS’17-ICEUTE’17 Le´
on, Spain, September 6–
8, 2017, Proceeding. Springer, pp. 600–609.
Foley, S., Karlsen, J., Putnin¸ˇ
s, T. J., 2018. Sex, drugs, and bitcoin:
How much illegal activity is financed through cryptocurrencies?
SSRN Electronic Journal.
Freeman, L. C., Roeder, D., Mulholland, R. R., 1979. Centrality in
social networks: Ii. experimental results. Social networks 2 (2),
119–141.
Fronzetti Colladon, A., Gloor, P. A., 2018. Measuring the impact of
spammers on e-mail and twitter networks. International Journal of
Information Management.
Fronzetti Colladon, A., Remondi, E., 2017. Using social network
analysis to prevent money laundering. Expert Systems with Ap-
plications 67, 49–58.
Fronzetti Colladon, A., Vagaggini, F., 2017. Robustness and stability
of enterprise intranet social networks: The impact of moderators.
Information Processing & Management 53 (6), 1287–1298.
Gallagher, S., UTC, Mar 2016. Whole lotta onions: Number of tor
hidden sites spikes-along with paranoia.
URL https://bit.ly/2MGTkrU
Gohari, F. S., Mohammadi, S., January 2014. A comprehensive frame-
work for identifying viral marketing’s influencers in twitter. Inter-
national SAMANM Journal of Marketing and Management 2 (1),
27–43.
Hasan, O., Brunie, L., Bertino, E., Shang, N., 2013. A decentralized
privacy preserving reputation protocol for the malicious adversar-
ial model. IEEE Transactions on Information Forensics and Secu-
rity 8 (6), 949–962.
Henni, K., Mezghani, N., Gouin-Vallerand, C., 2018. Unsupervised
graph-based feature selection via subspace and pagerank centrality.
Expert Systems with Applications 114, 46–53.
Holme, P., Kim, B. J., Yoon, C. N., Han, S. K., 2002. Attack vulnera-
bility of complex networks. Physical review E 65 (5), 056109.
Holmgren, Å. J., 2007. A framework for vulnerability assessment of
electric power systems. In: Critical Infrastructure. Springer, pp.
31–55.
Hu, Y., Wang, S., Ren, Y., Choo, K.-K. R., 2018. User influence analy-
sis for github developer social networks. Expert Systems with Ap-
plications 108, 108–118.
Hu, Y., Zhang, J., Bai, X., Yu, S., Yang, Z., 2016. Influence analysis
of github repositories. SpringerPlus 5 (1), 1268.
Husslage, B., Borm, P., Burg, T., Hamers, H., Lindelauf, R., 2015.
Ranking terrorists in networks: A sensitivity analysis of al qaeda’s
9/11 attack. Social Networks 42, 1–7.
Iyer, S., Killingback, T., Sundaram, B., Wang, Z., 2013. Attack
robustness and centrality of complex networks. PloS one 8 (4),
e59613.
Jansen, R., Tschorsch, F., Johnson, A., Scheuermann, B., 2014. The
sniper attack: Anonymously deanonymizing and disabling the tor
network. Tech. rep., OFFICE OF NAVAL RESEARCH ARLING-
TON VA.
Ji, S., Li, W., Gong, N. Z., Mittal, P., Beyah, R., 2016. Seed-based
de-anonymizability quantification of social networks. IEEE Trans-
actions on Information Forensics and Security 11 (7), 1398–1411.
Joshi, A., Fidalgo, E., Alegre, E., Al Nabki, M. W., 2018. Extractive
text summarization in dark web: A preliminary study. International
Conference of Applications of Intelligent Systems.
Katz, L., 1953. A new status index derived from sociometric analysis.
Psychometrika 18 (1), 39–43.
Kleinberg, J. M., 1999. Authoritative sources in a hyperlinked envi-
ronment. Journal of the ACM (JACM) 46 (5), 604–632.
Krebs, V. E., 2002. Mapping networks of terrorist cells. Connections
24 (3), 43–52.
Kwon, A., AlSabah, M., Lazar, D., Dacier, M., Devadas, S., 2015. Cir-
cuit fingerprinting attacks: Passive deanonymization of tor hidden
services. In: 24th USENIX Security Symposium (USENIX Secu-
rity 15). USENIX Association, Washington, D.C., pp. 287–302.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer,
C., June 2016. Neural architectures for named entity recognition.
In: Proceedings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human
Language Technologies. Association for Computational Linguis-
tics, San Diego, California, pp. 260–270.
URL http://www.aclweb.org/anthology/N16-1030
Leskovec, J., Lang, K. J., Dasgupta, A., Mahoney, M. W., 2009. Com-
munity structure in large networks: Natural cluster sizes and the
absence of large well-defined clusters. Internet Mathematics 6 (1),
29–123.
Levene, M., 2011. An introduction to search engines and web naviga-
tion. John Wiley & Sons.
Ling, Z., Luo, J., Wu, K., Yu, W., Fu, X., 2015. Torward: Discovery,
blocking, and traceback of malicious trac over tor. IEEE Trans-
actions on Information Forensics and Security 10 (12), 2515–2530.
Mao, G., Zhang, N., 2017. Fast approximation of average shortest
path length of directed ba networks. Physica A: Statistical Mechan-
ics and its Applications 466, 243–248.
Matic, S., Kotzias, P., Caballero, J., 2015. Caronte: Detecting loca-
tion leaks for deanonymizing tor hidden services. In: Proceedings
of the 22nd ACM SIGSAC Conference on Computer and Commu-
nications Security. ACM, pp. 1455–1466.
Memon, N., Larsen, H. L., 2006. Structural analysis and destabilizing
terrorist networks. In: DMIN. Citeseer, pp. 296–302.
Mitchell, J., 2017. Want to cry? ITNOW 59 (3), 12–13.
Moore, D., Rid, T., 2016. Cryptopolitik and the darknet. Survival
58 (1), 7–38.
Noor, U., Rashid, Z., Rauf, A., 2011. A survey of automatic deep
web classification techniques. International Journal of Computer
Applications 19 (6), 43–50.
Norbutas, L., 2018. Oine constraints in online drug marketplaces:
An exploratory analysis of a cryptomarket trade network. Interna-
tional Journal of Drug Policy 56, 92–100.
Nouh, M., Nurse, J. R., 2015. Identifying key-players in online activist
groups on the facebook social network. In: Data Mining Work-
shop (ICDMW), 2015 IEEE International Conference on. IEEE,
pp. 969–978.
Page, L., Brin, S., Motwani, R., Winograd, T., 1999. The pagerank
citation ranking: Bringing order to the web. Tech. rep., Stanford
InfoLab.
Rivest, R., 1992. The md5 message-digest algorithm. Internet Engi-
neering Task Force.
Ruhnau, B., 2000. Eigenvector-centrality—a node-centrality? Social
networks 22 (4), 357–365.
Scott, J., 2017. Social network analysis. Sage.
Seidman, S. B., 1983. Network structure and minimum degree. Social
networks 5 (3), 269–287.
18
Srinivas, A., Velusamy, R. L., 2015. Identification of influential nodes
from social networks based on enhanced degree centrality mea-
sure. In: Advance Computing Conference (IACC), 2015 IEEE In-
ternational. IEEE, pp. 1179–1184.
Taha, K., Yoo, P. D., 2017. Using the spanning tree of a criminal net-
work for identifying its leaders. IEEE Transactions on Information
Forensics and Security 12 (2), 445–453.
Wang, Y., Nelissen, N., Adamczuk, K., De Weer, A.-S., Vandenbul-
cke, M., Sunaert, S., Vandenberghe, R., Dupont, P., 2014. Re-
producibility and robustness of graph measures of the associative-
semantic network. PloS one 9 (12), e115215.
Watts, D. J., Strogatz, S. H., 1998. Collective dynamics of ‘small-
world’networks. nature 393 (6684), 440.
Wei, H., Pan, Z., Hu, G., Zhang, L., Yang, H., Li, X., Zhou, X.,
2018. Identifying influential nodes based on network representa-
tion learning in complex networks. PloS one 13 (7), e0200091.
Xu, X., Zhou, C., Wang, Z., 2009. Credit scoring algorithm based on
link analysis ranking with support vector machine. Expert Systems
with Applications 36 (2), 2625–2632.
Ye, Q., Wu, B., Wang, B., 2010. Distance distribution and average
shortest path length estimation in real-world networks. In: Inter-
national Conference on Advanced Data Mining and Applications.
Springer, pp. 322–333.
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J., 2016. Image captioning
with semantic attention. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. pp. 4651–4659.
Zhang, Y., Bao, Y., Zhao, S., Chen, J., Tang, J., 2015. Identifying node
importance by combining betweenness centrality and katz central-
ity. In: Cloud Computing and Big Data (CCBD), 2015 Interna-
tional Conference on. IEEE, pp. 354–357.
19
... Total onions Active onions Portion [25] 7,000 1,450 20.7% [26] 198,050 7,257 3.7% [12] 47,439 14,232 30% [27] 250,000 7,000 2.8% [28] 124,589 3,536 2.8% [29] 12,882 4,509 35% [30] 25,742 6,227 24.2% [31] 15,503 4,089 26.4% [19] 25,261 2,527 10% services measured was around 1,450 out of more than 7,000 identified addresses [25], 7,257 onion links were active out of 198,050 [26], 30% was online at least 90% of the experiment with 47,439 onions identified [12], 7 K Tor pages were alive out of more than 250 K addresses [27], or strategies that returned 124,589 addresses with only 3,536 active [28]. ...
... Total onions Active onions Portion [25] 7,000 1,450 20.7% [26] 198,050 7,257 3.7% [12] 47,439 14,232 30% [27] 250,000 7,000 2.8% [28] 124,589 3,536 2.8% [29] 12,882 4,509 35% [30] 25,742 6,227 24.2% [31] 15,503 4,089 26.4% [19] 25,261 2,527 10% services measured was around 1,450 out of more than 7,000 identified addresses [25], 7,257 onion links were active out of 198,050 [26], 30% was online at least 90% of the experiment with 47,439 onions identified [12], 7 K Tor pages were alive out of more than 250 K addresses [27], or strategies that returned 124,589 addresses with only 3,536 active [28]. ...
... • An even larger manual inspection [30] led to the classification of more than four thousand onion services into 31 categories. • Two studies [27,28] focused on creating and extending the DUTA dataset, manually tagging onions in 25 classes. • Lee et al. [21] classified three thousand dark sites into 15 categories. ...
... A crucial aspect of integrating these methodologies is the construction of data in a format conducive to the model, with feature extraction playing a key role. Studies have involved constructing a Dark Web dataset and assessing suspicious websites through ranking algorithms [14]. One investigation extracted diverse elements such as text, HTML, images, and graphs from the Dark Web and evaluated them using a learning-to-rankbased domain prioritization [15]. ...
... In this study, two Dark Web datasets, previously validated through analyses, were utilized. DUTA-10k [14] includes 25 main classes and 10,367 samples, whereas CoDA [47] includes 10 classes and 8,855 samples. We restricted our data collection to Englishlanguage content from the Dark Web. ...
... The present experiments were performed using the DUTA-10k [14] and CoDA [47] datasets. The DUTA-10k dataset, an expansion of the initial labeled Dark Web dataset proposed in 2016 [1], encompasses a broader array of sources, thereby aggregating a more diverse set of onion sites. ...
Article
Full-text available
The Dark Web is an internet domain that ensures user anonymity and has increasingly become a focal point for illegal activities and a repository for information on cyberattacks owing to the challenges in tracking its users. This study examined the classification of the Dark Web in relation to these cyber threats. We processed Dark Web texts to extract vector types suitable for machine learning classification. Traditional methods utilizing the entirety of Dark Web texts to generate features result in vectors including all words found on the Dark Web. However, this approach incorporates extraneous information in the vectors, diminishing learning effectiveness and extending processing duration. The research aimed to optimize the classification process by selectively focusing on keywords within each class, thereby curtailing word vector dimensions. This optimization was facilitated by leveraging the anonymity characteristic of the Dark Web and employing topic-modeling-based weight generation. These methods enabled the creation of word vectors with a constrained feature set, enhancing the distinction of Dark Web classes. To further improve classification performance, we integrated TextCNN with topic modeling weights. For validation, we employed two datasets and compared the performance of the model with other text classification algorithms, where the proposed model demonstrated superior effectiveness in Dark Web classification.
... In [10] information on hidden services is extracted from descriptors onion addresses and crawled to find text types of services available on them, languages, popularity, up-time, and amount of service protected by descriptor cookie. In continuation to their previous work in [16] authors in [14] have contributed with a new dataset "Darknet Usage Text Addresses" DUTA10K 8 and ranking algorithm for HS and analyzed activities, content distribution, and languages on the web pages. The dataset is proposed since the lifespan of onion domains is very short. ...
Article
Full-text available
The only way to access onion services is via the TOR browser providing anonymity and privacy to the client as well as the server. Information about these hidden services and the contents available on them cannot be gathered like websites on the surface web. So, they become a fertile ground for illegal content dissemination and hosting for cybercriminals. There is a persistent need to classify and block such content from onion sites. In this paper, we investigate data requested from onion services to help law enforcement agencies collect traces of cybercrime on these hidden services. We propose a system using fuzzy encoded LSTM to analyze contents retrieved from these sites and raise alerts if found illegal. The accuracy of fuzzy-encoded LSTM is found to be 81.04 % and it outperforms other classifiers.
... This study examined the characteristics and behavior of a particular type of botnet and attempted to identify the motivations and tactics of the attackers behind it. Other research has focused on identifying the most influential suspicious domains in the Tor network, a network of servers that can be used to access the Darkweb [83]. This research used a machine learning technique called "ToRank" to identify and rank the most influential domains based on their activity and connections to other domains [84]. ...
Article
Full-text available
The Darkweb, part of the deep web, can be accessed only through specialized computer software and used for illegal activities such as cybercrime, drug trafficking, and exploitation. Technological advancements like Tor, bitcoin, and cryptocurrencies allow criminals to carry out these activities anonymously, leading to increased use of the Darkweb. At the same time, computers have become an integral part of our daily lives, shaping our behavior, and influencing how we interact with each other and the world. This work carries out the bibliometric study on the research conducted on Darkweb over the last decade. The findings illustrate that most research on Darkweb can be clustered into four areas based on keyword co-occurrence analysis: (i) network security, malware, and cyber-attacks, (ii) cybercrime, data privacy, and cryptography, (iii) machine learning, social media, and artificial intelligence, and (iv) drug trafficking, cryptomarket. National Science Foundation from the United States is the top funder. Darkweb activities interfere with the Sustainable Development Goals (SDG) laid forth by the United Nations to promote peace and sustainability for current and future generations. SDG 16 (Peace, Justice, and Strong Institutions) has the highest number of publications and citations but has an inverse relationship with Darkweb, as the latter undermines the former. This study highlights the need for further research in bitcoin, blockchain, IoT, NLP, cryptocurrencies, phishing and cybercrime, botnets and malware, digital forensics, and electronic crime countermeasures about the Darkweb. The study further elucidates the multi-dimensional nature of the Darkweb, emphasizing the intricate relationship between technology, psychology, and geopolitics. This comprehensive understanding serves as a cornerstone for evolving effective countermeasures and calls for an interdisciplinary research approach. The study also delves into the psychological motivations driving individuals towards illegal activities on the Darkweb, highlighting the urgency for targeted interventions to promote pro-social online behavior.
Article
Full-text available
Child pornography—better known as child sexual abuse material (CSAM)—represents a severe form of exploitation and victimization of children, leaving the victims with emotional and physical trauma. In this study, we aim to analyze local patterns of CSAM consumption across 1341 French communes in 20 metropolitan regions of France between March 16 to May 31, 2019 using fine-grained mobile traffic data of Tor network-related web services. We estimate that approx. 0.08% of Tor mobile download traffic observed in France is linked to the consumption of CSAM by correlating it with local-level temporal porn consumption patterns. This compares to 0.19% of what we conservatively estimate to be the share of CSAM content in global Tor traffic. In line with existing literature on the link between sexual child abuse and the consumption of image-based content thereof, we observe a positive and statistically significant effect of our CSAM consumption estimates on the reported number of victims of sexual violence and vice versa, which validates our findings, after controlling for a set of geographically disaggregated features including socio-demographic characteristics, voting behavior, nearby points of interest and Google Trends queries. While this is a first, exploratory attempt to look at CSAM from a spatial epidemiological angle, we believe this research provides public health officials with valuable information to prioritize target areas for public awareness campaigns as another step to fulfill the global community’s pledge to target 16.2 of the sustainable development goals: “end abuse, exploitation, trafficking and all forms of violence and torture against children".
Article
This comprehensive study delves into the complexities of the Dark Web, a concealed segment of the internet that remains invisible to standard search engines and is accessible only through specialized tools like The Onion Router (TOR), which ensures user anonymity. While the Dark Web is celebrated for its capacity to safeguard privacy and foster free expression, it concurrently serves as a sanctuary for illegal endeavours, encompassing drug trafficking, unauthorized arms trading, and a spectrum of cybercrime. The primary objective of this research is to scrutinize the efficacy of onion routing, the foundational technology behind the Dark Web, in preserving user anonymity amidst escalating efforts by law enforcement agencies to dismantle illegal activities. This paper adopts a rigorous approach that melds an exhaustive review of pertinent literature with empirical investigations to pinpoint the intrinsic vulnerabilities within the onion routing framework. Furthermore, the study introduces innovative methodologies aimed at bolstering the detection and neutralization of illicit transactions and communications on the Dark Web. These proposed methods seek to establish a delicate balance between upholding the Dark Web's legitimate functions—such as protecting privacy and enabling free speech—and curtailing its misuse for criminal activities. The paper culminates in a discussion of the broader implications of these findings for policymakers, law enforcement officials, and privacy advocates. It provides a set of recommendations for future research and policy formulation in this intricate and ever-evolving domain, to navigate the challenges posed by the Dark Web while preserving its essential values.
Chapter
Terrorist Network Analysis has been one of the most commonly discussed ways for safeguarding Online Social Network accessing and transmission through decentralized, trustless, peer-to-peer networks since Gabriel Weimann’s research paper on Cyberterrorism published in 2005. This study analyzes peer-reviewed literature attempting to use Terrorist Network Analysis for Cybercrime and gives a comprehensive analysis of the most often used Terrorist Network Analysis Security applications. The Proposed Finding recommends Malicious activities (Messages and Post relating to Terrorist events on Social Networking) for alerting users, so that they could cut away his/her communication from Malicious Actors and also help Security Agencies to identify possible Attacks and Vulnerabilities on modern Online Social Network, their main reasons and countermeasures. This systematic study also offers light on future directions in Terrorist Network Analysis and Cybercrime research, education, and practices, such as Terrorist Network Analysis security in Online Social Network, Security for Machine Learning and automated techniques.
Article
Full-text available
Viral marketing can lead to extensive knowledge of marketing campaigns across customers with lower costs. The important point of viral marketing is targeting the subset of customers that can influence on others. Such customers enhance the efficiency of a marketing campaign by maximizing propagation of viral message throughout the network. According to increasing the importance of the Twitter network for marketing efforts in recent years, the aim of this work is to identify the best influential individuals for the efficient performance of viral marketing campaigns in this network. Recent works on Twitter reveal the lack of a comprehensive framework for differentiation of influencers in viral marketing. Our qualitative research aims at the synthesis of results and theories from previous studies in a new comprehensive framework. The paper first provides a detailed review on previous works about the influence and diffusion of information on Twitter. Second, according to the important features of viral marketing's influencers, it proposes a comprehensive framework for evaluating these features in terms of Twitter functions.This frameworkconcentrates on all of the important factors for identifying viral marketing's influencers. So, the most worthy twitterers with highest marketing value can be identified effectively based on our proposed framework.
Article
Full-text available
Identifying influential nodes is an important topic in many diverse applications, such as accelerating information propagation, controlling rumors and diseases. Many methods have been put forward to identify influential nodes in complex networks, ranging from node centrality to diffusion-based processes. However, most of the previous studies do not take into account overlapping communities in networks. In this paper, we propose an effective method based on network representation learning. The method considers not only the overlapping communities in networks, but also the network structure. Experiments on real-world networks show that the proposed method outperforms many benchmark algorithms and can be used in large-scale networks.
Article
Full-text available
Background: Cryptomarkets, or illegal anonymizing online platforms that facilitate drug trade, have been analyzed in a rapidly growing body of research. Previous research has found that, despite increased risks, cryptomarket sellers are often willing to ship illegal drugs internationally. There is little to no information, however, about the extent to which uncertainty and risk related to geographic constraints shapes buyers' behavior and, in turn, the structure of the global online drug trade network. In this paper, we analyze the structure of a complete cryptomarket trade network with a focus on the role of geographic clustering of buyers and sellers. Methods: We use publicly available crawls of the cryptomarket Abraxas, encompassing market transactions between 463 sellers and 3542 buyers of drugs in 2015. We use descriptive social network analysis and Exponential Random Graph Models (ERGM) to analyze the structure of the trade network. Results: The structure of the online drug trade network is primarily shaped by geographical boundaries. Buyers are more likely to buy from multiple sellers within a single country, and avoid buying from sellers in different countries, which leads to strong geographic clustering. The effect is especially strong between continents and weaker for countries within Europe. A small fraction of buyers (10%) account for more than a half of all drug purchases, while most buyers only buy once. Conclusion: Online drug trade networks might still be heavily shaped by offline (geographic) constraints, despite their ability to provide access for end-users to large international supply. Cryptomarkets might be more "localized" and less international than thought before. We discuss potential explanations for such geographical clustering and implications of the findings.
Article
Cryptocurrencies are among the largest unregulated markets in the world. We find that approximately one-quarter of bitcoin users are involved in illegal activity. We estimate that around $76 billion of illegal activity per year involve bitcoin (46% of bitcoin transactions), which is close to the scale of the U.S. and European markets for illegal drugs. The illegal share of bitcoin activity declines with mainstream interest in bitcoin and with the emergence of more opaque cryptocurrencies. The techniques developed in this paper have applications in cryptocurrency surveillance. Our findings suggest that cryptocurrencies are transforming the black markets by enabling “black e-commerce.” Received June 1, 2017; editorial decision December 8, 2018 by Editor Andrew Karolyi. Authors have furnished an Internet Appendix, which is available on the Oxford University Press Web site next to the link to the final published paper online.
Article
This paper investigates the research question if senders of large amounts of irrelevant or unsolicited information – commonly called “spammers” – distort the network structure of social networks. Two large social networks are analyzed, the first extracted from the Twitter discourse about a big telecommunication company, and the second obtained from three years of email communication of 200 managers working for a large multinational company. This work compares network robustness and the stability of centrality and interaction metrics, as well as the use of language, after removing spammers and the most and least connected nodes. The results show that spammers do not significantly alter the structure of the information-carrying network, for most of the social indicators. The authors additionally investigate the correlation between e-mail subject line and content by tracking language sentiment, emotionality, and complexity, addressing the cases where collecting email bodies is not permitted for privacy reasons. The findings extend the research about robustness and stability of social networks metrics, after the application of graph simplification strategies. The results have practical implication for network analysts and for those company managers who rely on network analytics (applied to company emails and social media data) to support their decision-making processes.
Article
Feature selection has become an indispensable part of intelligent systems, especially with the proliferation of high dimensional data. It identifies the subset of discriminative features leading to better learning performances, i.e., higher learning accuracy, lower computational cost and significant model interpretability. This paper proposes a new efficient unsupervised feature selection method based on graph centrality and subspace learning called UGFS for ‘Unsupervised Graph-based Feature Selection’. The method maps features on an affinity graph where the relationships (edges) between feature nodes are defined by means of data points subspace preference. Feature importance score is then computed on the entire graph using a centrality measure. For this purpose, we investigated the Google's PageRank method originally introduced to rank web-pages. The proposed feature selection method has been evaluated using classification and redundancy rates measured on the selected feature subsets. Comparisons with the well-known unsupervised feature selection methods, on gene/expression benchmark datasets, demonstrate the validity and the efficiency of the proposed method.
Article
Github, one of the largest social coding platforms, offers software developers the opportunity to engage in social activities relating to software development and to store or share their codes/projects with the wider community using the repositories. Analysis of data representing the social interactions of Github users can reveal a number of interesting features. In this paper, we analyze the data to understand user social influence on the platform. Specifically, we propose a Following-Star-Fork-Activity based approach to measure user influence in the Github developer social network. We first preprocess the Github data, and construct the social network. Then, we analyze user influence in the social network, in terms of popularity, centrality, content value, contribution and activity. Finally, we analyze the correlation of different user influence measures, and use Borda Count to comprehensively quantify user influence and verify the results.
Article
Nowadays, social microblogging services have become a popular expression platform of what people think. People use these platforms to produce content on different topics from finance, politics and sports to sociological fields in real-time. With the proliferation of social microblogging sites, the massive amount of opinion texts have become available in digital forms, thus enabling research on sentiment analysis to both deepen and broaden in different sociological fields. Previous sentiment analysis research on microblogging services generally focused on text as the unique source of information, and did not consider the social microblogging service network information. Inspired by the social network analysis research and sentiment analysis studies, we find that people's trust in a community have an important place in determining the community's sentiment polarity about a topic. When studies in the literature are examined, it is seen that trusted users in a community are actually influential users. Hence, we propose a novel sentiment analysis approach that takes into account the social network information as well. We concentrate on the effect of influential users on the sentiment polarity of a topic based microblogging community. Our approach extends the classical sentiment analysis methods, which only consider text content, by adding a novel PageRank-based influential user finding algorithm. We have carried out a comprehensive empirical study of two real-world Twitter datasets to analyze the correlation between the mood of the financial social community and the behavior of the stock exchange of Turkey, namely BIST100, using Pearson correlation coefficient method. Experimental results validate our assumptions and show that the proposed sentiment analysis method is more effective in finding topic based microblogging community's sentiment polarity.