Conference PaperPDF Available
Internet-Wide Scanner Fingerprint Identifier Based
on TCP/IP Header
Akira Tanaka∗† , Chansu Han, Takeshi Takahashi, and Katsuki Fujisawa
National Institute of Information and Communications Technology, Tokyo, Japan.
{tanaka.akira, han, takeshi takahashi}@nict.go.jp
Kyushu University, Fukuoka, Japan. akirat.tanaka@kyudai.jp and fujisawa@imi.kyushu-u.ac.jp
Abstract—Identifying individual scan activities is a crucial and
challenging activity for mitigating emerging cyber threats or
gaining insights into security scans. Sophisticated adversaries
distribute their scans over multiple hosts and operate with
stealth; therefore, low-rate scans hide beneath other benign
traffic. Although previous studies attempted to discover such
stealth scans by observing the distribution of ports and hosts,
well-organized scans are difficult to find. However, a scanner
can embed a fingerprint into the packet fields to distinguish
between the scan and other traffic. In this study, we propose a
new algorithm to identify the flexible fingerprint in consideration
of the genetic algorithm idea. To the best of our knowledge, this
is the first such attempt. We successfully identified previously
unknown fingerprints rather than existing ones through numer-
ical experiments on darknet traffic. We analyzed the packets
and discovered distinctive scan activities. Further, we collated the
results with both cyber threat intelligence and investigation/large-
scale scanner lists to ascertain the reliability of our model.
Index Terms—Internet-wide scan, genetic algorithm, finger-
print, darknet traffic
I. INTRODUCTION
Port scanning serves multiple purposes by probing a server
or host for open ports. Shodan and Censys1, known as search
engines for Internet-connected devices, build a searchable
database to discover vulnerabilities, their impact, and affected
IoT devices. Network administrators conduct port scans to
determine open ports and services in a penetration test to
verify network security policies. However, an attacker utilizes
a port scan to identify active network services and exploit their
vulnerabilities.
Intrusion detection systems (IDS) or firewalls detect such
adversaries by monitoring the number of packets sent by a
source IP address or a network class. However, sophisticated
adversaries slip through these defense systems by distributing
the scans over multiple hosts, thereby reducing the scan rate.
Durumeric et al. [1] investigated widespread port scanning and
distributed botnet scans. These distributed scans disappear in
the background noise during other scanning activities such as
benign traffic, benign port scans for investigation purposes,
and internet backscatter [2].
Robertson et al. [3] developed a detection module for stealth
scans, which assumes that cooperated hosts are located in
the same subnet. Further, by inspecting destination ports and
1https://www.shodan.io/, https://censys.io/
destination IP addresses, Yegneswara et al. [4] uncovered the
trend of intrusion attempts, wherein a few correlated sources
made a significantly large fraction of intrusion. Blaise et al. [5]
developed a statistical approach that utilizes a modified Z-
score measure for port-level change detection, which con-
tributes to the early detection of emerging botnets.
Moreover, existing studies report that scanners embed their
fingerprints into packet fields to distinguish between their
scan and backscatter traffic [6]. The adversary probably reuses
the same fingerprint during a scanning campaign because
creating a fingerprint for each source is inefficient. Snort,
Suricata2, and most IDS tools rely on a fingerprint database
to identify attacks that match the given fingerprints. Although
a fingerprint is expressive and easy to be understood by
network administrators, problems still exist. First, emerging
unknown threats may be overlooked. Second, it is expensive
to manually generate a fingerprint. Griffioen et al. [6] proposed
a method for finding a fingerprint in TCP and IP header fields
to address this problem. Their numerical experiments verified
its high detection performance for various scenarios, such as
distributing source addresses and limiting packets per host.
However, existing studies assume a fixed-form fingerprint,
thereby overlooking low-profile malware scans with diverse
fingerprints. Therefore, this study proposes an algorithm to au-
tomatically generate fingerprint candidates and select promis-
ing fingerprints. To the best of our knowledge, this is the first
study to automatically generate flexible fingerprints; thus, our
model cannot be compared with other existing models. We
analyzed the packets that have a fingerprint and collated their
source IP addresses with investigation/large-scale scanner lists
to ascertain the reliability of our model.
In summary, this study offers the following contributions:
1) We consider a genetic algorithm (GA) and propose a
new algorithm that automatically identifies the finger-
print embedded in IPv4 and TCP header fields.
2) We demonstrate the feasibility of our approach using
darknet traffic and discover several new fingerprints
previously unknown in the literature.
3) We collate our results with both cyber threat intelligence
and investigation/large-scale scanner lists to ensure the
reliability of our model.
2https://www.snort.org/, https://suricata.io/
TABLE I: Fingerprints of major open-source scanners (Mass-
can and ZMap) and malware (Mirai, Hajime) [1], [7]. fFB(·)
and fLB(·)output the first and last bytes of the input, respec-
tively. Similarly, fL2B(·)outputs the last two bytes of the input.
denotes the bitwise XOR that performs the logical inclusive
OR operation on each pair of corresponding bits.
Name Fingerprint
Hajime [8] xp,(tcp.window,14600)
xp,(fFB(tcp.seq),0) xp,(fLB (tcp.seq),0)
Massscan [9] xp,(ip.idfL2B(ip.dst)tcp.dportfL2B (tcp.seq),0)
ZMap [10] xp,(ip.id,54321)
Mirai [11] xp,(ip.seqip.dst,0)
II. PRELIMINARIES
This section provides the preliminaries to our proposed
method and experiments.
A. Header Fields Used for Producing Fingerprints
As shown in Fig. 1, the header fields of the IPv4 and
TCP protocols are categorized into: (a) modifiable fields
(yellow) and (b) unchangeable fields (white). Modifications
of unchangeable fields prevent IPv4 or TCP from working
correctly. For example, a modification of the IPv4 version field
prevents the destination from parsing the packet. Therefore,
fingerprints embedded in a packet are probably made of only
modifiable header fields.
B. Well-known Fingerprints
We define a TCP function fas a function of a TCP packet
in the form of a binary number, Fas the set of TCP functions,
and Bas the set of binary numbers. For example, the function
fthat accepts a TCP packet pand returns its source address
is a TCP function and is expressed as f(p) = ip.src. A sign
(f, b)is defined as an ordered pair of the TCP function fand a
binary number b. A TCP packet pis said to have a sign (f, b)
if f(p) = b, where f(p)is the output of the TCP function f.
Given the fingerprint of a scanner or malware, the corre-
sponding fingerprint is composed of the following variables:
xp,(f,b)=(1if a packet phas a sign (f, b)
0otherwise (1)
According to a previous study [7], a TCP SYN packet phas a
Hajime [8] signature if it possesses two features: (1) a window
size of 14,600 and (2) the first or last byte of the sequence
number is zero. These two conditions are satisfied if and only
if
xp,(tcp.window,14600)
xp,(fFB(tcp.seq),0) xp,(fLB (tcp.seq),0)= 1 (2)
where fFB and fLB output the first and last bytes of the input,
respectively. Other fingerprints and corresponding fingerprints
are summarized in Table I.
C. Internet-wide Scanner
We prepared two Internet-wide scanner lists. In Sec-
tion IV-C, these lists were collated with the source IP addresses
that send packet embedded fingerprints. The detail of the lists
are as follows:
Investigation scanner list
Querying domain names, we obtained 674 IP addresses
for investigation purposes. This list includes eighteen
organizations such as Shodan, Censys and BinaryEdge.
Large-scale scanner list
Following the previous research [12], if an IP address
satisfies both conditions in a day: (1) the number of
unique destination ports is more than or equal to 30;
(2) the number of packets exceeds the total IP address
space (we used the estimated number from darknet IP
addresses), the IP address is called a large-scale scanner.
Three hundred and twenty-five IP addresses satisfied the
criteria at least one day in October 2018, and the list
included all of them.
The above two lists have only 13 overlapping IP addresses,
all of which are Shodan.
III. METHODOLOGY
Our approach identified a fingerprint that is expressed as a
combination of bitwise operations between the header fields
of the IPv4 and TCP protocols. We provide an overview
(Section III-A) and the details of our method in the succeeding
sections.
A. Scanner’s Fingerprint Identifier
Algorithm 1 is the pseudocode for identifying fingerprints,
and the overview is summarized below:
1) Generate TCP functions
Considering the GA, we generate a new TCP func-
tion (line 5) from the current TCP functions ntimes.
Section III-B explains an initial TCP functions F1.
Generating a new TCP function is described in Sec-
tion III-C, thus we present an analogy between the GA
and generating TCP functions in Section III-D.
2) Find effective signs
If many packets have a sign (f, b), then xp,(f,b)is
probably a component of some fingerprints. The sign
(f, b)is defined as an effective sign, and Section III-E
provides a method for identifying them.
3) Consolidate effective signs and identify fingerprints
We calculate co-occurrences between effective signs,
that is, how often two effective signs hold for a packet
simultaneously. We unify some effective signs with
high co-occurrences and create a new fingerprint. For
instance, almost all the packets that have a ZMap
fingerprint (ip.id=54321) satisfy tcp.window=65,535 and
ack=0; hence ZMap fingerprint is updated to ip.id =
54321 tcp.window = 65535 ack = 0.
Fig. 1: Header fields of the IPv4 and TCP protocols. Modifiable fields are in yellow, and unchangeable fields are in white.
Algorithm 1: Scanner’s Fingerprint Identifier
Input : n: number of generated TCP functions
F1F: initial TCP functions
Output: S: set of fingerprints
1
2Function fgpt_identifier(n, F1):
// (1) Generate TCP functions
3FF1
4for i1to ndo
// see Algorithm 2
5fgenerate_TCPfunction(F)
6F.add(f)
// (2)Find effective signs
7E
8for fFdo
// see Algorithm 3
9E0find_effective_sign(f)
10 EEE0
// (3) Consolidate effective signs and
identify fingerprints
11 Sconsolidates and arranges effective signs Eand
finally identifies fingerprints
12 return S
B. Initialization of TCP functions
We define initial TCP functions F1Fas the set of TCP
functions that returns modifiable header fields (colored yellow
in Fig. 1). However, some header fields have the same binary
value for almost all packets. We eliminated the fields from
the initial TCP functions because these header fields are not
appropriate for the initial TCP functions component.
C. Generate TCP function
Given a subset of TCP functions FF, our algorithm
generates a new TCP function fby leveraging either (1)
feature extraction or (2) binary operation, with a probability
of ror 1r, respectively. r[0,1] is a hyperparameter
that determines the priority of the two operations. In feature
extraction, we first select a TCP function ffrom Fand
generate a new TCP function kf, where k:BBis
selected from predefined binary-valued functions Kwith a
discrete uniform distribution. For example, kf=fL2Bip.dst
transforms a packet into the last two bytes of the destination
address. Conversely, the binary operation chooses two TCP
functions f, g and produces a new TCP function ψ(f, g),
where ψis chosen from predefined binary operations Ψwith
a discrete uniform distribution. If f=ip.seq, g=ip.dst, and
ψis the bitwise XOR, the TCP function ψ(f, g)transforms a
packet into the XOR result between the sequence number and
destination address, denoted by ip.seq ip.dst.
ATCP function fis built through some function com-
positions, and its minimum number plus one is denoted by
τcount(f):
τcount(f):= min{i|fFi},where (3)
Fi:={kf|kK, f Fi0(i0< i)}
{ψ(f, g)|ψΨ, f Fi1, g Fi2(i1, i2< i)}
(4)
for i= 2,3,· · · . For instance,
τcount (fL2B(ip.seq)fL2B (ip.dst)) (5)
=τcount (fL2B (ψ(ip.seq,ip.dst)) (6)
=3 (7)
where ψdenotes bitwise XOR.
The pseudocode for generating a new TCP function is
written as the generate_TCPfunction in Algorithm 2.
D. Analogy between Genetic Algorithm and Generating TCP
functions
The GA pursues high-quality solutions for optimization
problems in broad research fields [13]. In GA, a population
of candidate solutions (individuals) iteratively evolves toward
better solutions via biologically inspired operators, such as
mutation, crossover, and selection. In each iteration, the fitness
of every individual is evaluated, and the fit individuals are
stochastically selected from the current population. Subse-
quently, these individuals are then modified via recombination
or random mutation.
Although we cannot directly apply the GA in identifying
fingerprints, we consider this idea and build a new algorithm
to make TCP functions. In Algorithm 2, a TCP function
represents an individual, and the fitness of each TCP function
is assessed via τcount. Specifically, a TCP function fthat
has smaller τcount(f)tends to be chosen, which implies that
we seek more simpler TCP functions. (1) feature extraction
and (2) binary operations correspond to the mutation and
crossover, respectively. The output of these operations inherit
the input features, which are similar to those of mutation and
crossover.
Algorithm 2: Generate TCP function
Input : FF:TCP functions
K: binary-valued functions
(kKimplies k:BB)
Ψ: binary operations on F
(ψΨimplies ψ:F×FF)
r[0,1] : probability of future extraction
Output: f: a new TCP function
1
2Function select_TCP_function(F):
3fselect fFwith the probability (1count(f))2
PfF(1count(f))2
4return f
5Function feature_extraction(F, K):
6fselect_TCP_function(F)
7kselect kKwith a discrete uniform distribution
8return kf
9Function binary_operation(F, Ψ):
10 fselect_TCP_function(F)
11 gselect_TCP_function(F)
12 ψselect ψΨwith a discrete uniform distribution
13 return ψ(f, g)
14 Function generate_TCPfunction(F):
15 xUniform(0,1) // random number from [0,1]
16 if xrthen
17 ffeature_extraction(F, K )
18 else
19 fbinary_operation(F, Ψ)
20 return f
E. Find Effective Signs
We describes a method for identifying the effective signs
(f, b), given a TCP function f. We examine the appearance
ratio of each binary in the destination of f.
Let fbe a TCP function, and let P={p}pbe packets
collected from network traffic. For each binary bB, we
define the appearance ratio of bas
ra(b):=#{pP|f(p) = b}
#P(8)
where #Adenotes the number of the elements in set A. The
image of the packets P={p}punder fis defined by
B:=f(P) = {f(p)|pP} B(9)
.R:= (ra(f(p)))pPdenotes a multiset (not set) whose
underlying set is {ra(b)|bB}. For every real number
α, we define R R(RαR) as a multiset composed
of any element of rRsuch that r < α (rα). The
population variance of a multiset Ais denoted by σ2
A. We
define the effective indicator of bBfor fFas
ef(b):=(σ2
Rra(b)2
R<ra(b)(if σ2
R<ra(b)>0)
NU LL (otherwise)(10)
Algorithm 3: Find Effective Sign
Input : f: a TCP function
P={p}p: packets
max sign : max number of effective signs
per TCP function
sign thres : threshold of effective signs
Output: E={(f, b)}:effective signs
1
2Function find_effective_sign(f):
3B {f(p)|pP}
4sorted B arrange Bin descending order of the
appearance ratio ra(b)
5max idx NUL L
6for i0to max sign 1do
7bsorted B[i]
8if ef(b)>sign thres then
9max idx i
10 E // initializes effective signs
11 if max idx 6=NU LL then
12 for i0to max idx do
13 E.add((f, B [i]))
14 return E
where NU LL implies that ef(b)cannot be defined. The effective
indicator evaluates the extent to which the binary bBinflu-
ences the variance of the appearance ratio, and a larger value
of ef(b)implies that (f, b)is an effective sign. Algorithm 3
describes the pseudocode of finding effective signs.
IV. EXP ER IM EN TS
We applied our model to darknet traffic and demonstrated
its feasibility. Section IV-A explains the dataset used in our
experiments, and the parameters of our model are summarized
in Section IV-B. We analyze the packets that have a fingerprint
in Section IV-C.
A. Dataset
Our dataset was collected from a darknet operated by
NICTER3. The darknet, also known as a network telescope,
passively monitors network traffic with an unreachable dark IP
address block. The darknet is an effective system for observing
indiscriminate Internet-wide scans because it does not receive
benign and regular network traffic. Because TCP SYN packets
are used to survey active hosts and open ports [6], our
experiment used only TCP SYN packets. Table II summarizes
the dataset and the computation time of our algorithm.
B. Parameter Setting
The following parameters were selected manually based on
the empirical investigation. In Algorithm 1, we set n = 2,000,
F1={ip.id,ip.checksum,ip.src,ip.dstaddr}
{tcp.sport,tcp.dport,tcp.seq,tcp.window}(11)
3Network Incident Analysis Center for Tactical Emergency Response: https:
//www.nicter.jp/en
TABLE II: Summary of our dataset and the computation time
of Algorithm 1
Situation #IPs§Period*#Packets Time
Implement
Algorithm 1 4,096 10/22 10/24 572,2896.5 h
Analyze packets
that has a fingerprint 4,096 10/22 10/28 117.5×10618 h
*The year was 2018 and the period was selected owing to active malicious
activities [7], [14].
§The number of IP addresses of our darknet.
We implemented Algorithm 1 using Python and one core on a server
AMD EPYC 7H12 (64 CPUs and 56 GB RAM). Elasticsearch was utilized
for the database of packets. The computation time only includes: (1)
Generating TCP functions and (2) Finding effective signs (lines 2–10).
We used 1 h amount of packet because of the computation cost (the
number of total packets is 41.2×106).
. Algorithm 2 adopts r= 0.1,K={fF2B, fL2B}and
Ψ = {ψ}where ψ:F×FFsatisfies ψ(f, g)(p)7→
f(p)g(p)for every packet p. max sign = 10 is used in
Algorithm 3, and sign thres is determined such that the num-
ber of effective signs ranges from 20 to 50. The computation
time of Algorithm 1 is proportional to both nand P={p}p.
F1,Kand Ψspecify the search space of the TCP functions.
max sign and sign thres determine the threshold for effective
signs.
C. Analyzing Packets that have a Fingerprint
For the TCP SYN packets, we applied Algorithm 1 five
times while excluding packets that had some fingerprints. Mul-
tiple applications can identify a fingerprint with a small num-
ber of packets. We identified three known and six unknown
fingerprints, as summarized in Table III. Some IP addresses
sent numerous packets that had a fingerprint with different
characteristics from the other packets that have the same
fingerprint. Therefore, we eliminated packets sent by these
IP addresses. Although any identified unknown fingerprint
occupied less than 0.5 % of the total packets, our model could
detect them. We identified well-known fingerprints, which are
summarized in Table I, except for Hajime, which is regarded as
a trivial attack because of the small number of packets (0.02 %
of the total packets). We categorized the identified fingerprints
into attack or investigation purposes, and the major features
are summarized below.
Attack purpose (Mirai botnet, Botnet1, and Attack2–6)
Destination ports are related with some vulnerability.
All source IP addresses of Mirai botnet, Botnet1,
and Attack2–5 are not included in the Investigation
or large-scale scanner lists, which are defined in
Section II-C.
Investigation purpose (ZMap and Masscan)
Destination ports cover a wide range of ports.
Source IP addresses overlap investigation and large-
scale scanner lists.
1) Attack Purpose Scan: As summarized in Table III, the
incessant packets with the Mirai fingerprint occupy 14.31% of
the total TCP SYN packets. Mirai targets 23/TCP (53.8%) and
2323/TCP (6.0%) as destination ports for the brute-force login
phase, where each percentage indicates the ratio of the packets
with the Mirai fingerprint. Destination port 4444/TCP (6.0%)
was used for an infection campaign for Android devices.
The remarkable characteristic of Botnet1 is that: (1) the
destination port is 5431, (2) the source port is 6, and (3)
the window size is 65535. Moreover, Botnet1 excludes the
packets with ip.id=54321, which is the ZMap fingerprint.
The destination ports 5431/TCP and 154 K unique source
IP addresses indicate that the Botnet1 aims to infect router
equipment with the Broadcom UPnP feature enabled4.
Both Attack2 and 3 satisfy ip.id=256, tcp.window=16384,
fL2B(tcp.seq)=0. Attack2 uses a fixed source port 6000/TCP,
whereas Attack3 uses any port except 6000/TCP. Destination
ports of Attack2 and 3, 3306/TCP, 1433/TCP, and 60001/TCP
occupy more than 14%, 14%, and 15%, respectively. The
adversary explores the vulnerability of SQL through 3306/TCP
and 1433/TCP [15], [16] and the vulnerability of the Jaws Web
Server (EDB-ID:414715) on 60001/TCP.
All the destination ports of Attack4 and 5 are dynamic ports
(in the range of 49152 to 65535).
Attack6 aims at 80/TCP (31.0%), 8080/TCP (26.9%),
85/TCP (26.8%) and 443(14.9%). According to the NICTER
observation report 20186, ports 80, 443, and 8080 originated
from attacks on GPON home routers (CVE-2018-10561 and
CVE-2018-105627).
2) Security Scan: Because almost all the packets with the
original ZMap fingerprint (ip.id=54321) satisfy tcp.ack=0 and
tcp.window=65535, the ZMap fingerprint is updated, as sum-
marized in Table III. Compared to the investigation scanner
list, the 64 source IP addresses are from Censys, denoted by
Censys(64), Binaryedge(12), Security.ipip.net(21), and Shad-
owserver(166). Succinctly, the investigation and large-scale
scanner list occupy 9.42% and 22.16%, respectively. We
observed constant packets, and a few IP addresses caused three
spikes. Further, ZMap uses a wide variety of source ports.
Massscan accounts for 67.14% of the total TCP SYN
packets and has Binaryedge(1), Censys(2), Onyphe(1), Secu-
rity.ipip.net(3), Shadowserver(11), and Shodan(6). Moreover,
77.38% of Masscan’s packets are from the IP addresses in the
large-scale scanner list.
V. CONCLUSION
This study considers a genetic algorithm and proposes a
new algorithm that automatically identifies the fingerprints
embedded in packet header fields. Numerical experiments
using darknet traffic demonstrated the feasibility of our model
by identifying previously unknown fingerprints rather than
existing ones. We analyzed these packets and revealed char-
acteristic scan activities that accounted for less than 0.5%
4https://blog.netlab.360.com/bcmpupnp hunter-a-100k- botnet-turns-home
-routers- to-email- spammers-en/
5EDB-ID represents the identification number of the exploit database
6https://www.nicter.jp/en/report
7https://nvd.nist.gov/
TABLE III: Fingerprints identified by Algorithm 1. ¬xdenotes the negation of x, and K denotes 103.
Name Packets (%) #source IP
addresses (%) Fingerprint
Mirai Botnet 16,812 K
(14.31%)
366,358
(25.851%) x(ip.seqip.dst,0) xp,(tcp.ack,0)
Botnet1 439 K
(0.37%)
154,136
(10.876%)
xp,(tcp.dport,5431) xp,(tcp.sport,6) xp,(tcp.window,65535)
xp,(tcp.seq,0) xp,(tcp.ack,0) ¬xp,(ip.id,54321)
Attack2 302 K
(0.26%)
61
(0.004%)
xp,(ip.id,256) xp,(tcp.window,16384) xp,(fL2B(tcp.seq),0)
xp,(tcp.ack,0) xp,(tcp.sport,6000)
Attack3 477 K
(0.41%)
71
(0.001%)
xp,(ip.id,256) xp,(tcp.window,16384) xp,(fL2B(tcp.seq),0)
xp,(tcp.ack,0) ¬xp,(tcp.sport,6000)
Attack4 423 K
(0.36%)
18
(0.001%)
x(fL2B(ip.dst)tcp.dport,0) xp,(tcp.ack,1)
xp,(ip.id,0) xp,(tcp.window,17520) x(tcp.sport,80)
Attack5 268 K
(0.23%)
98
(0.007%) x(fL2B(ip.dst)tcp.dport,0) xp,(tcp.ack,1) xp,(ip.id,38993)
Attack6 88 K
(0.07%)
56
(0.004%) xp,(tcp.window,1300) xp,(tcp.ack,0)
ZMap 9,441 K
(8.04%)
4,117
(0.291%) xp,(ip.id,54321) xp,(tcp.ack,0) xp,(tcp.window,65535)
Massscan 78,915 K
(67.17%)
1,650
(0.116%) xp,(ip.idfL2B(ip.dst)tcp.dportfL2B (tcp.seq),0) xp,(tcp.ack,0)
of the total packets. These results were collated with both
cyber threat intelligence and investigation/large-scale scanner
lists to ascertain the reliability of the fingerprints. In the next
step, we will perform parallel computations for speedup to
integrate our model into a real-time system. Furthermore, we
aim at identifying more reliable fingerprints by applying our
method to packets from dynamic malware analysis. Finally,
we will build a system that automatically associates identified
fingerprints with open-source threat intelligence [17].
ACKNOWLEDGMENT
The study was partly conducted under a contract of “MIT-
IGATE” of the Research and Development for Expansion of
Radio Wave Resources (JPJ000254), which was supported by
the Ministry of Internal Affairs and Communications, Japan.
REFERENCES
[1] Z. Durumeric, M. Bailey, and J. A. Halderman, “An internet-wide view
of internet-wide scanning,” Proc. 23rd USENIX Secur. Symp., pp. 65–78,
2014.
[2] N. Blenn, V. Ghi¨
ette, and C. Doerr, “Quantifying the Spectrum of
Denial-of-Service Attacks through Internet Backscatter, in Proc. 12th
Int. Conf. Availability, Reliab. Secur., ser. ARES ’17, 2017.
[3] S. Robertson, E. V. Siegel, M. Miller, and S. J. Stolfo, “Surveillance
detection in high bandwidth environments, Proc. DARPA Inf. Surviv.
Conf. Expo. DISCEX, vol. 1, pp. 130–138, 2003.
[4] V. Yegneswaran, P. Barford, and J. Ullrich, “Internet intrusions: Global
characteristics and prevalence, Perform. Eval. Rev., vol. 31, no. 1, pp.
138–147, 2003.
[5] A. Blaise, M. Bouet, V. Conan, and S. Secci, “Detection of zero-day
attacks: An unsupervised port-based approach,” Comput. Networks, vol.
180, no. January, 2020.
[6] H. Griffioen and C. Doerr, “Discovering Collaboration: Unveiling Slow,
Distributed Scanners based on Common Header Field Patterns, in
IEEE/IFIP Netw. Oper. Manag. Symp., 2020, pp. 1–9.
[7] C. Han, J. Shimamura, T. Takahashi, and et al., “Real-Time Detection
of Global Cyberthreat Based on Darknet by Estimating Anomalous
Synchronization Using Graphical Lasso,” IEICE Trans. Inf. Syst., vol.
103, no. 10, pp. 2113–2124, 2020.
[8] S. Herwig, K. Harvey, G. Hughey, and et al., “Measurement and analysis
of Hajime, a peer-to-peer IoT botnet,” in Netw. Distrib. Syst. Secur.
Symp., 2019.
[9] R. D. Graham, “MASSCAN: Mass IP port scanner.” [Online].
Available: https://github.com/robertdavidgraham/masscan
[10] Z. Durumeric, E. Wustrow, and J. A. Halderman, “ZMap: Fast Internet-
wide Scanning and Its Security Applications,” in Proc. USENIX Secur.
Symp., 2013, pp. 605–620.
[11] M. Antonakakis, T. April, M. Bailey, and et al., “Understanding the
Mirai Botnet,” in 26th USENIX Secur. Symp., 2017, pp. 1093–1110.
[12] Y. Endo, Y. Mori, J. Shimamura, and M. Kubo, “Proposing Criteria for
Detecting Internet-Wide Scanners for Darknet Monitoring, in IEICE
Tech. Rep., ser. ICSS2019-80, vol. 119, no. 437, Okinawa, mar 2020,
pp. 73–78, (in Japanese).
[13] Z. Z. Wang and A. Sobey, “A comparative review between Genetic
Algorithm use in composite optimisation and the state-of-the-art in
evolutionary computation, Compos. Struct., vol. 233, 2020.
[14] C. Han, J. Takeuchi, T. Takahashi, and D. Inoue, “Automated detection of
malware activities using nonnegative matrix factorization,” in 20th IEEE
International Conference On Trust, Security And Privacy In Computing
And Communications (TrustCom), 2021.
[15] K. Goseva-Popstojanova, B. Miller, R. Pantev, and A. Dimitrijevikj,
“Empirical analysis of attackers activity on multi-tier web systems,
Proc. Int. Conf. Adv. Inf. Netw. Appl. AINA, pp. 781–788, 2010.
[16] T. Battle, “GIAC Certified Incident Handler Practical,” Methods, 2003.
[17] T. Takahashi, Y. Umemura, C. Han, T. Ban, K. Furumoto, O. Nakamura,
K. Yoshioka, J. Takeuchi, N. Murata, and Y. Shiraishi, “Designing
comprehensive cyber threat analysis platform: Can we orchestrate anal-
ysis engines?” in 2021 IEEE International Conference on Pervasive
Computing and Communications Workshops and other Affiliated Events
(PerCom Workshops). IEEE, Mar. 2021.
... Our preliminary work [46] devised a method for identifying fingerprints that are represented by TCP/IP headers and operations on them. Although the method identified most of the well-known and some unknown fingerprints, it overlooked low-rate and coordinated scanners. ...
... A TCP function is a feature function that calculates a feature of a packet. We can produce TCP functions by combining TCP or IP header fields manually or by applying the method in our preliminary work [46] (see Appendix A). ...
... We utilize two methods to identify effective signs. Our preliminary work [46] (see Section III-E) proposed the first method that regarded a pattern as a fingerprint if many packets had the pattern. The second method prioritizes the number of hosts over the number of packets that possesses a pattern. ...
Article
Full-text available
Adversaries perform port scanning to discover accessible and vulnerable hosts as a prelude to cyber havoc. A darknet is a cyberattack observation network to capture these scanning activities through reachable yet unused IP addresses. However, the enormous amount of packets and superposition of diverse scanning strategies prevent extracting significant insights from the aggregate traffic. Some coordinated scanners disperse probe packets whose TCP/IP header follows a unique pattern to determine whether the received packets are valid responses to their probes or are part of other background traffic. We call such a pattern a fingerprint. For example, a probe packet from a Mirai-infected host satisfies a pattern whereby the destination IP address equals the sequence number. A fingerprint indicates that the source host has been involved in a particular scanning campaign. Although some fingerprints have been discovered and known to the public, there are and will be more undiscovered ones. We intend to unveil these fingerprints. Our preliminary work automatically identified flexible fingerprints but overlooked low-rate and coordinated scanners. In this work, we improved the fingerprint identifier, enabling it to detect these stealth scans. Moreover, we revealed the scans’ objectives by investigating destination port sets. We associated fingerprints with threat intelligence and verified their reliability. Our approach identified all well-known and eight unknown fingerprints on one month’s worth of darknet data collected from about three-hundred thousand unused IP addresses. We disclosed the fingerprints of the Mozi botnet and destination port sets that were previously unreported.
... We briefly describe related works now. Conventional studies [Mazel et al., 2017,Griffioen and Doerr, 2020, Tanaka et al., 2021 that identify scanners in a rule-based manner are unable to analyze unknown groups, perform fine-grained groupings, and perform long-term tracing. The clustering method of scanners [Cohen et al., 2020] also cannot perform long-term tracing of clusters when scanning hosts' IP addresses change. ...
... The scan target hosts were at IP addresses from 111.221.46.0 to 111.221.46.255, 256 ports in the benchmarking tests, as shown in Figure 2. This is because the target port for network scanning is port 80 on each IP address, which is commonly used for web traffic with HTTP services [23]. The benchmarking tests implement three network scanners, Nmap, Zmap, and Masscan. ...
Article
Full-text available
span>The growth of information and communication technology has made the internet network have many users. On the other side, this increases cybercrime and its risks. One of the main attack targets is network weakness. Therefore, cyber security is required, which first does a network scan to stop the attack. Points of vulnerability on the network can be discovered using scanning techniques. Furthermore, mitigation or recovery measures can be implemented. However, it needs a short response time and high accuracy while scanning to reduce the level of damage caused by cyber-attacks. In this paper, the proposed method improves the performance of a vulnerability management system based on network and port scanning by combining the benchmarking and scenario planning models. On a network scanning to discover open ports on a subnet, Masscan can achieve response times of less than 2 seconds, and on scenario planning for detection on a single host by Nmap can reach less than 4 seconds. It was combining both models obtained an adequate optimization response time. The total response time is less than 6 seconds.</span
... 1) Automatic Fingerprint Identification: This research is already in progress and we have published some results [14]. Scanners may embed their own fingerprints (signatures) in the TCP/IP fields of scan packets in order to receive distinct scan responses from the other packets [15]. ...
... It has also been reported in [46] that fingerprints are provided to distinguish scan results from backscatters. In contrast, Tanaka et al. proposed a method based on a genetic algorithm to automatically identify fingerprints embedded in TCP/IP headers from darknet traffic [70]. They succeeded in identifying unknown fingerprints from data corresponding to a short period. ...
Article
Full-text available
As cyberattacks become increasingly prevalent globally, there is a need to identify trends in these cyberattacks and take suitable countermeasures quickly. The darknet, an unused IP address space, is relatively conducive to observing and analyzing indiscriminate cyberattacks because of the absence of legitimate communication. Indiscriminate scanning activities by malware to spread their infections often show similar spatiotemporal patterns, and such trends are also observed on the darknet. To address the problem of early detection of malware activities, we focus on anomalous synchronization of spatiotemporal patterns observed in darknet traffic data. Our previous studies proposed algorithms that automatically estimate and detect anomalous spatiotemporal patterns of darknet traffic in real time by employing three independent machine learning methods. In this study, we integrated the previously proposed methods into a single framework, which we refer to as Dark-TRACER , and conducted quantitative experiments to evaluate its ability to detect these malware activities. We used darknet traffic data from October 2018 to October 2020 observed in our large-scale darknet sensors (up to /17 subnet scales). The results demonstrate that the weaknesses of the methods complement each other, and the proposed framework achieves an overall 100% recall rate. In addition, Dark-TRACER detects the average of malware activities 153.6 days earlier than when those malware activities are revealed to the public by reputable third-party security research organizations. Finally, we evaluated the cost of human analysis to implement the proposed system and demonstrated that two analysts can perform the daily operations necessary to operate the framework in approximately 7.3 h.
Article
Darknets are probes listening to traffic reaching IP addresses that host no services. Traffic reaching a darknet results from the actions of internet scanners, botnets and possibly misconfigured hosts. Such peculiar nature of the darknet traffic makes darknets a valuable instrument to discover malicious online activities, e.g., identifying coordinated actions performed by bots or scanners. However, the massive amount of packets and sources that darknets observe makes it hard to extract meaningful insights, calling for scalable tools to automatically identify and group sources that share similar behaviour. We here present i-DarkVec, a methodology to learn meaningful representations of Darknet traffic. i-DarkVec leverages Natural Language Processing techniques (e.g., Word2Vec) to capture the co-occurrence patterns that emerge when scanners or bots launch coordinated actions. As in NLP problems, the embeddings learned with i-DarkVec enable several new machine learning tasks on the darknet traffic, such as identifying clusters of senders engaged in similar activities. We extensively test i-DarkVec and explore its design space in a case study using real darknets. We show that with a proper definition of services , the learned embeddings can be used to (i) solve the classification problem to associate unknown sources’ IP addresses to the correct classes of coordinated actors, and (ii) automatically identify clusters of previously unknown sources performing similar attacks and scans, easing the security analyst’s job. i-DarkVec leverages a novel incremental embedding learning approach that is scalable and robust to traffic changes, making it applicable to dynamic and large-scale scenarios.
Article
Full-text available
The Mirai botnet, composed primarily of embedded and IoT devices, took the Internet by storm in late 2016 when it overwhelmed several high-profile targets with massive distributed denial-of-service (DDoS) attacks. In this paper, we provide a seven-month retrospective analysis of Mirai's growth to a peak of 600k infections and a history of its DDoS victims. By combining a variety of measurement perspectives, we analyze how the bot-net emerged, what classes of devices were affected, and how Mirai variants evolved and competed for vulnerable hosts. Our measurements serve as a lens into the fragile ecosystem of IoT devices. We argue that Mirai may represent a sea change in the evolutionary development of botnets-the simplicity through which devices were infected and its precipitous growth, demonstrate that novice malicious techniques can compromise enough low-end devices to threaten even some of the best-defended targets.
Article
Full-text available
With the rapid evolution and increase of cyberthreats in recent years, it is necessary to detect and understand it promptly and precisely to reduce the impact of cyberthreats. A darknet, which is an unused IP address space, has a high signal-to-noise ratio, so it is easier to understand the global tendency of malicious traffic in cyberspace than other observation networks. In this paper, we aim to capture global cyberthreats in real time. Since multiple hosts infected with similar malware tend to perform similar behavior, we propose a system that estimates a degree of synchronizations from the patterns of packet transmission time among the source hosts observed in unit time of the darknet and detects anomalies in real time. In our evaluation, we perform our proof-of-concept implementation of the proposed engine to demonstrate its feasibility and effectiveness, and we detect cyberthreats with an accuracy of 97.14%. This work is the first practical trial that detects cyberthreats from in-the-wild darknet traffic regardless of new types and variants in real time, and it quantitatively evaluates the result.
Article
Full-text available
Last years have witnessed more and more DDoS attacks towards high-profile websites, as the Mirai botnet attack on September 2016, or more recently the memcached attack on March 2018, this time with no botnet required. These two outbreaks were not detected nor mitigated during their spreading, but only at the time they happened. Such attacks are generally preceded by several stages, including infection of hosts or device fingerprinting; being able to capture this activity would allow their early detection. In this paper, we propose a technique for the early detection of emerging botnets and newly exploited vulnerabilities, which consists in (i) splitting the detection process over different network segments and retaining only distributed anomalies, (ii) monitoring at the port-level, with a simple yet efficient change-detection algorithm based on a modified Z-score measure. We argue how our technique, named Split-and-Merge, can ensure the detection of large-scale zero-day attacks and drastically reduce false positives. We apply the method on two datasets: the MAWI dataset, which provides daily traffic traces of a transpacific backbone link, and the UCSD Network Telescope dataset which contains unsolicited traffic mainly coming from botnet scans. The assumption of a normal distribution – for which the Z-score computation makes sense – is verified through empirical measures. We also show how the solution generates very few alerts; an extensive evaluation on the last three years allows identifying major attacks (including Mirai and memcached) that current Intrusion Detection Systems (IDSs) have not seen. Finally, we classify detected known and unknown anomalies to give additional insights about them.
Conference Paper
Full-text available
To compromise a computer, it is first necessary to discover which hosts are active and which services they run. This reconnaissance is typically accomplished through port scanning. Defense systems monitor for these unsolicited packets and raise an alarm if a predefined threshold is exceeded. To remain undetected, adversaries can either slow down the scan, and/or distribute it over multiple hosts. With each source below the threshold, the combination of all may still complete the scan efficiently. It is especially this group that is of concern: with enough resources and knowledge to execute such a coordinated activity, they will pose a more potent threat than the noisy "script kiddie". Correlating which out of 4 billion IPs potentially collaborate is however a challenging task, hence today's systems do not consider coordination beyond basic subnet aggregation. In this paper, we propose a method to identify and fingerprint distributed scanners based on commonalities in header fields, which are an artefact of the way fast port scanning software is built. We demonstrate that this method can effectively locate groups, and based on the monitoring logs we report on a number of new groups and tools, among them the largest coordinated scan campaign reported to date.
Conference Paper
Full-text available
Denial of Service (DoS) attacks are a major threat currently observable in computer networks and especially the Internet. In such an attack a malicious party tries to either break a service, running on a server, or exhaust the capacity or bandwidth of the victim to hinder customers to effectively use the service. Recent reports show that the total number of Distributed Denial of Service (DDoS) attacks is steadily growing with "mega-attacks" peaking at hundreds of gigabit/s (Gbps). In this paper, we will provide a quantification of DDoS attacks in size and duration beyond these outliers reported in the media. We find that these mega attacks do exist, but the bulk of attacks is in practice only a fraction of these frequently reported values. We further show that it is feasible to collect meaningful backscatter traces using surprisingly small telescopes, thereby enabling a broader audience to perform attack intelligence research.
Conference Paper
Internet-wide network scanning has numerous security applications, including exposing new vulnerabilities and tracking the adoption of defensive mechanisms, but probing the entire public address space with existing tools is both difficult and slow. We introduce ZMap, a modular, open-source network scanner specifically architected to perform Internet-wide scans and capable of surveying the entire IPv4 address space in under 45 minutes from user space on a single machine, approaching the theoretical maximum speed of gigabit Ethernet. We present the scanner architecture, experimentally characterize its performance and accuracy, and explore the security implications of high speed Internet-scale network surveys, both offensive and defensive. We also discuss best practices for good Internet citizenship when performing Internet-wide surveys, informed by our own experiences conducting a long-term research survey over the past year.