Conference PaperPDF Available

Internet-Wide Scanner Fingerprint Identifier Based on TCP/IP Header

December 2021

December 2021

DOI:10.1109/FMEC54266.2021.9732414

Conference: 2021 Sixth International Conference on Fog and Mobile Edge Computing (FMEC)

Authors:

Akira Tanaka

Kyushu University

Chansu Han

National Institute of Information and Communications Technology

Takeshi Takahashi

National Institute of Information and Communications Technology

Katsuki Fujisawa

Tokyo Institute of Technology

Content uploaded by Chansu Han

Content may be subject to copyright.

Internet-Wide Scanner Fingerprint Identiﬁer Based

on TCP/IP Header

Akira Tanaka∗† , Chansu Han∗, Takeshi Takahashi∗, and Katsuki Fujisawa†

∗National Institute of Information and Communications Technology, Tokyo, Japan.

{tanaka.akira, han, takeshi takahashi}@nict.go.jp

†Kyushu University, Fukuoka, Japan. akirat.tanaka@kyudai.jp and fujisawa@imi.kyushu-u.ac.jp

Abstract—Identifying individual scan activities is a crucial and

challenging activity for mitigating emerging cyber threats or

gaining insights into security scans. Sophisticated adversaries

distribute their scans over multiple hosts and operate with

stealth; therefore, low-rate scans hide beneath other benign

trafﬁc. Although previous studies attempted to discover such

stealth scans by observing the distribution of ports and hosts,

well-organized scans are difﬁcult to ﬁnd. However, a scanner

can embed a ﬁngerprint into the packet ﬁelds to distinguish

between the scan and other trafﬁc. In this study, we propose a

new algorithm to identify the ﬂexible ﬁngerprint in consideration

of the genetic algorithm idea. To the best of our knowledge, this

is the ﬁrst such attempt. We successfully identiﬁed previously

unknown ﬁngerprints rather than existing ones through numer-

ical experiments on darknet trafﬁc. We analyzed the packets

and discovered distinctive scan activities. Further, we collated the

results with both cyber threat intelligence and investigation/large-

scale scanner lists to ascertain the reliability of our model.

Index Terms—Internet-wide scan, genetic algorithm, ﬁnger-

print, darknet trafﬁc

I. INTRODUCTION

Port scanning serves multiple purposes by probing a server

or host for open ports. Shodan and Censys1, known as search

engines for Internet-connected devices, build a searchable

database to discover vulnerabilities, their impact, and affected

IoT devices. Network administrators conduct port scans to

determine open ports and services in a penetration test to

verify network security policies. However, an attacker utilizes

a port scan to identify active network services and exploit their

vulnerabilities.

Intrusion detection systems (IDS) or ﬁrewalls detect such

adversaries by monitoring the number of packets sent by a

source IP address or a network class. However, sophisticated

adversaries slip through these defense systems by distributing

the scans over multiple hosts, thereby reducing the scan rate.

Durumeric et al. [1] investigated widespread port scanning and

distributed botnet scans. These distributed scans disappear in

the background noise during other scanning activities such as

benign trafﬁc, benign port scans for investigation purposes,

and internet backscatter [2].

Robertson et al. [3] developed a detection module for stealth

scans, which assumes that cooperated hosts are located in

the same subnet. Further, by inspecting destination ports and

1https://www.shodan.io/, https://censys.io/

destination IP addresses, Yegneswara et al. [4] uncovered the

trend of intrusion attempts, wherein a few correlated sources

made a signiﬁcantly large fraction of intrusion. Blaise et al. [5]

developed a statistical approach that utilizes a modiﬁed Z-

score measure for port-level change detection, which con-

tributes to the early detection of emerging botnets.

Moreover, existing studies report that scanners embed their

ﬁngerprints into packet ﬁelds to distinguish between their

scan and backscatter trafﬁc [6]. The adversary probably reuses

the same ﬁngerprint during a scanning campaign because

creating a ﬁngerprint for each source is inefﬁcient. Snort,

Suricata2, and most IDS tools rely on a ﬁngerprint database

to identify attacks that match the given ﬁngerprints. Although

a ﬁngerprint is expressive and easy to be understood by

network administrators, problems still exist. First, emerging

unknown threats may be overlooked. Second, it is expensive

to manually generate a ﬁngerprint. Grifﬁoen et al. [6] proposed

a method for ﬁnding a ﬁngerprint in TCP and IP header ﬁelds

to address this problem. Their numerical experiments veriﬁed

its high detection performance for various scenarios, such as

distributing source addresses and limiting packets per host.

However, existing studies assume a ﬁxed-form ﬁngerprint,

thereby overlooking low-proﬁle malware scans with diverse

ﬁngerprints. Therefore, this study proposes an algorithm to au-

tomatically generate ﬁngerprint candidates and select promis-

ing ﬁngerprints. To the best of our knowledge, this is the ﬁrst

study to automatically generate ﬂexible ﬁngerprints; thus, our

model cannot be compared with other existing models. We

analyzed the packets that have a ﬁngerprint and collated their

source IP addresses with investigation/large-scale scanner lists

to ascertain the reliability of our model.

In summary, this study offers the following contributions:

1) We consider a genetic algorithm (GA) and propose a

new algorithm that automatically identiﬁes the ﬁnger-

print embedded in IPv4 and TCP header ﬁelds.

2) We demonstrate the feasibility of our approach using

darknet trafﬁc and discover several new ﬁngerprints

previously unknown in the literature.

3) We collate our results with both cyber threat intelligence

and investigation/large-scale scanner lists to ensure the

reliability of our model.

2https://www.snort.org/, https://suricata.io/

TABLE I: Fingerprints of major open-source scanners (Mass-

can and ZMap) and malware (Mirai, Hajime) [1], [7]. fFB(·)

and fLB(·)output the ﬁrst and last bytes of the input, respec-

tively. Similarly, fL2B(·)outputs the last two bytes of the input.

⊕denotes the bitwise XOR that performs the logical inclusive

OR operation on each pair of corresponding bits.

Name Fingerprint

Hajime [8] xp,(tcp.window,14600)

∧xp,(fFB(tcp.seq),0) ∨xp,(fLB (tcp.seq),0)

Massscan [9] xp,(ip.id⊕fL2B(ip.dst)⊕tcp.dport⊕fL2B (tcp.seq),0)

ZMap [10] xp,(ip.id,54321)

Mirai [11] xp,(ip.seq⊕ip.dst,0)

II. PRELIMINARIES

This section provides the preliminaries to our proposed

method and experiments.

A. Header Fields Used for Producing Fingerprints

As shown in Fig. 1, the header ﬁelds of the IPv4 and

TCP protocols are categorized into: (a) modiﬁable ﬁelds

(yellow) and (b) unchangeable ﬁelds (white). Modiﬁcations

of unchangeable ﬁelds prevent IPv4 or TCP from working

correctly. For example, a modiﬁcation of the IPv4 version ﬁeld

prevents the destination from parsing the packet. Therefore,

ﬁngerprints embedded in a packet are probably made of only

modiﬁable header ﬁelds.

B. Well-known Fingerprints

We deﬁne a TCP function fas a function of a TCP packet

in the form of a binary number, Fas the set of TCP functions,

and Bas the set of binary numbers. For example, the function

fthat accepts a TCP packet pand returns its source address

is a TCP function and is expressed as f(p) = ip.src. A sign

(f, b)is deﬁned as an ordered pair of the TCP function fand a

binary number b. A TCP packet pis said to have a sign (f, b)

if f(p) = b, where f(p)is the output of the TCP function f.

Given the ﬁngerprint of a scanner or malware, the corre-

sponding ﬁngerprint is composed of the following variables:

xp,(f,b)=(1if a packet phas a sign (f, b)

0otherwise (1)

According to a previous study [7], a TCP SYN packet phas a

Hajime [8] signature if it possesses two features: (1) a window

size of 14,600 and (2) the ﬁrst or last byte of the sequence

number is zero. These two conditions are satisﬁed if and only

xp,(tcp.window,14600)

∧xp,(fFB(tcp.seq),0) ∨xp,(fLB (tcp.seq),0)= 1 (2)

where fFB and fLB output the ﬁrst and last bytes of the input,

respectively. Other ﬁngerprints and corresponding ﬁngerprints

are summarized in Table I.

C. Internet-wide Scanner

We prepared two Internet-wide scanner lists. In Sec-

tion IV-C, these lists were collated with the source IP addresses

that send packet embedded ﬁngerprints. The detail of the lists

are as follows:

•Investigation scanner list

Querying domain names, we obtained 674 IP addresses

for investigation purposes. This list includes eighteen

organizations such as Shodan, Censys and BinaryEdge.

•Large-scale scanner list

Following the previous research [12], if an IP address

satisﬁes both conditions in a day: (1) the number of

unique destination ports is more than or equal to 30;

(2) the number of packets exceeds the total IP address

space (we used the estimated number from darknet IP

addresses), the IP address is called a large-scale scanner.

Three hundred and twenty-ﬁve IP addresses satisﬁed the

criteria at least one day in October 2018, and the list

included all of them.

The above two lists have only 13 overlapping IP addresses,

all of which are Shodan.

III. METHODOLOGY

Our approach identiﬁed a ﬁngerprint that is expressed as a

combination of bitwise operations between the header ﬁelds

of the IPv4 and TCP protocols. We provide an overview

(Section III-A) and the details of our method in the succeeding

sections.

A. Scanner’s Fingerprint Identiﬁer

Algorithm 1 is the pseudocode for identifying ﬁngerprints,

and the overview is summarized below:

1) Generate TCP functions

Considering the GA, we generate a new TCP func-

tion (line 5) from the current TCP functions ntimes.

Section III-B explains an initial TCP functions F1.

Generating a new TCP function is described in Sec-

tion III-C, thus we present an analogy between the GA

and generating TCP functions in Section III-D.

2) Find effective signs

If many packets have a sign (f, b), then xp,(f,b)is

probably a component of some ﬁngerprints. The sign

(f, b)is deﬁned as an effective sign, and Section III-E

provides a method for identifying them.

3) Consolidate effective signs and identify ﬁngerprints

We calculate co-occurrences between effective signs,

that is, how often two effective signs hold for a packet

simultaneously. We unify some effective signs with

high co-occurrences and create a new ﬁngerprint. For

instance, almost all the packets that have a ZMap

ﬁngerprint (ip.id=54321) satisfy tcp.window=65,535 and

ack=0; hence ZMap ﬁngerprint is updated to ip.id =

54321 ∧tcp.window = 65535 ∧ack = 0.

Fig. 1: Header ﬁelds of the IPv4 and TCP protocols. Modiﬁable ﬁelds are in yellow, and unchangeable ﬁelds are in white.

Algorithm 1: Scanner’s Fingerprint Identiﬁer

Input : n: number of generated TCP functions

F1⊆F: initial TCP functions

Output: S: set of ﬁngerprints

2Function fgpt_identifier(n, F1):

// (1) Generate TCP functions

3F←F1

4for i←1to ndo

// see Algorithm 2

5f←generate_TCPfunction(F)

6F.add(f)

// (2)Find effective signs

7E← ∅

8for f∈Fdo

// see Algorithm 3

9E0←find_effective_sign(f)

10 E←E∪E0

// (3) Consolidate effective signs and

identify fingerprints

11 S←consolidates and arranges effective signs Eand

ﬁnally identiﬁes ﬁngerprints

12 return S

B. Initialization of TCP functions

We deﬁne initial TCP functions F1⊆Fas the set of TCP

functions that returns modiﬁable header ﬁelds (colored yellow

in Fig. 1). However, some header ﬁelds have the same binary

value for almost all packets. We eliminated the ﬁelds from

the initial TCP functions because these header ﬁelds are not

appropriate for the initial TCP functions component.

C. Generate TCP function

Given a subset of TCP functions F⊆F, our algorithm

generates a new TCP function fby leveraging either (1)

feature extraction or (2) binary operation, with a probability

of ror 1−r, respectively. r∈[0,1] is a hyperparameter

that determines the priority of the two operations. In feature

extraction, we ﬁrst select a TCP function ffrom Fand

generate a new TCP function k◦f, where k:B→Bis

selected from predeﬁned binary-valued functions Kwith a

discrete uniform distribution. For example, k◦f=fL2B◦ip.dst

transforms a packet into the last two bytes of the destination

address. Conversely, the binary operation chooses two TCP

functions f, g and produces a new TCP function ψ(f, g),

where ψis chosen from predeﬁned binary operations Ψwith

a discrete uniform distribution. If f=ip.seq, g=ip.dst, and

ψis the bitwise XOR, the TCP function ψ(f, g)transforms a

packet into the XOR result between the sequence number and

destination address, denoted by ip.seq ⊕ip.dst.

ATCP function fis built through some function com-

positions, and its minimum number plus one is denoted by

τcount(f):

τcount(f):= min{i|f∈Fi},where (3)

Fi:={k◦f|k∈K, f ∈Fi0(i0< i)}

∪ {ψ(f, g)|ψ∈Ψ, f ∈Fi1, g ∈Fi2(i1, i2< i)}

(4)

for i= 2,3,· · · . For instance,

τcount (fL2B(ip.seq)⊕fL2B (ip.dst)) (5)

=τcount (fL2B (ψ(ip.seq,ip.dst)) (6)

=3 (7)

where ψdenotes bitwise XOR.

The pseudocode for generating a new TCP function is

written as the generate_TCPfunction in Algorithm 2.

D. Analogy between Genetic Algorithm and Generating TCP

functions

The GA pursues high-quality solutions for optimization

problems in broad research ﬁelds [13]. In GA, a population

of candidate solutions (individuals) iteratively evolves toward

better solutions via biologically inspired operators, such as

mutation, crossover, and selection. In each iteration, the ﬁtness

of every individual is evaluated, and the ﬁt individuals are

stochastically selected from the current population. Subse-

quently, these individuals are then modiﬁed via recombination

or random mutation.

Although we cannot directly apply the GA in identifying

ﬁngerprints, we consider this idea and build a new algorithm

to make TCP functions. In Algorithm 2, a TCP function

represents an individual, and the ﬁtness of each TCP function

is assessed via τcount. Speciﬁcally, a TCP function fthat

has smaller τcount(f)tends to be chosen, which implies that

we seek more simpler TCP functions. (1) feature extraction

and (2) binary operations correspond to the mutation and

crossover, respectively. The output of these operations inherit

the input features, which are similar to those of mutation and

crossover.

Algorithm 2: Generate TCP function

Input : F⊆F:TCP functions

K: binary-valued functions

(k∈Kimplies k:B→B)

Ψ: binary operations on F

(ψ∈Ψimplies ψ:F×F→F)

r∈[0,1] : probability of future extraction

Output: f: a new TCP function

2Function select_TCP_function(F):

3f←select f∈Fwith the probability (1/τcount(f))2

Pf∈F(1/τcount(f))2

4return f

5Function feature_extraction(F, K):

6f←select_TCP_function(F)

7k←select k∈Kwith a discrete uniform distribution

8return k◦f

9Function binary_operation(F, Ψ):

10 f←select_TCP_function(F)

11 g←select_TCP_function(F)

12 ψ←select ψ∈Ψwith a discrete uniform distribution

13 return ψ(f, g)

14 Function generate_TCPfunction(F):

15 x∼Uniform(0,1) // random number from [0,1]

16 if x≤rthen

17 f←feature_extraction(F, K )

18 else

19 f←binary_operation(F, Ψ)

20 return f

E. Find Effective Signs

We describes a method for identifying the effective signs

(f, b), given a TCP function f. We examine the appearance

ratio of each binary in the destination of f.

Let fbe a TCP function, and let P={p}pbe packets

collected from network trafﬁc. For each binary b∈B, we

deﬁne the appearance ratio of bas

ra(b):=#{p∈P|f(p) = b}

#P(8)

where #Adenotes the number of the elements in set A. The

image of the packets P={p}punder fis deﬁned by

B:=f(P) = {f(p)|p∈P} ⊆ B(9)

.R:= (ra(f(p)))p∈Pdenotes a multiset (not set) whose

underlying set is {ra(b)|b∈B}. For every real number

α, we deﬁne R<α ⊆R(R≤α⊆R) as a multiset composed

of any element of r∈Rsuch that r < α (r≤α). The

population variance of a multiset Ais denoted by σ2

A. We

deﬁne the effective indicator of b∈Bfor f∈Fas

ef(b):=(σ2

R≤ra(b)/σ2

R<ra(b)(if σ2

R<ra(b)>0)

NU LL (otherwise)(10)

Algorithm 3: Find Effective Sign

Input : f: a TCP function

P={p}p: packets

max sign : max number of effective signs

per TCP function

sign thres : threshold of effective signs

Output: E={(f, b)}:effective signs

2Function find_effective_sign(f):

3B← {f(p)|p∈P}

4sorted B ←arrange Bin descending order of the

appearance ratio ra(b)

5max idx ←NUL L

6for i←0to max sign −1do

7b←sorted B[i]

8if ef(b)>sign thres then

9max idx ←i

10 E← ∅ // initializes effective signs

11 if max idx 6=NU LL then

12 for i←0to max idx do

13 E.add((f, B [i]))

14 return E

where NU LL implies that ef(b)cannot be deﬁned. The effective

indicator evaluates the extent to which the binary b∈Binﬂu-

ences the variance of the appearance ratio, and a larger value

of ef(b)implies that (f, b)is an effective sign. Algorithm 3

describes the pseudocode of ﬁnding effective signs.

IV. EXP ER IM EN TS

We applied our model to darknet trafﬁc and demonstrated

its feasibility. Section IV-A explains the dataset used in our

experiments, and the parameters of our model are summarized

in Section IV-B. We analyze the packets that have a ﬁngerprint

in Section IV-C.

A. Dataset

Our dataset was collected from a darknet operated by

NICTER3. The darknet, also known as a network telescope,

passively monitors network trafﬁc with an unreachable dark IP

address block. The darknet is an effective system for observing

indiscriminate Internet-wide scans because it does not receive

benign and regular network trafﬁc. Because TCP SYN packets

are used to survey active hosts and open ports [6], our

experiment used only TCP SYN packets. Table II summarizes

the dataset and the computation time of our algorithm.

B. Parameter Setting

The following parameters were selected manually based on

the empirical investigation. In Algorithm 1, we set n = 2,000,

F1={ip.id,ip.checksum,ip.src,ip.dstaddr}

∪ {tcp.sport,tcp.dport,tcp.seq,tcp.window}(11)

3Network Incident Analysis Center for Tactical Emergency Response: https:

//www.nicter.jp/en

TABLE II: Summary of our dataset and the computation time

of Algorithm 1

Situation #IPs§Period*#Packets Time

Implement

Algorithm 1 4,096 10/22 – 10/24 572,289†6.5 h‡

Analyze packets

that has a ﬁngerprint 4,096 10/22 – 10/28 117.5×10618 h

*The year was 2018 and the period was selected owing to active malicious

activities [7], [14].

§The number of IP addresses of our darknet.

‡We implemented Algorithm 1 using Python and one core on a server

AMD EPYC 7H12 (64 CPUs and 56 GB RAM). Elasticsearch was utilized

for the database of packets. The computation time only includes: (1)

Generating TCP functions and (2) Finding effective signs (lines 2–10).

†We used 1 h amount of packet because of the computation cost (the

number of total packets is 41.2×106).

. Algorithm 2 adopts r= 0.1,K={fF2B, fL2B}and

Ψ = {ψ}where ψ:F×F→Fsatisﬁes ψ(f, g)(p)7→

f(p)⊕g(p)for every packet p. max sign = 10 is used in

Algorithm 3, and sign thres is determined such that the num-

ber of effective signs ranges from 20 to 50. The computation

time of Algorithm 1 is proportional to both nand P={p}p.

F1,Kand Ψspecify the search space of the TCP functions.

max sign and sign thres determine the threshold for effective

signs.

C. Analyzing Packets that have a Fingerprint

For the TCP SYN packets, we applied Algorithm 1 ﬁve

times while excluding packets that had some ﬁngerprints. Mul-

tiple applications can identify a ﬁngerprint with a small num-

ber of packets. We identiﬁed three known and six unknown

ﬁngerprints, as summarized in Table III. Some IP addresses

sent numerous packets that had a ﬁngerprint with different

characteristics from the other packets that have the same

ﬁngerprint. Therefore, we eliminated packets sent by these

IP addresses. Although any identiﬁed unknown ﬁngerprint

occupied less than 0.5 % of the total packets, our model could

detect them. We identiﬁed well-known ﬁngerprints, which are

summarized in Table I, except for Hajime, which is regarded as

a trivial attack because of the small number of packets (0.02 %

of the total packets). We categorized the identiﬁed ﬁngerprints

into attack or investigation purposes, and the major features

are summarized below.

•Attack purpose (Mirai botnet, Botnet1, and Attack2–6)

–Destination ports are related with some vulnerability.

–All source IP addresses of Mirai botnet, Botnet1,

and Attack2–5 are not included in the Investigation

or large-scale scanner lists, which are deﬁned in

Section II-C.

•Investigation purpose (ZMap and Masscan)

–Destination ports cover a wide range of ports.

–Source IP addresses overlap investigation and large-

scale scanner lists.

1) Attack Purpose Scan: As summarized in Table III, the

incessant packets with the Mirai ﬁngerprint occupy 14.31% of

the total TCP SYN packets. Mirai targets 23/TCP (53.8%) and

2323/TCP (6.0%) as destination ports for the brute-force login

phase, where each percentage indicates the ratio of the packets

with the Mirai ﬁngerprint. Destination port 4444/TCP (6.0%)

was used for an infection campaign for Android devices.

The remarkable characteristic of Botnet1 is that: (1) the

destination port is 5431, (2) the source port is 6, and (3)

the window size is 65535. Moreover, Botnet1 excludes the

packets with ip.id=54321, which is the ZMap ﬁngerprint.

The destination ports 5431/TCP and 154 K unique source

IP addresses indicate that the Botnet1 aims to infect router

equipment with the Broadcom UPnP feature enabled4.

Both Attack2 and 3 satisfy ip.id=256, tcp.window=16384,

fL2B(tcp.seq)=0. Attack2 uses a ﬁxed source port 6000/TCP,

whereas Attack3 uses any port except 6000/TCP. Destination

ports of Attack2 and 3, 3306/TCP, 1433/TCP, and 60001/TCP

occupy more than 14%, 14%, and 15%, respectively. The

adversary explores the vulnerability of SQL through 3306/TCP

and 1433/TCP [15], [16] and the vulnerability of the Jaws Web

Server (EDB-ID:414715) on 60001/TCP.

All the destination ports of Attack4 and 5 are dynamic ports

(in the range of 49152 to 65535).

Attack6 aims at 80/TCP (31.0%), 8080/TCP (26.9%),

85/TCP (26.8%) and 443(14.9%). According to the NICTER

observation report 20186, ports 80, 443, and 8080 originated

from attacks on GPON home routers (CVE-2018-10561 and

CVE-2018-105627).

2) Security Scan: Because almost all the packets with the

original ZMap ﬁngerprint (ip.id=54321) satisfy tcp.ack=0 and

tcp.window=65535, the ZMap ﬁngerprint is updated, as sum-

marized in Table III. Compared to the investigation scanner

list, the 64 source IP addresses are from Censys, denoted by

Censys(64), Binaryedge(12), Security.ipip.net(21), and Shad-

owserver(166). Succinctly, the investigation and large-scale

scanner list occupy 9.42% and 22.16%, respectively. We

observed constant packets, and a few IP addresses caused three

spikes. Further, ZMap uses a wide variety of source ports.

Massscan accounts for 67.14% of the total TCP SYN

packets and has Binaryedge(1), Censys(2), Onyphe(1), Secu-

rity.ipip.net(3), Shadowserver(11), and Shodan(6). Moreover,

77.38% of Masscan’s packets are from the IP addresses in the

large-scale scanner list.

V. CONCLUSION

This study considers a genetic algorithm and proposes a

new algorithm that automatically identiﬁes the ﬁngerprints

embedded in packet header ﬁelds. Numerical experiments

using darknet trafﬁc demonstrated the feasibility of our model

by identifying previously unknown ﬁngerprints rather than

existing ones. We analyzed these packets and revealed char-

acteristic scan activities that accounted for less than 0.5%

4https://blog.netlab.360.com/bcmpupnp hunter-a-100k- botnet-turns-home

-routers- to-email- spammers-en/

5EDB-ID represents the identiﬁcation number of the exploit database

6https://www.nicter.jp/en/report

7https://nvd.nist.gov/

TABLE III: Fingerprints identiﬁed by Algorithm 1. ¬xdenotes the negation of x, and K denotes 103.

Name Packets (%) #source IP

addresses (%) Fingerprint

Mirai Botnet 16,812 K

(14.31%)

366,358

(25.851%) x(ip.seq⊕ip.dst,0) ∧xp,(tcp.ack,0)

Botnet1 439 K

(0.37%)

154,136

(10.876%)

xp,(tcp.dport,5431) ∧xp,(tcp.sport,6) ∧xp,(tcp.window,65535)

∧xp,(tcp.seq,0) ∧xp,(tcp.ack,0) ∧ ¬xp,(ip.id,54321)

Attack2 302 K

(0.26%)

(0.004%)

xp,(ip.id,256) ∧xp,(tcp.window,16384) ∧xp,(fL2B(tcp.seq),0)

∧xp,(tcp.ack,0) ∧xp,(tcp.sport,6000)

Attack3 477 K

(0.41%)

(0.001%)

xp,(ip.id,256) ∧xp,(tcp.window,16384) ∧xp,(fL2B(tcp.seq),0)

∧xp,(tcp.ack,0) ∧ ¬xp,(tcp.sport,6000)

Attack4 423 K

(0.36%)

(0.001%)

x(fL2B(ip.dst)⊕tcp.dport,0) ∧xp,(tcp.ack,1)

∧xp,(ip.id,0) ∧xp,(tcp.window,17520) ∧x(tcp.sport,80)

Attack5 268 K

(0.23%)

(0.007%) x(fL2B(ip.dst)⊕tcp.dport,0) ∧xp,(tcp.ack,1) ∧xp,(ip.id,38993)

Attack6 88 K

(0.07%)

(0.004%) xp,(tcp.window,1300) ∧xp,(tcp.ack,0)

ZMap 9,441 K

(8.04%)

4,117

(0.291%) xp,(ip.id,54321) ∧xp,(tcp.ack,0) ∧xp,(tcp.window,65535)

Massscan 78,915 K

(67.17%)

1,650

(0.116%) xp,(ip.id⊕fL2B(ip.dst)⊕tcp.dport⊕fL2B (tcp.seq),0) ∧xp,(tcp.ack,0)

of the total packets. These results were collated with both

cyber threat intelligence and investigation/large-scale scanner

lists to ascertain the reliability of the ﬁngerprints. In the next

step, we will perform parallel computations for speedup to

integrate our model into a real-time system. Furthermore, we

aim at identifying more reliable ﬁngerprints by applying our

method to packets from dynamic malware analysis. Finally,

we will build a system that automatically associates identiﬁed

ﬁngerprints with open-source threat intelligence [17].

ACKNOWLEDGMENT

The study was partly conducted under a contract of “MIT-

IGATE” of the Research and Development for Expansion of

Radio Wave Resources (JPJ000254), which was supported by

the Ministry of Internal Affairs and Communications, Japan.

REFERENCES

[1] Z. Durumeric, M. Bailey, and J. A. Halderman, “An internet-wide view

of internet-wide scanning,” Proc. 23rd USENIX Secur. Symp., pp. 65–78,

2014.

[2] N. Blenn, V. Ghi¨

ette, and C. Doerr, “Quantifying the Spectrum of

Denial-of-Service Attacks through Internet Backscatter,” in Proc. 12th

Int. Conf. Availability, Reliab. Secur., ser. ARES ’17, 2017.

[3] S. Robertson, E. V. Siegel, M. Miller, and S. J. Stolfo, “Surveillance

detection in high bandwidth environments,” Proc. DARPA Inf. Surviv.

Conf. Expo. DISCEX, vol. 1, pp. 130–138, 2003.

[4] V. Yegneswaran, P. Barford, and J. Ullrich, “Internet intrusions: Global

characteristics and prevalence,” Perform. Eval. Rev., vol. 31, no. 1, pp.

138–147, 2003.

[5] A. Blaise, M. Bouet, V. Conan, and S. Secci, “Detection of zero-day

attacks: An unsupervised port-based approach,” Comput. Networks, vol.

180, no. January, 2020.

[6] H. Grifﬁoen and C. Doerr, “Discovering Collaboration: Unveiling Slow,

Distributed Scanners based on Common Header Field Patterns,” in

IEEE/IFIP Netw. Oper. Manag. Symp., 2020, pp. 1–9.

[7] C. Han, J. Shimamura, T. Takahashi, and et al., “Real-Time Detection

of Global Cyberthreat Based on Darknet by Estimating Anomalous

Synchronization Using Graphical Lasso,” IEICE Trans. Inf. Syst., vol.

103, no. 10, pp. 2113–2124, 2020.

[8] S. Herwig, K. Harvey, G. Hughey, and et al., “Measurement and analysis

of Hajime, a peer-to-peer IoT botnet,” in Netw. Distrib. Syst. Secur.

Symp., 2019.

[9] R. D. Graham, “MASSCAN: Mass IP port scanner.” [Online].

Available: https://github.com/robertdavidgraham/masscan

[10] Z. Durumeric, E. Wustrow, and J. A. Halderman, “ZMap: Fast Internet-

wide Scanning and Its Security Applications,” in Proc. USENIX Secur.

Symp., 2013, pp. 605–620.

[11] M. Antonakakis, T. April, M. Bailey, and et al., “Understanding the

Mirai Botnet,” in 26th USENIX Secur. Symp., 2017, pp. 1093–1110.

[12] Y. Endo, Y. Mori, J. Shimamura, and M. Kubo, “Proposing Criteria for

Detecting Internet-Wide Scanners for Darknet Monitoring,” in IEICE

Tech. Rep., ser. ICSS2019-80, vol. 119, no. 437, Okinawa, mar 2020,

pp. 73–78, (in Japanese).

[13] Z. Z. Wang and A. Sobey, “A comparative review between Genetic

Algorithm use in composite optimisation and the state-of-the-art in

evolutionary computation,” Compos. Struct., vol. 233, 2020.

[14] C. Han, J. Takeuchi, T. Takahashi, and D. Inoue, “Automated detection of

malware activities using nonnegative matrix factorization,” in 20th IEEE

International Conference On Trust, Security And Privacy In Computing

And Communications (TrustCom), 2021.

[15] K. Goseva-Popstojanova, B. Miller, R. Pantev, and A. Dimitrijevikj,

“Empirical analysis of attackers activity on multi-tier web systems,”

Proc. Int. Conf. Adv. Inf. Netw. Appl. AINA, pp. 781–788, 2010.

[16] T. Battle, “GIAC Certiﬁed Incident Handler Practical,” Methods, 2003.

[17] T. Takahashi, Y. Umemura, C. Han, T. Ban, K. Furumoto, O. Nakamura,

K. Yoshioka, J. Takeuchi, N. Murata, and Y. Shiraishi, “Designing

comprehensive cyber threat analysis platform: Can we orchestrate anal-

ysis engines?” in 2021 IEEE International Conference on Pervasive

Computing and Communications Workshops and other Afﬁliated Events

(PerCom Workshops). IEEE, Mar. 2021.

Detecting Coordinated Internet-Wide Scanning by TCP/IP Header Fingerprint

Article

Full-text available

Jan 2023

Adversaries perform port scanning to discover accessible and vulnerable hosts as a prelude to cyber havoc. A darknet is a cyberattack observation network to capture these scanning activities through reachable yet unused IP addresses. However, the enormous amount of packets and superposition of diverse scanning strategies prevent extracting significant insights from the aggregate traffic. Some coordinated scanners disperse probe packets whose TCP/IP header follows a unique pattern to determine whether the received packets are valid responses to their probes or are part of other background traffic. We call such a pattern a fingerprint. For example, a probe packet from a Mirai-infected host satisfies a pattern whereby the destination IP address equals the sequence number. A fingerprint indicates that the source host has been involved in a particular scanning campaign. Although some fingerprints have been discovered and known to the public, there are and will be more undiscovered ones. We intend to unveil these fingerprints. Our preliminary work automatically identified flexible fingerprints but overlooked low-rate and coordinated scanners. In this work, we improved the fingerprint identifier, enabling it to detect these stealth scans. Moreover, we revealed the scans’ objectives by investigating destination port sets. We associated fingerprints with threat intelligence and verified their reliability. Our approach identified all well-known and eight unknown fingerprints on one month’s worth of darknet data collected from about three-hundred thousand unused IP addresses. We disclosed the fingerprints of the Mozi botnet and destination port sets that were previously unreported.

Towards Long-Term Continuous Tracing of Internet-Wide Scanning Campaigns Based on Darknet Analysis

Conference Paper

Full-text available

Jan 2023

Response time optimization for vulnerability management system by combining the benchmarking and scenario planning models

Article

Full-text available

Feb 2023
IJECE

span>The growth of information and communication technology has made the internet network have many users. On the other side, this increases cybercrime and its risks. One of the main attack targets is network weakness. Therefore, cyber security is required, which first does a network scan to stop the attack. Points of vulnerability on the network can be discovered using scanning techniques. Furthermore, mitigation or recovery measures can be implemented. However, it needs a short response time and high accuracy while scanning to reduce the level of damage caused by cyber-attacks. In this paper, the proposed method improves the performance of a vulnerability management system based on network and port scanning by combining the benchmarking and scenario planning models. On a network scanning to discover open ports on a subnet, Masscan can achieve response times of less than 2 seconds, and on scenario planning for detection on a single host by Nmap can reach less than 4 seconds. It was combining both models obtained an adequate optimization response time. The total response time is less than 6 seconds.</span

Darknet Analysis-Based Early Detection Framework for Malware Activity: Issue and Potential Extension

Conference Paper

Full-text available

Dec 2022

Dark-TRACER: Early Detection Framework for Malware Activity Based on Anomalous Spatiotemporal Patterns

Article

Full-text available

Jan 2022

As cyberattacks become increasingly prevalent globally, there is a need to identify trends in these cyberattacks and take suitable countermeasures quickly. The darknet, an unused IP address space, is relatively conducive to observing and analyzing indiscriminate cyberattacks because of the absence of legitimate communication. Indiscriminate scanning activities by malware to spread their infections often show similar spatiotemporal patterns, and such trends are also observed on the darknet. To address the problem of early detection of malware activities, we focus on anomalous synchronization of spatiotemporal patterns observed in darknet traffic data. Our previous studies proposed algorithms that automatically estimate and detect anomalous spatiotemporal patterns of darknet traffic in real time by employing three independent machine learning methods. In this study, we integrated the previously proposed methods into a single framework, which we refer to as Dark-TRACER , and conducted quantitative experiments to evaluate its ability to detect these malware activities. We used darknet traffic data from October 2018 to October 2020 observed in our large-scale darknet sensors (up to /17 subnet scales). The results demonstrate that the weaknesses of the methods complement each other, and the proposed framework achieves an overall 100% recall rate. In addition, Dark-TRACER detects the average of malware activities 153.6 days earlier than when those malware activities are revealed to the public by reputable third-party security research organizations. Finally, we evaluated the cost of human analysis to implement the proposed system and demonstrated that two analysts can perform the daily operations necessary to operate the framework in approximately 7.3 h.

i-DarkVec: Incremental Embeddings for Darknet Traffic Analysis

Article

May 2023

Darknets are probes listening to traffic reaching IP addresses that host no services. Traffic reaching a darknet results from the actions of internet scanners, botnets and possibly misconfigured hosts. Such peculiar nature of the darknet traffic makes darknets a valuable instrument to discover malicious online activities, e.g., identifying coordinated actions performed by bots or scanners. However, the massive amount of packets and sources that darknets observe makes it hard to extract meaningful insights, calling for scalable tools to automatically identify and group sources that share similar behaviour. We here present i-DarkVec, a methodology to learn meaningful representations of Darknet traffic. i-DarkVec leverages Natural Language Processing techniques (e.g., Word2Vec) to capture the co-occurrence patterns that emerge when scanners or bots launch coordinated actions. As in NLP problems, the embeddings learned with i-DarkVec enable several new machine learning tasks on the darknet traffic, such as identifying clusters of senders engaged in similar activities. We extensively test i-DarkVec and explore its design space in a case study using real darknets. We show that with a proper definition of services , the learned embeddings can be used to (i) solve the classification problem to associate unknown sources’ IP addresses to the correct classes of coordinated actors, and (ii) automatically identify clusters of previously unknown sources performing similar attacks and scans, easing the security analyst’s job. i-DarkVec leverages a novel incremental embedding learning approach that is scalable and robust to traffic changes, making it applicable to dynamic and large-scale scenarios.

Understanding the Mirai Botnet Understanding the Mirai Botnet

Article

Full-text available

Sep 2022

The Mirai botnet, composed primarily of embedded and IoT devices, took the Internet by storm in late 2016 when it overwhelmed several high-profile targets with massive distributed denial-of-service (DDoS) attacks. In this paper, we provide a seven-month retrospective analysis of Mirai's growth to a peak of 600k infections and a history of its DDoS victims. By combining a variety of measurement perspectives, we analyze how the bot-net emerged, what classes of devices were affected, and how Mirai variants evolved and competed for vulnerable hosts. Our measurements serve as a lens into the fragile ecosystem of IoT devices. We argue that Mirai may represent a sea change in the evolutionary development of botnets-the simplicity through which devices were infected and its precipitous growth, demonstrate that novice malicious techniques can compromise enough low-end devices to threaten even some of the best-defended targets.

Automated Detection of Malware Activities Using Nonnegative Matrix Factorization

Conference Paper

Full-text available

Oct 2021

Real-Time Detection of Global Cyberthreat Based on Darknet by Estimating Anomalous Synchronization Using Graphical Lasso

Article

Full-text available

Oct 2020
IEICE T INF SYST

With the rapid evolution and increase of cyberthreats in recent years, it is necessary to detect and understand it promptly and precisely to reduce the impact of cyberthreats. A darknet, which is an unused IP address space, has a high signal-to-noise ratio, so it is easier to understand the global tendency of malicious traffic in cyberspace than other observation networks. In this paper, we aim to capture global cyberthreats in real time. Since multiple hosts infected with similar malware tend to perform similar behavior, we propose a system that estimates a degree of synchronizations from the patterns of packet transmission time among the source hosts observed in unit time of the darknet and detects anomalies in real time. In our evaluation, we perform our proof-of-concept implementation of the proposed engine to demonstrate its feasibility and effectiveness, and we detect cyberthreats with an accuracy of 97.14%. This work is the first practical trial that detects cyberthreats from in-the-wild darknet traffic regardless of new types and variants in real time, and it quantitatively evaluates the result.

Detection of zero-day attacks: An unsupervised port-based approach

Article

Full-text available

Oct 2020
COMPUT NETW

Last years have witnessed more and more DDoS attacks towards high-profile websites, as the Mirai botnet attack on September 2016, or more recently the memcached attack on March 2018, this time with no botnet required. These two outbreaks were not detected nor mitigated during their spreading, but only at the time they happened. Such attacks are generally preceded by several stages, including infection of hosts or device fingerprinting; being able to capture this activity would allow their early detection. In this paper, we propose a technique for the early detection of emerging botnets and newly exploited vulnerabilities, which consists in (i) splitting the detection process over different network segments and retaining only distributed anomalies, (ii) monitoring at the port-level, with a simple yet efficient change-detection algorithm based on a modified Z-score measure. We argue how our technique, named Split-and-Merge, can ensure the detection of large-scale zero-day attacks and drastically reduce false positives. We apply the method on two datasets: the MAWI dataset, which provides daily traffic traces of a transpacific backbone link, and the UCSD Network Telescope dataset which contains unsolicited traffic mainly coming from botnet scans. The assumption of a normal distribution – for which the Z-score computation makes sense – is verified through empirical measures. We also show how the solution generates very few alerts; an extensive evaluation on the last three years allows identifying major attacks (including Mirai and memcached) that current Intrusion Detection Systems (IDSs) have not seen. Finally, we classify detected known and unknown anomalies to give additional insights about them.

Discovering Collaboration: Unveiling Slow, Distributed Scanners based on Common Header Field Patterns

Conference Paper

Full-text available

May 2020

To compromise a computer, it is first necessary to discover which hosts are active and which services they run. This reconnaissance is typically accomplished through port scanning. Defense systems monitor for these unsolicited packets and raise an alarm if a predefined threshold is exceeded. To remain undetected, adversaries can either slow down the scan, and/or distribute it over multiple hosts. With each source below the threshold, the combination of all may still complete the scan efficiently. It is especially this group that is of concern: with enough resources and knowledge to execute such a coordinated activity, they will pose a more potent threat than the noisy "script kiddie". Correlating which out of 4 billion IPs potentially collaborate is however a challenging task, hence today's systems do not consider coordination beyond basic subnet aggregation. In this paper, we propose a method to identify and fingerprint distributed scanners based on commonalities in header fields, which are an artefact of the way fast port scanning software is built. We demonstrate that this method can effectively locate groups, and based on the monitoring logs we report on a number of new groups and tools, among them the largest coordinated scan campaign reported to date.

Quantifying the Spectrum of Denial-of-Service Attacks through Internet Backscatter

Conference Paper

Full-text available

Aug 2017

Denial of Service (DoS) attacks are a major threat currently observable in computer networks and especially the Internet. In such an attack a malicious party tries to either break a service, running on a server, or exhaust the capacity or bandwidth of the victim to hinder customers to effectively use the service. Recent reports show that the total number of Distributed Denial of Service (DDoS) attacks is steadily growing with "mega-attacks" peaking at hundreds of gigabit/s (Gbps). In this paper, we will provide a quantification of DDoS attacks in size and duration beyond these outliers reported in the media. We find that these mega attacks do exist, but the bulk of attacks is in practice only a fraction of these frequently reported values. We further show that it is feasible to collect meaningful backscatter traces using surprisingly small telescopes, thereby enabling a broader audience to perform attack intelligence research.

Designing Comprehensive Cyber Threat Analysis Platform: Can We Orchestrate Analysis Engines?

Conference Paper

Mar 2021

Measurement and Analysis of Hajime, a Peer-to-peer IoT Botnet

Conference Paper

Jan 2019

A comparative review between Genetic Algorithm use in composite optimisation and the state-of-the-art in evolutionary computation

Article

Nov 2019
COMPOS STRUCT

ZMap: fast internet-wide scanning and its security applications

Conference Paper

Aug 2013

Internet-wide network scanning has numerous security applications, including exposing new vulnerabilities and tracking the adoption of defensive mechanisms, but probing the entire public address space with existing tools is both difficult and slow. We introduce ZMap, a modular, open-source network scanner specifically architected to perform Internet-wide scans and capable of surveying the entire IPv4 address space in under 45 minutes from user space on a single machine, approaching the theoretical maximum speed of gigabit Ethernet. We present the scanner architecture, experimentally characterize its performance and accuracy, and explore the security implications of high speed Internet-scale network surveys, both offensive and defensive. We also discuss best practices for good Internet citizenship when performing Internet-wide surveys, informed by our own experiences conducting a long-term research survey over the past year.

Internet-Wide Scanner Fingerprint Identifier Based on TCP/IP Header

Recommended publications

Research and Implementation of Detection of OS Based on TCP/ IP Protocol Stack

Hershel: Single-Packet OS Fingerprinting

Techniques and Countermeasures of TCP/IP OS Fingerprinting on Linux Systems

A Data Mining Based Analysis of Nmap Operating System Fingerprint Database