Content uploaded by Anastasija Nikiforova
Author content
All content in this area was uploaded by Anastasija Nikiforova on Feb 13, 2022
Content may be subject to copyright.
This paper has been accepted for publishing in Proceedings of 8th International Conference on Internet of Things: Systems, Management
and Security (IOTSMS2021). The final authenticated version is available online at https://doi.org/10.1109/IOTSMS53705.2021.9704952
Please, cite this paper as:
Daskevics A. and Nikiforova A. "IoTSE-based open database vulnerability inspection in three Baltic countries:
ShoBEVODSDT sees you," 2021 8th International Conference on Internet of Things: Systems, Management and
Security (IOTSMS), 2021, pp. 1-8, doi: 10.1109/IOTSMS53705.2021.9704952.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
IoTSE-based open database vulnerability inspection
in three Baltic countries: ShoBEVODSDT sees you
Artjoms Daskevics
Faculty of Computing
University of Latvia
Riga, Latvia
artjoms.daskevics@gmail.com
Anastasija Nikiforova
Faculty of Computing, Innovation laboratory
University of Latvia
Riga, Latvia
anastasija.nikiforova@lu.lv, ORCID: 0000-0002-0532-3488
Abstract—This study aims to analyze the state of the
security of open data databases, i.e. being accessible from the
outside of organization, representing both relational databases
and NoSQL of three Baltic countries - Latvia, Lithuania,
Estonia. This is done by using previously proposed tool for non-
intrusive detection of vulnerable data sources called
ShoBEVODSDT (Shodan- and Binary Edge- based vulnerable
open data sources detection tool). ShoBEVODSDT is based on
the use of Internet of Things Search Engines (IoTSE). It is
found to be suitable for this study since it conducts the passive
assessment, which means that its use does not harm the
databases but rather checks for potentially existing bottlenecks
or weaknesses which, if the attack would take place, could be
exposed. It allows for both comprehensive analysis for all
unprotected data sources falling into the list of predefined data
sources - MySQL, PostgreSQL, MongoDB, Redis,
Elasticsearch, CouchDB, Cassandra and Memcached, or to
define IP range to examine what can be seen from the outside
of the organization about the data source. Although some data
sources can be described as following the security-by-design
principle, some of them face serious challenges in this respect.
The study carries out cross-country comparative study on 8
data sources. We inspect both, (1) the most vulnerable data
sources and (2) countries characterized by the highest number
of open data sources and the highest degree of “value” of data
being available to external actors.
Keywords— Internet of Things Search Engine (IoTSE),
Shodan, BinaryEdge, Internet of Things (IoT), database, NoSQL,
vulnerability
I. INTRODUCTION
Nowadays, there are billions interconnected devices
forming an Internet of Things (IoT) ecosystem. With an
increasing number of devices and systems in use, the risk of
security breaches increases as well [1-2]. One of these risks
is posed by open data sources, i.e. open databases by which
are not meant databases which are deliberately open for
others but databases which are not properly protected,
therefore they are available and accessible to external actors
outside the organization. Although it may sound surprisingly,
but the number of such databases is enormous. In many cases
this is caused by misconfiguration, where the responsibility
falls to database holders, in other cases there are
vulnerabilities in the products and services, where apart of
proper configuration additional security mechanisms are
needed. But how to find out whether the database is visible
and even accessible outside the organization? What
information (if any) may be gathered from it? Whether
stronger security mechanisms are needed? Is the
vulnerability rather related to internal configuration or the
database in use?
Although some questions may be partly answered by
referring to Common Vulnerability and Exposure (CVE)
Details and other data sources summarizing vulnerabilities
and patches on different services, this information may be
too general. Therefore, testing and more precisely penetration
testing could be the answer to allow to get an insight on the
current state for specific artifact, i.e. specific system or set of
systems, or region. In our previous study [3] we have
presented a tool for non-intrusive detection of vulnerable data
sources called ShoBEVODSDT (Shodan- and Binary Edge-
based vulnerable open data sources detection tool). This
time, we have applied ShoBEVODSDT tool to three
countries of Baltic region, namely Latvia, Lithuania and
Estonia, to carry out an extensive investigation on the current
state of data sources and their security in a country context.
The aim of this study is threefold: (1) to validate the tool
in real-life circumstances, thus patrolling the previous study,
(2) to draw conclusions on similarities or differences in three
Baltic country - Latvia, Lithuania and Estonia - patterns, i.e.
whether the technological development of Estonia will be
also seen in this matter, (3) to draw more objective
conclusions on the relationships between more vulnerable
open data sources in respect of specific data source, i.e.
allowing the detection of less ”protected by design” data
sources.
Thus, the following research questions (RQ) are posed:
(RQ1.1) What data source is the most likely to be open
database among eight analyzed?
(RQ1.2) What data source is the most likely to be
vulnerable?
(RQ2.1) Which country has the most open data sources?
(RQ2.2) What country has the most vulnerable open data
sources?
The paper is structured as follows: a background and
related studies (Section 2), methodology (Section 3), results
of analysis (Section 4), discussion and limitations (Section 5)
and conclusions (Section 6).
II. BACKGROUND
Today, security and database security in particular are
topical for at least a few reasons. First, databases are part of
each system that have only become more popular with the
involvement of the Internet of Things and integration of this
concept in our daily lives. Secondly, the popularity of
NoSQL and their relatively weak security has significantly
increased the popularity of this topic. The main security
concern is that the most NoSQL databases having a list of
benefits and advantages for users are less likely to provide
security measures, including sometimes very primitive and
simple measures such as authentication, authorization [1, 4].
This also applies to data encryption. Perhaps the most
provocative database in this respect is MongoDB, where in
2018 there were 54 000 databases accessible on the Internet,
which resulted in data leakage of 2.4 million patients of
telemedicine vendor [5]. While there have been
improvements in this respect in recent years, it remains a
problem. However, while the vulnerability of NoSQL
databases is widely debated, this does not mean that SQL
databases are secure and their holders do not risk their data
leaking.
According to a list drawn up by Bekker [5] and Identity
Force on major security breaches in 2020, a large number of
data leaks occur due to unsecured databases. As an example:
Estee Lauder – 440 million customer records;
Whisper – 900 million user records;
Key Ring digital wallet – 14 million users records;
Prestige Software hotel reservation platform – over
10 million hotel guests, including services such as
Expedia, Hotels.com, Booking.com, Agoda etc.;
Paay card payments database – 2.5 million card
transactions;
Slcikwraps – 850 000 customers records;
Unnamed U.K-based Security Firm has managed to
gain data belonging to Adobe, Twitter, Tumbler,
LinkedIn etc. and users with a total of over 5 billion
records;
Marijuana Dispensaries – 85 000 medical marijuana
patient and recreational user records, etc.
This Section will briefly cover existing studies on this
topic, which are typically divided in (1) registries allowing
identifying the level of security or vulnerability of the
service, more precisely database, in use and (2) approaches to
test the current state of the service used in a particular
system.
For registries to be used to identify the weakest areas of
the service, the Common Vulnerability and Exposure (CVE)
Details (https://www.cvedetails.com/) is probably the most
popular index used for a variety of services. CVE Details
collects and provides to every stakeholder an index of
registered vulnerabilities of various services, including
databases, dividing vulnerabilities in 13 categories: Denial of
Service, Code Execution, Overflow, Memory Corruption,
SQL injection, XSS, Directory Traversal, HTTP response
splitting, Bypass something, Gain information, Gain
Privileges, CSRF and File Inclusion, where “Gain
Information” category is close to the aspect we inspect.
Another registry is VulDB, i.e. the vulnerability database,
documenting and explaining security vulnerabilities, threats,
and exploits for more than 50 years. As CVE Details, it
provides data not only on databases. However, the number of
databases covered by it is limited and databases, such as
Memcahced, ElasticSearch, characterized by a high number
of vulnerabilities and leaks in recent years (also in line with
[5]), and some other databases covered by our study are not
presented. This registry can therefore be used as a
complimentary source, but in many cases it will not be
applicable.
Not least popular is also NVD – National Vulnerability
Database - the U.S. government repository that includes
databases of security checklist references, security-related
software flaws, misconfigurations, product names, and
impact metrics.
Although these sources are indisputably valuable, they
are rather static and general, i.e. provide general information
on the vulnerabilities of databases used by specific
organizations / systems. However, it is clear that this alone is
not enough. It does not allow to gain insight into what can be
seen outside the organization. The approaches to test the
current state of the service in use can help in it.
According to Bada et al. [1], testing tools such as security
or vulnerability scanners, presenting the threats and risks
found is an essential part of the vulnerability assessment
process. They typically allow to define, identify, and classify
the security holes. According to CERN [7], vulnerability
scanners are divided by the type of tests executed in intrusive
and non-intrusive tests. An intrusive test tries to expose the
vulnerability, which can crash the remote target. A non-
intrusive test attempts not to cause any harm to the target
system. They usually check the remote service version, or
whether the service is configured insecurely. This concept is
close to the central object of the study – identification of
open / unprotected databases. Intrusive tests are indisputably
more accurate, but they cannot be carried out legally in a
production environment. Although a nonintrusive test cannot
determine for sure if a service is vulnerable, it points on the
possibility that it is vulnerable, which is definitely important
and valuable.
Here the concepts of Open Source Intelligence (OSINT)
and Internet of Things Search Engines (IoTSE) come, which
search for and index publicly available and accessible IoT
devices, thereby allowing to understand how publicly
available and accessible are specific devices [8]. OSINT is
defined as a concept describing the search, collection,
analysis and use of information from open sources, and the
methods and tools used [9]. In more general terms, Williams
et al. describe the activities of OSINT in four stages: (1)
collection, (2) processing, (3) exploitation, and (4)
production [10], where processing and exploitation can take
place not entirely sequentially but rather in parallel. They
describe these stages as acquiring or obtaining information,
validating this information, determining its value of and
providing corresponding results to customers.
ShoBEVODSDT presented in [3] and used in this study
follows this paradigm.
The popularity of both concepts is increasing in a variety
of areas, including the detection of open databases and the
assessment of their vulnerabilities and leaks [1, 11], which
can be carried out at different levels, i.e. (1) at system level
for only one organization or (2) more comprehensive, when
overall insight on the state of the art can be gained not
limited to particular organization.
While there are some IoTSE-based tools such as
LeakLooker, LeakLooker X and Lampyre that allow to
automate information gathering, our previous study [3] found
that they have a list of limitations. For instance, the limited
list of databases, inability to perform a country-based
analysis, a format of results that is difficult to process etc. So
we have proposed our own IoTSE-based tool called
ShoBEVODSDT or Shodan- and Binary Edge- based
vulnerable open data sources detection tool. The next section,
the primary aim of which is to present a methodology of the
study, will cover it.
III. METHODOLOGY
The section sets out the main concepts related to the
study methodology, covering the data sources to be analyzed,
the main features of ShoBEVODSDT and introducing a
classification of the results obtained.
A. Data Sources
This study is closely related to our previous study [3],
when we presented ShoBEVODSDT. This close link with
this study means that the number and nature of data sources
we cover is the same. Thus, given that we have designed
ShoBEVODSDT to search for eight predefined open data
sources - MySQL, PostgreSQL, MongoDB, Redis,
Elasticsearch, CouchDB, Cassandra and Memcached, let us
briefly cover them here.
First, these data sources represent both (1) relational SQL
databases and (2) NoSQL databases. To be more precise,
three types of sources – (1) relational databases, (2) NoSQL
databases and (3) data stores, are covered to ensure a broader
view on the state of the play (for more detailed classification
of these data sources by their type, see Table I). This list is
based on three factors: (1) the most popular databases, where
the list is created based on the results of a survey of
developers conducted in the mid of 2020 [12], (2) the
different types of data storages, where apart of relational
databases, NoSQL databases were selected to represent both
types, document-oriented, column-oriented and key-value
databases, (3) our own experience when working with these
data storages, which is important because the specificities of
many of them affect the entire testing process, so at least
basic knowledge and skills working with them benefit.
Although the list of the most popular databases [12]
contradict other statistics, such as [13], where Oracle and MS
SQL are dominant, the list we have used overlaps with it
significantly, and even more, it came from developers.
However, given this limitation, i.e. not all the most popular
databases have been covered, we pose it as a future work,
since ShoBEVODSDT is scalable with a source code
available (https://github.com/zhmyh/Open-Databases), i.e.
the list of data sources to be analyzed may be extended. It
would allow everyone, if necessary, to extend the scope of
the developed tool to their needs.
B. ShoBEVODSDT or Shodan- and Binary Edge- based
Vulnerable Open Data Sources Detection Tool
ShoBEVODSDT has already been presented in [3], thus,
we will not cover it in very details. Instead, we will briefly
cover its main actions and output to be further analyzed for
the purpose of this study.
ShoBEVODSDT supports the detection of vulnerabilities
at early non-intrusive security assessment phases, which
makes it possible to apply it to both, own system or the
whole ecosystem of specific country or region. In this paper
we refer to the second case and apply it to the Baltic region
represented by three countries – Latvia, Estonia and
Lithuania.
While many studies use only one IoT search engine, such
as Shodan, which is considered a de facto OSINT tool [14-
15], ShoBEVODSDT is based on two of them – Binary Edge
and Shodan. It should contribute to the correctness and
completeness of the results and effectively determine their
potential attack surface and contribute to a targeted
assessment of vulnerability.
TABLE I. DATA SOURCES AND THEIR MODELS [BASED ON [13]]
Database
Primary database
model
Secondary database model
MySql
Relational DBMS
document store, spatial DBMS
PostgreSql
Relational DBMS
document store, spatial DBMS
MongoDB
Document store
spatial DBMS, search engine
Redis
Key-value store
document store, graph DBMS,
spatial DBMS, search engine,
time Series DBMS
Elasticsearch
Search engine
document store, Spatial DBMS
CouchDB
Document store
Spatial DBMS
Cassandra
Wide-column store
-
Memcached
Key-value store
-
Compared to individual IoTSE, ShoBEVODSDT extends
the list of features provided by Binary Edge and Shodan, and
allow for more categorized analysis of data obtained. The
later aspect is essential to our study, since we intend to cover
three countries and carry out their comparative analysis.
ShoBEVODSDT operation can be characterized as
follows:
ShoBEVODSDT searches for IP addresses of open
data sources that belong to an appropriate user-
defined country using possible filters from Shodan
and BinaryEdge. These results are combined by
eliminating duplicates (if any) and saving results in
“parsed/<service_name_> _<country>. txt”;
when an open data source is found, ShoBEVODSDT
gathers available data from relevant IP addresses, and
verifies whether it is possible to retrieve data from a
system that may be possible due to a weak level of
security. ShoBEVODSDT checks found IP addresses
by classifying them by:
(a) service, i.e. MySQL, PostgreSQL,
MongoDB, Redis, ElasticSearch,
CouchDB, Cassandra and Memcached, and
(b) country, i.e. Latvia, Lithuania and
Estonia.
By classification we understand the sorting of
results by matching folders that should be created
when finding an appropriate item for that
classification. This is done by “check” class method
associated with the service. If the connection to the
database has been successful, the IP address is
stored in „good/<service_name>_<country>.txt”,
otherwise, the IP address and error information are
stored in „bad/<service_name> _ <country>.txt”. In
total, up to 48 folders can be created in the case if IP
associated with all eight services belonging to both
three countries have been found;
ShoBEVODSDT retrieves data from data sources to
which it has managed to connect. This is done by
searching for files in a “parsed/good” that
corresponds to the service and country to be checked,
where the process of downloading database content
differs from one another. It is predefined in
ShoBEVODSDT for abovementioned eight services.
However, in the case if another service should be
added, a source code of ShoBEVODSDT should be
modified. All information is stored in the “parsed”
folder and in a file called “<IP_ADRESE>.txt”.
When the data are retrieved and classified, the next stage
of the obtained data assessment take place to evaluate the
“value” of data obtained.
C. Classification of Data Obtained by ShoBEVODSDT
Although the data obtained by ShoBEVODSDT is
automatically categorized by service and country to which
the specific IP address belongs to, data gathered from open
data sources should also be analyzed and classified to ensure
analysis and comparison of the sensitivity of data and the risk
they may pose to the organization. That is why we have
designed a very simple classification, where IP address are
divided in - (1) IP addresses to which ShoBEVODSDT has
managed to connect to, (2) IPs, to which ShoBEVODSDT
has failed to connect to. Then we refer only to the first
category and classify IP addresses according to the “value”
of information that can be obtained from these data sources.
The classification introduced is available in Table II. As in
[3], each category is assigned points from 0 to 5, depending
on the category where the higher risk, the higher the number
of points assigned to it.
TABLE II. IP ADDRESS CLASSIFICATION [3]
Category
Description
0
failed to connect
1
has managed to connect but failed to gather data
2
has managed to connect, but the database is empty
3
has managed to connect by gathering system data or non-
sensitive information
4
has managed to connect and gather sensitive data
5
compromised database
The nature of the categories is explained by the nature of
data that we have gathered through the approbation of our
tool. Therefore, “the database is empty” is derived as a
separate category that is widespread and less valuable
compared to “has managed to connect but failed to gather
data or information” but more than “has managed to connect
by gathering system data or non-sensitive information”. The
data obtained, however, may contain sensitive information or
database information. In addition, the database can be
compromised, by which we mean databases where all
records have been deleted and a report has been left on the
fact that all data have a backup copy and that the database
holders have to pay ransom (in Bitcoin) to recover the data,
while if it will not be done, fraudsters will report the breach
of the general data protection regulation (GDPR) and the
database holder will get the penalty because data were not
protected and data leakage took place.
IV. ANALYSIS AND RESULTS: SHOBEVODSDT IN USE
All the data provided in this Section have been collected
by the ShoBEVODSDT. Although the source code of the
ShoBEVODSDT is publicly available, thereby supporting
principles of open science, the data gathered are not
published since they could potentially provide information
that can be used for the attacking phase of penetration
testing. This is particularly risky when very sensitive
information is detected, but we consider our solution as a
“white hacking” tool.
In total, ShoBEVODSDT was able to process a total of
15 180 IP addresses, with the majority of IP addresses
belonging to Lithuania (7 453), followed by Estonia (5 352)
and Latvia (2 375). 98.43% of the addresses have failed to
connect. Therefor, the further actions took place with 1.57%
or 238 IP addresses only.
In terms of data source / database, the most popular
service (at least for Baltic States) is MySQL, followed by
PostgreSQL. However, the third most popular data source
varies from country to country and will be covered in the
following sections. The least popular service is Cassandra.
This trend is valid for both three countries analyzed. This
may be due to the fact that MySQL is intended as a website
database and various website deployment services offer
MySQL database for free when renting server, while
Cassandra is meant to store Big data with multiple servers. In
Figure 1, statistics on services are available, where the
percentage of connection status is displayed by the analyzed
data source. Let us now turn to an overview of the results by
country.
A. Latvia
ShoBEVODSDT has managed to find 2 375 IP addresses,
where 2 325 were protected, thus ShoBEVODSDT failed to
connect to them. However, 2.11% were open, which is
significantly higher than an average.
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
13452
1184
153
110
23
14
6
19
3
24
12
93
86
1
Distribution of IP addresses by successful connection to them
(by service)
connection failed (bad)
successful connection (good)
Fig. 1. Distribution of network hosts with IP addresses attempted to
connect (by service).
For the total number of IP addresses found, this result is
the lowest (50 IPs) but cannot be extrapolated to subsequent
results, i.e. the number of unprotected and vulnerable data
sources. For services in use, MySQL and PostgreSQL are the
most popular, followed by Memcached, MongoDB and
Redis.
Fig. 2 shows a general distribution of successful
connections that ShoBEVODSDT has managed to make,
where Memcached was the most popular database among
those to which a connection has been successful, followed by
ElasticSearch and MySQL. However, in the scope of the total
number of databases found, the total number of Cassandra
and ElasticSearch databases to which ShoBEVODSDT has
managed to connect is the highest – all databases found were
open, followed by Memacached – 82.5% of Memacached
databases, and Redis, i.e. 6.25%. There are also two services
to which ShoBEVODSDT has not managed to connect,
namely PostgreSQL and CouchDB.
For the “value” of data gathered by ShoBEVODSDT,
Fig. 3 shows that Memchached can be characterized by the
highest number of data collection cases when data gathered
can be classified as system data (3 points) and even sensitive
data (4 points). ShoBEVODSDT has identified 4 databases
that have already been compromised. Although Memcached
is characterized by the highest number of vulnerable
databases, this type of vulnerability has not been identified
for any Memcached database. However, the highest number
of compromised databases was found for ElasticSearch - 2
databases with 1 more MySQL and MongoDB. The most
common case for Latvian databases found is
ShoBEVODSDT ability to connect by gathering system data
or non-sensitive information (3 points).
8%
2%
2%
66%
20%
2%
Latvia: distribution of successful connections by
service
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
Fig. 2. Distribution of successful connections by service for Latvia.
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
0
5
10
15
20
25
Latvia: clasification of IP addres ses by service and gathered dat a "value"
(from 1 to 5 points)
1 - has managed to
connect but failed to
gather data or informa-
tion
2 - has managed to
connect, but the DB is
empty
3 - has managed to
connect by gathering
system data or non-sens i-
tive information
4 - has managed to
connect and gather sensi-
tive data
5 - compromised data-
base
data source
number of data sources
Fig. 3. Classification of IPs by service and gathered data “value” (Latvia).
B. Estonia
For Estonia, ShoBEVODSDT has managed to find 5 352
IP addresses, where 5 307 were protected, thus
ShoBEVODSDT failed to connect to them. 45 IP addresses
(0.84%) were open and further action took place with them.
Although the number of IP addresses found by
ShoBEVODSDT is more than twice as high as in the case of
Latvia, the ratio calculated as the number of IP addresses to
which it is possible to connect from outside to the total
number of detected IP addresses, is almost three times
higher.
For services in use, MySQL and PostgreSQL are the most
popular, followed by ElasticSearch, MongoDB and Redis.
Although MongoDB and Redis were in the list of top popular
services for Latvia, ElasticSearch is something new here,
while Memcached, which holds third place in Latvia, is
significantly less popular.
Fig. 4 shows a general distribution of successful
connections that ShoBEVODSDT has managed to make,
where ElasticSearch was the most popular database among
those to which a connection has been successful, followed by
MySQL and Memcached. However, in the scope of the total
number of databases found, the total number of ElasticSearch
databases to which ShoBEVODSDT has managed to connect
is the highest – all databases found were open, followed by
Memcached – 87.5%, and MongoDB - 15.8%.
As regards the “value” of data gathered by
ShoBEVODSDT, Fig. 5 shows that in the case of Estonia
MySQL followed by ElasticSearch can be characterized by
the highest number of both compromised databases and cases
when data gathered can be classified as system data (3
points). As regards sensitive data (4 points), ElasticSearch,
followed by Memcahced and Redis are leaders in this
negative trend. Moreover, 8 databases have been classified as
compromised with 4 ElasticSearch databases, 2 MongoDB
and 1 PostgreSQL and Memcached.
22%
4%
7%
2%
18%
47%
Estonia: distribution of successful
connections by service
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
Fig. 4. Distribution of successful connections by service for Estonia.
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
0
1
2
3
4
5
6
7
8
9
10
Estonia: clasification of IP addresses by service and gathered data "value"
(from 1 to 5 points)
1 - has managed to connect but failed
to gather data or information
2 - has managed to connect, but the
DB is empty
3 - has managed to connect by gathe-
ring system data or non-sensitive in-
formation
4 - has managed to connect and
gather sensitive data
5 - compromised database
Fig. 5. Classification of IPs by service and gathered data “value”
(Estonia).
Here we can see that although the total number of
compromised databases is not very high, ElasticSearch has
demonstrated the highest number of compromised databases
same as it was in the case of Latvia.
For the most common case for Estonian databases found
is the same as it was for Latvia - ability to connect by
gathering system data or non-sensitive information (3
points).
C. Lithuania
In case of Lithuania, ShoBEVODSDT has managed to
find 7 453 IP addresses, where 7 310 were protected. Thus,
further actions took place with 143 IP addresses (1.92%)
which were open. It is even more than a half of the total IP
addresses we process further. Although the number of IP
address to which we have managed to connect is the highest,
the ratio is lower compared to Latvian case but still more
than twice higher than in the case of Estonia.
For services in use, MySQL and PostgreSQL are the most
popular, followed by ElasticSearch, MongoDB and Redis.
An interesting point here is that Lithuania combines the
results of both abovementioned countries and their most
popular services.
Fig. 6 shows a general distribution of successful
connections that ShoBEVODSDT has managed to make,
where ElasticSearch was the most popular database among
those to which a connection has been successful, followed by
Memcached and MongoDB. However, in the scope of the
total number of databases found, the total number of
ElasticSearch to which ShoBEVODSDT has managed to
connect is the highest – all databases were open, followed by
Memcached – 77.6%, and MongoDB, i.e. 14.4%.
As regards the “value” of data gathered, Fig. 7 shows that
in the case of Lithuania ElasticSearch followed by
Memacached can be characterized by the highest number of
data gatherings.
3%
1%
14%
7%
36%
38%
Lithuania: distribution of successful
connections by service
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
Fig. 6. Distribution of successful connections by service for Lithuania.
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
0
5
10
15
20
25
30
35
40
Lithuani a: clasification of IP addr esses by service and gathered data "value"
(from 1 to 5 points)
1 - has managed to connect but failed
to gather data or information
2 - has managed to connect, but the
DB is empty
3 - has managed to connect by gathe-
ring system data or non-sensitive in-
formation
4 - has managed to connect and
gather sensitive data
5 - compromised database
Fig. 7. Classification of IPs by service and gathered data “value”
(Lithuania).
Here we also observe the highest number of
compromised databases (5 points), which mainly belong to
ElasticSearch and MongoDB (17 databases per database
type) with another one Memcached database. This means that
¼ of all open databases have been compromised.
The most common case is the same as for the countries
mentioned above – the ability to connect by gathering system
data or non-sensitive information (3 points), but in this case it
is not such an expressive leader, followed by compromised
databases (5 points) and those storing non-sensitive or
system data (2 points) with a little less databases, where
sensitive data have been gathered (3 points).
A. Summary of Results in the Country-by-country Context
For the databases to which ShoBEVODSDT has been
able to connect, it has been found that the highest ratio of the
compromised databases belongs to Lithuania, where a total
of 24.5% of all databases were compromised. It is surprising,
but it is followed by Estonia with 17.8% compromised
databases, while for Latvia only 8% of all databases to which
we have connected to, were compromised.
However, this trend does not applies to the ratio of cases
where sensitive data have been gathered, as the most
negative trend is shown by Latvia (20%), followed by
Lithuania (18.9%), with the best results for Estonia (13.3%).
As regards the gathering system and non-sensitive data,
Estonia demonstrates the most negative trend, where 46.7%
of all databases fall into this category (3 points), followed by
Latvia (44%) and 35% for Lithuania.
Overall, the “value” of the gathered data for the three
countries is 3.22, i.e. closer to the critical level, where the
worst results are demonstrated by Lithuania with 3.45 of 5
points, followed by Estonia with 3.18 and Latvia with 3.02
points. A summary of the analysis is provided in Table III,
where both data on the total IP addresses found, the
successful connections and the distribution of gathered data
“value” are provided by country (the most negative trends are
highlighted in red).
TABLE III. SUMMARY OF RESULTS BY COUNTRY
Latvia
Estonia
Lithuania
Total found
2375
5352
7453
Connection successful
50 (2.1%)
43 (0.8%)
143 (1.9%)
Compromised DB (5
points)
4 (8%)
8 (18.6%)
35
(24.5%)
sensitive data (4 points)
22 (40%)
21 (48.8%)
27 (18.9%)
System or non-sensitive
data (3 points)
22 (44%)
21 (48.8%)
50 (35%)
DB is empty (2 points)
11 (22%)
7 (16.3%)
29 (20.3%)
Failed to gather data (1
point)
3 (6%)
3 (7%)
3 (2.1%)
AVG data “value”
3.02
3.18
3.45
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
CouchDB
Cassandra
0,00% 20,00% 40,00% 60,00% 80,00% 100,00%
Sensitivity of gathered data by service (1 to 5 points)
1 - has managed to connect but fai-
led to gather data or information
2 - has managed to connect, but
the DB is empty
3 - has managed to connect by
gathering system data or non-
sensitive information
4 - has managed to connect and
gather sensitive data
5 - compromised database
Fig. 8. Sensitivity of gathered data by data source.
B. Results in the Context of Data Source
We have already found that Memcached and
ElasticSearch were the leading data sources to which
ShoBEVODSDT has managed to connect.
Let us turn now to a brief summary of the sensitivity of
the data, because the knowledge on whether it is possible to
connect to the database outside the organization is not
enough to make conclusions on their security.
Fig. 8 provides statistics on the “value” of data we have
gathered without their division by the service classified
according to Table II, where the most popular category is
“has managed to connect by gathering system data or non-
sensitive information” (45%), followed by “has managed to
connect, but the database is empty” (21%). This could be
considered as a positive result, i.e. while these data sources
are visible for external actors, they are not of very high value
to attackers, although they can facilitate attacks. However,
18% of these data sources contain data that could be used by
attackers, and 12% of them have already been compromised
[3].
For compromised databases, it is known that in 2020,
fraudsters attacked more than 22 000 MongoDB databases
[16], however, our experiment shows that MongoDB is not
the only database, which was compromised. The most
compromised databases, where the number of compromised
databases is related to the total number of databases of a
particular type to which ShoBEVODSDT has managed to
connect to, belong to Elasticsearch (27% of all
Elasticsearch), followed by MongoDB (11% of all
MongoDB) and PostgreSQL (0.08% of all PostgreSQL) and
Memcached (2% of all Memcached databases are
compromised).
For databases from which sensitive data have been
gathered, the leader of this negative trend is Redis, where for
83.3% of open databases to which ShoBEVODSDT has
managed to connect, it was possible to gain sensitive data,
which can be used for exploiting attack to it. MySQL and
Memcahced are also in the list of leaders in this respect.
To sum up:
PostgreSQL is mainly characterized by
compromised databases and databases from which
non-sensitive data or system data can be gathered;
MongoDB is characterized by a high number of
cases where databases are compromised (83.3%),
followed by data sources from which sensitive data
can be gathered (4.2%) and some data sources from
which system and non-sensitive data can be
gathered. This finding is also in line with [1, 5];
Cassandra in this case can be characterized as a
data source to which we have managed to connect,
but the database was empty;
Redis can be characterized as a data source from
which sensitive data may be gathered. In some
cases, the relevant databases are empty;
Memcached can be characterized as a data source
where system data and non-sensitive data were
gathered most frequently (61.3%), followed by
sensitive data gatherings (22.6%), empty databases
(12.9%) and 2.1% compromised databases;
MySQL is characterized by prevailing number of
databases from which non-sensitive or system data
can be gathered (52.6%) with 21.1% databases
from which sensitive data can be gathered and
5.3% compromised databases. However, MySQL
has also proved to be a database, where, although
ShoBEVODSDT has managed to connect to
database, data gathering has failed;
ElasticSearch databases represent all categories
where the largest number of databases is empty,
followed by a large number of databases that are
already compromised (26.7%), which is also in line
with [6]. In addition, 24.4% of ElasticSearch
databases contain non-sensitive or system data
(24.4%) with 8.1% databases with sensitive data.
However, it remains one of two databases types
where, although ShoBEVODSDT was able to
connect to it, data gathering was unsuccessful.
A summary by service (excluding CouchDB, where
ShoBEVODSDT has found no vulnerabilities) is provided in
Table IV, which provides both data on the total IP addresses
found, the successful connections and the distribution of
gathered data “value” for categories “5”, “4” and “1” (most
and least vulnerable), highlighting the most negative trends
in red.
Overall, the average “value” of the gathered data for eight
services under question is 2.83, where the worst results are
demonstrated by MongoDB with 4.5 of 5 points, followed by
PostgreSQL with 3.7 and ElasticSearch and Memcached with
3.17 and 3.16 points, respectively.
TABLE IV. SUMMARY OF RESULTS BY SERVICE
MySQL
PostgreSQL
MongoDB
Redis
Memcached
ElasticSearch
Cassandra
Total
found
1347
1
1187
177
122
116
86
7
Connectio
n
successful
0.14
%
0.3%
7.9%
9.8%
80%
100
%
14%
Compro
mised
DB (5
points)
5.3%
33%
71%
0
2.2%
27%
0
sensitive
data (4
points)
0
0
7.1%
83%
24%
8%
0
Failed to
gather
data (1
point)
21%
0
0
17%
0
3.5%
0
AVG
data
“value”
2.7
3.67
4.5
3.5
3.15
3.17
2
These results can be explained not only by the database
holder’s awareness of their data security, but also by the
relevant data sources default security mechanisms. In other
words, data sources with weaker mechanisms are more likely
to be vulnerable. Our examination of data sources under
question lead us to conclusion that Redis, Memcached have
no authentication mechanisms, and MongoDB and
ElasticSearch allow to enable them but do not have them
enabled by default. However MySQL, CouchDB and
Cassandra require authentication data and show better results
when ShoBEVODSDT is used.
This observation makes it possible to state that, in many
cases, even such a primitive and simple approach as proper
authentication mechanisms lead to a significant reduction in
the risk of data leakage and intrusion.
V. DISCUSSION AND LIMITATIONS
First, in this study we use our self-developed tool
ShoBEVODSDT [3], which utilizes a passive assessment
that is characterized by its low level of intrusiveness [17], the
respective data sources are not thoroughly tested to see if the
vulnerabilities identified in the systems actually exist rather
pointing on such possibility.
Secondly, the number of services inspected is limited,
which does not allow us to state with a high degree of
confidence that a particular service is highly vulnerable,
while the other one is totally secure because the number of
databases is not balanced. Thus, although a number indicates
that there is no vulnerabilities among open CouchDB
databases, it cannot be generalized because ShoBEVODSDT
has found only 14 IP addresses, although for other databases
this number exceeds 1 000. Thus, in order to draw more
generalizable conclusions on services, the sample should be
balanced. However, this was not the main aim of this study,
mainly by examining the state of the art in respect of three
countries.
In addition, in the future we also plan to perform a
comparison of the results obtained with CVE Details aimed
at verifying whether there is a relationship between the
registered “Gain Information” vulnerabilities and the data
that we have managed to collect. Similar approach was
applied by Genge et al. [15] and we suppose it will allow
obtaining more generalizable results on the services under
question.
VI. CONCLUSIONS
More and more studies highlight the risks posed by IoT
devices and stress the need for actions to ensure the security
of IoT ecosystem at a wide range of levels [1-2, 18]. In this
paper, we have applied the IoTSE-based tool
ShoBEVODSDT we have presented in our previous study [3]
to inspect the state of play of three countries in the Baltic
region, namely, Latvia, Estonia and Lithuania, with regard to
unprotected open databases accessible outside the
organization and the „value” of the data that can be gathered
from them, in the case of successful connection. We have
inspected eight data sources on their vulnerabilities and their
extent. We conclude that although the total number of open
databases accessible outside the organization is less than 2%
of the data sources scanned, there are data sources that may
pose risks to organizations. Even more, for 12% of open data
sources this has already taken place.
We conclude that the weakest results are demonstrated by
Lithuania with 3.45 of 5 points, followed by Estonia with
3.18 and Latvia with 3.02 points. For the services under
question, the worst results are demonstrated by MongoDB,
followed by PostgreSQL, ElasticSearch and Memcached.
We argue that the ShoBEVODSDT can be useful for (1)
individual organizations to determine whether their data
source data are visible and even accessible outside the
organization, (2) testers to effectively map the potential
attack surface and advance targeted vulnerability
assessments, with their further inspection and development
of preventive activities and security mechanisms, (3)
scientists and developers to carry out a comprehensive
multidimensional and longitudinal analysis of uprotected
data sources, (4) countries and their governments, defining
guidelines and laws according to state of the art on a country
level that would promote technological development and
better protection.
REFERENCES
[1] M. Bada, I. Pete, “An exploration of the cybercrime ecosystem
around Shodan,” In 2020 7th International Conference on Internet of
Things: Systems, Management and Security (IOTSMS) (pp. 1-8).
IEEE, December, 2020.
[2] M. Al-Ruithe, S. Mthunzi, E. Benkhelifa, “Data governance for
security in IoT & cloud converged environments,” In 2016
IEEE/ACS 13th International Conference of Computer Systems and
Applications (AICCSA) (pp. 1-8). IEEE, November, 2016.
[3] A. Daskevics, A. Nikiforova, „ShoBeVODSDT: Shodan and Binary
Edge based vulnerable open data sources detection tool or what
Internet of Things Search Engines know about you,” 2021 (in print)
[4] E.Sahafizadeh, M. A. Nematbakhsh, “A survey on security issues in
Big Data and NoSQL,” Advances in Computer Science: an
International Journal, 4(4), 68-72, 2015.
[5] J. Davis, “Telemedicine vendor breaches the data of 2.4 million
patients in Mexico,” 2018. [Online]. Available:
https://www.healthcareitnews.com/news/telemedicine-vendor-
breaches-data-24-million-patients-mexico
[6] E. Bekker (2020). Identity Force, A sontiq Brand. 2020 data
breaches. The most significant breaches of the year.
[7] B. Burns, D. Killion, N. Beauchesne, E. Moret, J. Sobrier, M. Lynn,...
P. Guersch, “Security power tools,” O'Reilly Media, Inc., 2007.
[8] S. Samtani, M. Kantarcioglu, H. Chen, “Trailblazing the Artificial
Intelligence for Cybersecurity Discipline: A Multi-Disciplinary
Research Roadmap”, CM Trans. Manage. Inf. Syst. 11, 4, Article 17
December 2020, DOI:https://doi.org/10.1145/3430360.
[9] J. R. G. Evangelista, R. J. Sassi, M. Romero, D. Napolitano,
“Systematic literature review to investigate the application of open
source intelligence (osint) with artificial intelligence,” Journal of
Applied Security Research, 1-25, 2020.
[10] H. J. Williams, I. Blum, “Defining second generation open source
intelligence (OSINT) for the defense enterprise,” RAND Corporation
Santa Monica United States, 2018.
[11] A. Oganesyan, DeviceLock Inc., How Researchers Discover
MongoDB and Elasticsearch Open Databases (2019), online:
https://m.devicelock.com/blog/how-researchers-discover-mongodb-
and-elasticsearch-open-databases.html
[12] O. Valin (2020) Most popular databases in 2020 and new trends,
[online] https://www.eversql.com/most-popular-databases-in-2020/,
last accessed 22.06.2021
[13] DB-engines (2021), [online] https://db-engines.com/en/ranking, last
accessed 22.06.2021
[14] P. D. C. de Sousa Rodrigues, “An OSINT Approach to Automated
Asset Discovery and Monitoring,” 2019.
[15] B.Genge, C. Enăchescu, “ShoVAT: Shodan-based vulnerability
assessment tool for Internet-facing services,” Security and
communication networks, 9(15), 2696-2714, 2016.
[16] A. Bizga, “Bad Actors Target MongoDB Databases, Threatening to
Contact GDPR Legislators Unless Ransom is Paid,” Online:
https://securityboulevard.com/2020/07/bad-actors-target-mongodb-
databases-threatening-to-contact-gdpr-legislators-unless-ransom-is-
paid/, last accessed: 11.07.2021
[17] S. Samtani, S. Yu, H. Zhu, M. Patton, H. Chen, “Identifying SCADA
vulnerabilities using passive and active vulnerability assessment
techniques,” In 2016 IEEE Conference on Intelligence and Security
Informatics (ISI) (pp. 25-30). IEEE, September, 2016.
[18] Y. Jararweh, M. Al-Ayyoub, E. Benkhelifa, M. Vouk, A. Rindos,
“SDIoT: a software defined based internet of things framework,”
Journal of Ambient Intelligence and Humanized Computing, 6(4),
453-461, 2015.