ChapterPDF Available

Analysis of Transformation Tools Applicable on NewSQL Databases

Authors:

Abstract and Figures

With the introduction of NoSQL databases the new opportunities in the field of cloud technology like scalability and performance were utilized. In fact not NoSQL databases could not satisfy all requirements towards some applications like consistency and ACID transactions. For this reason NewSQL databases were developed - a mixture of relational databases and NoSQL databases with the aim to get the advantages of both. Besides, presenting the most known NewSQL databases and the requirements of transformation tools, the most popular transformation tools in the field of NewSQL databases were introduced. An analysis shows how the comply with the requirements.
Content may be subject to copyright.
Analysis of transformation tools applicable on
NewSQL databases
Sarah Myriam Lydia Hahn, Ionela Chereja, and Oliviu Matei
Computers and Information Technology, Technical University Cluj Napoca,
Cluj-Napoca, Romania
sarah-hahn@gmx.net,ionela.chereja@yahoo.com,oliviu.matei@holisun.com
Abstract. With the introduction of NoSQL databases the new oppor-
tunities in the field of cloud technology like scalability and performance
were utilized. In fact not NoSQL databases could not satisfy all require-
ments towards some applications like consistency and ACID transactions.
For this reason NewSQL databases were developed - a mixture of rela-
tional databases and NoSQL databases with the aim to get the advan-
tages of both. Besides, presenting the most known NewSQL databases
and the requirements of transformation tools, the most popular trans-
formation tools in the field of NewSQL databases were introduced. An
analysis shows how the comply with the requirements.
Keywords: NewSQL Transformation ETL ELT Cloud technol-
ogy ACID model CAP theorem Scalability Consistency
1 Introduction
Not only caused by trend of big data but also caused by new possibilities in the
range of cloud computing like resource pooling and rapid elasticity [39] variable
new database management systems were developed. So called NoSQL databases
(“not only SQL”) try to eliminate the disadvantages of RDBMS by leveraging
cloud technology. They go along with fast I/O, supporting big data, scalability
and low cost [29]. Furthermore they are schema free to handle a variety of data.
In the beginning they had not a structured query language. That is one of the
reasons NewSQL database came up. These databases want to pair the advantages
from both systems together - relational database management systems (RDBMS)
and NoSQL [14]. Just as the mode of operation of the databases changed also
the way of data load processing changed. Typical for RDBMS were Extract-
Transform-Load (ETL) processes with the following steps [38]:
Extract: First the necessary data is extracted from different data sources
into a working area. The data itself is not adapted in any way.
Transform: In the working area the data is transformed to make it usable in
the target database.
Load: In the last step the transformed data is loaded into the target database.
2 Analysis of transformation tools applicable on NewSQL databases
With NoSQL and NewSQL databases ETL processes changed to Extract-Load-
Transform (ELT) processes. Data is first loaded as a whole into the DWH without
any transformations. Whenever a transformation is needed the loading stage has
not to be changed. Only new transformation has to be set up [12]. Transforma-
tions can be precomputed or calculated in real-time. Caused by new database
management systems (DBMS) and processes there are new requirements to-
ward transformation tools. This paper analyses the most popular transforma-
tion tools in the field of NewSQL databases. Until now there are several analysis
regarding NewSQL databases [19, 20, 31]. But there is no analysis concerning
transformation tools in the field of NewSQL databases. By analysing the most
important properties, the advantages and disadvantages of each tool are shown
and a comparison in table form can be found. All in all it should support the
decision-making to find the best database system and transformation tool for
each process. It also represents a market overview on the most popular tools in
the context of NewSQL databases, which is existing yet. The structure is the
following: the following section ”related work and explanations” provides con-
cepts and further definition as the basis for the following. In the third chapter
”methodology” the approach of the analysis itself is described. Following the
”analysis” of the several transformation tools can be found in chapter four. Last
but not least the ”conclusion” of the analysis is shown.
2 Related work and explanations
The section explains related work and explanations for the following analysis.
First the basic of NewSQL databases - cloud computing - is introduced. After-
wards NewSQL databases and the concepts of ACID and CAP are illustrated.
2.1 Cloud computing
There are many different definitions for the term ”cloud computing” [16]. In the
following the term is defined as parallel and distributed system based on various,
virtualized computers provided by a cloud vendor [9]. The cloud vendors ensure
the operation by providing the hardware, a simple access and professional skills.
It is based on a pay per use model [16]. Concerning to the National Institute
of Standards and Technology cloud computing has the following characteristics
[39]:
On-demand self-service - a customer can access the services without a human
interaction with the service vendor
Broad network access - thick and thin client applications have admission to
the services through a standard mechanism
Resource pooling - the capability of the vendor can be served to multiple
consumers
Rapid elasticity - the provision of the resources is automated, rapid and
elastic
Analysis of transformation tools applicable on NewSQL databases 3
Measured service - resource usage is monitored, controlled, and reported
automatically for transparency
These characteristics has several advantages for the user such as a lesser up-front
investment, lower operating costs, higher scalability and elasticity, easy access
trough the internet, reduced business risks and maintenance expenses [53].
2.2 NewSQL
NewSQL databases were first mentioned in 2011 by Matthew Asslet [1]. These
kind of databases bring the advantages from RDBMS and NoSQL databases
together. With the new kind of databases online transaction processing (OLTP)
workloads should be migrated to NewSQL as an alternative to NoSQL because
of the disadvantages. In contrast to NoSQL databases it is scalable, has a high
performance and availability. On the other side like RDBMS it guarantees ACID
transactions and supports (partial) SQL queries. SQL is also the main language
for applications using NewSQL. A non-locking concurrency control mechanism
supports several read operations and writes without any locks [31, 2]. Depend-
ing on the NewSQL database it uses known features from NoSQL databases
such as column-oriented data model, distributed nodes or in-memory processing
with the objective to process big data in the field of OLTP applications [40].
Roughly, NewSQL databases can be divided into using data sharding layer and
distributed storage. A data sharding layer is used when there are multiple in-
dependent instances of RDBMS [11, 4]. For each database it has to checked in
detail how it how it works between the different instances especially concerning
ACID transactions and availability. In contrast distributed storage has one single
logical RDBMS. There is only one API endpoint which is called. The data itself
is distributed on several nodes [33].
2.3 ACID model
ACID defines the four characteristics atomicity, consistency, isolation, durability.
If a database has these properties requests are also called transactions. Atomicity
means that each transaction whether it is completely successful or not. If there
is a power failure or any other error in between the execution of a transaction
the whole transaction will fail and the data will be corrected as it was before
the transaction. The second property consistency says that the database always
must be in a valid state. Data sets are always consistent concerning constraints,
cascades and triggers. Isolation results in that each transaction is viewed sepa-
rately. Even if the execution was done in parallel it has the same result as if the
execution was run separately. To guarantee the isolation tables can be locked
if there are concurrent transactions. Durability ensures that once a transaction
was successful it will also be like that after a failure, like a power failure [21].
This can be implemented with a transaction log on a hard disk.
4 Analysis of transformation tools applicable on NewSQL databases
2.4 CAP theorem
The CAP theorem is a theorem for distributed systems and their characteris-
tics. It was published in 2002 and is explained next [17]. It is focused on the
three characteristics consistency, availability, and partition tolerance. Consis-
tency means that all nodes in a distributed system look at the same data at the
same time. When data is inserted, changed or deleted all nodes must have the
the same data up to the last write operation. A read operation has the same re-
sult independently from the node which is queried. The definition of consistency
is not the same as the consistency in the ACID model. The definition in ACID
describes the consistency within a data set. Availability describes the up time
of the systems which is 100%. Consequently the system is always accessible. Ev-
ery request at any time gets a response within a acceptable response time. The
response must not be an error. Availability does not define that the response
must contain the correct or latest value. The third characteristic is partition
tolerance. It is characterized by a high fault tolerance independent from loss
of connection between nodes, dropped messages or other events. The theorem
indicates that a database cannot satisfy all of these three characteristics. It is
always at maximum a combination of two.
3 Methodology
3.1 Databases and transformation tools
There are many surveys which compare the different existing NewSQL databases
and their properties [19, 20, 31]. But they do not focus how data can be trans-
formed. In this survey selected transformation tools are evaluated. Due to the
fact that several databases come along with several transformation tools the most
popular databases are the basis. In the following passages it is described how
the most popular databases were evaluated. There is no ranking for NewSQL
databases thus the most common NewSQL databases were identified by a liter-
ature research - ClustrixDB, CockroachDB, Drizzle, GridGain IMDB, MemSQL
(now SingleStore), NuoDB, Spanner, SQLFire, TransLattice Elastic Database,
VoltDB [31, 20, 33, 22]. By reason that some of the databases cannot be found
in DB-Engine Ranking the databases of Google Scholar and Web of Science
were used to identify their popularity. Web of Science is one of the ”most
trusted publisher-independent global citation database” [10]. Google Scholar pro-
vides scholarly literature containing articles, theses, books, abstracts and court
opinions[36]. The most search results had the databases CockroachDB, Mem-
SQL/SingleStore, VoltDB, NuoDB and Spanner - even the numbers were not
that high. The objective of this article is to have an overview about the most
important transformation tools in the context of NewSQL databases. Caused
by the fact that there is no ranking yet three different approaches were used
to identify them. A research were made for scientific articles about big data or
transformation tools. Furthermore Gartners magic quadrant for data integration
Analysis of transformation tools applicable on NewSQL databases 5
was used. For each type of technology there is a separate quadrant. The quad-
rant contains the most popular players divided into four quadrants - leaders,
visionaries, niche players and challengers. Leaders have a leading position in the
market and also a good position for the future. Visionaries have a vision for
the upcoming future but they do not have a good position yet. Niche players
are not that popular in the market because of a small target group or missing
innovation and focus. Challengers are doing well in the moment but they are
not prepared for the future [15]. As a third approach the recommendations of
the database developer website will be considered. Generally third party com-
ponents or internal developments are not mentioned. Coding interfaces like SQL
were not mentioned.
Databases. In the following passages the chosen databases are shortly intro-
duced as well as their transformation tools are part of the following analysis.
CockroachDB offers distributed SQL and is strongly oriented at Google’s
database Spanner. In its basic version it is free of charge. Hosting in AWS or
Google cloud is possible [34]. Regarding to the research methodology there was
no transformation tool found which has a standard connector for CockroachDB.
Therefore the database is not evaluated in the following parts.
NuoDB is a distributed SQL database which can be used in several enviro-
ments such as Google Cloud, OpenShift, AWS, Azure, docker, kubernetes, Linux
or VMWare [41]. Pentaho offers a connector to NuoDB [3].
SingleStore is a distributed SQL system which can be hosted at different
cloud vendors like AWS, Google Cloud, Microsoft Azure and RedHat OpenShift
or in a private cloud. It supports different types of data like relational, JSON,
geospatial, key-value and time series [45]. The researched transformation tools
are HVR, Informatica and Talend [50].
Spanner is classified as a relational database but in the context of NewSQL
databases and provided by Google [24]. Spanner is supported by Informatica
and Safe Software [50,24].
VoltDB is a in-memory database written in Java and C++. It aims to
bring data storage in form of a database and streaming together. Furthermore
it can be used for OLTP and OLAP applications [49]. Within the scope of this
research several Analysis and Self-Service Reporting tools were found but not a
transformation tool [49]. It is not considered furthermore.
Transformation tools. In the next sections the chosen transformation tools
are shortly presented.
HVR is a data integration software. It runs in a separate distributed envi-
ronment and uses log-based change data capture (CDC) technology. There are
over 1500 deployments and clients in over 30 countries [23].
Informatica provides several data integration products for cloud as well
as for on premise solutions. These products include solutions for advanced data
transformation, B2B data exchange, ultra messaging and enterprise data catalog
[26]. The focus of the following evalution will be set on Informatica Power Center.
6 Analysis of transformation tools applicable on NewSQL databases
Pentaho is a data integration tool by Hitachi Vantara. There are two ver-
sions of it - Pentaho Community Project and Pentaho Enterprise Edition. With
the free version a codeless pipeline development is possible. All other features
like load balancing, scheduling, streaming, spark or the data science tool kit
require the enterprise version [37].
Safe Software offers the tool Feature Manipulation Engine (FME). A data
integration tool for all industrial sectors like local government, energy, trans-
portation, utilities, education, natural resources, AEC and commercial [25].
Talend offers not only a tool for ETL processes but also for data integrity,
governance and application and API integration. In the scope of data integration
it offers a pipeline desginer, data inventory, data preparation and stitch [47].
Table 1 gives a shortly overview which transformation tool can be used with
which database. If a transformation tool is not mentioned in the focus of a specific
database it can also support the database. This can be seen in the overview.
Table 1. Combination possibilities of NewSQL databases and transformation tools
HVR Informatica Pentaho Safe Software Talend
NuoDB no no yes no no
SingleStore yes yes no yes yes
Spanner no yes no yes no
3.2 Evaluation criteria
The presented transformation tools will be reviewed in relation to various crite-
ria. In the subchapters these criteria will be explained.
Consistency. NewSQL databases are geared to RDBMS and offer ACID trans-
actions. For a higher availability and fault tolerance the standard properties can
be changed. This results in possible stale data [45, 41, 24].
Scalability. Scalability of a system is relevant if there are much more or less
data in a data flow. Whether a system is scalable or not depends on the under-
lying database and not on the transformation tool. In the evaluation it is shown
what kind of scalability the databases support. Referring to the evaluation the
following kind of scaling are of note:
Partitioning There are two ways of partitioning - horizontal and vertical
partitioning. Horizontal partitioning means that data sets of one entity are stored
in different partitions on different nodes. Vertical partitioning describes the stor-
age of one or more columns in one partition. This way of partitioning is very
common for column-oriented data stores. Horizontal partitioning splits up into
the two most common techniques: range partitioning and consistent hashing.
Analysis of transformation tools applicable on NewSQL databases 7
Range partitioning creates the partitions on a partition key. Each partition on
each server has a defined range of the partition key. By querying data using the
partition key the query is very fast because only a few partitions are needed.
On the other side there can be data hot spots and load-balancing issues when
loading data of only one partition. Further a routing server is needed. Consistent
hashing builds the partition based on the hash of each data set. The hash values
are approximative uniformly distributed. The advantage is that the partitions
have nearly the same size and there is no need for restructuring of the partition
borders. Even if there is a new partition the movement of the data is easy be-
cause only the partitions next to the new one has to be split up. Queries instead
can be slow when an access on each partition is needed [19].
Replication A replication of data enlarges not only read and write operation
performance but also system reliability, fault tolerance, and durability. In the
following it is differed between master–slave and multi-master replication. In the
master-slave model always one node represents the master and the other nodes
are so called slaves. All writes are committed at the master node. Afterwards
they are replicated to the slaves. The master node is consistent. All the nodes can
be used for read requests to have a performance increase. By reading from a slave
node the data can be stale. In the opposite using multi-master replication means
that all nodes are masters and can receive write operations. The replication
is happening between all nodes [19]. There is not one leading node which is
consistent.
Licence and costs. Many of the new databases are open source [35]. But there
are also proprietary software which goes along with licence costs such as Google
Spanner. In the analysis it is not only mentioned what kind of licence and con-
comitant costs the transformation tool has but also the underlying database.
This information supports to calculate the whole resources needed for the devel-
opment of an use case.
Transformation types. A data flow contains one ore more data sources with
data, one ore more transformation steps and at least a data target for the data
output [5]. Concerning to the ELT approach data source and data target is the
same database instance. The focus is on the transformation tool thus the data
source and target will not be considered. There are many common transfor-
mation types. To implement known transformation types in the environment of
NewSQL databases the transformation tool has to provide at least these common
functionalities:
Additional columns/Expression [52, 7, 6]
Aggregation [46, 6, 13, 52]
Conversion [46, 6, 13]
Data Cleansing [46, 7]
Filter [46, 6, 13, 52]
Join/Lookup [46, 6, 13, 52]
8 Analysis of transformation tools applicable on NewSQL databases
Merge/Union [46, 6, 13]
Sequence/Surrogate Key [13, 52]
Update [52]
It is measured how many of these transformation types are supported by each
tool. Also if there are special transformation types they will be highlighted.
Standard components. The transformation types listed before can often be
implemented in a user-defined code like a query language or with user interface.
With such an interface data flows can be build. In a transform editor the result
can be seen in real-time [5]. There are several benefits of using low-developing
platforms such as no technological knowledge is needed and the development is
faster. Both decreases the price of a new development. Also complexity is re-
duced and the maintenance is more easy [43, 42]. It is measured how many of
the provided transformation types mentioned before can be implemented via the
interface of the transformation tool as a standard component. The advantages
of standard components in the case of transformation tools are that the devel-
opment of new transformations and the data preparation itself is much easier
and can be made without a deep technological and development knowledge. The
time needed for new development decreases as well as the implementation costs.
Furthermore the maintenance of transformation processes is not only easier but
also can be made by users without a technical or development background. This
is especially important in case of limited (developer) resources.
Usability, popularity and stability. ISO 9241 was published in 1999 contain-
ing seventeen parts of requirements for office work with visual display terminals.
The requirements range from menu dialog, colours to usability [28]. In part eleven
the term of usability is defined. It is the extent to which a software can be used
by specified users to fulfill given tasks with effectiveness, efficiency and satisfac-
tion in a specified context of use [27]. Depending on other use cases he conclusion
can be made that there is a correlation between usability and popularity [44].
To measure the popularity of each tool on the basis of DB-Engine Ranking the
following metrics are used [18]:
Number of search results in Google and Bing
Frequency of searches by means of Google Analytics
Amount of questions in Stack Overflow
Number of job advertisements on Indeed and Simply Hired
Sum of LinkedIn accounts with reference in the profile
Total amount of Tweets
Number of search results in Web of Science and Google scholar
The popularity is an indication by choosing a transformation tool how common
the usage of the tool and how big the community is. Depending on the project
this can be very relevant. It is not only important how popular a tool is but also
how long it is on the market. To appreciate the popularity better the year of
Analysis of transformation tools applicable on NewSQL databases 9
the launch is mentioned. This is also an indicator how stable a software is. The
stability of a software which is available over years can be ranked higher than a
new one. The stability of a new software is vague.
Performance. Performance concerning time, hardware resources, capacity and
modes is also an important factor for data integration processes. These factors
can be measured as [48]:
Time efficiency: grade of low response and processing times
Resource utilization: resources utilized by the process differentiate between
the type of resource and the quantity
Capacity: demand that can be executed without missing the other require-
ments
Modes: modes of data processing (batch, real-time etc.)
Within the scope of this article the performance of each tool is not measured. If
available there will be a reference to other studies or benchmarks.
Competitive advantages. If there are any competitive advantages they will
be also shown in the evaluation. These can also be features which are not in
the context of transformation and the underlying database itself. Depending on
the use case these features can be helpful for the implementation. It has to be
decided if it is useful in a project and if the additional value is high enough to
choose over another transformation tool.
4 Analysis
This sections gives an overview about the databases in the context of the pre-
sented, important criteria for transformation tools followed by the criteria which
are specific for transformation tools. A summary of the criteria on database level
can be found in table 2. Also the consistency is specified there. It is not analysed
in detail because all of the NewSQL databases offer ACID transactions. The
properties of the transformation tools are shortly summarized in table 3.
Scalability. Singlestore creates replicas if it is requested at table creation or the
cluster is in high availability or redundancy-2 mode. If a replication is necessary
master-slave replication is used. By default synchronous mode is chosen. This
means that all replicas of a table are locked until a write transaction is finished.
This is important to guarantee consistency. Also an asynchronous mode can
be chosen. In this mode the replicas are not locked [45]. Google Spanner does
not have a partition model. But it divides data depending on the primary key
in different separated ranges. The ranges are set automatically for balancing.
Google Spanner’s replication also has a master (Google calls it a leader) node
and several slave nodes. The implementation is based on the Paxos protocol. It
10 Analysis of transformation tools applicable on NewSQL databases
is another way of defining the agreed value [24]. A different way of partitioning
is used by NuoDB. The storage is maintained by storage managers. Each storage
manager has different access storage groups. The connection between the storage
manager and a storage group is a many to many relationship. A storage group
is a physical storage. Such a storage group contains a number of partitions. A
partition is always is stored in one storage group. Replication is made by NuoDB
automatically in an asynchronous multi-master replication [41].
Table 2. Database properties
Consistency Scalability Licence and costs
NuoDB ACID Storage manager, Multi-master Open source
(Proprietary)
SingleStore ACID Master-slave replication Proprietary
Spanner ACID Automated range partitioning,
Paxos protocol replication
Proprietary
Licence and costs. Google is a cloud vendor and offers based on cloud technol-
ogy its database Spanner. There are not fixed costs. Instead they depend on the
usage of the database. The measurement for the cost units are reading and writ-
ing operations, storage, back-up, replication, in-memory cache, streams and data
transfer. This price model has a correlation with the usage [24]. In this analysis
the use cases are not further specified. Hence a fixed price cannot be assumed. A
similar price model has Singlestore. The price depends on the cloud, region and
resources chosen [45]. NuoDB’s community version is free but limited and only
usable for low storage. The enterprise version goees along with a licence to pay
[41]. HVR is a proprietary software but the pricing is not revealed. In advance
an three hour trial can be done [23]. Also Informatica does not have a public
price listing. It is a proprietary software [26]. Pentaho also has two versions - the
open source Pentaho Enterprise Edition and the proprietary Pentaho Commu-
nity Project [37]. Talend has a basic open source software called Talend Open
Source. For the extended version there is a licence per user e.g. for using cloud
integration [47]. A more special pricing concept has Safe Software. There are free
licenses for students, graduates, research, non-profit organizations, teaching and
trials. The license fee has not been raised since 2002. The license depends on the
environment like on premise or cloud and on the kind of organization [25].
Transformation types. In HVR there are so called actions for data transfor-
mation. In the restrict action data can be filtered, joined and duplicates can be
removed. Within the action of ColumnProperties that data type can be changed
and a surrogate key can be used. The integrate action can also merge data in
the target. A derived column can be created via code in the ColumnProperties
action. An aggregation can not be made out of the box [23]. With Power Center
Analysis of transformation tools applicable on NewSQL databases 11
and Data Quality - part of Informatica’s product range - all the relevant trans-
formation types are supported [26]. In Pentaho the transformation types are
called Data Integration Steps. They provide all necessary transformations [37].
Also the Safe Software’s operators support all defined transformation types [25].
In Talend there are many components which can be used in the transformation
process. Also code can be used e.g. for querying the database. The derivation of
columns or expression should be made in code. Finally all transformations can
be implemented [47].
Standard components. Actions in HVR are based on labels and text boxes.
There are predefined parameters in the restrict action for filtering, lookup and
removing duplicates. They are called ”CaptureCondition”, ”HorizLookupTable”
and ”SelectDistinct”. With the parameters ”Datatype” and ”SurrogateKey” in
the action ”ColumnProperties” data conversion can be made and a surrogate
key deposited. The property ”Resilient” in action ”Integrate” allows to make
a merge in the target database. It can also be used for updates [23]. In In-
formatica Power Center there are several transformation types which can be
added to the graphical data flow like ”aggregator”, ”filter”, ”sorter”, ”expres-
sion”, ”joiner”, ”lookup”, ”union”, ”sequence generator” and ”update”. With
the ”sorter” transformation also duplicates can be removed [26]. In Pentaho
there are different Data Integration Steps which can be used for the implemen-
tation of the different transformation types such as ”group by” for aggregations,
”filter rows”, ”unique rows”, ”select values”, ”concat fields”, ”merge join”, ”ap-
pend streams”, ”add sequence” and ”combination lookup/update” [37]. In Safe
Software’s FME the transformation types are called transformers. The most im-
portant ones are ”Aggregator”, ”AttributeFilter”, ”DuplicateFilter”, ”BaseC-
onverter”, ”AttributeCreator”, ”StringConcatenator”, ”DatabaseJoiner”, ”Fea-
tureMerger”, ”Counter” and ”DatabaseUpdater” [25]. Talend has several com-
ponents in its Talend Open Studio and expanded modules for transformations.
The most necessary are ”tAggregateRow”, ”tFilterRow”, ”tUniqRow”, ”tCon-
vertType”, ”tMap” and ”tUnite”. There are database specific components for
update and merge in the target [47]. In many parts there is not only one stan-
dard component available for a certain transformation type. However only the
most popular one is shown in this article.
Usability, popularity and stability. Analysis of the transformation tools
Pentaho and Talend shows that they are easy to use [8, 32]. In total there is no
comparison to all of the focused transformation tools. To classify the popularity
of a tool the measures were related to each other. The greatest popularity has
Informatica. Almost 40 % less is the popularity of Talend followed by Pentaho.
The tools with the lowest popularity are HVR and Safe Software. After first
explorations HVR was deployed in 2012 [23]. Informatica, as a software company,
is on the market since 1993 [26]. Pentaho was developed in 2004. In 2015 the
software was bought by Hitachi Vantara [37]. Safe software began their company
12 Analysis of transformation tools applicable on NewSQL databases
history in 1993 helping forestry companies [25]. Talend was founded in 2005.
Since then they enlarged their software solution [47].
Performance. HVR does not support complex ETL processes very well [51].
Talend improves their performance by supporting parallel processing [47, 30].
Due to the fact that there is no study comparing the performance of all tools,
the tools cannot be ranked by performance. Also not for every transformation
tool information was found.
Competitive advantages. HVR supports many different topologies such as
uni-directional, bi-directional, multi-directional, broadcast, integration and cas-
cading [23]. Informatica provides several products which supports the data in-
tegration process. These support by ultra messaging, build a data hub or the
data exchange between companies [26]. With the Lumada data services Hitachi
Vantara offers a wide product range for Analytics, Governance and Ops Agility
containing multi cloud solutions such as data catolog, data optimizer for Hadoop
and a data lake [37]. Talend has a strong focus on data integration processes.
This results in over 900 free components and connectors which can be used [47].
Table 3. Transformation tool properties
Licence and costs Transforma-
tion types
Standard
components
Popularity Founding
year
Competitive
advantages
HVR Proprietary 8/9 7/9 Not popular 2012 Supporting many
topologies
Informa-
tica
Proprietary 9/9 8/9 Popular 1993 Continuative data
integration prod-
ucts
Pentaho Open source/
Proprietary
9/9 9/9 Less popular 2004 Lumada data ser-
vices
Safe
Software
Proprietary 9/9 9/9 Not popular 1993
Talend Open source/
Proprietary
9/9 8/9 Popular 2005 over 900 compo-
nents & connec-
tors
5 Conclusion and future research
Because NoSQL databases cannot satisfy all requirements an application has
towards a database NewSQL databases are getting more popular. This analysis
gives an overview about the most popular transformation tools on the market for
NewSQL databases including their advantages and disadvantages. In fact there
Analysis of transformation tools applicable on NewSQL databases 13
are not that many tools that support NewSQL databases. There are also pop-
ular NewSQL databases which are not supported by any tool. This shows how
restricted the selection is. Instead code or a user specific connection has to be
used. The tools which have connectors for NewSQL databases are not new on the
market. Instead they are existing ETL tools which can be now also user for trans-
formations in NewSQL datbases. In this study fundamental concepts regarding
the change of NewSQL databases and their properties were presented. Using a
self-developed selection method the most popular NewSQL databases and trans-
formation tools were chosen. The most important properties of a transformation
tool were introduced and discussed for each database and tool. In overview tables
the advantages and disadvantages can be seen quickly and clearly. The analysis
is not only a market overview but a decision guidance for the implementation
of specific use cases. It is shown that a choice of a transformation tool always
includes the choice of a database system. The research is an overview about
different transformation tools in the context of NewSQL databases. There are
more NewSQL databases on the market which are not mentioned here because
of their less popularity. There is not only a change of the popularity over time
also depending on the use case other databases could fit better. Continuing spe-
cific use cases can be implemented with the several tools and be measured and
analyzed. Within the scope of the study not every property could be analyzed in
detail. There are further studies necessary. The performance of standard com-
ponents have to be analyzed in relation to the performance of custom code.
Furthermore the usability of each tool has to be measured with a representative
group of people. Besides that it is partly possible to connect a database with a
transformation tool without a standard connector.
Acknowledgments. This research was made possible by funding from the ICT-
AGRI-FOOD 2020 Joint Call. This work was supported by a grant of the Roma-
nian National Authority for Scientific Research and Innovation, CCCDI - UE-
FISCDI, project number COFUND-ICT-AGRI-FOOD-GOHYDRO 200/2020,
within PNCDI III.
References
1. Anslett, M.: Nosql, newsql and beyond: The drivers and use cases for database al-
ternatives. NoSQL, NewSQL and Beyond: The Drivers and Use Cases for Database
Alternatives (2011)
2. Aslett, M.: How will the database incumbents respond to nosql and newsql (2011)
3. Atriwal, L., Nagar, P., Tayal, S., Gupta, V.: Business intelligence tools for big data.
Journal of Basic and Applied Engineering Research 3(6), 505–509 (2016)
4. Authors, V.: The vitess documentation, https://vitess.io/docs/
5. Belyy, A., Xu, F., Herdan, T., He, M., Syed, A., Cao, W., Yee, M.: Dataset previews
for etl transforms (May 28 2013), uS Patent 8,452,723
6. Bergamaschi, S., Guerra, F., Orsini, M., Sartori, C., Vincini, M.: A semantic ap-
proach to etl technologies. Data & Knowledge Engineering 70(8), 717–731 (2011)
14 Analysis of transformation tools applicable on NewSQL databases
7. Bhide, M.A., Bonagiri, K.K., Mittapalli, S.K.: Column based data transfer in ex-
tract transform and load (etl) systems (Aug 20 2013), uS Patent 8,515,898
8. Biplob, M.B., Sheraji, G.A., Khan, S.I.: Comparison of different extraction trans-
formation and loading tools for data warehousing. In: 2018 International Confer-
ence on Innovations in Science, Engineering and Technology (ICISET). pp. 262–
267. IEEE (2018)
9. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing
and emerging it platforms: Vision, hype, and reality for delivering computing as
the 5th utility. Future Generation computer systems 25(6), 599–616 (2009)
10. Clarivate: Web of science, https://clarivate.com/webofsciencegroup/
solutions/web-of-science/
11. Data, C.: Citus documentation, https://docs.citusdata.com/
12. Davenport, R.J.: Etl vs elt: A subjective view. Insource Commercial aspects of BI
whitepaper (2008)
13. El Akkaoui, Z., Zimanyi, E., Maz´on, J.N., Trujillo, J.: A model-driven framework
for etl process development. In: DOLAP ’11: Proceedings of the ACM 14th inter-
national workshop on Data Warehousing and OLAP. pp. 45–52 (10 2011)
14. Fatima, H., Wasnik, K.: Comparison of sql, nosql and newsql databases for internet
of things. In: 2016 IEEE Bombay Section Symposium (IBSS). pp. 1–6. IEEE (2016)
15. Gartner, I.: Magic quadrant research methodology (2020), https://www.gartner.
com/en/research/methodologies/magic-quadrants-research
16. Geelan, J., et al.: Twenty-one experts define cloud computing. Cloud Computing
Journal 4, 1–5 (2009)
17. Gilbert, S., Lynch, N.: Brewer’s conjecture and the feasibility of consistent, avail-
able, partition-tolerant web services. Acm Sigact News 33(2), 51–59 (2002)
18. solidIT consulting & software development GmbH: Db-engines (2020), https://
db-engines.com/
19. Grolinger, K., Higashino, W.A., Tiwari, A., Capretz, M.A.: Data management in
cloud environments: Nosql and newsql data stores. Journal of Cloud Computing:
advances, systems and applications 2(1), 22 (2013)
20. Gurevich, Y.: Comparative survey of nosql/newsql db systems. Department of
Mathematics and Computer Science, The Open University of Israel (2015)
21. Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM
computing surveys (CSUR) 15(4), 287–317 (1983)
22. Hajoui, O., Dehbi, R., Talea, M., Batouta, Z.I.: An advanced comparative study
of the most promising nosql and newsql databases with a multi-criteria analysis
method. Journal of Theoretical and Applied Information Technology 81(3), 579
(2015)
23. HVR Software, I.: Enterprise data integration software — hvr (2020), https://
www.hvr-software.com/
24. Inc., G.: Cloud spanner — google cloud (2020), https://cloud.google.com/
spanner
25. Inc., S.S.: Safe software — fme — data integration platform (2020), https://www.
safe.com/
26. Informatica: Enterprise cloud data management — informatica deutschland (2020),
https://www.informatica.com/de/
27. ISO: Ergonomische Anforderungen f¨ur B¨urot¨atigkeiten mit Bildschirmger¨aten –
Teil 11: Anforderungen an die Gebrauchstauglichkeit - Leits¨atze. Beuth Verlag,
Berlin (1999)
Analysis of transformation tools applicable on NewSQL databases 15
28. ISO: Ergonomische Anforderungen f¨ur B¨urot¨atigkeiten mit Bildschirmger¨aten -
Teil 1: Allgemeine Einf¨uhrung (ISO 9241-1:1997) (enth¨alt ¨
Anderung AMD 1:2001);
Deutsche Fassung EN ISO 9241-1:1997 + A1:2001. Beuth Verlag, Berlin (2002)
29. Jing Han, Haihong E, Guan Le, Jian Du: Survey on nosql database. In: 2011 6th
International Conference on Pervasive Computing and Applications. pp. 363–366
(2011)
30. Katragadda, R., Tirumala, S.S., Nandigam, D.: Etl tools for data warehousing:
an empirical study of open source talend studio versus microsoft ssis. Computing
Conference Papers [147] (2015)
31. Kaur, K., Sachdeva, M.: Performance evaluation of newsql databases. In: 2017
International Conference on Inventive Systems and Control (ICISC). pp. 1–5. IEEE
(2017)
32. Kherdekar, V.A., Metkewar, P.S.: A technical comprehensive survey of etl tools.
Advanced Engineering Research and Applications p. 20 (2016)
33. Kumar, R., Gupta, N., Maharwal, H., Charu, S., Yadav, K.: Critical analysis of
database management using newsql. International Journal of Computer Science
and Mobile Computing 3, 434–438 (2014)
34. Labs, C.: Cockroach labs, the company building cockroachdb (2020), https://
cockroachlabs.com/
35. Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14
(2010)
36. LLC, G.: About google scholar, https://scholar.google.com/intl/de/scholar/
about.html
37. LLC, H.V.: Pentaho enterprise edition — hitachi vantara (2020), https://www.
hitachivantara.com/
38. Mehra, K.K., Kumar, P., Choudhury, P., Lakkur, A.L., Samanta, S.: Extract, trans-
form and load (etl) system and method (Apr 25 2017), uS Patent 9,633,095
39. Mell, P., Grance, T., et al.: The nist definition of cloud computing. NIST special
publication 800-145 (2011)
40. Moniruzzaman, A.B.M.: Newsql: Towards next-generation scalable rdbms for on-
line transaction processing (oltp) for big data management. International Journal
of Database Theory and Application 7 (11 2014)
41. NuoDB: Nuodb home — nuodb (2020), https://https://nuodb.com/
42. OutSystems: The state of application development: Is it ready for disruption? Tech.
rep., Department of Computer Science, Michigan State University, OutSystems:
Boston, MA, USA (September 2019)
43. Richardson, C., Rymer, J.R.: New development platforms emerge for customer-
facing applications. Forrester: Cambridge, MA, USA (2014)
44. Scowen, G., Regenbrecht, H.: Increased popularity through compliance with usabil-
ity guidelines in e-learning web sites. International Journal of Information Tech-
nology and Web Engineering (IJITWE) 4(3), 38–57 (2009)
45. SingleStore, I.: Singlestore is the database of nowpowering modern applications
and analytical systems (2020), https://https://www.singlestore.com/
46. Song, X., Yan, X., Yang, L.: Design etl metamodel based on uml profile. In: 2009
Second International Symposium on Knowledge Acquisition and Modeling. vol. 3,
pp. 69–72. IEEE (2009)
47. Talend: Talend - a cloud data integration leader (modern etl) (2020), https://
www.talend.com/
48. Theodorou, V., Abell´o, A., Lehner, W.: Quality measures for etl processes. In:
International Conference on Data Warehousing and Knowledge Discovery. pp. 9–
22. Springer (2014)
16 Analysis of transformation tools applicable on NewSQL databases
49. VoltDB, I.: Home - voltdb (2020), https://www.voltdb.com/
50. Zaidi, E., Thoo, E., Heudecker, N., Menon, S., Thanaraj, R.: Gartner magic quad-
rant for data integration tools. Gartner Group (2020)
51. Zaidi, E., Thoo, E., Heudecker, N., Menon, S., Thanaraj, R.: Magic quadrant for
data integration tools, 2020 (2020)
52. Zamanian, K., Nesamoney, D.: Apparatus and method for performing data trans-
formations in data warehousing (Jan 15 2002), uS Patent 6,339,775
53. Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research
challenges. Journal of internet services and applications 1(1), 7–18 (2010)
... Concepts of NoSQL databases such as column-oriented data model, distributed nodes or in-memory processing are used for the implementation [22]. Detailed evaluations between the most common NewSQL databases were made in the past [10,11,16]. All in all these evaluations focus on the properties and not on the performance. ...
Chapter
Full-text available
In the field of cloud technology and big data a new kind of databases arose - NewSQL databases. In the meantime there are many different suppliers and databases available. In this article the most common NewSQL databases in the field of Linux OS are presented and their performance is measured and compared to each other. Therefore a data set with weather data was prepared and used. Furthermore the test setting is explained. The results of the different tests are illustrated and discussed. Recommended actions are given at the end.
Chapter
Besides the well-known relational database management systems (RDBMS) a new kind of databases called NewSQL databases were developed. NewSQL databases aim to combine the advantages of RDBMS and NoSQL databases. In this article the performance of the most popular RDBMS and NewSQL databases is measured and compared to each other. The objective is to measure whether there is a improve of performance for NewSQL databases as promised. To get comparable results a data set was prepared as well as homogeneous test settings were defined. Furthermore KPIs for the measurement were specified. The outcome of the performance tests are displayed and explained. Conclusions as well as recommended actions are made.
Chapter
Evaluating the performance of high-impact components when building a solution is the process that provides an informed and fitting choice. Choosing a database system and the proper infrastructural configuration underneath it is however a process that can prove to be extremely complex. In the context of the highly volatile NewSQL database systems and Cloud hosting, choosing the fitting combination is not only difficult but can be daunting. This paper provides an overview of the implications of creating reliable benchmarking for NewSQL Cloud hosted databases. It includes the steps involved in accurate performance evaluations, existing tools, proposes a systematic approach for narrowing the number of combinations to test for and lists the standard benchmarks as well as alternatives. With this work we aim to introduce clarity and structure for the database benchmarking process, based on existing research and the current gaps therein.
Patent
Full-text available
Disclosed is a user interface on a display for editing data transformations comprising an ETL process. A first display area presents a data representation of a data transformation. A second display area presents a view of input data, and a third display are presents a view of output data. User input to modify the data transformation is received. In response to receiving the user input, the third display area is updated with output data generated by applying the modified data transformation to the input data.
Article
Full-text available
In modern days ETL tools are vey useful in data integration and data warehousing. Input is given to the datawarehouse through ETL. ETL means Extraction, Transformation and Loading. ETL tools transfer data from one source system to another source system. As these tools are mainly used in Busineness Intelligence and Data Warehousing, there is lot of space for their progress.There are lot of ETL tools available in the market varying from version to version to stay proficient against other tools. Each and every tool has its own features and limitations. In this paper we have carried out technical survey of existing ETL tools and benchmarking of these tools has been performed by considering certain parametrs including scalability, reusability, interoperability, support to big data, parallelism, usability, flexibility etc. Finally, problems and challenges of ETL tools have been discussed thoroughly and its state of the art is summarized.
Article
Full-text available
Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time[1].Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale [2]. Currently, there are a hundred solutions to the problem of Big Data that can be classified into three categories: NoSQL databases, NewSQL and Search-based systems. One of the major problems often mentioned is the heterogeneity of the languages and the interfaces they offer to developers and users. Different platforms and languages have been proposed, and applications developed for one system require significant effort to be migrated to another one [3]. Our motivation to write this article is to make a comparative study of Big Data systems, this is our first step to design and implement concrete and effective solution to the interoperability problem between Big Data systems. However, this study will help the professionals in decision-making.
Conference Paper
Full-text available
Relational databases are bound to follow various database integrity rules and constraints that makes the reporting a time consuming process. Data Warehousing has evolved out of the desperate need for easy access to structured storage of quality data that can be used for effective decision making. Data are turned into knowledge and knowledge into plans which are instrumental in profitable business decision making. To serve this purpose, data need to be extracted from various sources, transformed and loaded into the data warehouse which constitute the process of ETL (Extract, Transform and Load). ETL process can be accomplished using various tools both open source and proprietary. In this paper, we provide an empirical study of two ETL tools, an open source Talend Studio and Microsoft SSIS. In spite of the dominance among a vast majority of computer software solutions, open source technologies, as the comparative analysis that this study has undertaken, concludes that open sources tools are yet to evolve in order to be sustainable.
Conference Paper
Full-text available
ETL processes play an increasingly important role for the support of modern business operations. These business processes are centred around artifacts with high variability and diverse lifecycles, which correspond to key business entities. The apparent complexity of these activities has been examined through the prism of Business Process Management, mainly focusing on functional requirements and performance optimization. However, the quality dimension has not yet been thoroughly investigated and there is a need for a more human-centric approach to bring them closer to business-users requirements. In this paper we take a first step towards this direction by defining a sound model for ETL process quality characteristics and quantitative measures for each characteristic, based on existing literature. Our model shows dependencies among quality characteristics and can provide the basis for subsequent analysis using Goal Modeling techniques.
Article
One of the key advances in resolving the big-data problem has been the emergence of an alternative database technology. Today, classic RDBMS are complemented by a rich set of alternative Data Management Systems (DMS) specially designed to handle the volume, variety, velocity and variability ofBig Data collections; these DMS include NoSQL, NewSQL and Search-based systems. NewSQL is a class of modern relational database management systems (RDBMS) that provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system. This paper discusses about NewSQL data management system; and compares with NoSQL and with traditional database system. This paper covers architecture, characteristics, classification of NewSQL databases for online transaction processing (OLTP) for Big data management. It also provides the list ofpopular NoSQL as well as NewSQL databases in separate categorized tables. This paper compares SQL based RDBMS, NoSQL and NewSQL databases with set of metrics; as well as, addressed some research issues ofNoSQL and NewSQL.