ChapterPDF Available

Analysis of Transformation Tools Applicable on NewSQL Databases

July 2021

July 2021

DOI:10.1007/978-3-030-77442-4_16

In book: Software Engineering and Algorithms (pp.180-195)

Authors:

Sarah Myriam Lydia Hahn

Universitatea Tehnica Cluj-Napoca

Oliviu Matei

Universitatea Tehnica Cluj-Napoca

With the introduction of NoSQL databases the new opportunities in the field of cloud technology like scalability and performance were utilized. In fact not NoSQL databases could not satisfy all requirements towards some applications like consistency and ACID transactions. For this reason NewSQL databases were developed - a mixture of relational databases and NoSQL databases with the aim to get the advantages of both. Besides, presenting the most known NewSQL databases and the requirements of transformation tools, the most popular transformation tools in the field of NewSQL databases were introduced. An analysis shows how the comply with the requirements.

Combination possibilities of NewSQL databases and transformation tools

…

Database properties

…

Transformation tool properties

…

Figures - uploaded by Sarah Myriam Lydia Hahn

Content may be subject to copyright.

Content uploaded by Sarah Myriam Lydia Hahn

Content may be subject to copyright.

Analysis of transformation tools applicable on

NewSQL databases

Sarah Myriam Lydia Hahn, Ionela Chereja, and Oliviu Matei

Computers and Information Technology, Technical University Cluj Napoca,

Cluj-Napoca, Romania

sarah-hahn@gmx.net,ionela.chereja@yahoo.com,oliviu.matei@holisun.com

Abstract. With the introduction of NoSQL databases the new oppor-

tunities in the ﬁeld of cloud technology like scalability and performance

were utilized. In fact not NoSQL databases could not satisfy all require-

ments towards some applications like consistency and ACID transactions.

For this reason NewSQL databases were developed - a mixture of rela-

tional databases and NoSQL databases with the aim to get the advan-

tages of both. Besides, presenting the most known NewSQL databases

and the requirements of transformation tools, the most popular trans-

formation tools in the ﬁeld of NewSQL databases were introduced. An

analysis shows how the comply with the requirements.

Keywords: NewSQL •Transformation •ETL •ELT •Cloud technol-

ogy •ACID model •CAP theorem •Scalability •Consistency

1 Introduction

Not only caused by trend of big data but also caused by new possibilities in the

range of cloud computing like resource pooling and rapid elasticity [39] variable

new database management systems were developed. So called NoSQL databases

(“not only SQL”) try to eliminate the disadvantages of RDBMS by leveraging

cloud technology. They go along with fast I/O, supporting big data, scalability

and low cost [29]. Furthermore they are schema free to handle a variety of data.

In the beginning they had not a structured query language. That is one of the

reasons NewSQL database came up. These databases want to pair the advantages

from both systems together - relational database management systems (RDBMS)

and NoSQL [14]. Just as the mode of operation of the databases changed also

the way of data load processing changed. Typical for RDBMS were Extract-

Transform-Load (ETL) processes with the following steps [38]:

–Extract: First the necessary data is extracted from diﬀerent data sources

into a working area. The data itself is not adapted in any way.

–Transform: In the working area the data is transformed to make it usable in

the target database.

–Load: In the last step the transformed data is loaded into the target database.

2 Analysis of transformation tools applicable on NewSQL databases

With NoSQL and NewSQL databases ETL processes changed to Extract-Load-

Transform (ELT) processes. Data is ﬁrst loaded as a whole into the DWH without

any transformations. Whenever a transformation is needed the loading stage has

not to be changed. Only new transformation has to be set up [12]. Transforma-

tions can be precomputed or calculated in real-time. Caused by new database

management systems (DBMS) and processes there are new requirements to-

ward transformation tools. This paper analyses the most popular transforma-

tion tools in the ﬁeld of NewSQL databases. Until now there are several analysis

regarding NewSQL databases [19, 20, 31]. But there is no analysis concerning

transformation tools in the ﬁeld of NewSQL databases. By analysing the most

important properties, the advantages and disadvantages of each tool are shown

and a comparison in table form can be found. All in all it should support the

decision-making to ﬁnd the best database system and transformation tool for

each process. It also represents a market overview on the most popular tools in

the context of NewSQL databases, which is existing yet. The structure is the

following: the following section ”related work and explanations” provides con-

cepts and further deﬁnition as the basis for the following. In the third chapter

”methodology” the approach of the analysis itself is described. Following the

”analysis” of the several transformation tools can be found in chapter four. Last

but not least the ”conclusion” of the analysis is shown.

2 Related work and explanations

The section explains related work and explanations for the following analysis.

First the basic of NewSQL databases - cloud computing - is introduced. After-

wards NewSQL databases and the concepts of ACID and CAP are illustrated.

2.1 Cloud computing

There are many diﬀerent deﬁnitions for the term ”cloud computing” [16]. In the

following the term is deﬁned as parallel and distributed system based on various,

virtualized computers provided by a cloud vendor [9]. The cloud vendors ensure

the operation by providing the hardware, a simple access and professional skills.

It is based on a pay per use model [16]. Concerning to the National Institute

of Standards and Technology cloud computing has the following characteristics

[39]:

–On-demand self-service - a customer can access the services without a human

interaction with the service vendor

–Broad network access - thick and thin client applications have admission to

the services through a standard mechanism

–Resource pooling - the capability of the vendor can be served to multiple

consumers

–Rapid elasticity - the provision of the resources is automated, rapid and

elastic

Analysis of transformation tools applicable on NewSQL databases 3

–Measured service - resource usage is monitored, controlled, and reported

automatically for transparency

These characteristics has several advantages for the user such as a lesser up-front

investment, lower operating costs, higher scalability and elasticity, easy access

trough the internet, reduced business risks and maintenance expenses [53].

2.2 NewSQL

NewSQL databases were ﬁrst mentioned in 2011 by Matthew Asslet [1]. These

kind of databases bring the advantages from RDBMS and NoSQL databases

together. With the new kind of databases online transaction processing (OLTP)

workloads should be migrated to NewSQL as an alternative to NoSQL because

of the disadvantages. In contrast to NoSQL databases it is scalable, has a high

performance and availability. On the other side like RDBMS it guarantees ACID

transactions and supports (partial) SQL queries. SQL is also the main language

for applications using NewSQL. A non-locking concurrency control mechanism

supports several read operations and writes without any locks [31, 2]. Depend-

ing on the NewSQL database it uses known features from NoSQL databases

such as column-oriented data model, distributed nodes or in-memory processing

with the objective to process big data in the ﬁeld of OLTP applications [40].

Roughly, NewSQL databases can be divided into using data sharding layer and

distributed storage. A data sharding layer is used when there are multiple in-

dependent instances of RDBMS [11, 4]. For each database it has to checked in

detail how it how it works between the diﬀerent instances especially concerning

ACID transactions and availability. In contrast distributed storage has one single

logical RDBMS. There is only one API endpoint which is called. The data itself

is distributed on several nodes [33].

2.3 ACID model

ACID deﬁnes the four characteristics atomicity, consistency, isolation, durability.

If a database has these properties requests are also called transactions. Atomicity

means that each transaction whether it is completely successful or not. If there

is a power failure or any other error in between the execution of a transaction

the whole transaction will fail and the data will be corrected as it was before

the transaction. The second property consistency says that the database always

must be in a valid state. Data sets are always consistent concerning constraints,

cascades and triggers. Isolation results in that each transaction is viewed sepa-

rately. Even if the execution was done in parallel it has the same result as if the

execution was run separately. To guarantee the isolation tables can be locked

if there are concurrent transactions. Durability ensures that once a transaction

was successful it will also be like that after a failure, like a power failure [21].

This can be implemented with a transaction log on a hard disk.

4 Analysis of transformation tools applicable on NewSQL databases

2.4 CAP theorem

The CAP theorem is a theorem for distributed systems and their characteris-

tics. It was published in 2002 and is explained next [17]. It is focused on the

three characteristics consistency, availability, and partition tolerance. Consis-

tency means that all nodes in a distributed system look at the same data at the

same time. When data is inserted, changed or deleted all nodes must have the

the same data up to the last write operation. A read operation has the same re-

sult independently from the node which is queried. The deﬁnition of consistency

is not the same as the consistency in the ACID model. The deﬁnition in ACID

describes the consistency within a data set. Availability describes the up time

of the systems which is 100%. Consequently the system is always accessible. Ev-

ery request at any time gets a response within a acceptable response time. The

response must not be an error. Availability does not deﬁne that the response

must contain the correct or latest value. The third characteristic is partition

tolerance. It is characterized by a high fault tolerance independent from loss

of connection between nodes, dropped messages or other events. The theorem

indicates that a database cannot satisfy all of these three characteristics. It is

always at maximum a combination of two.

3 Methodology

3.1 Databases and transformation tools

There are many surveys which compare the diﬀerent existing NewSQL databases

and their properties [19, 20, 31]. But they do not focus how data can be trans-

formed. In this survey selected transformation tools are evaluated. Due to the

fact that several databases come along with several transformation tools the most

popular databases are the basis. In the following passages it is described how

the most popular databases were evaluated. There is no ranking for NewSQL

databases thus the most common NewSQL databases were identiﬁed by a liter-

ature research - ClustrixDB, CockroachDB, Drizzle, GridGain IMDB, MemSQL

(now SingleStore), NuoDB, Spanner, SQLFire, TransLattice Elastic Database,

VoltDB [31, 20, 33, 22]. By reason that some of the databases cannot be found

in DB-Engine Ranking the databases of Google Scholar and Web of Science

were used to identify their popularity. Web of Science is one of the ”most

trusted publisher-independent global citation database” [10]. Google Scholar pro-

vides scholarly literature containing articles, theses, books, abstracts and court

opinions[36]. The most search results had the databases CockroachDB, Mem-

SQL/SingleStore, VoltDB, NuoDB and Spanner - even the numbers were not

that high. The objective of this article is to have an overview about the most

important transformation tools in the context of NewSQL databases. Caused

by the fact that there is no ranking yet three diﬀerent approaches were used

to identify them. A research were made for scientiﬁc articles about big data or

transformation tools. Furthermore Gartners magic quadrant for data integration

Analysis of transformation tools applicable on NewSQL databases 5

was used. For each type of technology there is a separate quadrant. The quad-

rant contains the most popular players divided into four quadrants - leaders,

visionaries, niche players and challengers. Leaders have a leading position in the

market and also a good position for the future. Visionaries have a vision for

the upcoming future but they do not have a good position yet. Niche players

are not that popular in the market because of a small target group or missing

innovation and focus. Challengers are doing well in the moment but they are

not prepared for the future [15]. As a third approach the recommendations of

the database developer website will be considered. Generally third party com-

ponents or internal developments are not mentioned. Coding interfaces like SQL

were not mentioned.

Databases. In the following passages the chosen databases are shortly intro-

duced as well as their transformation tools are part of the following analysis.

CockroachDB oﬀers distributed SQL and is strongly oriented at Google’s

database Spanner. In its basic version it is free of charge. Hosting in AWS or

Google cloud is possible [34]. Regarding to the research methodology there was

no transformation tool found which has a standard connector for CockroachDB.

Therefore the database is not evaluated in the following parts.

NuoDB is a distributed SQL database which can be used in several enviro-

ments such as Google Cloud, OpenShift, AWS, Azure, docker, kubernetes, Linux

or VMWare [41]. Pentaho oﬀers a connector to NuoDB [3].

SingleStore is a distributed SQL system which can be hosted at diﬀerent

cloud vendors like AWS, Google Cloud, Microsoft Azure and RedHat OpenShift

or in a private cloud. It supports diﬀerent types of data like relational, JSON,

geospatial, key-value and time series [45]. The researched transformation tools

are HVR, Informatica and Talend [50].

Spanner is classiﬁed as a relational database but in the context of NewSQL

databases and provided by Google [24]. Spanner is supported by Informatica

and Safe Software [50,24].

VoltDB is a in-memory database written in Java and C++. It aims to

bring data storage in form of a database and streaming together. Furthermore

it can be used for OLTP and OLAP applications [49]. Within the scope of this

research several Analysis and Self-Service Reporting tools were found but not a

transformation tool [49]. It is not considered furthermore.

Transformation tools. In the next sections the chosen transformation tools

are shortly presented.

HVR is a data integration software. It runs in a separate distributed envi-

ronment and uses log-based change data capture (CDC) technology. There are

over 1500 deployments and clients in over 30 countries [23].

Informatica provides several data integration products for cloud as well

as for on premise solutions. These products include solutions for advanced data

transformation, B2B data exchange, ultra messaging and enterprise data catalog

[26]. The focus of the following evalution will be set on Informatica Power Center.

6 Analysis of transformation tools applicable on NewSQL databases

Pentaho is a data integration tool by Hitachi Vantara. There are two ver-

sions of it - Pentaho Community Project and Pentaho Enterprise Edition. With

the free version a codeless pipeline development is possible. All other features

like load balancing, scheduling, streaming, spark or the data science tool kit

require the enterprise version [37].

Safe Software oﬀers the tool Feature Manipulation Engine (FME). A data

integration tool for all industrial sectors like local government, energy, trans-

portation, utilities, education, natural resources, AEC and commercial [25].

Talend oﬀers not only a tool for ETL processes but also for data integrity,

governance and application and API integration. In the scope of data integration

it oﬀers a pipeline desginer, data inventory, data preparation and stitch [47].

Table 1 gives a shortly overview which transformation tool can be used with

which database. If a transformation tool is not mentioned in the focus of a speciﬁc

database it can also support the database. This can be seen in the overview.

Table 1. Combination possibilities of NewSQL databases and transformation tools

HVR Informatica Pentaho Safe Software Talend

NuoDB no no yes no no

SingleStore yes yes no yes yes

Spanner no yes no yes no

3.2 Evaluation criteria

The presented transformation tools will be reviewed in relation to various crite-

ria. In the subchapters these criteria will be explained.

Consistency. NewSQL databases are geared to RDBMS and oﬀer ACID trans-

actions. For a higher availability and fault tolerance the standard properties can

be changed. This results in possible stale data [45, 41, 24].

Scalability. Scalability of a system is relevant if there are much more or less

data in a data ﬂow. Whether a system is scalable or not depends on the under-

lying database and not on the transformation tool. In the evaluation it is shown

what kind of scalability the databases support. Referring to the evaluation the

following kind of scaling are of note:

Partitioning There are two ways of partitioning - horizontal and vertical

partitioning. Horizontal partitioning means that data sets of one entity are stored

in diﬀerent partitions on diﬀerent nodes. Vertical partitioning describes the stor-

age of one or more columns in one partition. This way of partitioning is very

common for column-oriented data stores. Horizontal partitioning splits up into

the two most common techniques: range partitioning and consistent hashing.

Analysis of transformation tools applicable on NewSQL databases 7

Range partitioning creates the partitions on a partition key. Each partition on

each server has a deﬁned range of the partition key. By querying data using the

partition key the query is very fast because only a few partitions are needed.

On the other side there can be data hot spots and load-balancing issues when

loading data of only one partition. Further a routing server is needed. Consistent

hashing builds the partition based on the hash of each data set. The hash values

are approximative uniformly distributed. The advantage is that the partitions

have nearly the same size and there is no need for restructuring of the partition

borders. Even if there is a new partition the movement of the data is easy be-

cause only the partitions next to the new one has to be split up. Queries instead

can be slow when an access on each partition is needed [19].

Replication A replication of data enlarges not only read and write operation

performance but also system reliability, fault tolerance, and durability. In the

following it is diﬀered between master–slave and multi-master replication. In the

master-slave model always one node represents the master and the other nodes

are so called slaves. All writes are committed at the master node. Afterwards

they are replicated to the slaves. The master node is consistent. All the nodes can

be used for read requests to have a performance increase. By reading from a slave

node the data can be stale. In the opposite using multi-master replication means

that all nodes are masters and can receive write operations. The replication

is happening between all nodes [19]. There is not one leading node which is

consistent.

Licence and costs. Many of the new databases are open source [35]. But there

are also proprietary software which goes along with licence costs such as Google

Spanner. In the analysis it is not only mentioned what kind of licence and con-

comitant costs the transformation tool has but also the underlying database.

This information supports to calculate the whole resources needed for the devel-

opment of an use case.

Transformation types. A data ﬂow contains one ore more data sources with

data, one ore more transformation steps and at least a data target for the data

output [5]. Concerning to the ELT approach data source and data target is the

same database instance. The focus is on the transformation tool thus the data

source and target will not be considered. There are many common transfor-

mation types. To implement known transformation types in the environment of

NewSQL databases the transformation tool has to provide at least these common

functionalities:

–Additional columns/Expression [52, 7, 6]

–Aggregation [46, 6, 13, 52]

–Conversion [46, 6, 13]

–Data Cleansing [46, 7]

–Filter [46, 6, 13, 52]

–Join/Lookup [46, 6, 13, 52]

8 Analysis of transformation tools applicable on NewSQL databases

–Merge/Union [46, 6, 13]

–Sequence/Surrogate Key [13, 52]

–Update [52]

It is measured how many of these transformation types are supported by each

tool. Also if there are special transformation types they will be highlighted.

Standard components. The transformation types listed before can often be

implemented in a user-deﬁned code like a query language or with user interface.

With such an interface data ﬂows can be build. In a transform editor the result

can be seen in real-time [5]. There are several beneﬁts of using low-developing

platforms such as no technological knowledge is needed and the development is

faster. Both decreases the price of a new development. Also complexity is re-

duced and the maintenance is more easy [43, 42]. It is measured how many of

the provided transformation types mentioned before can be implemented via the

interface of the transformation tool as a standard component. The advantages

of standard components in the case of transformation tools are that the devel-

opment of new transformations and the data preparation itself is much easier

and can be made without a deep technological and development knowledge. The

time needed for new development decreases as well as the implementation costs.

Furthermore the maintenance of transformation processes is not only easier but

also can be made by users without a technical or development background. This

is especially important in case of limited (developer) resources.

Usability, popularity and stability. ISO 9241 was published in 1999 contain-

ing seventeen parts of requirements for oﬃce work with visual display terminals.

The requirements range from menu dialog, colours to usability [28]. In part eleven

the term of usability is deﬁned. It is the extent to which a software can be used

by speciﬁed users to fulﬁll given tasks with eﬀectiveness, eﬃciency and satisfac-

tion in a speciﬁed context of use [27]. Depending on other use cases he conclusion

can be made that there is a correlation between usability and popularity [44].

To measure the popularity of each tool on the basis of DB-Engine Ranking the

following metrics are used [18]:

–Number of search results in Google and Bing

–Frequency of searches by means of Google Analytics

–Amount of questions in Stack Overﬂow

–Number of job advertisements on Indeed and Simply Hired

–Sum of LinkedIn accounts with reference in the proﬁle

–Total amount of Tweets

–Number of search results in Web of Science and Google scholar

The popularity is an indication by choosing a transformation tool how common

the usage of the tool and how big the community is. Depending on the project

this can be very relevant. It is not only important how popular a tool is but also

how long it is on the market. To appreciate the popularity better the year of

Analysis of transformation tools applicable on NewSQL databases 9

the launch is mentioned. This is also an indicator how stable a software is. The

stability of a software which is available over years can be ranked higher than a

new one. The stability of a new software is vague.

Performance. Performance concerning time, hardware resources, capacity and

modes is also an important factor for data integration processes. These factors

can be measured as [48]:

–Time eﬃciency: grade of low response and processing times

–Resource utilization: resources utilized by the process diﬀerentiate between

the type of resource and the quantity

–Capacity: demand that can be executed without missing the other require-

ments

–Modes: modes of data processing (batch, real-time etc.)

Within the scope of this article the performance of each tool is not measured. If

available there will be a reference to other studies or benchmarks.

Competitive advantages. If there are any competitive advantages they will

be also shown in the evaluation. These can also be features which are not in

the context of transformation and the underlying database itself. Depending on

the use case these features can be helpful for the implementation. It has to be

decided if it is useful in a project and if the additional value is high enough to

choose over another transformation tool.

4 Analysis

This sections gives an overview about the databases in the context of the pre-

sented, important criteria for transformation tools followed by the criteria which

are speciﬁc for transformation tools. A summary of the criteria on database level

can be found in table 2. Also the consistency is speciﬁed there. It is not analysed

in detail because all of the NewSQL databases oﬀer ACID transactions. The

properties of the transformation tools are shortly summarized in table 3.

Scalability. Singlestore creates replicas if it is requested at table creation or the

cluster is in high availability or redundancy-2 mode. If a replication is necessary

master-slave replication is used. By default synchronous mode is chosen. This

means that all replicas of a table are locked until a write transaction is ﬁnished.

This is important to guarantee consistency. Also an asynchronous mode can

be chosen. In this mode the replicas are not locked [45]. Google Spanner does

not have a partition model. But it divides data depending on the primary key

in diﬀerent separated ranges. The ranges are set automatically for balancing.

Google Spanner’s replication also has a master (Google calls it a leader) node

and several slave nodes. The implementation is based on the Paxos protocol. It

10 Analysis of transformation tools applicable on NewSQL databases

is another way of deﬁning the agreed value [24]. A diﬀerent way of partitioning

is used by NuoDB. The storage is maintained by storage managers. Each storage

manager has diﬀerent access storage groups. The connection between the storage

manager and a storage group is a many to many relationship. A storage group

is a physical storage. Such a storage group contains a number of partitions. A

partition is always is stored in one storage group. Replication is made by NuoDB

automatically in an asynchronous multi-master replication [41].

Table 2. Database properties

Consistency Scalability Licence and costs

NuoDB ACID Storage manager, Multi-master Open source

(Proprietary)

SingleStore ACID Master-slave replication Proprietary

Spanner ACID Automated range partitioning,

Paxos protocol replication

Proprietary

Licence and costs. Google is a cloud vendor and oﬀers based on cloud technol-

ogy its database Spanner. There are not ﬁxed costs. Instead they depend on the

usage of the database. The measurement for the cost units are reading and writ-

ing operations, storage, back-up, replication, in-memory cache, streams and data

transfer. This price model has a correlation with the usage [24]. In this analysis

the use cases are not further speciﬁed. Hence a ﬁxed price cannot be assumed. A

similar price model has Singlestore. The price depends on the cloud, region and

resources chosen [45]. NuoDB’s community version is free but limited and only

usable for low storage. The enterprise version goees along with a licence to pay

[41]. HVR is a proprietary software but the pricing is not revealed. In advance

an three hour trial can be done [23]. Also Informatica does not have a public

price listing. It is a proprietary software [26]. Pentaho also has two versions - the

open source Pentaho Enterprise Edition and the proprietary Pentaho Commu-

nity Project [37]. Talend has a basic open source software called Talend Open

Source. For the extended version there is a licence per user e.g. for using cloud

integration [47]. A more special pricing concept has Safe Software. There are free

licenses for students, graduates, research, non-proﬁt organizations, teaching and

trials. The license fee has not been raised since 2002. The license depends on the

environment like on premise or cloud and on the kind of organization [25].

Transformation types. In HVR there are so called actions for data transfor-

mation. In the restrict action data can be ﬁltered, joined and duplicates can be

removed. Within the action of ColumnProperties that data type can be changed

and a surrogate key can be used. The integrate action can also merge data in

the target. A derived column can be created via code in the ColumnProperties

action. An aggregation can not be made out of the box [23]. With Power Center

Analysis of transformation tools applicable on NewSQL databases 11

and Data Quality - part of Informatica’s product range - all the relevant trans-

formation types are supported [26]. In Pentaho the transformation types are

called Data Integration Steps. They provide all necessary transformations [37].

Also the Safe Software’s operators support all deﬁned transformation types [25].

In Talend there are many components which can be used in the transformation

process. Also code can be used e.g. for querying the database. The derivation of

columns or expression should be made in code. Finally all transformations can

be implemented [47].

Standard components. Actions in HVR are based on labels and text boxes.

There are predeﬁned parameters in the restrict action for ﬁltering, lookup and

removing duplicates. They are called ”CaptureCondition”, ”HorizLookupTable”

and ”SelectDistinct”. With the parameters ”Datatype” and ”SurrogateKey” in

the action ”ColumnProperties” data conversion can be made and a surrogate

key deposited. The property ”Resilient” in action ”Integrate” allows to make

a merge in the target database. It can also be used for updates [23]. In In-

formatica Power Center there are several transformation types which can be

added to the graphical data ﬂow like ”aggregator”, ”ﬁlter”, ”sorter”, ”expres-

sion”, ”joiner”, ”lookup”, ”union”, ”sequence generator” and ”update”. With

the ”sorter” transformation also duplicates can be removed [26]. In Pentaho

there are diﬀerent Data Integration Steps which can be used for the implemen-

tation of the diﬀerent transformation types such as ”group by” for aggregations,

”ﬁlter rows”, ”unique rows”, ”select values”, ”concat ﬁelds”, ”merge join”, ”ap-

pend streams”, ”add sequence” and ”combination lookup/update” [37]. In Safe

Software’s FME the transformation types are called transformers. The most im-

portant ones are ”Aggregator”, ”AttributeFilter”, ”DuplicateFilter”, ”BaseC-

onverter”, ”AttributeCreator”, ”StringConcatenator”, ”DatabaseJoiner”, ”Fea-

tureMerger”, ”Counter” and ”DatabaseUpdater” [25]. Talend has several com-

ponents in its Talend Open Studio and expanded modules for transformations.

The most necessary are ”tAggregateRow”, ”tFilterRow”, ”tUniqRow”, ”tCon-

vertType”, ”tMap” and ”tUnite”. There are database speciﬁc components for

update and merge in the target [47]. In many parts there is not only one stan-

dard component available for a certain transformation type. However only the

most popular one is shown in this article.

Usability, popularity and stability. Analysis of the transformation tools

Pentaho and Talend shows that they are easy to use [8, 32]. In total there is no

comparison to all of the focused transformation tools. To classify the popularity

of a tool the measures were related to each other. The greatest popularity has

Informatica. Almost 40 % less is the popularity of Talend followed by Pentaho.

The tools with the lowest popularity are HVR and Safe Software. After ﬁrst

explorations HVR was deployed in 2012 [23]. Informatica, as a software company,

is on the market since 1993 [26]. Pentaho was developed in 2004. In 2015 the

software was bought by Hitachi Vantara [37]. Safe software began their company

12 Analysis of transformation tools applicable on NewSQL databases

history in 1993 helping forestry companies [25]. Talend was founded in 2005.

Since then they enlarged their software solution [47].

Performance. HVR does not support complex ETL processes very well [51].

Talend improves their performance by supporting parallel processing [47, 30].

Due to the fact that there is no study comparing the performance of all tools,

the tools cannot be ranked by performance. Also not for every transformation

tool information was found.

Competitive advantages. HVR supports many diﬀerent topologies such as

uni-directional, bi-directional, multi-directional, broadcast, integration and cas-

cading [23]. Informatica provides several products which supports the data in-

tegration process. These support by ultra messaging, build a data hub or the

data exchange between companies [26]. With the Lumada data services Hitachi

Vantara oﬀers a wide product range for Analytics, Governance and Ops Agility

containing multi cloud solutions such as data catolog, data optimizer for Hadoop

and a data lake [37]. Talend has a strong focus on data integration processes.

This results in over 900 free components and connectors which can be used [47].

Table 3. Transformation tool properties

Licence and costs Transforma-

tion types

Standard

components

Popularity Founding

year

Competitive

advantages

HVR Proprietary 8/9 7/9 Not popular 2012 Supporting many

topologies

Informa-

tica

Proprietary 9/9 8/9 Popular 1993 Continuative data

integration prod-

ucts

Pentaho Open source/

Proprietary

9/9 9/9 Less popular 2004 Lumada data ser-

vices

Safe

Software

Proprietary 9/9 9/9 Not popular 1993

Talend Open source/

Proprietary

9/9 8/9 Popular 2005 over 900 compo-

nents & connec-

tors

5 Conclusion and future research

Because NoSQL databases cannot satisfy all requirements an application has

towards a database NewSQL databases are getting more popular. This analysis

gives an overview about the most popular transformation tools on the market for

NewSQL databases including their advantages and disadvantages. In fact there

Analysis of transformation tools applicable on NewSQL databases 13

are not that many tools that support NewSQL databases. There are also pop-

ular NewSQL databases which are not supported by any tool. This shows how

restricted the selection is. Instead code or a user speciﬁc connection has to be

used. The tools which have connectors for NewSQL databases are not new on the

market. Instead they are existing ETL tools which can be now also user for trans-

formations in NewSQL datbases. In this study fundamental concepts regarding

the change of NewSQL databases and their properties were presented. Using a

self-developed selection method the most popular NewSQL databases and trans-

formation tools were chosen. The most important properties of a transformation

tool were introduced and discussed for each database and tool. In overview tables

the advantages and disadvantages can be seen quickly and clearly. The analysis

is not only a market overview but a decision guidance for the implementation

of speciﬁc use cases. It is shown that a choice of a transformation tool always

includes the choice of a database system. The research is an overview about

diﬀerent transformation tools in the context of NewSQL databases. There are

more NewSQL databases on the market which are not mentioned here because

of their less popularity. There is not only a change of the popularity over time

also depending on the use case other databases could ﬁt better. Continuing spe-

ciﬁc use cases can be implemented with the several tools and be measured and

analyzed. Within the scope of the study not every property could be analyzed in

detail. There are further studies necessary. The performance of standard com-

ponents have to be analyzed in relation to the performance of custom code.

Furthermore the usability of each tool has to be measured with a representative

group of people. Besides that it is partly possible to connect a database with a

transformation tool without a standard connector.

Acknowledgments. This research was made possible by funding from the ICT-

AGRI-FOOD 2020 Joint Call. This work was supported by a grant of the Roma-

nian National Authority for Scientiﬁc Research and Innovation, CCCDI - UE-

FISCDI, project number COFUND-ICT-AGRI-FOOD-GOHYDRO 200/2020,

within PNCDI III.

References

1. Anslett, M.: Nosql, newsql and beyond: The drivers and use cases for database al-

ternatives. NoSQL, NewSQL and Beyond: The Drivers and Use Cases for Database

Alternatives (2011)

2. Aslett, M.: How will the database incumbents respond to nosql and newsql (2011)

3. Atriwal, L., Nagar, P., Tayal, S., Gupta, V.: Business intelligence tools for big data.

Journal of Basic and Applied Engineering Research 3(6), 505–509 (2016)

4. Authors, V.: The vitess documentation, https://vitess.io/docs/

5. Belyy, A., Xu, F., Herdan, T., He, M., Syed, A., Cao, W., Yee, M.: Dataset previews

for etl transforms (May 28 2013), uS Patent 8,452,723

6. Bergamaschi, S., Guerra, F., Orsini, M., Sartori, C., Vincini, M.: A semantic ap-

proach to etl technologies. Data & Knowledge Engineering 70(8), 717–731 (2011)

14 Analysis of transformation tools applicable on NewSQL databases

7. Bhide, M.A., Bonagiri, K.K., Mittapalli, S.K.: Column based data transfer in ex-

tract transform and load (etl) systems (Aug 20 2013), uS Patent 8,515,898

8. Biplob, M.B., Sheraji, G.A., Khan, S.I.: Comparison of diﬀerent extraction trans-

formation and loading tools for data warehousing. In: 2018 International Confer-

ence on Innovations in Science, Engineering and Technology (ICISET). pp. 262–

267. IEEE (2018)

9. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing

and emerging it platforms: Vision, hype, and reality for delivering computing as

the 5th utility. Future Generation computer systems 25(6), 599–616 (2009)

10. Clarivate: Web of science, https://clarivate.com/webofsciencegroup/

solutions/web-of-science/

11. Data, C.: Citus documentation, https://docs.citusdata.com/

12. Davenport, R.J.: Etl vs elt: A subjective view. Insource Commercial aspects of BI

whitepaper (2008)

13. El Akkaoui, Z., Zimanyi, E., Maz´on, J.N., Trujillo, J.: A model-driven framework

for etl process development. In: DOLAP ’11: Proceedings of the ACM 14th inter-

national workshop on Data Warehousing and OLAP. pp. 45–52 (10 2011)

14. Fatima, H., Wasnik, K.: Comparison of sql, nosql and newsql databases for internet

of things. In: 2016 IEEE Bombay Section Symposium (IBSS). pp. 1–6. IEEE (2016)

15. Gartner, I.: Magic quadrant research methodology (2020), https://www.gartner.

com/en/research/methodologies/magic-quadrants-research

16. Geelan, J., et al.: Twenty-one experts deﬁne cloud computing. Cloud Computing

Journal 4, 1–5 (2009)

17. Gilbert, S., Lynch, N.: Brewer’s conjecture and the feasibility of consistent, avail-

able, partition-tolerant web services. Acm Sigact News 33(2), 51–59 (2002)

18. solidIT consulting & software development GmbH: Db-engines (2020), https://

db-engines.com/

19. Grolinger, K., Higashino, W.A., Tiwari, A., Capretz, M.A.: Data management in

cloud environments: Nosql and newsql data stores. Journal of Cloud Computing:

advances, systems and applications 2(1), 22 (2013)

20. Gurevich, Y.: Comparative survey of nosql/newsql db systems. Department of

Mathematics and Computer Science, The Open University of Israel (2015)

21. Haerder, T., Reuter, A.: Principles of transaction-oriented database recovery. ACM

computing surveys (CSUR) 15(4), 287–317 (1983)

22. Hajoui, O., Dehbi, R., Talea, M., Batouta, Z.I.: An advanced comparative study

of the most promising nosql and newsql databases with a multi-criteria analysis

method. Journal of Theoretical and Applied Information Technology 81(3), 579

(2015)

23. HVR Software, I.: Enterprise data integration software — hvr (2020), https://

www.hvr-software.com/

24. Inc., G.: Cloud spanner — google cloud (2020), https://cloud.google.com/

spanner

25. Inc., S.S.: Safe software — fme — data integration platform (2020), https://www.

safe.com/

26. Informatica: Enterprise cloud data management — informatica deutschland (2020),

https://www.informatica.com/de/

27. ISO: Ergonomische Anforderungen f¨ur B¨urot¨atigkeiten mit Bildschirmger¨aten –

Teil 11: Anforderungen an die Gebrauchstauglichkeit - Leits¨atze. Beuth Verlag,

Berlin (1999)

Analysis of transformation tools applicable on NewSQL databases 15

28. ISO: Ergonomische Anforderungen f¨ur B¨urot¨atigkeiten mit Bildschirmger¨aten -

Teil 1: Allgemeine Einf¨uhrung (ISO 9241-1:1997) (enth¨alt ¨

Anderung AMD 1:2001);

Deutsche Fassung EN ISO 9241-1:1997 + A1:2001. Beuth Verlag, Berlin (2002)

29. Jing Han, Haihong E, Guan Le, Jian Du: Survey on nosql database. In: 2011 6th

International Conference on Pervasive Computing and Applications. pp. 363–366

(2011)

30. Katragadda, R., Tirumala, S.S., Nandigam, D.: Etl tools for data warehousing:

an empirical study of open source talend studio versus microsoft ssis. Computing

Conference Papers [147] (2015)

31. Kaur, K., Sachdeva, M.: Performance evaluation of newsql databases. In: 2017

International Conference on Inventive Systems and Control (ICISC). pp. 1–5. IEEE

(2017)

32. Kherdekar, V.A., Metkewar, P.S.: A technical comprehensive survey of etl tools.

Advanced Engineering Research and Applications p. 20 (2016)

33. Kumar, R., Gupta, N., Maharwal, H., Charu, S., Yadav, K.: Critical analysis of

database management using newsql. International Journal of Computer Science

and Mobile Computing 3, 434–438 (2014)

34. Labs, C.: Cockroach labs, the company building cockroachdb (2020), https://

cockroachlabs.com/

35. Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14

(2010)

36. LLC, G.: About google scholar, https://scholar.google.com/intl/de/scholar/

about.html

37. LLC, H.V.: Pentaho enterprise edition — hitachi vantara (2020), https://www.

hitachivantara.com/

38. Mehra, K.K., Kumar, P., Choudhury, P., Lakkur, A.L., Samanta, S.: Extract, trans-

form and load (etl) system and method (Apr 25 2017), uS Patent 9,633,095

39. Mell, P., Grance, T., et al.: The nist deﬁnition of cloud computing. NIST special

publication 800-145 (2011)

40. Moniruzzaman, A.B.M.: Newsql: Towards next-generation scalable rdbms for on-

line transaction processing (oltp) for big data management. International Journal

of Database Theory and Application 7 (11 2014)

41. NuoDB: Nuodb home — nuodb (2020), https://https://nuodb.com/

42. OutSystems: The state of application development: Is it ready for disruption? Tech.

rep., Department of Computer Science, Michigan State University, OutSystems:

Boston, MA, USA (September 2019)

43. Richardson, C., Rymer, J.R.: New development platforms emerge for customer-

facing applications. Forrester: Cambridge, MA, USA (2014)

44. Scowen, G., Regenbrecht, H.: Increased popularity through compliance with usabil-

ity guidelines in e-learning web sites. International Journal of Information Tech-

nology and Web Engineering (IJITWE) 4(3), 38–57 (2009)

45. SingleStore, I.: Singlestore is the database of now™powering modern applications

and analytical systems (2020), https://https://www.singlestore.com/

46. Song, X., Yan, X., Yang, L.: Design etl metamodel based on uml proﬁle. In: 2009

Second International Symposium on Knowledge Acquisition and Modeling. vol. 3,

pp. 69–72. IEEE (2009)

47. Talend: Talend - a cloud data integration leader (modern etl) (2020), https://

www.talend.com/

48. Theodorou, V., Abell´o, A., Lehner, W.: Quality measures for etl processes. In:

International Conference on Data Warehousing and Knowledge Discovery. pp. 9–

22. Springer (2014)

16 Analysis of transformation tools applicable on NewSQL databases

49. VoltDB, I.: Home - voltdb (2020), https://www.voltdb.com/

50. Zaidi, E., Thoo, E., Heudecker, N., Menon, S., Thanaraj, R.: Gartner magic quad-

rant for data integration tools. Gartner Group (2020)

51. Zaidi, E., Thoo, E., Heudecker, N., Menon, S., Thanaraj, R.: Magic quadrant for

data integration tools, 2020 (2020)

52. Zamanian, K., Nesamoney, D.: Apparatus and method for performing data trans-

formations in data warehousing (Jan 15 2002), uS Patent 6,339,775

53. Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research

challenges. Journal of internet services and applications 1(1), 7–18 (2010)

Comparison of the Performance of NewSQL Databases Based on Linux OS

Chapter

Full-text available

Nov 2021

In the field of cloud technology and big data a new kind of databases arose - NewSQL databases. In the meantime there are many different suppliers and databases available. In this article the most common NewSQL databases in the field of Linux OS are presented and their performance is measured and compared to each other. Therefore a data set with weather data was prepared and used. Furthermore the test setting is explained. The results of the different tests are illustrated and discussed. Recommended actions are given at the end.

Analysis of the Performance of NewSQL Databases Compared to RDBMS Based on Linux OS

Chapter

Jan 2023

Besides the well-known relational database management systems (RDBMS) a new kind of databases called NewSQL databases were developed. NewSQL databases aim to combine the advantages of RDBMS and NoSQL databases. In this article the performance of the most popular RDBMS and NewSQL databases is measured and compared to each other. The objective is to measure whether there is a improve of performance for NewSQL databases as promised. To get comparable results a data set was prepared as well as homogeneous test settings were defined. Furthermore KPIs for the measurement were specified. The outcome of the performance tests are displayed and explained. Conclusions as well as recommended actions are made.

Benchmarking NewSQL Cloud-Native or Cloud-Hosted Databases

Chapter

Nov 2021

Evaluating the performance of high-impact components when building a solution is the process that provides an informed and fitting choice. Choosing a database system and the proper infrastructural configuration underneath it is however a process that can prove to be extremely complex. In the context of the highly volatile NewSQL database systems and Cloud hosting, choosing the fitting combination is not only difficult but can be daunting. This paper provides an overview of the implications of creating reliable benchmarking for NewSQL Cloud hosted databases. It includes the steps involved in accurate performance evaluations, existing tools, proposes a systematic approach for narrowing the number of combinations to test for and lists the standard benchmarks as well as alternatives. With this work we aim to introduce clarity and structure for the database benchmarking process, based on existing research and the current gaps therein.

Comparison of Different Extraction Transformation and Loading Tools for Data Warehousing

Conference Paper

Full-text available

Oct 2018

Dataset previews for ETL transforms

Patent

Full-text available

May 2013

Disclosed is a user interface on a display for editing data transformations comprising an ETL process. A first display area presents a data representation of a data transformation. A second display area presents a view of input data, and a third display are presents a view of output data. User input to modify the data transformation is received. In response to receiving the user input, the third display area is updated with output data generated by applying the modified data transformation to the input data.

A Technical Comprehensive Survey of ETL Tools

Article

Full-text available

Mar 2016

In modern days ETL tools are vey useful in data integration and data warehousing. Input is given to the datawarehouse through ETL. ETL means Extraction, Transformation and Loading. ETL tools transfer data from one source system to another source system. As these tools are mainly used in Busineness Intelligence and Data Warehousing, there is lot of space for their progress.There are lot of ETL tools available in the market varying from version to version to stay proficient against other tools. Each and every tool has its own features and limitations. In this paper we have carried out technical survey of existing ETL tools and benchmarking of these tools has been performed by considering certain parametrs including scalability, reusability, interoperability, support to big data, parallelism, usability, flexibility etc. Finally, problems and challenges of ETL tools have been discussed thoroughly and its state of the art is summarized.

An advanced comparative study of the most promising NoSQL and NewSQL databases with a multi-criteria analysis method

Article

Full-text available

Nov 2015

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time[1].Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data. Big data is a set of techniques and technologies that require new forms of integration to uncover large hidden values from large datasets that are diverse, complex, and of a massive scale [2]. Currently, there are a hundred solutions to the problem of Big Data that can be classified into three categories: NoSQL databases, NewSQL and Search-based systems. One of the major problems often mentioned is the heterogeneity of the languages and the interfaces they offer to developers and users. Different platforms and languages have been proposed, and applications developed for one system require significant effort to be migrated to another one [3]. Our motivation to write this article is to make a comparative study of Big Data systems, this is our first step to design and implement concrete and effective solution to the interoperability problem between Big Data systems. However, this study will help the professionals in decision-making.

ETL tools for Data Warehousing: An empirical study of Open Source Talend Studio versus Microsoft SSIS

Conference Paper

Full-text available

Jan 2015

Relational databases are bound to follow various database integrity rules and constraints that makes the reporting a time consuming process. Data Warehousing has evolved out of the desperate need for easy access to structured storage of quality data that can be used for effective decision making. Data are turned into knowledge and knowledge into plans which are instrumental in profitable business decision making. To serve this purpose, data need to be extracted from various sources, transformed and loaded into the data warehouse which constitute the process of ETL (Extract, Transform and Load). ETL process can be accomplished using various tools both open source and proprietary. In this paper, we provide an empirical study of two ETL tools, an open source Talend Studio and Microsoft SSIS. In spite of the dominance among a vast majority of computer software solutions, open source technologies, as the comparative analysis that this study has undertaken, concludes that open sources tools are yet to evolve in order to be sustainable.

Quality Measures for ETL Processes

Conference Paper

Full-text available

Sep 2014

ETL processes play an increasingly important role for the support of modern business operations. These business processes are centred around artifacts with high variability and diverse lifecycles, which correspond to key business entities. The apparent complexity of these activities has been examined through the prism of Business Process Management, mainly focusing on functional requirements and performance optimization. However, the quality dimension has not yet been thoroughly investigated and there is a need for a more human-centric approach to bring them closer to business-users requirements. In this paper we take a first step towards this direction by defining a sound model for ETL process quality characteristics and quantitative measures for each characteristic, based on existing literature. Our model shows dependencies among quality characteristics and can provide the basis for subsequent analysis using Goal Modeling techniques.

Performance evaluation of NewSQL databases

Conference Paper

Jan 2017

Comparison of SQL, NoSQL and NewSQL databases for internet of things

Conference Paper

Dec 2016

The NIST definition of cloud computing

Article

Jan 2011

NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management

Article

Nov 2014

A B M Moniruzzaman

One of the key advances in resolving the big-data problem has been the emergence of an alternative database technology. Today, classic RDBMS are complemented by a rich set of alternative Data Management Systems (DMS) specially designed to handle the volume, variety, velocity and variability ofBig Data collections; these DMS include NoSQL, NewSQL and Search-based systems. NewSQL is a class of modern relational database management systems (RDBMS) that provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system. This paper discusses about NewSQL data management system; and compares with NoSQL and with traditional database system. This paper covers architecture, characteristics, classification of NewSQL databases for online transaction processing (OLTP) for Big data management. It also provides the list ofpopular NoSQL as well as NewSQL databases in separate categorized tables. This paper compares SQL based RDBMS, NoSQL and NewSQL databases with set of metrics; as well as, addressed some research issues ofNoSQL and NewSQL.

Analysis of Transformation Tools Applicable on NewSQL Databases

Abstract and Figures

Recommended publications

Evaluation of Transformation Tools in the Context of NoSQL Databases

Comparison of the Performance of NewSQL Databases Based on Linux OS

Analysis of the Performance of NewSQL Databases Compared to RDBMS Based on Linux OS

Benchmarking NewSQL Cloud-Native or Cloud-Hosted Databases