ArticlePDF Available

Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency

International Journal of Genomics

October 2015
2015(2):502795

DOI:10.1155/2015/502795

License
CC BY

Authors:

Rene Xavier

University of Brasília

Fernanda Hondo

University of Brasília

Show all 7 authorsHide

Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB.

Stages of extraction.

…

Comparison between inserts (time × file number).

…

Comparison between extractions (time × file number).

…

Figures - available from: International Journal of Genomics

This content is subject to copyright. Terms and conditions apply.

Access to this full-text is provided by Wiley.

Learn more

Content available from International Journal of Genomics

This content is subject to copyright. Terms and conditions apply.

Research Article

Evaluating the Cassandra NoSQL Database Approach for

Genomic Data Persistency

Rodrigo Aniceto,1Rene Xavier,1Valeria Guimarães,1Fernanda Hondo,1

Maristela Holanda,1Maria Emilia Walter,1and Sérgio Lifschitz2

1Computer Science Department, University of Brasilia (UNB), 70910-900 Brasilia, DF, Brazil

2Informatics Department, Pontical Catholic University of Rio de Janeiro (PUC-Rio),

22451-900 Rio de Janeiro, RJ, Brazil

Correspondence should be addressed to Maristela Holanda; mholanda@cic.unb.br

Received  March ; Accepted  May 

Academic Editor: Che-Lun Hung

License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly

cited.

Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics.

One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the

persistency of genomic data, particularly storing and analyzing these large-scale processed data. To nd an alternative to the

frequently considered relational database model becomes a compelling task. Other data models may be more eective when dealing

with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the

Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real

data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and

another NoSQL database approach, MongoDB.

1. Introduction

Advanced hardware and soware technologies increase the

speed and eciency with which scientic workows may be

performed. Scientists may execute a given workow many

times, comparing results from these executions and providing

greater accuracy in data analysis. However, handling large

volumes of data produced by distinct program executions

under varied conditions becomes increasingly dicult. ese

massive amounts of data must be stored and treated in order

to support current genomic research [–]. erefore, one

of the main problems when working with genomic data

refers to the storage and search of these data, requiring many

computational resources.

In computational environments with large amounts of

possibly unconventional data, NoSQL []databasesystems

have emerged as an alternative to traditional Relational

Database Management Systems (RDBMS). NoSQL systems

are distributed databases built to meet the demands of high

scalability and fault tolerance in the management and analysis

of massive amounts of data. NoSQL databases are coded

in many distinct programming languages and are generally

available as open-source soware.

e objective of this paper is to study the persistency

of genomic data on a particular and widely used NoSQL

database system, namely, Cassandra []. e tests performed

for this study use real genomic data to evaluate insertion

and extraction operations into and from the Cassandra

database. Considering the large amounts of data in current

genome projects, we are particularly concerned with high

performances. We discuss and compare our results with a

relational system (PostgreSQL) and another NoSQL database

system, MongoDB [].

is paper is organized as follows. Section  presents a

brief introduction for NoSQL databases and the main features

of Cassandra database system. We discuss some related work

in Section  andwepresent,atSection ,thearchitecture

ofthedatabasesystem.Section  discusses the practical

results obtained and Section  concludes and suggests future

works.

Hindawi Publishing Corporation

International Journal of Genomics

Volume 2015, Article ID 502795, 7 pages

http://dx.doi.org/10.1155/2015/502795

International Journal of Genomics

2. NoSQL Databases: An Overview

Many relevant innovations in data management came from

Web . applications. However, the techniques and tools

available in relational systems may, sometimes, limit their

deployment. erefore, some researchers have decided to

develop their own web-scale database solutions [].

NoSQL (not-only SQL) databases have emerged as a

solution to storage scalability issues, parallelism, and man-

agementoflargevolumesofunstructureddata.Ingeneral,

NoSQLsystemshavethefollowingcharacteristics[–]: (i)

they are based on a nonrelational data model; (ii) they rely

on distributed processing; (iii) high availability and scalability

aremainconcerns;and(iv)someareschemalessandhavethe

ability to handle both structured and unstructured data.

ere are four main categories of NoSQL databases [,–

]:

(i) Key-value stores: data is stored as key-pairs values.

ese systems are similar to dictionaries, where data

is addressed by a single key. Values are isolated

and independent from another, and relationships are

handled by the application logic.

(ii) Column family database: it denes the data structure

as a predened set of columns. e super columns

and column family structures can be considered the

database schema.

(iii) Document-based storage: a document store uses the

concept of key-value store. e documents are col-

lections of attributes and values, where an attribute

canbemultivalued.EachdocumentcontainsanID

key,whichisuniquewithinacollectionandidenties

document.

(iv) Graph databases: graphs are used to represent

schemas. A graph database works with three abstrac-

tions: node, relationships between nodes, and key-

value pairs that can attach to nodes and relationships.

2.1. Cassandra Database System. Cassandra is a cloud-

oriented database system, massively scalable, designed to

store a large amount of data from multiple servers, while

providing high availability and consistent data []. It is based

onthearchitectureofAmazon’sDynamo[]andalsoon

Google’s BigTable data model []. Cassandra enables queries

as in a key-value model, where each row has a unique row

key, a feature adopted from Dynamo [,,,]. Cassandra

is considered a hybrid NoSQL database, using characteristics

of both key-value and column oriented databases.

Cassandra’s architecture is made of nodes, clusters, data

centers and a partitioner. A node is a physical instance of

Cassandra. Cassandra does not use a master-slave architec-

ture; rather, Cassandra uses peer-to-peer architecture, which

all nodes are equal. A cluster is a group of nodes or even a

single node. A group of clusters is a data center. A partitioner

is a hash function for computing the token of each row key.

When one row is inserted, a token is calculated, based

on its unique row key. is token determines in what node

that particular row will be stored. Each node of a cluster is

responsible for a range of data based on a token. When the

rowisinsertedanditstokeniscalculated,thisrowisstoredon

a node responsible for this token. e advantage here is that

multiple rows can be written in parallel into the database, as

each node is responsible for its own write requests. However

this may be seen as a drawback regarding data extraction,

becoming a bottleneck. e MurMur3Partitioner []isa

partitioner that uses tokens to assign equal portions of data

toeachnode.istechniquewasselectedbecauseitprovides

fast hashing, and its hash function helps to evenly distribute

data to all the nodes of a cluster.

e main elements of Cassandra are keyspaces,column

families, columns, and rows []. A keyspace contains the

processing steps of the data replication and is similar to a

schema in a relational database. Typically, a cluster has one

keyspace per application. A column family is a set of key-

value pairs containing a column with its unique row keys. A

column is the smallest increment of data, which contains a

name, a value, and a timestamp. Rows are columns with the

same primary key.

When a write operation occurs, Cassandra immediately

stores the instruction on the Commit log, which goes into the

hard disk (HD). Data from this write operation is stored at

the memtable,whichstaysinRAM.Onlywhenapredened

memory limit is reached, this data is written on SSTables that

stay in the HD. en, the Commit log and the memtable are

cleaned up [,]. In case of failure regarding the memtables,

Cassandra reexecutes the written instructions available at the

Commit log [,].

When an extract instruction is executed, Cassandra rst

searches information in memtables. A large RAM allows large

amounts of data in memtables and less data in HD, resulting

in quick access to information [].

3. Storing Genomic Data

Persistency of genomic data is not a recent problem. In ,

Bloom and Sharpe [] described the diculties of managing

these data. One of the main diculties was the growing

number of data generated by the queries. e work in R¨

ohm

and Blakeley [] and Huacarpuma [] consider relational

databases (SQL Server  and PostgreSQL, resp.) to store

genomic data in FASTQ format.

Bateman and Wood []havesuggestedusingNoSQL

databases as a good alternative to persisting genetic data.

However, no practical results are given. Ye and Li []

proposed the use of Cassandra as a storage system. ey

consider multiple nodes so that there were no gaps in the

consistencyofthedata.WangandTang[] indicated some

instructions for creating an application to perform data

operations in Cassandra.

Tudorica and Bucur [] compared some NoSQL

databases to a MySQL relational database using the YCSB

(Yahoo! Cloud Serving Benchmark). ey conclude that in

an environment where write operations prevail MySQL has

a signicantly higher latency when compared to Cassandra.

Similar results about performance improvements for writing

operations in Cassandra, when compared to MS SQL

Express, were also reported by Li and Manoharan [].

International Journal of Genomics 

Many research works [–] present results involving

the performance of a Cassandra database system for massive

data volumes. In this paper, we have decided to evaluate the

performance of Cassandra NoSQL database system speci-

cally for genomic data.

4. Case Study

To validate our case study we have used real data. e

sequences (also called reads) were obtained from liver and

kidney tissue samples of one human male from the SRA-NCBI

(http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?), sequenced

by the Illumina Genome Analyzer. It produced ,,

sequences for the kidney samples and ,, sequences

for the liver samples, each sequence containing  bases.

Marioni et al. [] generated these sequences.

FASTQ le stores sequences of nucleotides and their

corresponding quality values. ree les were obtained from

ltered sequences sampled from kidney cells, and another

three les consisted of ltered genomic sequences sampled

from liver cells. It should be noted that these data were

selectedbecausetheywereinFASTQ[] format, which is

commonly used in bioinformatics workows.

In this case study, we carried out three analyses. In the rst

one, we investigated how Cassandra behaves when the com-

putational environment is composed of a cluster with two and

four computers. In the second one, we analyze the behavior of

Cassandra compared to PostgreSQL, a relational database. In

thelastcasestudy,weusedtheMongoDBdocument-oriented

NoSQLdatabasetocomparetoCassandra’sresults.

4.1. Cloud Environment Architecture. In order to investigate

the expected advantages of Cassandra’s scalability, we have

createdtwocloudenvironments:onewithtwonodesand

the other with four nodes. Cassandra was installed on every

node of the cluster. We have also used OpsCenter . [],

a DSE tool that implements a browser-based interface to

remotely manage the cluster conguration and architecture.

e architecture contains a single data center, named DC. A

single cluster, named BIOCluster, containing the nodes, was

created, working with DC.

4.2. Java Client. At the soware level, we have dened the

following functional requirements: (i) create a keyspace; (ii)

create a table to store a FASTQ le; (iii) create a table with

the names of inserted FASTQ les and their corresponding

metadata; (iv) receive an input le containing data from

a FASTQ le and insert it into a previously created table,

followed by the le name and metadata; (v) extract all data

from a table containing the contents of a FASTQ le; and (vi)

remove the table and the keyspace.

Nonfunctional requirements were also dened: (i) the

use of Java API, provided by DataStax, in order to have a

better integration between the Cassandra distribution and the

developed client application; (ii) the use of Cassandra Query

Language (CQL) [], for database interactions, which is the

current query language of Cassandra and resembles SQL; (iii)

conversion to JSON les to be used by the client application,

since it is simpler to work with JSON les in Java; and (iv) a

good performance in operations.

With respect to this last requirement, three applications

were developed, two for data conversion and one client

application for Cassandra.

() FastqTojson Application converts the FASTQ input

le into smaller JSON les, each JSON le with ve

hundredthousandreads.eobjectiveistoloadthese

smallJSONlesbecause,usually,FASTQleoccupies

a few gigabytes. Furthermore, as it presents a proper

format for the Java client, it does not consume many

computational resources. Each JSON le occupies ten

thousand rows in the database: each row is an array of

ten columns; each eld value of the column contains

ve reads.

() Cassandra client was also developed in Java, using the

JavaAPIprovidedbyDataStaxandistheoneinwhich

the data persists. is client creates a keyspace, inserts

all JSON les from the rst application in a single

table, and extracts the data from a table.

For the database schema, it consists of a single

keyspace, called biodata,asinglecluster,calledbio-

cluster, one table of metadata and one table for each

lepersisting,asshowninFigure .

e allocation strategy for replicas and the repli-

cation factor are properties from the keyspace.e

allocation strategy determines whether or not data is

distributed through a network of dierent clusters.

e Simple Strategy [] was selected since this case

study was performed in a single cluster. Likewise,

since we did not consider failures and our goal was

to study performance rather than fault recovery, we

have chosen one replication factor. It should be noted

that the replication factor determines the number

of replicas distributed along the cluster. Focusing on

performance,ahighernumberofreplicaswouldalso

interfere on the insertion time.

As previously mentioned, the client application cre-

ates a table for each inserted FASTQ le, which has

the same name of the le. Each of these tables has

eleven columns, and each cell stores a small part

of a JSON le, ten reads per cell, which is about

 MB in size. is small set for columns and cells

is due to the eciency of Cassandra when a small

number of columns are used and a big number of

rows. is is also a consequence of the ability of

MurMur3Partitioner to distribute each row in one

node. erefore, the cluster has a better load balance

during insertions and extractions.

Once a table is created, the client inserts all data from

JSON in the rst stage on the database, as shown in

Figure . In what follows, a single row is inserted into

the metadata table containing as a row key the name

of FASTQ le and a column with the number of rows.

is latter is inserted into the metadata table to solve

the memory limit of the Java Virtual Machine, which

mayhappenwhenqueryinglargetables.

International Journal of Genomics

Line 2

ine 2

Cluster: BIOCluster

Keyspace bioData

Table metadata

Key le 1 RowsRows

Table le 1

Line 1

Line 2 Line 3

Key 1 Va lu e JKey 2 Va lu e JKey 3 Va l u e J

Table le 2

Line 1 Line 2 Line 3

Key le 2

Value AValue AValue A

Key 1 Va l ue JKey 2 Va lu e JKey 3 Val u e JValue AValue AValue A

F : Database schema.

arq1.json

arq2.json

FastqToJson

FASTQ le

arqN.json

Cassandra

client

F : Stages of insertion.

When extracting data, the client queries the metadata

table to get the number of rows on the table with the

FASTQ data and then proceeds to the table extraction,

whichisdonerowbyrowandwrittenintoan“.out”

le.

() OutToJso Application. Aer data extraction, there is

a single le with the extension “.out.” is application

converts this le into a FASTQ format, making it

identical to the original input le, resulting only in the

FAS TQ le wit h out te mpora r y  l e “.ou t.” is pro c e ss

is shown in Figure .

OutToFastq

Data.out Data.fastq

Cassandra

client

F : Stages of extraction.

5. Results

In this work, we have considered three experimental case

studies to evaluate data consistency and performance for

storing and extracting genomic data. For the rst one, we

veried Cassandra’s scalability and variation in performance.

For the second case study, we compared the Cassandra results

to a PostgreSQL relational system and, nally, we used the

MongoDB NoSQL database and compared other results to

Cassandra NoSQL system. e case studies used the same

datatoinsertandreadsequences.

During the Cassandra evaluation, we have created two

clusters. e rst one, a Cassandra cluster with two com-

puters, was created, while for the second one, a new cluster

with four computers was created. e rst cluster consisted

of two computers with Intel Xeon E-/. GHz processor,

one with GB RAM and the other with GB RAM. For the

second cluster, besides the same two computers, two other

computers with Intel Core i processor and  GB RAM was

included. Each one of them used Ubuntu ..

5.1. Insertions and Extractions Cassandra NoSQL. e input

les are six FASTQ les with ltered data from kidney and

liver cells. Table  showsthesizesoftheleandthenumber

International Journal of Genomics 

T  : C el l s  les.

File File number Size Number of lines

Liver cells les

 , GB .

 , GB .

 , GB . 

Kidney cells les

 , GB .

 , GB .

 , GB .

ofrowsthattheirrespectiveJSONlehadwheninsertedinto

Cassandra.

We have based the performance analyses on the elapsed

time to store (insert) data into and to retrieve (extract) data

from the database. ese elapsed times are important because

if one wants to use the Cassandra system in bioinformatics

workows, it is necessary to know how long the data becomes

available to execute each program.

Table  shows the elapsed times to insert and extract

sequences in the database, with both implementations.

Columns  and  show the insertions using two nodes.

Similarly, columns  and  show the extractions using four

nodes. As expected, we could conrm the hypothesis that the

database performance increases when we add more nodes.

Figures and show comparative charts of insertion and

extraction elapsed times according to the number of comput-

ers that Cassandra considers. Insertion into two computers

is longer than using four computers. Here the performance

also improves when the number of computers increases in the

cluster.

5.2. Comparison of Relational and Cassandra NoSQL Systems.

We compared the C a s s a n d r a r e s u lt s w i t h H u a c a r p u ma [ ]

that used the same data to insert and read sequences in the

PostgreSQL, a relational database. In the latter experiment,

theauthorusedonlyoneserverwithanIntelXeonprocessor,

eight cores of . GHz and  GB RAM, executing Linux

Server Ubuntu/Linaro ..-.

e server’s RAM for the relational database is larger than

the sum of the memories of the four computers used in this

experiment. Nonetheless, we use the results of the relational

database to demonstrate that it is possible to achieve high

performances even with a modest hardware due to scalability

and parallelism.

Table  shows the sum of the insertion and extraction

times in the relational database and the two computational

environments using Cassandra, Cassandra (), a cluster

with two computers, and Cassandra (), a cluster with four

computers.

e writing time in Cassandra is lower due to parallelism,

as seen in Table .WriteactionsinCassandraaremoreeec-

tive than in a relational database. However, its performance

was lower for query answering, as shown in Figure .is

is due to two factors: rst, Cassandra had to ensure that the

returned content was in its latest version, verifying the data

divided between machines; second, the data size is larger than

the available RAM; therefore, part of the data had to be stored

in SSTable, reducing the speed of the search.

123456

(min)

File

Insertion

Cassandra (2)

Cassandra (4)

F : Comparison between inserts (time ×le number).

123456

(min)

File

Extraction

Cassandra (2)

Cassandra (4)

F : Comparison between extractions (time ×le number).

e reader should note that the results obtained with

Cassandra just indicate a trend. ey are not conclusive

because the hardware characteristics of all experiments are

dierent.

Nevertheless, the improved performance with the

increase of nodes is an indication that Cassandra may some-

times surpass relational database systems in a larger number

of computers, making its use viable in data searches in

bioinformatics.

5.3. Comparison of MongoDB and Cassandra NoSQL Data-

bases. We compared the Cassandra results to the same data

to insert and read sequences in a MongoDB NoSQL. is is an

open-source document-oriented NoSQL database designed

to store large amounts of data.

e server where we have installed MongoDB is an i

processor with  GB RAM. is server has  GB RAM more.

eserverwherewehaveinstalledMongoDBhadGBRAM

more than cluster with two computers, Cassandra (), and

 GB RAM less than the sum of the RAM memories of four

computers, Cassandra ().

International Journal of Genomics

T : Times to insert and extract sequences from the database.

File Size Insertion Extraction

Cassandra () Cassandra () Cassandra () Cassandra ()

 , GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms

 ,GB msms msms msms msms

 , GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms

 ,GB msms msms msms msms

 , GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms

 , GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms

T : PostgreSQL and Cassandra results.

Database Insertion Extraction

PostgreSQL hms ms

Cassandra ()  m  s  h  m   s

Cassandra() ms ms

100

120

PostgreSQL Cassandra (2) Cassandra (4)

(min)

Database

Insertion

Extraction

F : Comparison between Cassandra and PostgreSQL.

Table  shows the sum of the insertion and extraction

times in the MongoDB database and the Cassandra with

two and four computers in a cluster. e performances of

insertion operations were similar using either MongoDB or

Cassandra databases. However, the MongoDB showed better

behavior than Cassandra NoSQL in the extraction of genomic

data in FASTQ format.

In Figure  our results suggest that there is a similar

behavior of the insertions in both MongoDB and Cassandra.

ere was a performance gain of more than % in the

extraction, when comparing the results of a Cassandra in

a cluster with two computers and another cluster with four

computers.

6. Conclusions

In this work we studied genomic data persistence, with

the implementation of a NoSQL database using Cassandra.

T:MongoDBandCassandranalresults.

Database Insertion Extraction

MongoDB  m  s  m  s

Cassandra ()  m  s  h  m   s

Cassandra() ms ms

MongoDB Cassandra (2) Cassandra (4)

(min)

Database

Insertion

Extraction

F : Comparison between Cassandra and MongoDB database.

We have observed that it presented a high performance

for writing operations due to the larger number of massive

insertions compared to data extractions. We used the DSE

tool together with Cassandra, which allowed us to create a

cluster and a client application suitable for the expected data

manipulation.

Our results suggest that there is a reduction of the

insertion and query times when more nodes are added in

Cassandra. ere was a performance gain of about % in the

insertions and a gain of % in reading, when comparing the

resultsofaclusterwithtwocomputersandanothercluster

with four computers.

Comparing the performance of Cassandra to the Mon-

goDB database, the results of MongoDB indicate that the

extraction of the MongoDB is better than Cassandra. For data

insertions the behaviors of Cassandra and MongoDB were

similar.

From the results presented here, it is possible to outline

new approaches in studies of persistency regarding genomic

International Journal of Genomics 

data. Positive results could boost new research, for example,

the creation of a similar application using other NoSQL

databases or new tests using Cassandra with dierent hard-

ware congurations seeking improvements in performance.

It is also possible to create a relational database with hardware

settings identical to Cassandra, in order to make more

detailed comparisons.

Conflict of Interests

e authors declare that there is no conict of interests

regarding the publication of this paper.

References

[] S. A. Simon, J. Zhai, R. S. Nandety et al., “Short-read sequencing

technologies for transcriptional analyses,” Annual Review of

Plant Biology,vol.,no.,pp.–,.

[] M. L. Metzker,“Sequencing technologies—the next generation,”

Nature Reviews Genetics, vol. , no. , pp. –, .

[] C.-L. Hung and G.-J. Hua, “Local alignment tool based on

Hadoop framework and GPU architecture,” BioMed Research

International, vol. , Article ID ,  pages, .

[] Y.-C. Lin, C.-S. Yu, and Y.-J. Lin, “Enabling large-scale biomed-

ical analysis in the cloud,” BioMed Research International,vol.

,ArticleID,pages,.

[] K. Kaur and R. Rani, “Modeling and querying data in NoSQL

databases,” in Proceedings of the IEEE International Conference

on Big Data,pp.–,October.

[] A. Lakshman and P. Malik, “Cassandra: a decentralized struc-

tured storage system,” Operating Systems Review,vol.,no.,

pp.–,.

[] K. Chodorow, MongoDB—e denitive Guide, O’Reilly, nd

edition, .

[] R.HechtandS.Jablonski,“NoSQLevaluation:ausecaseori-

ented survey,” in Proceedings of the International Conference on

Cloud and Service Computing (CSC ’11), pp. –, December

.

[] Y. Muhammad, Evaluation and implementation of distributed

NoSQL database for MMO gaming environment [M.S. thesis],

Uppsala University, .

[] C. J. M. Tauro, S. Aravindh, and A. B. Shreeharsha, “Compar-

ative study of the new generation, agile, scalable, high perfor-

mance NOSQL databases,” International Journal of Computer

Applications,vol.,no.,pp.–,.

[] R. P. Padhy, M. Patra, and S. C. Satapathy, “RDBMS to NoSQL:

reviewing some next-generation non-relational databases,”

International Journal of Advanced Engineering Science and

Tech n o l o g ies, vol. , no. , pp. –, .

[] M. Bach and A. Werner, “Standardization of NoSQL database

languages,” in Beyond Databases, Architectures, and Structures:

10th International Conference, BDAS 2014, Ustron, Poland,

May 27–30, 2014. Proceedings,vol.ofCommunications in

Computer and Information Science, pp. –, Springer, Berlin,

Germany, .

[] M. Indrawan-Santiago, “Database research: are we at a cross-

road? Reection on NoSQL,” in Proceedings of the 15th Interna-

tional Conference on Network-Based Information Systems (NBIS

’12), pp. –, IEEE, Melbourne, Australia, September .

[]G.DeCandia,D.Hastorun,M.Jampanietal.,“Dynamo:

amazon’s highly available key-value store,” in Proceedings of the

21st ACM Symposium on Operating Systems Principles (SOSP

’07), pp. –, ACM, October .

[] F. Chang, J. Dean, S. Ghemawat et al., “Bigtable: a distributed

storage system for structured data,” in Proceedings of the

USENIX Symposium on Operating Systems Design and Imple-

mentation (OSDI '06), pp. –, .

[] E. Hewitt, Cassandra—e Denitive Guide, O’Reilly, st edi-

tion, .

[] M. Klems, D. Bermbach, and R. Weinert, “A runtime quality

measurement framework for cloud database service systems,”

in Proceedings of the 8th International Conference on the Quality

of Information and Communications Technology (QUATIC ’12),

pp.–,September.

[] V. Parthasarathy, Learning Cassandra for Administrators,Packt

Publishing, Birmingham, UK, .

[] DataStax, Apache Cassandra . Documentation, , http://

www.datastax.com/documentation/cassandra/./pdf/cassan-

dra.pdf.

[] M. Fowler and P. J. Sadalage, NoSQL Distilled: A Brief Guide to

the Emerging World of Polyglot Persistence, Pearson Education,

Essex, UK, .

[] T. Bloom and T. Sharpe, “Managing data from high-throughput

genomic processing: a case study,” in Proceedings of the 13th

InternationalConferenceonVeryLargeDataBases(VLDB’04),

pp. –, .

[] U. R¨

ohm and J. A. Blakeley, “Data management for high

throughput genomics,” in Proceedings of the Biennial Conference

on Innovative Data Systems Research (CIDR ’09), Asilomar, Calif,

USA, January , http://www-db.cs.wisc.edu/cidr/cidr/

Paper .pdf.

[] R. C. Huacarpuma, Adatamodelforapipelineoftranscriptome

high performance sequencing [M.S. thesis], University of Bras´

ılia,

.

[] A. Bateman and M. Wood, “Cloud computing,” Bioinformatics,

vol.,no.,p.,.

[] Z. Ye and S. Li, “A request skew aware heterogeneous distributed

storage system based on Cassandra,” in Proceedings of the Inter-

national Conference on Computer and Management (CAMAN

’11), pp. –, May .

[] G. Wang and J. Tang, “e NoSQL principles and basic appli-

cation of cassandra model,” in Proceedings of the International

Conference on Computer Science and Service System (CSSS ’12),

pp. –, August .

[] B. G. Tudorica and C. Bucur, “A comparison between several

NoSQL databases with comments and notes,” in Proceedings of

the 10th RoEduNet International Conference on Networking in

Education and Research (RoEduNet ’11), pp. –, June .

[] Y. Li and S. Manoharan, “A performance comparison of SQL

and NoSQL databases,” in Proceedings of the 14th IEEE Pacic

Rim Conference on Communications, Computers, and Signal

Processing (PACRIM ’13), pp. –, August .

[] J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y.

Gilad, “RNA-seq: an assessment of technical reproducibility and

comparison with gene expression arrays,” Genome Research,vol.

, no. , pp. –, .

[] OpsCenter 4.0 User Guide Documentation,DataStax,,http://

www.datastax.com/documentation/opscenter/./pdf/opscus-

erguide.pdf.

[] DataStax, DataStax Enterprise . Documentation, , http://

www.datastax.com/doc-source/pdf/dse.pdf.

Content uploaded by Sergio Lifschitz

Content may be subject to copyright.

Available via license: CC BY

Content may be subject to copyright.

Content uploaded by Maria Emilia Machado Telles Walter

Content may be subject to copyright.

Automatic configuration of the Cassandra database using irace

Article

Full-text available

Aug 2021

Database systems play a central role in modern data-centered applications. Their performance is thus a key factor in the efficiency of data processing pipelines. Modern database systems expose several parameters that users and database administrators can configure to tailor the database settings to the specific application considered. While this task has traditionally been performed manually, in the last years several methods have been proposed to automatically find the best parameter configuration for a database. Many of these methods, however, use statistical models that require high amounts of data and fail to represent all the factors that impact the performance of a database, or implement complex algorithmic solutions. In this work we study the potential of a simple model-free general-purpose configuration tool to automatically find the best parameter configuration of a database. We use the irace configurator to automatically find the best parameter configuration for the Cassandra NoSQL database using the YCBS benchmark under different scenarios. We establish a reliable experimental setup and obtain speedups of up to 30% over the default configuration in terms of throughput, and we provide an analysis of the configurations obtained.

Using unstructured logs generated in complex large-scale micro-service-based architecture for data analysis

Article

Jan 2023

With deployments of complicated or complex large scale microservice architectures the kind of data generated from all those systems makes a typical production infrastructure huge, complicated and difficult to manage. In this scenario, logs play a major role and can be considered as an important source of information in a large scale secured environment. Till date many researchers have contributed various methods towards conversion of unstructured logs to structured ones. However post conversion the dimension of the dataset generated increases many folds which are too complex for data analysis. In this paper, we have discussed techniques and methods to deal with extraction of all features from a produced structured log, reducing N-dimensional features to fixed dimensions without compromising the quality of data in a cost-efficient manner that can be used for any further machine learning based analysis.

Using unstructured logs generated in complex large scale micro-service-based architecture for data analysis

Article

Jan 2022

COVID-19 pandemic: Is it the right time to develop interconnected national biomedical registries?

Article

Full-text available

Dec 2021

Athanasios Kotoulas

Analysis of NoSQL Databases: A Comparative Study

Article

Jan 2017

A relational database is a table based system where there's no scalability, lowest data duplication, computationally overpriced table joins and issue in addressing complicated data. The matter with relations in relational database is that advanced operations with massive data sets quickly become prohibitively resource intense. Relational databases don't lend themselves well to the type of horizontal scalability that is needed for large-scale social networking or cloud applications. NoSQL has emerged as results of the demand for relational database alternatives. The most important motivation behind NoSQL is scalability. NoSQL is supposed for the present growing breed of net applications that require scaling effectively. This paper analyzes the NoSQL database that is the demand of the present large-scale social networking or cloud applications. The analysis of assorted NoSQL databases like Bigtable, Cassandra, CouchDB, MongoDB and Couchbase has been highlighted.

Planning Safety Solutions, Models and Algorithms for Special Databases

Conference Paper

Full-text available

Aug 2020

DNA sequencing is the process of determining the order of nucleotides in DNA. The rapid speed of sequencing attained with modern DNA sequenc-ing technology has been instrumental in the sequencing of complete DNA sequences, including the human genome. Nevertheless it is a sensitive data which needs safe but efficient storage methods. The goal in this research was to analyze different models and algorithms to determine which is the most applicable for storage, and query considering the need of user permissions, and encryption.

Practical implications of using non‐relational databases to store large genomic data files and novel phenotypes

Article

Aug 2021

The objective of our study was to provide practical directions on the storage of genomic information and novel phenotypes (treated here as unstructured data) using a non-relational database. The MongoDB technology was assessed for this purpose, enabling frequent data transactions involving numerous individuals under genetic evaluation. Our study investigated different genomic (Illumina Final Report, PLINK, 0125, FASTQ, and VCF formats) and phenotypic (including media files) information, using both real and simulated datasets. Advantages of our centralized database concept include the sublinear running time for queries after increasing the number of samples/markers exponentially, in addition to the comprehensive management of distinct data formats while searching for specific genomic regions. A comparison of our non-relational and generic solution, with an existing relational approach (developed for tabular data types using 2 bits to store genotypes), showed reduced importing time to handle 50M SNPs (PLINK format) achieved by the relational schema. Our experimental results also reinforce that data conversion is a costly step required to manage genomic data into both relational and non-relational database systems, and therefore, must be carefully treated for large applications.

Speed up Cassandra read path by using Coordinator Cache

Conference Paper

Mar 2021

Security Issues in Internet of Things: Principles, Challenges, Taxonomy

Chapter

Jan 2021

The Internet of Things (IoT) has great potential to change the fundamental way of interacting with technology in daily life, and for ease, it also observes and records user preferences that challenge privacy in another way. IoT devices are suspended to extensive usage even more than mobile phones and attain more access to private and secured data. With the growth of connected devices, mobile security is already a challenge, so perspective challenges for IoT connected devices must be much greater than considered at present and can be primarily categorized into safety, security and privacy. Rigorous development of security techniques should be an essential process toward the foundation of strong IoT systems to achieve and retain user trust. The survey in this paper reviewed and analyzed security principles, attacks and countermeasures at different layers of IoT-layered architecture, considering the bottlenecks of IoT systems.

A Comparative Study of NoSQL Databases

Chapter

Jan 2021

The need and trend of data record analysis has seen an enormous rise in the past. More and more organizations are realizing the need for a schematic decision making procedure which makes them rely on past data to make future predictions. In this run, the data analysis techniques have also developed along with the advancement of data formats available and now trends are more towards NoSQL (Not Only SQL) type of data stores than the relational ones. This paper explores the types of NoSQL which offer high availability, performance, and eventual concurrency applications but losing the ACID properties of the traditional databases. The authors discuss various data stores in brief and also compare these data stores based on different aspects.

Standardization of NoSQL Database Languages

Conference Paper

Full-text available

May 2014

NoSQL database systems have been becoming more and more popular and accepted by a database users thus their rapid development is nowadays observed. Because of this fact, modern database engines and their categories in the form of the Venn diagram are mentioned in the paper. Besides, the possibilities of using declarative languages that are modeled on SQL - the language for relational databases – in NoSQL, are presented. For this purpose selected NoSQL technologies are given in more details and their query languages are described. Moreover, the NoSQL language commands’ equivalents of SQL standard are provided in this document.

RDBMS to NoSQL: Reviewing Some Next-Generation Non-Relational Database's

Article

Full-text available

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Article

Full-text available

May 2014
BMRI

With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.

A performance comparison of SQL and NoSQL databases

Conference Paper

Full-text available

Aug 2013

With the current emphasis on “Big Data”, NoSQL databases have surged in popularity. These databases are claimed to perform better than SQL databases. In this paper we aim to independently investigate the performance of some NoSQL and SQL databases in the light of key-value stores. We compare read, write, delete, and instantiate operations on key-value stores implemented by NoSQL and SQL databases. Besides, we also investigate an additional operation: iterating through all keys. An abstract key-value pair framework supporting these basic operations is designed and implemented using all the databases tested. Experimental results measure the timing of these operations and we summarize our findings of how the databases stack up against each other. Our results show that not all NoSQL databases perform better than SQL databases. Some are much worse. And for each database, the performance varies with each operation. Some are slow to instantiate, but fast to read, write, and delete. Others are fast to instantiate but slow on the other operations. And there is little correlation between performance and the data model each database uses.

MongoDB: The Definitive Guide

Book

Jan 2010

Cassandra: The Definitive Guide

Book

Dec 2010

Eben Hewitt

What could you do with data if scalability wasn't a problem? With this hands-on guide, you'll learn how Apache Cassandra handles hundreds of terabytes of data while remaining highly available across multiple data centers -- capabilities that have attracted Facebook, Twitter, and other data-intensive companies. Cassandra: The Definitive Guide provides the technical details and practical examples you need to assess this database management system and put it to work in a production environment. Author Eben Hewitt demonstrates the advantages of Cassandra's nonrelational design, and pays special attention to data modeling. If you're a developer, DBA, application architect, or manager looking to solve a database scaling issue or future-proof your application, this guide shows you how to harness Cassandra's speed and flexibility. * Understand the tenets of Cassandra's column-oriented structure * Learn how to write, update, and read Cassandra data * Discover how to add or remove nodes from the cluster as your application requires * Examine a working application that translates from a relational model to Cassandra's data model * Use examples for writing clients in Java, Python, and C# * Use the JMX interface to monitor a cluster's usage, memory patterns, and more * Tune memory settings, data storage, and caching for better performance

NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence

Book

Jan 2012

A comparison between several NoSQL databases with comments and notes

Conference Paper

Jun 2011

This paper is trying to comment on the various NoSQL (Not only Structured Query Language) systems and to make a comparison (using multiple criteria) between them. The NoSQL databases were created as a mean to offer high performance (both in terms of speed and size) and high availability at the price of loosing the ACID (Atomic, Consistent, Isolated, Durable) trait of the traditional databases in exchange with keeping a weaker BASE (Basic Availability, Soft state, Eventual consistency) feature. Remains to be seen which of the multiple solutions created since the official appearance of the NoSQL concept (which was defined in 1998 and reintroduced in 2009, around which moment several NoSQL solutions emerged; at the present moment there are known over 120 such solutions) are really delivering on these promises of higher performance (although several of them are already used with very good results).

Modeling and querying data in NoSQL databases

Conference Paper

Oct 2013

Relational databases are providing storage for several decades now. However for today's interactive web and mobile applications the importance of flexibility and scalability in data model can not be over-stated. The term NoSQL broadly covers all non-relational databases that provide schema-less and scalable model. NoSQL databases which are also termed as Internetage databases are currently being used by Google, Amazon, Facebook and many other major organizations operating in the era of Web 2.0. Different classes of NoSQL databases namely key-value pair, document, column-oriented and graph databases enable programmers to model the data closer to the format as used in their application. In this paper, data modeling and query syntax of relational and some classes of NoSQL databases have been explained with the help of an case study of a news website like Slashdot.

The NoSQL Principles and Basic Application of Cassandra Model

Conference Paper

Aug 2012

This paper reveal the secret of NoSQL. The CAP theorem, the BASE theorem and the Eventual Consistency theorem construct the foundation stone of NoSQL Cassandra is one kind of NoSQL databases, It is used by Twitter, Facebook and some other famous corporations. Taking it for example, this online trading system is based on Cassandra database. I'll design and contrast the relational model and Cassandra-based model of this system, then construct the key space, the column family and do some other configuration. After these jobs done, I'll do some coding to implement this system.