ArticlePDF Available

Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency

Wiley
International Journal of Genomics
Authors:

Abstract and Figures

Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB.
This content is subject to copyright. Terms and conditions apply.
Research Article
Evaluating the Cassandra NoSQL Database Approach for
Genomic Data Persistency
Rodrigo Aniceto,1Rene Xavier,1Valeria Guimarães,1Fernanda Hondo,1
Maristela Holanda,1Maria Emilia Walter,1and Sérgio Lifschitz2
1Computer Science Department, University of Brasilia (UNB), 70910-900 Brasilia, DF, Brazil
2Informatics Department, Pontical Catholic University of Rio de Janeiro (PUC-Rio),
22451-900 Rio de Janeiro, RJ, Brazil
Correspondence should be addressed to Maristela Holanda; mholanda@cic.unb.br
Received  March ; Accepted  May 
Academic Editor: Che-Lun Hung
Copyright ©  Rodrigo Aniceto et al. is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics.
One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the
persistency of genomic data, particularly storing and analyzing these large-scale processed data. To nd an alternative to the
frequently considered relational database model becomes a compelling task. Other data models may be more eective when dealing
with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the
Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real
data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and
another NoSQL database approach, MongoDB.
1. Introduction
Advanced hardware and soware technologies increase the
speed and eciency with which scientic workows may be
performed. Scientists may execute a given workow many
times, comparing results from these executions and providing
greater accuracy in data analysis. However, handling large
volumes of data produced by distinct program executions
under varied conditions becomes increasingly dicult. ese
massive amounts of data must be stored and treated in order
to support current genomic research []. erefore, one
of the main problems when working with genomic data
refers to the storage and search of these data, requiring many
computational resources.
In computational environments with large amounts of
possibly unconventional data, NoSQL []databasesystems
have emerged as an alternative to traditional Relational
Database Management Systems (RDBMS). NoSQL systems
are distributed databases built to meet the demands of high
scalability and fault tolerance in the management and analysis
of massive amounts of data. NoSQL databases are coded
in many distinct programming languages and are generally
available as open-source soware.
e objective of this paper is to study the persistency
of genomic data on a particular and widely used NoSQL
database system, namely, Cassandra []. e tests performed
for this study use real genomic data to evaluate insertion
and extraction operations into and from the Cassandra
database. Considering the large amounts of data in current
genome projects, we are particularly concerned with high
performances. We discuss and compare our results with a
relational system (PostgreSQL) and another NoSQL database
system, MongoDB [].
is paper is organized as follows. Section  presents a
brief introduction for NoSQL databases and the main features
of Cassandra database system. We discuss some related work
in Section  andwepresent,atSection ,thearchitecture
ofthedatabasesystem.Section  discusses the practical
results obtained and Section  concludes and suggests future
works.
Hindawi Publishing Corporation
International Journal of Genomics
Volume 2015, Article ID 502795, 7 pages
http://dx.doi.org/10.1155/2015/502795
International Journal of Genomics
2. NoSQL Databases: An Overview
Many relevant innovations in data management came from
Web . applications. However, the techniques and tools
available in relational systems may, sometimes, limit their
deployment. erefore, some researchers have decided to
develop their own web-scale database solutions [].
NoSQL (not-only SQL) databases have emerged as a
solution to storage scalability issues, parallelism, and man-
agementoflargevolumesofunstructureddata.Ingeneral,
NoSQLsystemshavethefollowingcharacteristics[]: (i)
they are based on a nonrelational data model; (ii) they rely
on distributed processing; (iii) high availability and scalability
aremainconcerns;and(iv)someareschemalessandhavethe
ability to handle both structured and unstructured data.
ere are four main categories of NoSQL databases [,
]:
(i) Key-value stores: data is stored as key-pairs values.
ese systems are similar to dictionaries, where data
is addressed by a single key. Values are isolated
and independent from another, and relationships are
handled by the application logic.
(ii) Column family database: it denes the data structure
as a predened set of columns. e super columns
and column family structures can be considered the
database schema.
(iii) Document-based storage: a document store uses the
concept of key-value store. e documents are col-
lections of attributes and values, where an attribute
canbemultivalued.EachdocumentcontainsanID
key,whichisuniquewithinacollectionandidenties
document.
(iv) Graph databases: graphs are used to represent
schemas. A graph database works with three abstrac-
tions: node, relationships between nodes, and key-
value pairs that can attach to nodes and relationships.
2.1. Cassandra Database System. Cassandra is a cloud-
oriented database system, massively scalable, designed to
store a large amount of data from multiple servers, while
providing high availability and consistent data []. It is based
onthearchitectureofAmazonsDynamo[]andalsoon
Google’s BigTable data model []. Cassandra enables queries
as in a key-value model, where each row has a unique row
key, a feature adopted from Dynamo [,,,]. Cassandra
is considered a hybrid NoSQL database, using characteristics
of both key-value and column oriented databases.
Cassandra’s architecture is made of nodes, clusters, data
centers and a partitioner. A node is a physical instance of
Cassandra. Cassandra does not use a master-slave architec-
ture; rather, Cassandra uses peer-to-peer architecture, which
all nodes are equal. A cluster is a group of nodes or even a
single node. A group of clusters is a data center. A partitioner
is a hash function for computing the token of each row key.
When one row is inserted, a token is calculated, based
on its unique row key. is token determines in what node
that particular row will be stored. Each node of a cluster is
responsible for a range of data based on a token. When the
rowisinsertedanditstokeniscalculated,thisrowisstoredon
a node responsible for this token. e advantage here is that
multiple rows can be written in parallel into the database, as
each node is responsible for its own write requests. However
this may be seen as a drawback regarding data extraction,
becoming a bottleneck. e MurMur3Partitioner []isa
partitioner that uses tokens to assign equal portions of data
toeachnode.istechniquewasselectedbecauseitprovides
fast hashing, and its hash function helps to evenly distribute
data to all the nodes of a cluster.
e main elements of Cassandra are keyspaces,column
families, columns, and rows []. A keyspace contains the
processing steps of the data replication and is similar to a
schema in a relational database. Typically, a cluster has one
keyspace per application. A column family is a set of key-
value pairs containing a column with its unique row keys. A
column is the smallest increment of data, which contains a
name, a value, and a timestamp. Rows are columns with the
same primary key.
When a write operation occurs, Cassandra immediately
stores the instruction on the Commit log, which goes into the
hard disk (HD). Data from this write operation is stored at
the memtable,whichstaysinRAM.Onlywhenapredened
memory limit is reached, this data is written on SSTables that
stay in the HD. en, the Commit log and the memtable are
cleaned up [,]. In case of failure regarding the memtables,
Cassandra reexecutes the written instructions available at the
Commit log [,].
When an extract instruction is executed, Cassandra rst
searches information in memtables. A large RAM allows large
amounts of data in memtables and less data in HD, resulting
in quick access to information [].
3. Storing Genomic Data
Persistency of genomic data is not a recent problem. In ,
Bloom and Sharpe [] described the diculties of managing
these data. One of the main diculties was the growing
number of data generated by the queries. e work in R¨
ohm
and Blakeley [] and Huacarpuma [] consider relational
databases (SQL Server  and PostgreSQL, resp.) to store
genomic data in FASTQ format.
Bateman and Wood []havesuggestedusingNoSQL
databases as a good alternative to persisting genetic data.
However, no practical results are given. Ye and Li []
proposed the use of Cassandra as a storage system. ey
consider multiple nodes so that there were no gaps in the
consistencyofthedata.WangandTang[] indicated some
instructions for creating an application to perform data
operations in Cassandra.
Tudorica and Bucur [] compared some NoSQL
databases to a MySQL relational database using the YCSB
(Yahoo! Cloud Serving Benchmark). ey conclude that in
an environment where write operations prevail MySQL has
a signicantly higher latency when compared to Cassandra.
Similar results about performance improvements for writing
operations in Cassandra, when compared to MS SQL
Express, were also reported by Li and Manoharan [].
International Journal of Genomics
Many research works [] present results involving
the performance of a Cassandra database system for massive
data volumes. In this paper, we have decided to evaluate the
performance of Cassandra NoSQL database system speci-
cally for genomic data.
4. Case Study
To validate our case study we have used real data. e
sequences (also called reads) were obtained from liver and
kidney tissue samples of one human male from the SRA-NCBI
(http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?), sequenced
by the Illumina Genome Analyzer. It produced ,,
sequences for the kidney samples and ,, sequences
for the liver samples, each sequence containing  bases.
Marioni et al. [] generated these sequences.
FASTQ le stores sequences of nucleotides and their
corresponding quality values. ree les were obtained from
ltered sequences sampled from kidney cells, and another
three les consisted of ltered genomic sequences sampled
from liver cells. It should be noted that these data were
selectedbecausetheywereinFASTQ[] format, which is
commonly used in bioinformatics workows.
In this case study, we carried out three analyses. In the rst
one, we investigated how Cassandra behaves when the com-
putational environment is composed of a cluster with two and
four computers. In the second one, we analyze the behavior of
Cassandra compared to PostgreSQL, a relational database. In
thelastcasestudy,weusedtheMongoDBdocument-oriented
NoSQLdatabasetocomparetoCassandrasresults.
4.1. Cloud Environment Architecture. In order to investigate
the expected advantages of Cassandra’s scalability, we have
createdtwocloudenvironments:onewithtwonodesand
the other with four nodes. Cassandra was installed on every
node of the cluster. We have also used OpsCenter . [],
a DSE tool that implements a browser-based interface to
remotely manage the cluster conguration and architecture.
e architecture contains a single data center, named DC. A
single cluster, named BIOCluster, containing the nodes, was
created, working with DC.
4.2. Java Client. At the soware level, we have dened the
following functional requirements: (i) create a keyspace; (ii)
create a table to store a FASTQ le; (iii) create a table with
the names of inserted FASTQ les and their corresponding
metadata; (iv) receive an input le containing data from
a FASTQ le and insert it into a previously created table,
followed by the le name and metadata; (v) extract all data
from a table containing the contents of a FASTQ le; and (vi)
remove the table and the keyspace.
Nonfunctional requirements were also dened: (i) the
use of Java API, provided by DataStax, in order to have a
better integration between the Cassandra distribution and the
developed client application; (ii) the use of Cassandra Query
Language (CQL) [], for database interactions, which is the
current query language of Cassandra and resembles SQL; (iii)
conversion to JSON les to be used by the client application,
since it is simpler to work with JSON les in Java; and (iv) a
good performance in operations.
With respect to this last requirement, three applications
were developed, two for data conversion and one client
application for Cassandra.
() FastqTojson Application converts the FASTQ input
le into smaller JSON les, each JSON le with ve
hundredthousandreads.eobjectiveistoloadthese
smallJSONlesbecause,usually,FASTQleoccupies
a few gigabytes. Furthermore, as it presents a proper
format for the Java client, it does not consume many
computational resources. Each JSON le occupies ten
thousand rows in the database: each row is an array of
ten columns; each eld value of the column contains
ve reads.
() Cassandra client was also developed in Java, using the
JavaAPIprovidedbyDataStaxandistheoneinwhich
the data persists. is client creates a keyspace, inserts
all JSON les from the rst application in a single
table, and extracts the data from a table.
For the database schema, it consists of a single
keyspace, called biodata,asinglecluster,calledbio-
cluster, one table of metadata and one table for each
lepersisting,asshowninFigure .
e allocation strategy for replicas and the repli-
cation factor are properties from the keyspace.e
allocation strategy determines whether or not data is
distributed through a network of dierent clusters.
e Simple Strategy [] was selected since this case
study was performed in a single cluster. Likewise,
since we did not consider failures and our goal was
to study performance rather than fault recovery, we
have chosen one replication factor. It should be noted
that the replication factor determines the number
of replicas distributed along the cluster. Focusing on
performance,ahighernumberofreplicaswouldalso
interfere on the insertion time.
As previously mentioned, the client application cre-
ates a table for each inserted FASTQ le, which has
the same name of the le. Each of these tables has
eleven columns, and each cell stores a small part
of a JSON le, ten reads per cell, which is about
MB in size. is small set for columns and cells
is due to the eciency of Cassandra when a small
number of columns are used and a big number of
rows. is is also a consequence of the ability of
MurMur3Partitioner to distribute each row in one
node. erefore, the cluster has a better load balance
during insertions and extractions.
Once a table is created, the client inserts all data from
JSON in the rst stage on the database, as shown in
Figure . In what follows, a single row is inserted into
the metadata table containing as a row key the name
of FASTQ le and a column with the number of rows.
is latter is inserted into the metadata table to solve
the memory limit of the Java Virtual Machine, which
mayhappenwhenqueryinglargetables.
International Journal of Genomics
Line 2
L
ine 2
Cluster: BIOCluster
Keyspace bioData
Table metadata
Key le 1 RowsRows
Table le 1
Line 1
Line 1
Line 2 Line 3
Key 1 Va lu e JKey 2 Va lu e JKey 3 Va l u e J
Table le 2
Line 1 Line 2 Line 3
Key le 2
Value AValue AValue A
Key 1 Va l ue JKey 2 Va lu e JKey 3 Val u e JValue AValue AValue A
F : Database schema.
arq1.json
arq2.json
FastqToJson
FASTQ le
arqN.json
Cassandra
client
.
.
.
F : Stages of insertion.
When extracting data, the client queries the metadata
table to get the number of rows on the table with the
FASTQ data and then proceeds to the table extraction,
whichisdonerowbyrowandwrittenintoan“.out
le.
() OutToJso Application. Aer data extraction, there is
a single le with the extension “.out.” is application
converts this le into a FASTQ format, making it
identical to the original input le, resulting only in the
FAS TQ le wit h out te mpora r y  l e “.ou t.” is pro c e ss
is shown in Figure .
OutToFastq
Data.out Data.fastq
Cassandra
client
F : Stages of extraction.
5. Results
In this work, we have considered three experimental case
studies to evaluate data consistency and performance for
storing and extracting genomic data. For the rst one, we
veried Cassandra’s scalability and variation in performance.
For the second case study, we compared the Cassandra results
to a PostgreSQL relational system and, nally, we used the
MongoDB NoSQL database and compared other results to
Cassandra NoSQL system. e case studies used the same
datatoinsertandreadsequences.
During the Cassandra evaluation, we have created two
clusters. e rst one, a Cassandra cluster with two com-
puters, was created, while for the second one, a new cluster
with four computers was created. e rst cluster consisted
of two computers with Intel Xeon E-/. GHz processor,
one with GB RAM and the other with GB RAM. For the
second cluster, besides the same two computers, two other
computers with Intel Core i processor and  GB RAM was
included. Each one of them used Ubuntu ..
5.1. Insertions and Extractions Cassandra NoSQL. e input
les are six FASTQ les with ltered data from kidney and
liver cells. Table  showsthesizesoftheleandthenumber
International Journal of Genomics
T  : C el l s  les.
File File number Size Number of lines
Liver cells les
, GB .
, GB .
, GB . 
Kidney cells les
, GB .
, GB .
, GB .
ofrowsthattheirrespectiveJSONlehadwheninsertedinto
Cassandra.
We have based the performance analyses on the elapsed
time to store (insert) data into and to retrieve (extract) data
from the database. ese elapsed times are important because
if one wants to use the Cassandra system in bioinformatics
workows, it is necessary to know how long the data becomes
available to execute each program.
Table  shows the elapsed times to insert and extract
sequences in the database, with both implementations.
Columns  and  show the insertions using two nodes.
Similarly, columns  and  show the extractions using four
nodes. As expected, we could conrm the hypothesis that the
database performance increases when we add more nodes.
Figures and show comparative charts of insertion and
extraction elapsed times according to the number of comput-
ers that Cassandra considers. Insertion into two computers
is longer than using four computers. Here the performance
also improves when the number of computers increases in the
cluster.
5.2. Comparison of Relational and Cassandra NoSQL Systems.
We compared the C a s s a n d r a r e s u lt s w i t h H u a c a r p u ma [ ]
that used the same data to insert and read sequences in the
PostgreSQL, a relational database. In the latter experiment,
theauthorusedonlyoneserverwithanIntelXeonprocessor,
eight cores of . GHz and  GB RAM, executing Linux
Server Ubuntu/Linaro ..-.
e server’s RAM for the relational database is larger than
the sum of the memories of the four computers used in this
experiment. Nonetheless, we use the results of the relational
database to demonstrate that it is possible to achieve high
performances even with a modest hardware due to scalability
and parallelism.
Table  shows the sum of the insertion and extraction
times in the relational database and the two computational
environments using Cassandra, Cassandra (), a cluster
with two computers, and Cassandra (), a cluster with four
computers.
e writing time in Cassandra is lower due to parallelism,
as seen in Table .WriteactionsinCassandraaremoreeec-
tive than in a relational database. However, its performance
was lower for query answering, as shown in Figure .is
is due to two factors: rst, Cassandra had to ensure that the
returned content was in its latest version, verifying the data
divided between machines; second, the data size is larger than
the available RAM; therefore, part of the data had to be stored
in SSTable, reducing the speed of the search.
0
5
10
15
123456
(min)
File
Insertion
Cassandra (2)
Cassandra (4)
F : Comparison between inserts (time ×le number).
0
5
10
15
20
25
123456
(min)
File
Extraction
Cassandra (2)
Cassandra (4)
F : Comparison between extractions (time ×le number).
e reader should note that the results obtained with
Cassandra just indicate a trend. ey are not conclusive
because the hardware characteristics of all experiments are
dierent.
Nevertheless, the improved performance with the
increase of nodes is an indication that Cassandra may some-
times surpass relational database systems in a larger number
of computers, making its use viable in data searches in
bioinformatics.
5.3. Comparison of MongoDB and Cassandra NoSQL Data-
bases. We compared the Cassandra results to the same data
to insert and read sequences in a MongoDB NoSQL. is is an
open-source document-oriented NoSQL database designed
to store large amounts of data.
e server where we have installed MongoDB is an i
processor with  GB RAM. is server has  GB RAM more.
eserverwherewehaveinstalledMongoDBhadGBRAM
more than cluster with two computers, Cassandra (), and
GB RAM less than the sum of the RAM memories of four
computers, Cassandra ().
International Journal of Genomics
T : Times to insert and extract sequences from the database.
File Size Insertion Extraction
Cassandra () Cassandra () Cassandra () Cassandra ()
, GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms
,GB msms msms msms msms
, GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms
,GB msms msms msms msms
, GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms
, GB  m  s  ms  m  s  ms  m  s  ms  m  s  ms
T : PostgreSQL and Cassandra results.
Database Insertion Extraction
PostgreSQL hms ms
Cassandra ()  m  s  h  m   s
Cassandra() ms ms
0
20
40
60
80
100
120
PostgreSQL Cassandra (2) Cassandra (4)
(min)
Database
Insertion
Extraction
F : Comparison between Cassandra and PostgreSQL.
Table  shows the sum of the insertion and extraction
times in the MongoDB database and the Cassandra with
two and four computers in a cluster. e performances of
insertion operations were similar using either MongoDB or
Cassandra databases. However, the MongoDB showed better
behavior than Cassandra NoSQL in the extraction of genomic
data in FASTQ format.
In Figure  our results suggest that there is a similar
behavior of the insertions in both MongoDB and Cassandra.
ere was a performance gain of more than % in the
extraction, when comparing the results of a Cassandra in
a cluster with two computers and another cluster with four
computers.
6. Conclusions
In this work we studied genomic data persistence, with
the implementation of a NoSQL database using Cassandra.
T:MongoDBandCassandranalresults.
Database Insertion Extraction
MongoDB  m  s  m  s
Cassandra ()  m  s  h  m   s
Cassandra() ms ms
MongoDB Cassandra (2) Cassandra (4)
0
10
20
30
40
50
60
70
80
(min)
Database
Insertion
Extraction
F : Comparison between Cassandra and MongoDB database.
We have observed that it presented a high performance
for writing operations due to the larger number of massive
insertions compared to data extractions. We used the DSE
tool together with Cassandra, which allowed us to create a
cluster and a client application suitable for the expected data
manipulation.
Our results suggest that there is a reduction of the
insertion and query times when more nodes are added in
Cassandra. ere was a performance gain of about % in the
insertions and a gain of % in reading, when comparing the
resultsofaclusterwithtwocomputersandanothercluster
with four computers.
Comparing the performance of Cassandra to the Mon-
goDB database, the results of MongoDB indicate that the
extraction of the MongoDB is better than Cassandra. For data
insertions the behaviors of Cassandra and MongoDB were
similar.
From the results presented here, it is possible to outline
new approaches in studies of persistency regarding genomic
International Journal of Genomics
data. Positive results could boost new research, for example,
the creation of a similar application using other NoSQL
databases or new tests using Cassandra with dierent hard-
ware congurations seeking improvements in performance.
It is also possible to create a relational database with hardware
settings identical to Cassandra, in order to make more
detailed comparisons.
Conflict of Interests
e authors declare that there is no conict of interests
regarding the publication of this paper.
References
[] S. A. Simon, J. Zhai, R. S. Nandety et al., “Short-read sequencing
technologies for transcriptional analyses,Annual Review of
Plant Biology,vol.,no.,pp.,.
[] M. L. Metzker,“Sequencing technologies—the next generation,
Nature Reviews Genetics, vol. , no. , pp. –, .
[] C.-L. Hung and G.-J. Hua, “Local alignment tool based on
Hadoop framework and GPU architecture,BioMed Research
International, vol. , Article ID ,  pages, .
[] Y.-C. Lin, C.-S. Yu, and Y.-J. Lin, “Enabling large-scale biomed-
ical analysis in the cloud,BioMed Research International,vol.
,ArticleID,pages,.
[] K. Kaur and R. Rani, “Modeling and querying data in NoSQL
databases,” in Proceedings of the IEEE International Conference
on Big Data,pp.,October.
[] A. Lakshman and P. Malik, “Cassandra: a decentralized struc-
tured storage system,Operating Systems Review,vol.,no.,
pp.,.
[] K. Chodorow, MongoDB—e denitive Guide, O’Reilly, nd
edition, .
[] R.HechtandS.Jablonski,“NoSQLevaluation:ausecaseori-
ented survey,” in Proceedings of the International Conference on
Cloud and Service Computing (CSC ’11), pp. –, December
.
[] Y. Muhammad, Evaluation and implementation of distributed
NoSQL database for MMO gaming environment [M.S. thesis],
Uppsala University, .
[] C. J. M. Tauro, S. Aravindh, and A. B. Shreeharsha, “Compar-
ative study of the new generation, agile, scalable, high perfor-
mance NOSQL databases,International Journal of Computer
Applications,vol.,no.,pp.,.
[] R. P. Padhy, M. Patra, and S. C. Satapathy, “RDBMS to NoSQL:
reviewing some next-generation non-relational databases,
International Journal of Advanced Engineering Science and
Tech n o l o g ies, vol. , no. , pp. –, .
[] M. Bach and A. Werner, “Standardization of NoSQL database
languages,” in Beyond Databases, Architectures, and Structures:
10th International Conference, BDAS 2014, Ustron, Poland,
May 27–30, 2014. Proceedings,vol.ofCommunications in
Computer and Information Science, pp. –, Springer, Berlin,
Germany, .
[] M. Indrawan-Santiago, “Database research: are we at a cross-
road? Reection on NoSQL,” in Proceedings of the 15th Interna-
tional Conference on Network-Based Information Systems (NBIS
’12), pp. –, IEEE, Melbourne, Australia, September .
[]G.DeCandia,D.Hastorun,M.Jampanietal.,“Dynamo:
amazons highly available key-value store,” in Proceedings of the
21st ACM Symposium on Operating Systems Principles (SOSP
’07), pp. –, ACM, October .
[] F. Chang, J. Dean, S. Ghemawat et al., “Bigtable: a distributed
storage system for structured data,” in Proceedings of the
USENIX Symposium on Operating Systems Design and Imple-
mentation (OSDI '06), pp. –, .
[] E. Hewitt, Cassandra—e Denitive Guide, O’Reilly, st edi-
tion, .
[] M. Klems, D. Bermbach, and R. Weinert, “A runtime quality
measurement framework for cloud database service systems,
in Proceedings of the 8th International Conference on the Quality
of Information and Communications Technology (QUATIC ’12),
pp.,September.
[] V. Parthasarathy, Learning Cassandra for Administrators,Packt
Publishing, Birmingham, UK, .
[] DataStax, Apache Cassandra . Documentation, , http://
www.datastax.com/documentation/cassandra/./pdf/cassan-
dra.pdf.
[] M. Fowler and P. J. Sadalage, NoSQL Distilled: A Brief Guide to
the Emerging World of Polyglot Persistence, Pearson Education,
Essex, UK, .
[] T. Bloom and T. Sharpe, “Managing data from high-throughput
genomic processing: a case study,” in Proceedings of the 13th
InternationalConferenceonVeryLargeDataBases(VLDB’04),
pp. –, .
[] U. R¨
ohm and J. A. Blakeley, “Data management for high
throughput genomics,” in Proceedings of the Biennial Conference
on Innovative Data Systems Research (CIDR ’09), Asilomar, Calif,
USA, January , http://www-db.cs.wisc.edu/cidr/cidr/
Paper .pdf.
[] R. C. Huacarpuma, Adatamodelforapipelineoftranscriptome
high performance sequencing [M.S. thesis], University of Bras´
ılia,
.
[] A. Bateman and M. Wood, “Cloud computing,Bioinformatics,
vol.,no.,p.,.
[] Z. Ye and S. Li, “A request skew aware heterogeneous distributed
storage system based on Cassandra,” in Proceedings of the Inter-
national Conference on Computer and Management (CAMAN
’11), pp. –, May .
[] G. Wang and J. Tang, “e NoSQL principles and basic appli-
cation of cassandra model,” in Proceedings of the International
Conference on Computer Science and Service System (CSSS ’12),
pp. –, August .
[] B. G. Tudorica and C. Bucur, “A comparison between several
NoSQL databases with comments and notes,” in Proceedings of
the 10th RoEduNet International Conference on Networking in
Education and Research (RoEduNet ’11), pp. –, June .
[] Y. Li and S. Manoharan, “A performance comparison of SQL
and NoSQL databases,” in Proceedings of the 14th IEEE Pacic
Rim Conference on Communications, Computers, and Signal
Processing (PACRIM ’13), pp. –, August .
[] J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y.
Gilad, “RNA-seq: an assessment of technical reproducibility and
comparison with gene expression arrays,Genome Research,vol.
, no. , pp. –, .
[] OpsCenter 4.0 User Guide Documentation,DataStax,,http://
www.datastax.com/documentation/opscenter/./pdf/opscus-
erguide.pdf.
[] DataStax, DataStax Enterprise . Documentation, , http://
www.datastax.com/doc-source/pdf/dse.pdf.
... Our approach is completely general, and can be applied to different relational and NoSQL databases with little effort. In this work we choose to study the performance of irace on the Cassandra database, one of the most popular NoSQL databases, used in several real-world applications such as Internet of Things, genomics, or electric consumption data (Cassandra, 2014;Duarte & Bernardino, 2016;Daz, Martn & Rubio, 2016;Mahgoub et al., 2017a;Le et al., 2014;Aniceto et al., 2015;Pinheiro et al., 2017). We measure the performance in terms of throughput using the YCSB benchmark (Cooper et al., 2010;Wang & Tang, 2012), observing a speedup of up to 30% over the default configuration. ...
... We use Cassandra for two main reasons. First, Cassandra is one of the most used and best performing NoSQL databases today, with applications in several different domains (Duarte & Bernardino, 2016;Daz, Martn & Rubio, 2016;Mahgoub et al., 2017a;Le et al., 2014;Aniceto et al., 2015;Pinheiro et al., 2017). Second, the existing documentation is very complete, and it allows to easily replicate and generalize the experiments carried out in this work. ...
Article
Full-text available
Database systems play a central role in modern data-centered applications. Their performance is thus a key factor in the efficiency of data processing pipelines. Modern database systems expose several parameters that users and database administrators can configure to tailor the database settings to the specific application considered. While this task has traditionally been performed manually, in the last years several methods have been proposed to automatically find the best parameter configuration for a database. Many of these methods, however, use statistical models that require high amounts of data and fail to represent all the factors that impact the performance of a database, or implement complex algorithmic solutions. In this work we study the potential of a simple model-free general-purpose configuration tool to automatically find the best parameter configuration of a database. We use the irace configurator to automatically find the best parameter configuration for the Cassandra NoSQL database using the YCBS benchmark under different scenarios. We establish a reliable experimental setup and obtain speedups of up to 30% over the default configuration in terms of throughput, and we provide an analysis of the configurations obtained.
... In the management and analysis of massive amounts of data it requires the system to be highly scalable and fault-tolerant. NoSQL databases are coded in many distinct programming languages and are generally available as open-source software (Aniceto et al., 2015). NoSQL systems are also sometimes called 'not only SQL' to emphasise that they may support SQL-like query languages, or sit alongside SQL databases in polyglot persistent architectures (Fowler, 2012;Rouse, 2017). ...
Article
With deployments of complicated or complex large scale microservice architectures the kind of data generated from all those systems makes a typical production infrastructure huge, complicated and difficult to manage. In this scenario, logs play a major role and can be considered as an important source of information in a large scale secured environment. Till date many researchers have contributed various methods towards conversion of unstructured logs to structured ones. However post conversion the dimension of the dataset generated increases many folds which are too complex for data analysis. In this paper, we have discussed techniques and methods to deal with extraction of all features from a produced structured log, reducing N-dimensional features to fixed dimensions without compromising the quality of data in a cost-efficient manner that can be used for any further machine learning based analysis.
... In the management and analysis of massive amounts of data it requires the system to be highly scalable and fault-tolerant. NoSQL databases are coded in many distinct programming languages and are generally available as open-source software (Aniceto et al., 2015). NoSQL systems are also sometimes called 'not only SQL' to emphasise that they may support SQL-like query languages, or sit alongside SQL databases in polyglot persistent architectures (Fowler, 2012;Rouse, 2017). ...
... After reviewing the modern biomedical research, our team believes that a personal bio-molecular signature should be registered either via a custom national healthcare application or via the generic modular parts of the structured electronic healthcare record of a hospital information system. Academic and commercial streamline software including bio-molecular database structures (i.e., DNA mutations, SNPs, PCR results) and further technical details (i.e., Flat files, VCF files, binary large objects, usage of molecular biology/bioinformatics databases annotations) have been published over a decade to integrate cross-country clinical and molecular databases [13][14][15][16][17]. Recent advanced technology, combining NoSQL databases, molecular/genomic standard data structures, cloud architectures, reliable FOSS, and highs performance computing servers, seems promising to efficiently manage large next-generation sequencing/whole genome sequencing data and distribute via web services individuals' bio-information [18][19][20]. Therefore, it is highly possible that IT management and eGov policies can create positive change in the pandemic. ...
... The system offers a fault tolerant, high availableness, decentralized store for information which may be scaled up by adding hardware nodes to the system [9]. Cassandra implements an "eventually consistent" model that trades-off consistency of data stores within the system for availableness [14]. Information is automatically replicated to multiple nodes for fault-tolerance. ...
Article
A relational database is a table based system where there's no scalability, lowest data duplication, computationally overpriced table joins and issue in addressing complicated data. The matter with relations in relational database is that advanced operations with massive data sets quickly become prohibitively resource intense. Relational databases don't lend themselves well to the type of horizontal scalability that is needed for large-scale social networking or cloud applications. NoSQL has emerged as results of the demand for relational database alternatives. The most important motivation behind NoSQL is scalability. NoSQL is supposed for the present growing breed of net applications that require scaling effectively. This paper analyzes the NoSQL database that is the demand of the present large-scale social networking or cloud applications. The analysis of assorted NoSQL databases like Bigtable, Cassandra, CouchDB, MongoDB and Couchbase has been highlighted.
... In this paper the authors construct a DNA relational database with a simple data model in which one DNA molecule stores one piece of data, [3] and this one introduces the database aspects of DNA computing [4]. This paper discuss the Cassandra NoSQL database approach for storing genomic data [5]. In [6] the authors managed to implement a web interface, to help students to develop informatic thinking skills. ...
Conference Paper
Full-text available
DNA sequencing is the process of determining the order of nucleotides in DNA. The rapid speed of sequencing attained with modern DNA sequenc-ing technology has been instrumental in the sequencing of complete DNA sequences, including the human genome. Nevertheless it is a sensitive data which needs safe but efficient storage methods. The goal in this research was to analyze different models and algorithms to determine which is the most applicable for storage, and query considering the need of user permissions, and encryption.
Article
The objective of our study was to provide practical directions on the storage of genomic information and novel phenotypes (treated here as unstructured data) using a non-relational database. The MongoDB technology was assessed for this purpose, enabling frequent data transactions involving numerous individuals under genetic evaluation. Our study investigated different genomic (Illumina Final Report, PLINK, 0125, FASTQ, and VCF formats) and phenotypic (including media files) information, using both real and simulated datasets. Advantages of our centralized database concept include the sublinear running time for queries after increasing the number of samples/markers exponentially, in addition to the comprehensive management of distinct data formats while searching for specific genomic regions. A comparison of our non-relational and generic solution, with an existing relational approach (developed for tabular data types using 2 bits to store genotypes), showed reduced importing time to handle 50M SNPs (PLINK format) achieved by the relational schema. Our experimental results also reinforce that data conversion is a costly step required to manage genomic data into both relational and non-relational database systems, and therefore, must be carefully treated for large applications.
Chapter
The Internet of Things (IoT) has great potential to change the fundamental way of interacting with technology in daily life, and for ease, it also observes and records user preferences that challenge privacy in another way. IoT devices are suspended to extensive usage even more than mobile phones and attain more access to private and secured data. With the growth of connected devices, mobile security is already a challenge, so perspective challenges for IoT connected devices must be much greater than considered at present and can be primarily categorized into safety, security and privacy. Rigorous development of security techniques should be an essential process toward the foundation of strong IoT systems to achieve and retain user trust. The survey in this paper reviewed and analyzed security principles, attacks and countermeasures at different layers of IoT-layered architecture, considering the bottlenecks of IoT systems.
Chapter
The need and trend of data record analysis has seen an enormous rise in the past. More and more organizations are realizing the need for a schematic decision making procedure which makes them rely on past data to make future predictions. In this run, the data analysis techniques have also developed along with the advancement of data formats available and now trends are more towards NoSQL (Not Only SQL) type of data stores than the relational ones. This paper explores the types of NoSQL which offer high availability, performance, and eventual concurrency applications but losing the ACID properties of the traditional databases. The authors discuss various data stores in brief and also compare these data stores based on different aspects.
Conference Paper
Full-text available
NoSQL database systems have been becoming more and more popular and accepted by a database users thus their rapid development is nowadays observed. Because of this fact, modern database engines and their categories in the form of the Venn diagram are mentioned in the paper. Besides, the possibilities of using declarative languages that are modeled on SQL - the language for relational databases – in NoSQL, are presented. For this purpose selected NoSQL technologies are given in more details and their query languages are described. Moreover, the NoSQL language commands’ equivalents of SQL standard are provided in this document.
Article
Full-text available
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Conference Paper
Full-text available
With the current emphasis on “Big Data”, NoSQL databases have surged in popularity. These databases are claimed to perform better than SQL databases. In this paper we aim to independently investigate the performance of some NoSQL and SQL databases in the light of key-value stores. We compare read, write, delete, and instantiate operations on key-value stores implemented by NoSQL and SQL databases. Besides, we also investigate an additional operation: iterating through all keys. An abstract key-value pair framework supporting these basic operations is designed and implemented using all the databases tested. Experimental results measure the timing of these operations and we summarize our findings of how the databases stack up against each other. Our results show that not all NoSQL databases perform better than SQL databases. Some are much worse. And for each database, the performance varies with each operation. Some are slow to instantiate, but fast to read, write, and delete. Others are fast to instantiate but slow on the other operations. And there is little correlation between performance and the data model each database uses.
Book
What could you do with data if scalability wasn't a problem? With this hands-on guide, you'll learn how Apache Cassandra handles hundreds of terabytes of data while remaining highly available across multiple data centers -- capabilities that have attracted Facebook, Twitter, and other data-intensive companies. Cassandra: The Definitive Guide provides the technical details and practical examples you need to assess this database management system and put it to work in a production environment. Author Eben Hewitt demonstrates the advantages of Cassandra's nonrelational design, and pays special attention to data modeling. If you're a developer, DBA, application architect, or manager looking to solve a database scaling issue or future-proof your application, this guide shows you how to harness Cassandra's speed and flexibility. * Understand the tenets of Cassandra's column-oriented structure * Learn how to write, update, and read Cassandra data * Discover how to add or remove nodes from the cluster as your application requires * Examine a working application that translates from a relational model to Cassandra's data model * Use examples for writing clients in Java, Python, and C# * Use the JMX interface to monitor a cluster's usage, memory patterns, and more * Tune memory settings, data storage, and caching for better performance
Conference Paper
This paper is trying to comment on the various NoSQL (Not only Structured Query Language) systems and to make a comparison (using multiple criteria) between them. The NoSQL databases were created as a mean to offer high performance (both in terms of speed and size) and high availability at the price of loosing the ACID (Atomic, Consistent, Isolated, Durable) trait of the traditional databases in exchange with keeping a weaker BASE (Basic Availability, Soft state, Eventual consistency) feature. Remains to be seen which of the multiple solutions created since the official appearance of the NoSQL concept (which was defined in 1998 and reintroduced in 2009, around which moment several NoSQL solutions emerged; at the present moment there are known over 120 such solutions) are really delivering on these promises of higher performance (although several of them are already used with very good results).
Conference Paper
Relational databases are providing storage for several decades now. However for today's interactive web and mobile applications the importance of flexibility and scalability in data model can not be over-stated. The term NoSQL broadly covers all non-relational databases that provide schema-less and scalable model. NoSQL databases which are also termed as Internetage databases are currently being used by Google, Amazon, Facebook and many other major organizations operating in the era of Web 2.0. Different classes of NoSQL databases namely key-value pair, document, column-oriented and graph databases enable programmers to model the data closer to the format as used in their application. In this paper, data modeling and query syntax of relational and some classes of NoSQL databases have been explained with the help of an case study of a news website like Slashdot.
Conference Paper
This paper reveal the secret of NoSQL. The CAP theorem, the BASE theorem and the Eventual Consistency theorem construct the foundation stone of NoSQL Cassandra is one kind of NoSQL databases, It is used by Twitter, Facebook and some other famous corporations. Taking it for example, this online trading system is based on Cassandra database. I'll design and contrast the relational model and Cassandra-based model of this system, then construct the key space, the column family and do some other configuration. After these jobs done, I'll do some coding to implement this system.