Access to this full-text is provided by Wiley.
Content available from International Journal of Genomics
This content is subject to copyright. Terms and conditions apply.
Research Article
Evaluating the Cassandra NoSQL Database Approach for
Genomic Data Persistency
Rodrigo Aniceto,1Rene Xavier,1Valeria Guimarães,1Fernanda Hondo,1
Maristela Holanda,1Maria Emilia Walter,1and Sérgio Lifschitz2
1Computer Science Department, University of Brasilia (UNB), 70910-900 Brasilia, DF, Brazil
2Informatics Department, Pontical Catholic University of Rio de Janeiro (PUC-Rio),
22451-900 Rio de Janeiro, RJ, Brazil
Correspondence should be addressed to Maristela Holanda; mholanda@cic.unb.br
Received March ; Accepted May
Academic Editor: Che-Lun Hung
Copyright © Rodrigo Aniceto et al. is is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics.
One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the
persistency of genomic data, particularly storing and analyzing these large-scale processed data. To nd an alternative to the
frequently considered relational database model becomes a compelling task. Other data models may be more eective when dealing
with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the
Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real
data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and
another NoSQL database approach, MongoDB.
1. Introduction
Advanced hardware and soware technologies increase the
speed and eciency with which scientic workows may be
performed. Scientists may execute a given workow many
times, comparing results from these executions and providing
greater accuracy in data analysis. However, handling large
volumes of data produced by distinct program executions
under varied conditions becomes increasingly dicult. ese
massive amounts of data must be stored and treated in order
to support current genomic research [–]. erefore, one
of the main problems when working with genomic data
refers to the storage and search of these data, requiring many
computational resources.
In computational environments with large amounts of
possibly unconventional data, NoSQL []databasesystems
have emerged as an alternative to traditional Relational
Database Management Systems (RDBMS). NoSQL systems
are distributed databases built to meet the demands of high
scalability and fault tolerance in the management and analysis
of massive amounts of data. NoSQL databases are coded
in many distinct programming languages and are generally
available as open-source soware.
e objective of this paper is to study the persistency
of genomic data on a particular and widely used NoSQL
database system, namely, Cassandra []. e tests performed
for this study use real genomic data to evaluate insertion
and extraction operations into and from the Cassandra
database. Considering the large amounts of data in current
genome projects, we are particularly concerned with high
performances. We discuss and compare our results with a
relational system (PostgreSQL) and another NoSQL database
system, MongoDB [].
is paper is organized as follows. Section presents a
brief introduction for NoSQL databases and the main features
of Cassandra database system. We discuss some related work
in Section andwepresent,atSection ,thearchitecture
ofthedatabasesystem.Section discusses the practical
results obtained and Section concludes and suggests future
works.
Hindawi Publishing Corporation
International Journal of Genomics
Volume 2015, Article ID 502795, 7 pages
http://dx.doi.org/10.1155/2015/502795
International Journal of Genomics
2. NoSQL Databases: An Overview
Many relevant innovations in data management came from
Web . applications. However, the techniques and tools
available in relational systems may, sometimes, limit their
deployment. erefore, some researchers have decided to
develop their own web-scale database solutions [].
NoSQL (not-only SQL) databases have emerged as a
solution to storage scalability issues, parallelism, and man-
agementoflargevolumesofunstructureddata.Ingeneral,
NoSQLsystemshavethefollowingcharacteristics[–]: (i)
they are based on a nonrelational data model; (ii) they rely
on distributed processing; (iii) high availability and scalability
aremainconcerns;and(iv)someareschemalessandhavethe
ability to handle both structured and unstructured data.
ere are four main categories of NoSQL databases [,–
]:
(i) Key-value stores: data is stored as key-pairs values.
ese systems are similar to dictionaries, where data
is addressed by a single key. Values are isolated
and independent from another, and relationships are
handled by the application logic.
(ii) Column family database: it denes the data structure
as a predened set of columns. e super columns
and column family structures can be considered the
database schema.
(iii) Document-based storage: a document store uses the
concept of key-value store. e documents are col-
lections of attributes and values, where an attribute
canbemultivalued.EachdocumentcontainsanID
key,whichisuniquewithinacollectionandidenties
document.
(iv) Graph databases: graphs are used to represent
schemas. A graph database works with three abstrac-
tions: node, relationships between nodes, and key-
value pairs that can attach to nodes and relationships.
2.1. Cassandra Database System. Cassandra is a cloud-
oriented database system, massively scalable, designed to
store a large amount of data from multiple servers, while
providing high availability and consistent data []. It is based
onthearchitectureofAmazon’sDynamo[]andalsoon
Google’s BigTable data model []. Cassandra enables queries
as in a key-value model, where each row has a unique row
key, a feature adopted from Dynamo [,,,]. Cassandra
is considered a hybrid NoSQL database, using characteristics
of both key-value and column oriented databases.
Cassandra’s architecture is made of nodes, clusters, data
centers and a partitioner. A node is a physical instance of
Cassandra. Cassandra does not use a master-slave architec-
ture; rather, Cassandra uses peer-to-peer architecture, which
all nodes are equal. A cluster is a group of nodes or even a
single node. A group of clusters is a data center. A partitioner
is a hash function for computing the token of each row key.
When one row is inserted, a token is calculated, based
on its unique row key. is token determines in what node
that particular row will be stored. Each node of a cluster is
responsible for a range of data based on a token. When the
rowisinsertedanditstokeniscalculated,thisrowisstoredon
a node responsible for this token. e advantage here is that
multiple rows can be written in parallel into the database, as
each node is responsible for its own write requests. However
this may be seen as a drawback regarding data extraction,
becoming a bottleneck. e MurMur3Partitioner []isa
partitioner that uses tokens to assign equal portions of data
toeachnode.istechniquewasselectedbecauseitprovides
fast hashing, and its hash function helps to evenly distribute
data to all the nodes of a cluster.
e main elements of Cassandra are keyspaces,column
families, columns, and rows []. A keyspace contains the
processing steps of the data replication and is similar to a
schema in a relational database. Typically, a cluster has one
keyspace per application. A column family is a set of key-
value pairs containing a column with its unique row keys. A
column is the smallest increment of data, which contains a
name, a value, and a timestamp. Rows are columns with the
same primary key.
When a write operation occurs, Cassandra immediately
stores the instruction on the Commit log, which goes into the
hard disk (HD). Data from this write operation is stored at
the memtable,whichstaysinRAM.Onlywhenapredened
memory limit is reached, this data is written on SSTables that
stay in the HD. en, the Commit log and the memtable are
cleaned up [,]. In case of failure regarding the memtables,
Cassandra reexecutes the written instructions available at the
Commit log [,].
When an extract instruction is executed, Cassandra rst
searches information in memtables. A large RAM allows large
amounts of data in memtables and less data in HD, resulting
in quick access to information [].
3. Storing Genomic Data
Persistency of genomic data is not a recent problem. In ,
Bloom and Sharpe [] described the diculties of managing
these data. One of the main diculties was the growing
number of data generated by the queries. e work in R¨
ohm
and Blakeley [] and Huacarpuma [] consider relational
databases (SQL Server and PostgreSQL, resp.) to store
genomic data in FASTQ format.
Bateman and Wood []havesuggestedusingNoSQL
databases as a good alternative to persisting genetic data.
However, no practical results are given. Ye and Li []
proposed the use of Cassandra as a storage system. ey
consider multiple nodes so that there were no gaps in the
consistencyofthedata.WangandTang[] indicated some
instructions for creating an application to perform data
operations in Cassandra.
Tudorica and Bucur [] compared some NoSQL
databases to a MySQL relational database using the YCSB
(Yahoo! Cloud Serving Benchmark). ey conclude that in
an environment where write operations prevail MySQL has
a signicantly higher latency when compared to Cassandra.
Similar results about performance improvements for writing
operations in Cassandra, when compared to MS SQL
Express, were also reported by Li and Manoharan [].
International Journal of Genomics
Many research works [–] present results involving
the performance of a Cassandra database system for massive
data volumes. In this paper, we have decided to evaluate the
performance of Cassandra NoSQL database system speci-
cally for genomic data.
4. Case Study
To validate our case study we have used real data. e
sequences (also called reads) were obtained from liver and
kidney tissue samples of one human male from the SRA-NCBI
(http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?), sequenced
by the Illumina Genome Analyzer. It produced ,,
sequences for the kidney samples and ,, sequences
for the liver samples, each sequence containing bases.
Marioni et al. [] generated these sequences.
FASTQ le stores sequences of nucleotides and their
corresponding quality values. ree les were obtained from
ltered sequences sampled from kidney cells, and another
three les consisted of ltered genomic sequences sampled
from liver cells. It should be noted that these data were
selectedbecausetheywereinFASTQ[] format, which is
commonly used in bioinformatics workows.
In this case study, we carried out three analyses. In the rst
one, we investigated how Cassandra behaves when the com-
putational environment is composed of a cluster with two and
four computers. In the second one, we analyze the behavior of
Cassandra compared to PostgreSQL, a relational database. In
thelastcasestudy,weusedtheMongoDBdocument-oriented
NoSQLdatabasetocomparetoCassandra’sresults.
4.1. Cloud Environment Architecture. In order to investigate
the expected advantages of Cassandra’s scalability, we have
createdtwocloudenvironments:onewithtwonodesand
the other with four nodes. Cassandra was installed on every
node of the cluster. We have also used OpsCenter . [],
a DSE tool that implements a browser-based interface to
remotely manage the cluster conguration and architecture.
e architecture contains a single data center, named DC. A
single cluster, named BIOCluster, containing the nodes, was
created, working with DC.
4.2. Java Client. At the soware level, we have dened the
following functional requirements: (i) create a keyspace; (ii)
create a table to store a FASTQ le; (iii) create a table with
the names of inserted FASTQ les and their corresponding
metadata; (iv) receive an input le containing data from
a FASTQ le and insert it into a previously created table,
followed by the le name and metadata; (v) extract all data
from a table containing the contents of a FASTQ le; and (vi)
remove the table and the keyspace.
Nonfunctional requirements were also dened: (i) the
use of Java API, provided by DataStax, in order to have a
better integration between the Cassandra distribution and the
developed client application; (ii) the use of Cassandra Query
Language (CQL) [], for database interactions, which is the
current query language of Cassandra and resembles SQL; (iii)
conversion to JSON les to be used by the client application,
since it is simpler to work with JSON les in Java; and (iv) a
good performance in operations.
With respect to this last requirement, three applications
were developed, two for data conversion and one client
application for Cassandra.
() FastqTojson Application converts the FASTQ input
le into smaller JSON les, each JSON le with ve
hundredthousandreads.eobjectiveistoloadthese
smallJSONlesbecause,usually,FASTQleoccupies
a few gigabytes. Furthermore, as it presents a proper
format for the Java client, it does not consume many
computational resources. Each JSON le occupies ten
thousand rows in the database: each row is an array of
ten columns; each eld value of the column contains
ve reads.
() Cassandra client was also developed in Java, using the
JavaAPIprovidedbyDataStaxandistheoneinwhich
the data persists. is client creates a keyspace, inserts
all JSON les from the rst application in a single
table, and extracts the data from a table.
For the database schema, it consists of a single
keyspace, called biodata,asinglecluster,calledbio-
cluster, one table of metadata and one table for each
lepersisting,asshowninFigure .
e allocation strategy for replicas and the repli-
cation factor are properties from the keyspace.e
allocation strategy determines whether or not data is
distributed through a network of dierent clusters.
e Simple Strategy [] was selected since this case
study was performed in a single cluster. Likewise,
since we did not consider failures and our goal was
to study performance rather than fault recovery, we
have chosen one replication factor. It should be noted
that the replication factor determines the number
of replicas distributed along the cluster. Focusing on
performance,ahighernumberofreplicaswouldalso
interfere on the insertion time.
As previously mentioned, the client application cre-
ates a table for each inserted FASTQ le, which has
the same name of the le. Each of these tables has
eleven columns, and each cell stores a small part
of a JSON le, ten reads per cell, which is about
MB in size. is small set for columns and cells
is due to the eciency of Cassandra when a small
number of columns are used and a big number of
rows. is is also a consequence of the ability of
MurMur3Partitioner to distribute each row in one
node. erefore, the cluster has a better load balance
during insertions and extractions.
Once a table is created, the client inserts all data from
JSON in the rst stage on the database, as shown in
Figure . In what follows, a single row is inserted into
the metadata table containing as a row key the name
of FASTQ le and a column with the number of rows.
is latter is inserted into the metadata table to solve
the memory limit of the Java Virtual Machine, which
mayhappenwhenqueryinglargetables.
International Journal of Genomics
Line 2
L
ine 2
Cluster: BIOCluster
Keyspace bioData
Table metadata
Key le 1 RowsRows
Table le 1
Line 1
Line 1
Line 2 Line 3
Key 1 Va lu e JKey 2 Va lu e JKey 3 Va l u e J
Table le 2
Line 1 Line 2 Line 3
Key le 2
Value AValue AValue A
Key 1 Va l ue JKey 2 Va lu e JKey 3 Val u e JValue AValue AValue A
F : Database schema.
arq1.json
arq2.json
FastqToJson
FASTQ le
arqN.json
Cassandra
client
.
.
.
F : Stages of insertion.
When extracting data, the client queries the metadata
table to get the number of rows on the table with the
FASTQ data and then proceeds to the table extraction,
whichisdonerowbyrowandwrittenintoan“.out”
le.
() OutToJso Application. Aer data extraction, there is
a single le with the extension “.out.” is application
converts this le into a FASTQ format, making it
identical to the original input le, resulting only in the
FAS TQ le wit h out te mpora r y l e “.ou t.” is pro c e ss
is shown in Figure .
OutToFastq
Data.out Data.fastq
Cassandra
client
F : Stages of extraction.
5. Results
In this work, we have considered three experimental case
studies to evaluate data consistency and performance for
storing and extracting genomic data. For the rst one, we
veried Cassandra’s scalability and variation in performance.
For the second case study, we compared the Cassandra results
to a PostgreSQL relational system and, nally, we used the
MongoDB NoSQL database and compared other results to
Cassandra NoSQL system. e case studies used the same
datatoinsertandreadsequences.
During the Cassandra evaluation, we have created two
clusters. e rst one, a Cassandra cluster with two com-
puters, was created, while for the second one, a new cluster
with four computers was created. e rst cluster consisted
of two computers with Intel Xeon E-/. GHz processor,
one with GB RAM and the other with GB RAM. For the
second cluster, besides the same two computers, two other
computers with Intel Core i processor and GB RAM was
included. Each one of them used Ubuntu ..
5.1. Insertions and Extractions Cassandra NoSQL. e input
les are six FASTQ les with ltered data from kidney and
liver cells. Table showsthesizesoftheleandthenumber
International Journal of Genomics
T : C el l s les.
File File number Size Number of lines
Liver cells les
, GB .
, GB .
, GB .
Kidney cells les
, GB .
, GB .
, GB .
ofrowsthattheirrespectiveJSONlehadwheninsertedinto
Cassandra.
We have based the performance analyses on the elapsed
time to store (insert) data into and to retrieve (extract) data
from the database. ese elapsed times are important because
if one wants to use the Cassandra system in bioinformatics
workows, it is necessary to know how long the data becomes
available to execute each program.
Table shows the elapsed times to insert and extract
sequences in the database, with both implementations.
Columns and show the insertions using two nodes.
Similarly, columns and show the extractions using four
nodes. As expected, we could conrm the hypothesis that the
database performance increases when we add more nodes.
Figures and show comparative charts of insertion and
extraction elapsed times according to the number of comput-
ers that Cassandra considers. Insertion into two computers
is longer than using four computers. Here the performance
also improves when the number of computers increases in the
cluster.
5.2. Comparison of Relational and Cassandra NoSQL Systems.
We compared the C a s s a n d r a r e s u lt s w i t h H u a c a r p u ma [ ]
that used the same data to insert and read sequences in the
PostgreSQL, a relational database. In the latter experiment,
theauthorusedonlyoneserverwithanIntelXeonprocessor,
eight cores of . GHz and GB RAM, executing Linux
Server Ubuntu/Linaro ..-.
e server’s RAM for the relational database is larger than
the sum of the memories of the four computers used in this
experiment. Nonetheless, we use the results of the relational
database to demonstrate that it is possible to achieve high
performances even with a modest hardware due to scalability
and parallelism.
Table shows the sum of the insertion and extraction
times in the relational database and the two computational
environments using Cassandra, Cassandra (), a cluster
with two computers, and Cassandra (), a cluster with four
computers.
e writing time in Cassandra is lower due to parallelism,
as seen in Table .WriteactionsinCassandraaremoreeec-
tive than in a relational database. However, its performance
was lower for query answering, as shown in Figure .is
is due to two factors: rst, Cassandra had to ensure that the
returned content was in its latest version, verifying the data
divided between machines; second, the data size is larger than
the available RAM; therefore, part of the data had to be stored
in SSTable, reducing the speed of the search.
0
5
10
15
123456
(min)
File
Insertion
Cassandra (2)
Cassandra (4)
F : Comparison between inserts (time ×le number).
0
5
10
15
20
25
123456
(min)
File
Extraction
Cassandra (2)
Cassandra (4)
F : Comparison between extractions (time ×le number).
e reader should note that the results obtained with
Cassandra just indicate a trend. ey are not conclusive
because the hardware characteristics of all experiments are
dierent.
Nevertheless, the improved performance with the
increase of nodes is an indication that Cassandra may some-
times surpass relational database systems in a larger number
of computers, making its use viable in data searches in
bioinformatics.
5.3. Comparison of MongoDB and Cassandra NoSQL Data-
bases. We compared the Cassandra results to the same data
to insert and read sequences in a MongoDB NoSQL. is is an
open-source document-oriented NoSQL database designed
to store large amounts of data.
e server where we have installed MongoDB is an i
processor with GB RAM. is server has GB RAM more.
eserverwherewehaveinstalledMongoDBhadGBRAM
more than cluster with two computers, Cassandra (), and
GB RAM less than the sum of the RAM memories of four
computers, Cassandra ().
International Journal of Genomics
T : Times to insert and extract sequences from the database.
File Size Insertion Extraction
Cassandra () Cassandra () Cassandra () Cassandra ()
, GB m s ms m s ms m s ms m s ms
,GB msms msms msms msms
, GB m s ms m s ms m s ms m s ms
,GB msms msms msms msms
, GB m s ms m s ms m s ms m s ms
, GB m s ms m s ms m s ms m s ms
T : PostgreSQL and Cassandra results.
Database Insertion Extraction
PostgreSQL hms ms
Cassandra () m s h m s
Cassandra() ms ms
0
20
40
60
80
100
120
PostgreSQL Cassandra (2) Cassandra (4)
(min)
Database
Insertion
Extraction
F : Comparison between Cassandra and PostgreSQL.
Table shows the sum of the insertion and extraction
times in the MongoDB database and the Cassandra with
two and four computers in a cluster. e performances of
insertion operations were similar using either MongoDB or
Cassandra databases. However, the MongoDB showed better
behavior than Cassandra NoSQL in the extraction of genomic
data in FASTQ format.
In Figure our results suggest that there is a similar
behavior of the insertions in both MongoDB and Cassandra.
ere was a performance gain of more than % in the
extraction, when comparing the results of a Cassandra in
a cluster with two computers and another cluster with four
computers.
6. Conclusions
In this work we studied genomic data persistence, with
the implementation of a NoSQL database using Cassandra.
T:MongoDBandCassandranalresults.
Database Insertion Extraction
MongoDB m s m s
Cassandra () m s h m s
Cassandra() ms ms
MongoDB Cassandra (2) Cassandra (4)
0
10
20
30
40
50
60
70
80
(min)
Database
Insertion
Extraction
F : Comparison between Cassandra and MongoDB database.
We have observed that it presented a high performance
for writing operations due to the larger number of massive
insertions compared to data extractions. We used the DSE
tool together with Cassandra, which allowed us to create a
cluster and a client application suitable for the expected data
manipulation.
Our results suggest that there is a reduction of the
insertion and query times when more nodes are added in
Cassandra. ere was a performance gain of about % in the
insertions and a gain of % in reading, when comparing the
resultsofaclusterwithtwocomputersandanothercluster
with four computers.
Comparing the performance of Cassandra to the Mon-
goDB database, the results of MongoDB indicate that the
extraction of the MongoDB is better than Cassandra. For data
insertions the behaviors of Cassandra and MongoDB were
similar.
From the results presented here, it is possible to outline
new approaches in studies of persistency regarding genomic
International Journal of Genomics
data. Positive results could boost new research, for example,
the creation of a similar application using other NoSQL
databases or new tests using Cassandra with dierent hard-
ware congurations seeking improvements in performance.
It is also possible to create a relational database with hardware
settings identical to Cassandra, in order to make more
detailed comparisons.
Conflict of Interests
e authors declare that there is no conict of interests
regarding the publication of this paper.
References
[] S. A. Simon, J. Zhai, R. S. Nandety et al., “Short-read sequencing
technologies for transcriptional analyses,” Annual Review of
Plant Biology,vol.,no.,pp.–,.
[] M. L. Metzker,“Sequencing technologies—the next generation,”
Nature Reviews Genetics, vol. , no. , pp. –, .
[] C.-L. Hung and G.-J. Hua, “Local alignment tool based on
Hadoop framework and GPU architecture,” BioMed Research
International, vol. , Article ID , pages, .
[] Y.-C. Lin, C.-S. Yu, and Y.-J. Lin, “Enabling large-scale biomed-
ical analysis in the cloud,” BioMed Research International,vol.
,ArticleID,pages,.
[] K. Kaur and R. Rani, “Modeling and querying data in NoSQL
databases,” in Proceedings of the IEEE International Conference
on Big Data,pp.–,October.
[] A. Lakshman and P. Malik, “Cassandra: a decentralized struc-
tured storage system,” Operating Systems Review,vol.,no.,
pp.–,.
[] K. Chodorow, MongoDB—e denitive Guide, O’Reilly, nd
edition, .
[] R.HechtandS.Jablonski,“NoSQLevaluation:ausecaseori-
ented survey,” in Proceedings of the International Conference on
Cloud and Service Computing (CSC ’11), pp. –, December
.
[] Y. Muhammad, Evaluation and implementation of distributed
NoSQL database for MMO gaming environment [M.S. thesis],
Uppsala University, .
[] C. J. M. Tauro, S. Aravindh, and A. B. Shreeharsha, “Compar-
ative study of the new generation, agile, scalable, high perfor-
mance NOSQL databases,” International Journal of Computer
Applications,vol.,no.,pp.–,.
[] R. P. Padhy, M. Patra, and S. C. Satapathy, “RDBMS to NoSQL:
reviewing some next-generation non-relational databases,”
International Journal of Advanced Engineering Science and
Tech n o l o g ies, vol. , no. , pp. –, .
[] M. Bach and A. Werner, “Standardization of NoSQL database
languages,” in Beyond Databases, Architectures, and Structures:
10th International Conference, BDAS 2014, Ustron, Poland,
May 27–30, 2014. Proceedings,vol.ofCommunications in
Computer and Information Science, pp. –, Springer, Berlin,
Germany, .
[] M. Indrawan-Santiago, “Database research: are we at a cross-
road? Reection on NoSQL,” in Proceedings of the 15th Interna-
tional Conference on Network-Based Information Systems (NBIS
’12), pp. –, IEEE, Melbourne, Australia, September .
[]G.DeCandia,D.Hastorun,M.Jampanietal.,“Dynamo:
amazon’s highly available key-value store,” in Proceedings of the
21st ACM Symposium on Operating Systems Principles (SOSP
’07), pp. –, ACM, October .
[] F. Chang, J. Dean, S. Ghemawat et al., “Bigtable: a distributed
storage system for structured data,” in Proceedings of the
USENIX Symposium on Operating Systems Design and Imple-
mentation (OSDI '06), pp. –, .
[] E. Hewitt, Cassandra—e Denitive Guide, O’Reilly, st edi-
tion, .
[] M. Klems, D. Bermbach, and R. Weinert, “A runtime quality
measurement framework for cloud database service systems,”
in Proceedings of the 8th International Conference on the Quality
of Information and Communications Technology (QUATIC ’12),
pp.–,September.
[] V. Parthasarathy, Learning Cassandra for Administrators,Packt
Publishing, Birmingham, UK, .
[] DataStax, Apache Cassandra . Documentation, , http://
www.datastax.com/documentation/cassandra/./pdf/cassan-
dra.pdf.
[] M. Fowler and P. J. Sadalage, NoSQL Distilled: A Brief Guide to
the Emerging World of Polyglot Persistence, Pearson Education,
Essex, UK, .
[] T. Bloom and T. Sharpe, “Managing data from high-throughput
genomic processing: a case study,” in Proceedings of the 13th
InternationalConferenceonVeryLargeDataBases(VLDB’04),
pp. –, .
[] U. R¨
ohm and J. A. Blakeley, “Data management for high
throughput genomics,” in Proceedings of the Biennial Conference
on Innovative Data Systems Research (CIDR ’09), Asilomar, Calif,
USA, January , http://www-db.cs.wisc.edu/cidr/cidr/
Paper .pdf.
[] R. C. Huacarpuma, Adatamodelforapipelineoftranscriptome
high performance sequencing [M.S. thesis], University of Bras´
ılia,
.
[] A. Bateman and M. Wood, “Cloud computing,” Bioinformatics,
vol.,no.,p.,.
[] Z. Ye and S. Li, “A request skew aware heterogeneous distributed
storage system based on Cassandra,” in Proceedings of the Inter-
national Conference on Computer and Management (CAMAN
’11), pp. –, May .
[] G. Wang and J. Tang, “e NoSQL principles and basic appli-
cation of cassandra model,” in Proceedings of the International
Conference on Computer Science and Service System (CSSS ’12),
pp. –, August .
[] B. G. Tudorica and C. Bucur, “A comparison between several
NoSQL databases with comments and notes,” in Proceedings of
the 10th RoEduNet International Conference on Networking in
Education and Research (RoEduNet ’11), pp. –, June .
[] Y. Li and S. Manoharan, “A performance comparison of SQL
and NoSQL databases,” in Proceedings of the 14th IEEE Pacic
Rim Conference on Communications, Computers, and Signal
Processing (PACRIM ’13), pp. –, August .
[] J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y.
Gilad, “RNA-seq: an assessment of technical reproducibility and
comparison with gene expression arrays,” Genome Research,vol.
, no. , pp. –, .
[] OpsCenter 4.0 User Guide Documentation,DataStax,,http://
www.datastax.com/documentation/opscenter/./pdf/opscus-
erguide.pdf.
[] DataStax, DataStax Enterprise . Documentation, , http://
www.datastax.com/doc-source/pdf/dse.pdf.
Content uploaded by Sergio Lifschitz
Author content
All content in this area was uploaded by Sergio Lifschitz on Nov 19, 2015
Content may be subject to copyright.
Available via license: CC BY
Content may be subject to copyright.
Content uploaded by Maria Emilia Machado Telles Walter
Author content
All content in this area was uploaded by Maria Emilia Machado Telles Walter on Nov 19, 2015
Content may be subject to copyright.