Conference PaperPDF Available

The ARES Project: Network Architecture for Delivering and Processing Genomics Data

February 2014

February 2014

DOI:10.1109/NCCA.2014.12

Conference: Proceedings of the 2014 IEEE 3rd Symposium on Network Cloud Computing and Applications (ncca 2014)

Authors:

Mauro Femminella

Università degli Studi di Perugia

Gianluca Reali

Università degli Studi di Perugia

Dario Valocchi

University College London

Emilia Nunzi

Università degli Studi di Perugia

This paper shows the network solutions proposed and implemented in the framework of the project ARES. The strategic objective of ARES is to create an advanced CDN, accessible through a cloud interface, supporting medical and research systems making a large use of genomic data. The expected achievements consist of identifying suitable management policies of genomic contents in a cloud environment, in terms of efficiency, resiliency, scalability, and QoS in a distributed fashion using a multi-user CDN approach. The experimental architecture envisages the use of the following key elements: a wideband networking infrastructure, a resource virtualization software platform, a distributed service architecture, including suitable computing and storage infrastructure, a distributed software platform able to ingest, validate, and analyze high volumes of data, a software engine executing a sophisticated intelligence for workload distribution and CDN provisioning.

NetServ node architecture.

…

NSIS Hose off-path signaling, implemented through the instantiation of multiple NSIS off-path bubbles [14] from NSIS nodes close to the IP path. Nodes 2, 3, and 4 belong to the Hose between 1 and 5. In principle, node 5 is not request to be an NSIS node.

…

Architecture of a GEANT PoP.

…

Overall network scenario.

…

a) NetServ-based CDN signaling, highlighted in yellow in subfigures, and (b) data transfer and subsequent signaling.

…

Figures - uploaded by Mauro Femminella

Content may be subject to copyright.

Content uploaded by Mauro Femminella

Content may be subject to copyright.

A preview of the PDF is not available

Orchestration of Cloud Genomic Services

Conference Paper

Jul 2019

Genomics as a Service: A Joint Computing and Networking Perspective

Article

Full-text available

Feb 2018
COMPUT NETW

This paper provides a global picture about the deployment of networked processing services for genomic data sets. Many current research make an extensive use genomic data, which are massive and rapidly increasing over time. They are typically stored in remote databases, accessible by using Internet. For this reason, a significant issue for effectively handling genomic data through data networks consists of the available network services. A first contribution of this paper consists of identifying the still unexploited features of genomic data that could allow optimizing their networked management. The second and main contribution of this survey consists of a methodological classification of computing and networking alternatives which can be used to offer what we call the Genomic-as-a-Service (GaaS) paradigm. In more detail, we analyze the main genomic processing applications, and classify not only the main computing alternatives to run genomics workflows in either a local machine or a distributed cloud environment, but also the main software technologies available to develop genomic processing services. Since an analysis encompassing only the computing aspects would provide only a partial view of the issues for deploying GaaS system, we present also the main networking technologies that are available to efficiently support a GaaS solution. We first focus on existing service platforms, and analyze them in terms of service features, such as scalability, flexibility, and efficiency. Then, we present a taxonomy for both wide area and datacenter network technologies that may fit the GaaS requirements. It emerges that virtualization, both in computing and networking, is the key for a successful large-scale exploitation of genomic data, by pushing ahead the adoption of the GaaS paradigm. Finally, the paper illustrates a short and long-term vision on future research challenges in the field.

Genome Centric Networking: a network function virtualization solution for genomic applications

Conference Paper

Jul 2017

Experts have warned that processing of genetic data will soon exceed the computing needs of Twitter and YouTube. This is due to the drop of the costs for sequencing DNA of any living creature and its huge impact in many application areas. Designing suitable network architectures for distributing such data is therefore of paramount importance. Management of genomic data sets is a typical big data problem, characterized not only by a huge volume, but also by the large size of each genomic file. Since it is unthinkable that any professional who needs to process genomes can own the infrastructure for massive genome analysis, a cloud-based access to genomic services is envisaged. This will have a significant impact on the underlying networks, which could become the system bottleneck. In this paper, we propose Genome Centric Networking (GCN), a novel network function virtualization framework for cloud-based genomic data management, designed with the aim of limiting the exchanged traffic by using distributed caching. The key element of GCN is a novel signaling protocol, which allows both discovering network resources and managing caches. We evaluated GCN on a real testbed. GCN allows halving the exchanged traffic and reducing the transfer time of genomic datasets significantly.

Federated Clouds for Biomedical Research: Integrating OpenStack for ICTBioMed

Conference Paper

Full-text available

Oct 2014

Increasingly complex biomedical data from diverse sources demands large storage, efficient software and high performance computing for the data’s computationally intensive analysis. Cloud technology provides flexible storage and data processing capacity to aggregate and analyze complex data; facilitating knowledge sharing and integration from different disciplines in a collaborative research environment. The ICTBioMed collaborative is a team of internationally renowned academic and medical research institutions committed to advancing discovery in biomedicine. In this work we describe the cloud framework design, development, and associated software platform and tools we are working to develop, federate and deploy in a coordinated and evolving manner to accelerate research developments in the biomedical field. Further, we highlight some of the essential considerations and challenges to deploying a complex open architecture cloud-based research infrastructure with numerous software components, internationally distributed infrastructure and a diverse user base.

The ARES Project: Cloud services for medical genomics

Conference Paper

Full-text available

Feb 2014

This paper shows the cloud services provided by the project ARES. The network solutions have been illustrated in a companion paper in the same conference. The ARES project aims to deploy CDN services over a broadband network for accessing and exchanging genomic datasets, accessible by medical and research personnel through a Cloud interface. This paper illustrates the procedure defined to access such services, also providing a case-study simulation to show the implementation of the bioinformatics pipeline included. The experimental activity in ARES aims to gain a detailed understanding of the network problems relating to its sustainability given the increasing use of genomics for diagnostic purposes. The main aim is to allow an extensive use of genomic data through the collection of relevant information available from the network in the medical and diagnostic field diseases.

Networking issues related to delivering and processing genomic big data

Article

Full-text available

Jan 2015

This paper presents the interdisciplinary area of networked genomic research and medicine. We first illustrate the most significant issues involved in these activities. Then, we illustrate an ongoing research activity in this area, and the relevant results obtained by using a network architecture designed for optimising the distribution of a large amount of genomic data-set. This architecture, based on an evolved Next Steps in Signalling, addresses the major challenges for managing genomic data-sets over a network with limited resources. Research results have been obtained in the framework of the project Advanced networking for EU genomic RESearch, which is part of the GÉANT/GN3plus project, co-funded by the European Commission.

Anovel framework for fast packet I/O

Conference Paper

Full-text available

Jun 2012

Luigi Rizzo

Many applications (routers, traffic monitors, firewalls, etc.) need to send and receive packets at line rate even on very fast links. In this paper we present netmap, a novel framework that enables commodity operating systems to handle the millions of packets per seconds traversing 1..10 Gbit/s links, without requiring custom hardware or changes to applications. In building netmap, we identified and successfully reduced or removed three main packet processing costs: per-packet dynamic memory allocations, removed by preallocating resources; system call overheads, amortized over large batches; and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas. Separately, some of these techniques have been used in the past. The novelty in our proposal is not only that we exceed the performance of most of previouswork, but also that we provide an architecture that is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain. Netmap has been implemented in FreeBSD and Linux for several 1 and 10 Gbit/s network adapters. In our prototype, a single core running at 900 MHz can send or receive 14.88 Mpps (the peak packet rate on 10 Gbit/s links). This is more than 20 times faster than conventional APIs. Large speedups (5× and more) are also achieved on user-space Click and other packet forwarding applications using a libpcap emulation library running on top of netmap.

The ARES Project: Cloud services for medical genomics

Conference Paper

Full-text available

Feb 2014

Asymmetric caching: improved network deduplication for mobile devices

Article

Full-text available

Aug 2012

Network deduplication (dedup) is an attractive approach to improve network performance for mobile devices. With traditional deduplication, the dedup~source uses only the portion of the cache at the dedup~destination that it is aware of. We argue in this work that in a mobile environment, the dedup~destination (say the mobile) could have accumulated a much larger cache than what the current dedup~source is aware of. This can occur because of several reasons ranging from the mobile consuming content through heterogeneous wireless technologies, to the mobile moving across different wireless networks. In this context, we propose asymmetric caching, a solution that is overlaid on baseline network deduplication, but which allows the dedup~destination to selectively feedback appropriate portions of its cache to the dedup~source with the intent of improving the redundancy elimination efficiency. We show using traffic traces collected from 30 mobile users, that with asymmetric caching, over 89% of the achievable redundancy can be identified and eliminated even when the dedup~source has less than one hundredth of the cache size as the dedup~destination. Further, we show that the ratio of bytes saved from transmission at the dedup~source because of asymmetric caching is over 6x that of the number of bytes sent as feedback. Finally, with a prototype implementation of asymmetric caching on both a Linux laptop and an Android smartphone, we demonstrate that the solution is deployable with reasonable CPU and memory overheads.

Gossip-based Signaling Dissemination Extension for Next Steps In Signaling

Article

Full-text available

Apr 2012

In this paper, we propose a new gossip-based signaling dissemination method for the Next Steps in Signaling protocol family. In more detail, we propose to extend the General Internet Signaling Transport (GIST) protocol, so as to leverage these new dissemination capabilities from all NSIS Signaling Layer Protocol applications using its transport capabilities. The extension consists of two main procedures: a bootstrap procedure, during which new GIST-enabled nodes discover each other, and a dissemination procedure, which is used to effectively disseminate signaling messages within an Autonomous System. To this aim, we defined three dissemination models, bubble, balloon, and hose, so as to fulfill requirements of different network and service management scenarios. An experimental campaign carried out on GENI shows the effectiveness of the proposed solution.

Extending the NetServ autonomic management capabilities using OpenFlow

Article

Full-text available

Apr 2012

Autonomic management capabilities of the Future Internet can be provided through a recently proposed service architecture called NetServ. It consists of the interconnection of programmable nodes which enable dynamic deployment and execution of network and application services. This paper shows how this architecture can be further improved by introducing the OpenFlow architecture and implementing the OpenFlow controller as a NetServ service, thus improving both the NetServ management performance and its flexibility. These achievements are demonstrated experimentally on the GENI environment, showing the platform self-protecting capabilities in case of a SIP DoS attack.

Content Distribution Networks: An Engineering Approach

Book

Feb 2002

Dinesh C. Verma

Database resources of the National Center for Biotechnology Information

Article

Jan 2000
NUCLEIC ACIDS RES

David L Wheeler

In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval and resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s Web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing pages, GeneMap’99, Davis Human–Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP) pages, Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, Cancer Genome Anatomy Project (CGAP) pages, SAGEmap, Online Mendelian Inheritance in Man (OMIM) and the Molecular Modeling Database (MMDB). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov

Orchestrating massively distributed CDNs

Conference Paper

Dec 2012

We consider a content delivery architecture based on geographically dispersed groups of "last-mile" CDN servers, e.g., set-top boxes located within users' homes. These servers may belong to administratively separate domains, such as multiple ISPs. We propose a set of scalable, adaptive mechanisms to jointly manage content replication and request routing within this architecture. Relying on primal-dual methods and fluid-limit techniques, we formally prove the optimality of our design. We further evaluate its performance on both synthetic and trace-driven simulations, based on real BitTorrent traces, and observe a reduction of network costs by more than 50% over traditional mechanisms such as LRU/LFU with closest request routing.

Tradeoffs in CDN designs for throughput oriented traffic

Conference Paper

Dec 2012

Internet delivery infrastructures are traditionally optimized for low-latency traffic, such as the Web traffic. However, in recent years we are witnessing a massive growth of throughput-oriented applications, such as video streaming. These applications introduce new tradeoffs and design choices for content delivery networks (CDNs). In this paper, we focus on understanding two key design choices: (1) What is the impact of the number of CDN's peering points and server locations on its aggregate throughput and operating costs? (2) How much can ISP-CDNs benefit from using path selection to maximize its aggregate throughput compared to other CDNs who only have control at the edge? Answering these questions is challenging because content distribution involves a complex ecosystem consisting of many parties (clients, CDNs, ISPs) and depends on various settings which differ across places and over time. We introduce a simple model to illustrate and quantify the essential tradeoffs in CDN designs. Using extensive analysis over a variety of network topologies (with varying numbers of CDN peering points and server locations), operating cost models, and client video streaming traces, we observe that: (1) Doubling the number of peering points roughly doubles the aggregate throughput over a wide range of values and network topologies. In contrast, optimal path selection improves the CDN aggregate throughput by less than 70\%, and in many cases by as little as a few percents. (2) Keeping the number of peering points constant, but reducing the number of location (data centers) at which the CDN is deployed can significantly reduce operating costs.

Moving Big Data to The Cloud: An Online Cost-Minimizing Approach

Article

Dec 2013

Cloud computing, rapidly emerging as a new computation paradigm, provides agile and scalable resource access in a utility-like fashion, especially for the processing of big data. An important open issue here is to efficiently move the data, from different geographical locations over time, into a cloud for effective processing. The de facto approach of hard drive shipping is not flexible or secure. This work studies timely, cost-minimizing upload of massive, dynamically-generated, geo-dispersed data into the cloud, for processing using a MapReduce-like framework. Targeting at a cloud encompassing disparate data centers, we model a cost-minimizing data migration problem, and propose two online algorithms: an online lazy migration (OLM) algorithm and a randomized fixed horizon control (RFHC) algorithm , for optimizing at any given time the choice of the data center for data aggregation and processing, as well as the routes for transmitting data there. Careful comparisons among these online and offline algorithms in realistic settings are conducted through extensive experiments, which demonstrate close-to-offline-optimum performance of the online algorithms.

The ARES Project: Network Architecture for Delivering and Processing Genomics Data

Abstract and Figures

Recommended publications

Research on Auto-Scaling of Web Applications in Cloud: Survey, Trends and Future Directions

Prediction based efficient resource provisioning and its impact on QOS parameters in the cloud envir...

Future networked healthcare systems: A review and case study

An Aggregatable Name-Based Routing for Energy-Efficient Data Sharing in Big Data Era