Conference PaperPDF Available

The ARES Project: Network Architecture for Delivering and Processing Genomics Data

Authors:

Abstract and Figures

This paper shows the network solutions proposed and implemented in the framework of the project ARES. The strategic objective of ARES is to create an advanced CDN, accessible through a cloud interface, supporting medical and research systems making a large use of genomic data. The expected achievements consist of identifying suitable management policies of genomic contents in a cloud environment, in terms of efficiency, resiliency, scalability, and QoS in a distributed fashion using a multi-user CDN approach. The experimental architecture envisages the use of the following key elements: a wideband networking infrastructure, a resource virtualization software platform, a distributed service architecture, including suitable computing and storage infrastructure, a distributed software platform able to ingest, validate, and analyze high volumes of data, a software engine executing a sophisticated intelligence for workload distribution and CDN provisioning.
Content may be subject to copyright.
A preview of the PDF is not available
... Amazon Web Services (AWS) provides a dataset including the genomes of 1,700 people via the Amazon storage service (S3), taken from the well known 1000 Genomes Project [8] [9]. Other initiatives include the Elixir project, which provided the pan-European network and storage infrastructure for biological data [12], the Open Science Data Cloud, provided by the Bionimbus project [14], and the ARES project, consisting of an overlay architecture for genomic computing implemented over the Géant network [15]. ...
... These initiatives either do not consider networking aspects [12] [14], or include just a high level description, without including specific networking solutions with a rigorous evaluation of the expected performance [13] [15]. ...
... Their arrival process is Poisson with mean inter-arrival time of 20 minutes. The associated content size is uniformly distributed in the rage[1,15] GB.Fig. 2. Testbed logical topology, inspired to the Géant network. ...
... Other initiatives include the pan-European network and storage infrastructure for biological data of the Elixir project [32], the Open Science Data Cloud developed by the project Bionimbus [34], the overlay architecture framed in the G?ant network (ARES [35]), and ICTBioMed [33], a bottom up, OpenStack-based initiative for supporting bio-medical applications, including genomics. ...
... However, the common feature of these initiatives is that they either neglect networking aspects ( [32] [34]), or provide just a high level description, without detailed system specifications and performance evaluation ( [33] [35]). ...
Article
Full-text available
This paper provides a global picture about the deployment of networked processing services for genomic data sets. Many current research make an extensive use genomic data, which are massive and rapidly increasing over time. They are typically stored in remote databases, accessible by using Internet. For this reason, a significant issue for effectively handling genomic data through data networks consists of the available network services. A first contribution of this paper consists of identifying the still unexploited features of genomic data that could allow optimizing their networked management. The second and main contribution of this survey consists of a methodological classification of computing and networking alternatives which can be used to offer what we call the Genomic-as-a-Service (GaaS) paradigm. In more detail, we analyze the main genomic processing applications, and classify not only the main computing alternatives to run genomics workflows in either a local machine or a distributed cloud environment, but also the main software technologies available to develop genomic processing services. Since an analysis encompassing only the computing aspects would provide only a partial view of the issues for deploying GaaS system, we present also the main networking technologies that are available to efficiently support a GaaS solution. We first focus on existing service platforms, and analyze them in terms of service features, such as scalability, flexibility, and efficiency. Then, we present a taxonomy for both wide area and datacenter network technologies that may fit the GaaS requirements. It emerges that virtualization, both in computing and networking, is the key for a successful large-scale exploitation of genomic data, by pushing ahead the adoption of the GaaS paradigm. Finally, the paper illustrates a short and long-term vision on future research challenges in the field.
... Other initiatives include the pan-European network and storage infrastructure for biological data of the Elixir project [32], the Open Science Data Cloud developed by the project Bionimbus [34], the overlay architecture framed in the Géant network (ARES [35]), and ICTBioMed [33], a bottom up, OpenStack-based initiative for supporting bio-medical applications, including genomics. ...
... However, the common feature of these initiatives is that they either neglect networking aspects ([32] [34]), or provide just a high level description, without detailed system specifications and performance evaluation ([33] [35]). ...
Conference Paper
Experts have warned that processing of genetic data will soon exceed the computing needs of Twitter and YouTube. This is due to the drop of the costs for sequencing DNA of any living creature and its huge impact in many application areas. Designing suitable network architectures for distributing such data is therefore of paramount importance. Management of genomic data sets is a typical big data problem, characterized not only by a huge volume, but also by the large size of each genomic file. Since it is unthinkable that any professional who needs to process genomes can own the infrastructure for massive genome analysis, a cloud-based access to genomic services is envisaged. This will have a significant impact on the underlying networks, which could become the system bottleneck. In this paper, we propose Genome Centric Networking (GCN), a novel network function virtualization framework for cloud-based genomic data management, designed with the aim of limiting the exchanged traffic by using distributed caching. The key element of GCN is a novel signaling protocol, which allows both discovering network resources and managing caches. We evaluated GCN on a real testbed. GCN allows halving the exchanged traffic and reducing the transfer time of genomic datasets significantly.
... More importantly it specifies a communication framework between users and developers with a continuous roadmap generated in conjunction with public authorities such as the US National Institutes of Health and academic, medical hospital, and industry advisors. Additionally we will work or share best practices and identify areas of collaboration with related peer projects such as the ARES [21,22] and Bionimbus Projects [23]. For many years, scientific innovation has been a driving force for real progress in the medical sciences. ...
Conference Paper
Full-text available
Increasingly complex biomedical data from diverse sources demands large storage, efficient software and high performance computing for the data’s computationally intensive analysis. Cloud technology provides flexible storage and data processing capacity to aggregate and analyze complex data; facilitating knowledge sharing and integration from different disciplines in a collaborative research environment. The ICTBioMed collaborative is a team of internationally renowned academic and medical research institutions committed to advancing discovery in biomedicine. In this work we describe the cloud framework design, development, and associated software platform and tools we are working to develop, federate and deploy in a coordinated and evolving manner to accelerate research developments in the biomedical field. Further, we highlight some of the essential considerations and challenges to deploying a complex open architecture cloud-based research infrastructure with numerous software components, internationally distributed infrastructure and a diverse user base.
... A detailed description of the ARES network architecture is reported in a companion paper submitted to the same conference [29]. In order to clarify the working context of this paper, we provide a short and basic description of network architecture for the sake of consistency of this paper. ...
Conference Paper
Full-text available
This paper shows the cloud services provided by the project ARES. The network solutions have been illustrated in a companion paper in the same conference. The ARES project aims to deploy CDN services over a broadband network for accessing and exchanging genomic datasets, accessible by medical and research personnel through a Cloud interface. This paper illustrates the procedure defined to access such services, also providing a case-study simulation to show the implementation of the bioinformatics pipeline included. The experimental activity in ARES aims to gain a detailed understanding of the network problems relating to its sustainability given the increasing use of genomics for diagnostic purposes. The main aim is to allow an extensive use of genomic data through the collection of relevant information available from the network in the medical and diagnostic field diseases.
Article
Full-text available
This paper presents the interdisciplinary area of networked genomic research and medicine. We first illustrate the most significant issues involved in these activities. Then, we illustrate an ongoing research activity in this area, and the relevant results obtained by using a network architecture designed for optimising the distribution of a large amount of genomic data-set. This architecture, based on an evolved Next Steps in Signalling, addresses the major challenges for managing genomic data-sets over a network with limited resources. Research results have been obtained in the framework of the project Advanced networking for EU genomic RESearch, which is part of the GÉANT/GN3plus project, co-funded by the European Commission.
Conference Paper
Full-text available
Many applications (routers, traffic monitors, firewalls, etc.) need to send and receive packets at line rate even on very fast links. In this paper we present netmap, a novel framework that enables commodity operating systems to handle the millions of packets per seconds traversing 1..10 Gbit/s links, without requiring custom hardware or changes to applications. In building netmap, we identified and successfully reduced or removed three main packet processing costs: per-packet dynamic memory allocations, removed by preallocating resources; system call overheads, amortized over large batches; and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas. Separately, some of these techniques have been used in the past. The novelty in our proposal is not only that we exceed the performance of most of previouswork, but also that we provide an architecture that is tightly integrated with existing operating system primitives, not tied to specific hardware, and easy to use and maintain. Netmap has been implemented in FreeBSD and Linux for several 1 and 10 Gbit/s network adapters. In our prototype, a single core running at 900 MHz can send or receive 14.88 Mpps (the peak packet rate on 10 Gbit/s links). This is more than 20 times faster than conventional APIs. Large speedups (5× and more) are also achieved on user-space Click and other packet forwarding applications using a libpcap emulation library running on top of netmap.
Conference Paper
Full-text available
This paper shows the cloud services provided by the project ARES. The network solutions have been illustrated in a companion paper in the same conference. The ARES project aims to deploy CDN services over a broadband network for accessing and exchanging genomic datasets, accessible by medical and research personnel through a Cloud interface. This paper illustrates the procedure defined to access such services, also providing a case-study simulation to show the implementation of the bioinformatics pipeline included. The experimental activity in ARES aims to gain a detailed understanding of the network problems relating to its sustainability given the increasing use of genomics for diagnostic purposes. The main aim is to allow an extensive use of genomic data through the collection of relevant information available from the network in the medical and diagnostic field diseases.
Article
Full-text available
Network deduplication (dedup) is an attractive approach to improve network performance for mobile devices. With traditional deduplication, the dedup~source uses only the portion of the cache at the dedup~destination that it is aware of. We argue in this work that in a mobile environment, the dedup~destination (say the mobile) could have accumulated a much larger cache than what the current dedup~source is aware of. This can occur because of several reasons ranging from the mobile consuming content through heterogeneous wireless technologies, to the mobile moving across different wireless networks. In this context, we propose asymmetric caching, a solution that is overlaid on baseline network deduplication, but which allows the dedup~destination to selectively feedback appropriate portions of its cache to the dedup~source with the intent of improving the redundancy elimination efficiency. We show using traffic traces collected from 30 mobile users, that with asymmetric caching, over 89% of the achievable redundancy can be identified and eliminated even when the dedup~source has less than one hundredth of the cache size as the dedup~destination. Further, we show that the ratio of bytes saved from transmission at the dedup~source because of asymmetric caching is over 6x that of the number of bytes sent as feedback. Finally, with a prototype implementation of asymmetric caching on both a Linux laptop and an Android smartphone, we demonstrate that the solution is deployable with reasonable CPU and memory overheads.
Article
Full-text available
In this paper, we propose a new gossip-based signaling dissemination method for the Next Steps in Signaling protocol family. In more detail, we propose to extend the General Internet Signaling Transport (GIST) protocol, so as to leverage these new dissemination capabilities from all NSIS Signaling Layer Protocol applications using its transport capabilities. The extension consists of two main procedures: a bootstrap procedure, during which new GIST-enabled nodes discover each other, and a dissemination procedure, which is used to effectively disseminate signaling messages within an Autonomous System. To this aim, we defined three dissemination models, bubble, balloon, and hose, so as to fulfill requirements of different network and service management scenarios. An experimental campaign carried out on GENI shows the effectiveness of the proposed solution.
Article
Full-text available
Autonomic management capabilities of the Future Internet can be provided through a recently proposed service architecture called NetServ. It consists of the interconnection of programmable nodes which enable dynamic deployment and execution of network and application services. This paper shows how this architecture can be further improved by introducing the OpenFlow architecture and implementing the OpenFlow controller as a NetServ service, thus improving both the NetServ management performance and its flexibility. These achievements are demonstrated experimentally on the GENI environment, showing the platform self-protecting capabilities in case of a SIP DoS attack.
Article
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval and resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s Web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing pages, GeneMap’99, Davis Human–Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP) pages, Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, Cancer Genome Anatomy Project (CGAP) pages, SAGEmap, Online Mendelian Inheritance in Man (OMIM) and the Molecular Modeling Database (MMDB). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov
Conference Paper
We consider a content delivery architecture based on geographically dispersed groups of "last-mile" CDN servers, e.g., set-top boxes located within users' homes. These servers may belong to administratively separate domains, such as multiple ISPs. We propose a set of scalable, adaptive mechanisms to jointly manage content replication and request routing within this architecture. Relying on primal-dual methods and fluid-limit techniques, we formally prove the optimality of our design. We further evaluate its performance on both synthetic and trace-driven simulations, based on real BitTorrent traces, and observe a reduction of network costs by more than 50% over traditional mechanisms such as LRU/LFU with closest request routing.
Conference Paper
Internet delivery infrastructures are traditionally optimized for low-latency traffic, such as the Web traffic. However, in recent years we are witnessing a massive growth of throughput-oriented applications, such as video streaming. These applications introduce new tradeoffs and design choices for content delivery networks (CDNs). In this paper, we focus on understanding two key design choices: (1) What is the impact of the number of CDN's peering points and server locations on its aggregate throughput and operating costs? (2) How much can ISP-CDNs benefit from using path selection to maximize its aggregate throughput compared to other CDNs who only have control at the edge? Answering these questions is challenging because content distribution involves a complex ecosystem consisting of many parties (clients, CDNs, ISPs) and depends on various settings which differ across places and over time. We introduce a simple model to illustrate and quantify the essential tradeoffs in CDN designs. Using extensive analysis over a variety of network topologies (with varying numbers of CDN peering points and server locations), operating cost models, and client video streaming traces, we observe that: (1) Doubling the number of peering points roughly doubles the aggregate throughput over a wide range of values and network topologies. In contrast, optimal path selection improves the CDN aggregate throughput by less than 70\%, and in many cases by as little as a few percents. (2) Keeping the number of peering points constant, but reducing the number of location (data centers) at which the CDN is deployed can significantly reduce operating costs.
Article
Cloud computing, rapidly emerging as a new computation paradigm, provides agile and scalable resource access in a utility-like fashion, especially for the processing of big data. An important open issue here is to efficiently move the data, from different geographical locations over time, into a cloud for effective processing. The de facto approach of hard drive shipping is not flexible or secure. This work studies timely, cost-minimizing upload of massive, dynamically-generated, geo-dispersed data into the cloud, for processing using a MapReduce-like framework. Targeting at a cloud encompassing disparate data centers, we model a cost-minimizing data migration problem, and propose two online algorithms: an online lazy migration (OLM) algorithm and a randomized fixed horizon control (RFHC) algorithm , for optimizing at any given time the choice of the data center for data aggregation and processing, as well as the routes for transmitting data there. Careful comparisons among these online and offline algorithms in realistic settings are conducted through extensive experiments, which demonstrate close-to-offline-optimum performance of the online algorithms.