Figure 7 - uploaded by Ioan Raicu
Content may be subject to copyright.
Logical Distributor/Collector Design 

Logical Distributor/Collector Design 

Source publication
Conference Paper
Full-text available
Loosely coupled programming is a powerful paradigm for rapidly creating higher-level applications from scientific programs on petascale systems, typically using scripting languages. This paradigm is a form of many-task computing (MTC) which focuses on the passing of data between programs as ordinary files rather than messages. While it has the sign...

Contexts in source publication

Context 1
... and possibly termination time (once all individual tasks have completed). Until recently, such uncoordinated jobs were primarily run on moderate scale clusters or on distributed (“grid”) resources. Most clusters were not large enough to encounter IO contention problems such as those described here. Furthermore, cluster nodes generally have considerable local disks suitable for storing large input and output data. The primary problem on such systems has thus been mainly to efficiently stage data and schedule jobs so that they can best benefit from the staged data [Khanna+2006; Khanna+2007]. File IO is a more significant problem with distributed resources. Condor provides a remote IO library that forwards system calls to a shadow process running on the “home” machine where the files actually reside. Global Access to Secondary Storage (GASS [Bester+1999]) available in Globus takes a different approach, transparently providing a temporary replica cache for input and output files. Our collective IO goes beyond these approaches to intelligently utilize local filesystems, and to provide intermediate file systems, broadcasting of input files, and batching of output files. Unlike Condor remote IO, our approach does not require relinking. Our approach makes it practical for tens to hundreds of thousands of processor cores now (and in a few years, a million cores) to perform concurrent, asynchronous IO operations. These numbers are easily an order of magnitude greater than what has been addressed in any previous implementation. The requirements described to this point translate into a straightforward design for handling collective IO, which consists of three main components: 1) one or more intermediate file systems (IFSs) enabling data to be placed and cached closer to the computation (from an access-latency and bandwidth perspective) while overcoming the size limitation of the typical RAM-based local file systems that are prevalent in petascale-precursor systems; 2) a data distributor , which replicates sufficiently large common input datasets to intermediate file systems; and 3) a data collector mechanism, which collects output datasets on IFSs and efficiently writes the collected data to large archive files on the GFS. Our implementation of this design, which we have prototyped for performance evaluation, uses simple scripts to coordinate “off the shelf” data management components. All of our prototypes and measurements to date have been done on the Argonne BG/P systems (Surveyor, 4096 processors, and Intrepid, 163,840 processors). Not all of the design aspects described below exist yet in the prototype. These are indicated in the description. We executed all of our compute tasks under the Falkon lightweight task scheduler [Raicu+2007; Raicu+2008] running under ZeptoOS [ZeptoOS2008]. The structure of the system is shown in overview in Figure 7, and in more detail in Figure 9, which depicts the flow of input and output data in our BG/P-based prototype. Within the BG/P testbed, the RAM-based file system of the local node, which contains about 1GB of free space, is used as the LFS. For input staging, the LFS of one or more compute nodes is set aside as a “file server” and is dedicated as an IFS for a set of compute nodes. We create large IFSs from fast LFSs by striping IFS contents over several LFS file systems, using the MosaStore file IO service [Al-Kiswany+2007]. Compute nodes access the IFS using the BG/P torus network [BGP]. The creation of the IFS and the partitioning of compute nodes between IFS functions and computing can be done on a per-workload basis, and can vary from workload to workload. In the same manner that compute node and IO node operating systems are booted when a BG/P job is started, the creation of the IFSs and the CN-to-IFS mapping can be performed as a per-workload setup task performed when compute nodes are provisioned by Falkon [Raicu+2007; Raicu+2008]. This enables the CN-to- IFS ratio to be tailored to the disk space and bandwidth needs of the workflow (Figure 8). The input distributor stages common input data efficiently to LFS or IFS. This mechanism is used to cache files that will be frequently re-read, or that will be read in inefficient buffer lengths, closer to the compute nodes. The key to this operation is to use broadcast or multicast methods, where available, to move common data from global to local or intermediate file systems. For accessing input data, we stage input datasets as follows: • Small input datasets are staged from GFS to the LFS of the compute nodes which read them. • Datasets read by only one task but that are too large to be staged to an LFS are staged to an IFS of sufficient size. • All large datasets that are read by multiple tasks are replicated to all IFSs that serve the set of compute nodes involved in a computation. In our prototype implementation, data is replicated from GFS to multiple IFSs by the Chirp replicate command [Thain+2008]. (Steps 1 and 2 in Figure 7.) We employ two functions: the first identifies if a given compute node is a data- serving or application-executing node. The second maps executor compute nodes to its IFS data server. The decision of whether to place an input file on LFS or IFS is made explicitly (i.e., hard-coded in our prototype). Each IFS is mounted on all associated compute nodes, and accessed via FUSE. The output collector gathers (small) output data files from multiple processors and aggregates them into efficient units for transfer to GFS. In this way, we reduce greatly the number of files created on the GFS (which reduces the number of costly file creation operations) and also increase the size of those files (which permits data to be written to GFS in larger, more efficient block sizes and write buffer lengths). The use of the output collector also enables data to be cached on LFS or IFS for later analysis or reprocessing. Our goal is that files which can fit on the LFS can be written there by the application program, while larger output files can be written directly to IFS, and output files too large to fit on the LFS or IFS are written directly to GFS. (This differentiation is not implemented in the prototype). In this way, we can optimize the performance of output operations such as file and directory creation and small write operations. The collector operates as follows. When application programs complete, any output data on the LFS is copied to an IFS (Figure 7, Step 3). When the copy is complete, the data is atomically moved to a staging directory, where the following algorithm (Step 4) is used: One consequence of this design is that short tasks can complete quicker, without having each task remain on a compute node waiting for its data to be written to GFS, as the staging of data from IFS to GFS is handled asynchronously by The fact that data managed by the output collector on LFSs or IFSs can be retained for subsequent processing makes it possible to re-process the output data of one stage of a workflow far more efficiently than if the data had to be retrieved from GFS. When previously written output does need to be retrieved from GFS, the ability to access files in parallel from a randomly accessible archive (as described below) further improves performance. And intermediate output data that doesn’t need to be retained persistently can be left on LFS or IFS storage without moving it to GFS at all. To facilitate multi-stage workflows, in which the output of one stage of a parallel computation is consumed by the next, we incorporate two capabilities in our design: 1) the use of an archive format for collective output that can be efficiently re- processed in parallel, and 2) the ability to cache intermediate results on LFS and/or IFS file systems. We base our output collector design on the use of a relatively new archive utility xar [XAR], which unlike traditional tar (and similar) archive formats includes an updateable XML directory containing the byte offset of each archive member. This directory enables files to be extracted via random access, and hence xar (unlike tar) archives can be processed efficiently in parallel in later stages or a workflow. In the future, it is likely that we can implement parallel IO to an xar archive from multiple compute nodes, thus enhancing write performance potential even further. To enable testing of such re-processing of derived data from LFS, we employ a prototype of a new primitive collective execution operation “run task x on all compute nodes” which enables all previous outputs on LFS to be processed. Our prototype does not yet use xar, but rather tar, which has a similar interface. We present measurements from the Argonne ALCF BG/P, running under ZeptoOS and Falkon. We have evaluated various features on up to 98,304 (out of 163,840) processors. Dedicated test time on the entire facility is rare, so all tests below were done with the background noise of activity from other jobs running on other processors. Nonetheless, the trends indicated are fairly clear, and we expect that they will be verifiable in future tests in a controlled, dedicated environment. We have made measurements in both areas of the proposed collective IO primitives (denoted as CIO throughout this section), such as input data distribution, and output data collection. We also applied the collective IO primitives to a molecular dynamics docking application at up to 96K processors. Our first set of results investigated how effectively compute nodes can read data from the IFSs (over the torus network), examining various data volumes and various IFS/LFS ratios. We used the lightweight Chirp file system [Thain+2008] and the Fuse interface to read files from IFS to LFS. Figure 11 shows higher aggregate performance with larger files, and with higher ratios, with the best IFS performance reaching 162 MB/s for 100 MB files and a 256:1 ratio. However, as the bandwidth is split between 256 clients, the per-node ...
Context 2
... greater than what has been addressed in any previous implementation. The requirements described to this point translate into a straightforward design for handling collective IO, which consists of three main components: 1) one or more intermediate file systems (IFSs) enabling data to be placed and cached closer to the computation (from an access-latency and bandwidth perspective) while overcoming the size limitation of the typical RAM-based local file systems that are prevalent in petascale-precursor systems; 2) a data distributor , which replicates sufficiently large common input datasets to intermediate file systems; and 3) a data collector mechanism, which collects output datasets on IFSs and efficiently writes the collected data to large archive files on the GFS. Our implementation of this design, which we have prototyped for performance evaluation, uses simple scripts to coordinate “off the shelf” data management components. All of our prototypes and measurements to date have been done on the Argonne BG/P systems (Surveyor, 4096 processors, and Intrepid, 163,840 processors). Not all of the design aspects described below exist yet in the prototype. These are indicated in the description. We executed all of our compute tasks under the Falkon lightweight task scheduler [Raicu+2007; Raicu+2008] running under ZeptoOS [ZeptoOS2008]. The structure of the system is shown in overview in Figure 7, and in more detail in Figure 9, which depicts the flow of input and output data in our BG/P-based prototype. Within the BG/P testbed, the RAM-based file system of the local node, which contains about 1GB of free space, is used as the LFS. For input staging, the LFS of one or more compute nodes is set aside as a “file server” and is dedicated as an IFS for a set of compute nodes. We create large IFSs from fast LFSs by striping IFS contents over several LFS file systems, using the MosaStore file IO service [Al-Kiswany+2007]. Compute nodes access the IFS using the BG/P torus network [BGP]. The creation of the IFS and the partitioning of compute nodes between IFS functions and computing can be done on a per-workload basis, and can vary from workload to workload. In the same manner that compute node and IO node operating systems are booted when a BG/P job is started, the creation of the IFSs and the CN-to-IFS mapping can be performed as a per-workload setup task performed when compute nodes are provisioned by Falkon [Raicu+2007; Raicu+2008]. This enables the CN-to- IFS ratio to be tailored to the disk space and bandwidth needs of the workflow (Figure 8). The input distributor stages common input data efficiently to LFS or IFS. This mechanism is used to cache files that will be frequently re-read, or that will be read in inefficient buffer lengths, closer to the compute nodes. The key to this operation is to use broadcast or multicast methods, where available, to move common data from global to local or intermediate file systems. For accessing input data, we stage input datasets as follows: • Small input datasets are staged from GFS to the LFS of the compute nodes which read them. • Datasets read by only one task but that are too large to be staged to an LFS are staged to an IFS of sufficient size. • All large datasets that are read by multiple tasks are replicated to all IFSs that serve the set of compute nodes involved in a computation. In our prototype implementation, data is replicated from GFS to multiple IFSs by the Chirp replicate command [Thain+2008]. (Steps 1 and 2 in Figure 7.) We employ two functions: the first identifies if a given compute node is a data- serving or application-executing node. The second maps executor compute nodes to its IFS data server. The decision of whether to place an input file on LFS or IFS is made explicitly (i.e., hard-coded in our prototype). Each IFS is mounted on all associated compute nodes, and accessed via FUSE. The output collector gathers (small) output data files from multiple processors and aggregates them into efficient units for transfer to GFS. In this way, we reduce greatly the number of files created on the GFS (which reduces the number of costly file creation operations) and also increase the size of those files (which permits data to be written to GFS in larger, more efficient block sizes and write buffer lengths). The use of the output collector also enables data to be cached on LFS or IFS for later analysis or reprocessing. Our goal is that files which can fit on the LFS can be written there by the application program, while larger output files can be written directly to IFS, and output files too large to fit on the LFS or IFS are written directly to GFS. (This differentiation is not implemented in the prototype). In this way, we can optimize the performance of output operations such as file and directory creation and small write operations. The collector operates as follows. When application programs complete, any output data on the LFS is copied to an IFS (Figure 7, Step 3). When the copy is complete, the data is atomically moved to a staging directory, where the following algorithm (Step 4) is used: One consequence of this design is that short tasks can complete quicker, without having each task remain on a compute node waiting for its data to be written to GFS, as the staging of data from IFS to GFS is handled asynchronously by The fact that data managed by the output collector on LFSs or IFSs can be retained for subsequent processing makes it possible to re-process the output data of one stage of a workflow far more efficiently than if the data had to be retrieved from GFS. When previously written output does need to be retrieved from GFS, the ability to access files in parallel from a randomly accessible archive (as described below) further improves performance. And intermediate output data that doesn’t need to be retained persistently can be left on LFS or IFS storage without moving it to GFS at all. To facilitate multi-stage workflows, in which the output of one stage of a parallel computation is consumed by the next, we incorporate two capabilities in our design: 1) the use of an archive format for collective output that can be efficiently re- processed in parallel, and 2) the ability to cache intermediate results on LFS and/or IFS file systems. We base our output collector design on the use of a relatively new archive utility xar [XAR], which unlike traditional tar (and similar) archive formats includes an updateable XML directory containing the byte offset of each archive member. This directory enables files to be extracted via random access, and hence xar (unlike tar) archives can be processed efficiently in parallel in later stages or a workflow. In the future, it is likely that we can implement parallel IO to an xar archive from multiple compute nodes, thus enhancing write performance potential even further. To enable testing of such re-processing of derived data from LFS, we employ a prototype of a new primitive collective execution operation “run task x on all compute nodes” which enables all previous outputs on LFS to be processed. Our prototype does not yet use xar, but rather tar, which has a similar interface. We present measurements from the Argonne ALCF BG/P, running under ZeptoOS and Falkon. We have evaluated various features on up to 98,304 (out of 163,840) processors. Dedicated test time on the entire facility is rare, so all tests below were done with the background noise of activity from other jobs running on other processors. Nonetheless, the trends indicated are fairly clear, and we expect that they will be verifiable in future tests in a controlled, dedicated environment. We have made measurements in both areas of the proposed collective IO primitives (denoted as CIO throughout this section), such as input data distribution, and output data collection. We also applied the collective IO primitives to a molecular dynamics docking application at up to 96K processors. Our first set of results investigated how effectively compute nodes can read data from the IFSs (over the torus network), examining various data volumes and various IFS/LFS ratios. We used the lightweight Chirp file system [Thain+2008] and the Fuse interface to read files from IFS to LFS. Figure 11 shows higher aggregate performance with larger files, and with higher ratios, with the best IFS performance reaching 162 MB/s for 100 MB files and a 256:1 ratio. However, as the bandwidth is split between 256 clients, the per-node throughput is only 0.6 MB/s. Computing the per- node throughput for the 64:1 ratio yields 2.3 MB/s, a significant increase. Thus, we conclude that a 64:1 ratio is good when trying to maximize the bandwidth per node. Larger ratios reduce the number of IFSs that need to be managed; however, there are practical limits that prohibit these ratios from being extremely large. In the case of a 512:1 ratio and 100 MB files, our benchmarks failed due to memory exhaustion when 512 compute nodes simultaneously connected to 1 compute node to transfer the 100 MB file. This needs further analysis. Our next set of experiments used the lightweight MosaStore file system [Al-Kiswany+2007] to explore how effectively we can stripe LFSs to form a larger IFS. Our preliminary results in Figure 12 show that as we increase the degree of striping we get significant increases in aggregate throughput, up from 158 MB/s to 831 MB/s. The best performing configuration was 32 compute nodes aggregating their 2GB-per-node LFSs into a 64 GB IFS. This aggregation not only increases performance, but also allows compute nodes to keep their IO relatively local when working with large files that do not fit in a single compute node 2GB RAM-based LSF. Our final experiment for the input data distribution section focused on how quickly we can distribute data from GFS to a set of IFSs, or potentially to LFSs. As in our previous experiment, we use Chirp (see Figure 13). Chirp has a native ...
Context 3
... kernel module depends on it. There has been much research on collective operations in the context of the message passing programming paradigm. These operations allow a group of processes to perform a common, pre-defined operation “collectively” on a set of data. For example, the MPI standard [MPI] offers a large number of such operations, from a basic broadcast (delivering an identical copy of data from one source to many destinations), through scatter (delivering a different part of input data from one source to each destination) and its opposite, gather (assembling the result at one destination from its parts available on multiple sources), to reduction operations (like gather, but instead of assembling, the parts of the result are combined). These operations are considered so crucial for the performance of message passing programs that the BG/P provides the separate collective tree network to perform them efficiently in hardware [BGP]. Similarly, collective IO is not a new concept in parallel computing. It is employed, e.g., by ROMIO [ Thakur+1999 ], the most popular MPI-IO implementation, in its generalized two-phase IO implementation. When compute tasks want to perform IO, they first exchange information about their intentions, in an attempt to coalesce many small requests into fewer larger ones (an assumption being that the processes access the same file). When reading, in the first phase the processes issue large read requests, and in the second phase, they exchange parts of their read buffers with one another, using efficient MPI communication primitives so that each process ends up with the data it was interested it. For writing, the two phases are reversed. MPI collective communication and IO operations require applications to be at least loosely synchronous, in that progress must be made in globally synchronized phases, and that all processes participate in a collective operation. These conditions restrict the use of standard collective operations in loosely coupled, uncoordinated scenarios, limiting them to initialization time (before any individual tasks start running), and possibly termination time (once all individual tasks have completed). Until recently, such uncoordinated jobs were primarily run on moderate scale clusters or on distributed (“grid”) resources. Most clusters were not large enough to encounter IO contention problems such as those described here. Furthermore, cluster nodes generally have considerable local disks suitable for storing large input and output data. The primary problem on such systems has thus been mainly to efficiently stage data and schedule jobs so that they can best benefit from the staged data [Khanna+2006; Khanna+2007]. File IO is a more significant problem with distributed resources. Condor provides a remote IO library that forwards system calls to a shadow process running on the “home” machine where the files actually reside. Global Access to Secondary Storage (GASS [Bester+1999]) available in Globus takes a different approach, transparently providing a temporary replica cache for input and output files. Our collective IO goes beyond these approaches to intelligently utilize local filesystems, and to provide intermediate file systems, broadcasting of input files, and batching of output files. Unlike Condor remote IO, our approach does not require relinking. Our approach makes it practical for tens to hundreds of thousands of processor cores now (and in a few years, a million cores) to perform concurrent, asynchronous IO operations. These numbers are easily an order of magnitude greater than what has been addressed in any previous implementation. The requirements described to this point translate into a straightforward design for handling collective IO, which consists of three main components: 1) one or more intermediate file systems (IFSs) enabling data to be placed and cached closer to the computation (from an access-latency and bandwidth perspective) while overcoming the size limitation of the typical RAM-based local file systems that are prevalent in petascale-precursor systems; 2) a data distributor , which replicates sufficiently large common input datasets to intermediate file systems; and 3) a data collector mechanism, which collects output datasets on IFSs and efficiently writes the collected data to large archive files on the GFS. Our implementation of this design, which we have prototyped for performance evaluation, uses simple scripts to coordinate “off the shelf” data management components. All of our prototypes and measurements to date have been done on the Argonne BG/P systems (Surveyor, 4096 processors, and Intrepid, 163,840 processors). Not all of the design aspects described below exist yet in the prototype. These are indicated in the description. We executed all of our compute tasks under the Falkon lightweight task scheduler [Raicu+2007; Raicu+2008] running under ZeptoOS [ZeptoOS2008]. The structure of the system is shown in overview in Figure 7, and in more detail in Figure 9, which depicts the flow of input and output data in our BG/P-based prototype. Within the BG/P testbed, the RAM-based file system of the local node, which contains about 1GB of free space, is used as the LFS. For input staging, the LFS of one or more compute nodes is set aside as a “file server” and is dedicated as an IFS for a set of compute nodes. We create large IFSs from fast LFSs by striping IFS contents over several LFS file systems, using the MosaStore file IO service [Al-Kiswany+2007]. Compute nodes access the IFS using the BG/P torus network [BGP]. The creation of the IFS and the partitioning of compute nodes between IFS functions and computing can be done on a per-workload basis, and can vary from workload to workload. In the same manner that compute node and IO node operating systems are booted when a BG/P job is started, the creation of the IFSs and the CN-to-IFS mapping can be performed as a per-workload setup task performed when compute nodes are provisioned by Falkon [Raicu+2007; Raicu+2008]. This enables the CN-to- IFS ratio to be tailored to the disk space and bandwidth needs of the workflow (Figure 8). The input distributor stages common input data efficiently to LFS or IFS. This mechanism is used to cache files that will be frequently re-read, or that will be read in inefficient buffer lengths, closer to the compute nodes. The key to this operation is to use broadcast or multicast methods, where available, to move common data from global to local or intermediate file systems. For accessing input data, we stage input datasets as follows: • Small input datasets are staged from GFS to the LFS of the compute nodes which read them. • Datasets read by only one task but that are too large to be staged to an LFS are staged to an IFS of sufficient size. • All large datasets that are read by multiple tasks are replicated to all IFSs that serve the set of compute nodes involved in a computation. In our prototype implementation, data is replicated from GFS to multiple IFSs by the Chirp replicate command [Thain+2008]. (Steps 1 and 2 in Figure 7.) We employ two functions: the first identifies if a given compute node is a data- serving or application-executing node. The second maps executor compute nodes to its IFS data server. The decision of whether to place an input file on LFS or IFS is made explicitly (i.e., hard-coded in our prototype). Each IFS is mounted on all associated compute nodes, and accessed via FUSE. The output collector gathers (small) output data files from multiple processors and aggregates them into efficient units for transfer to GFS. In this way, we reduce greatly the number of files created on the GFS (which reduces the number of costly file creation operations) and also increase the size of those files (which permits data to be written to GFS in larger, more efficient block sizes and write buffer lengths). The use of the output collector also enables data to be cached on LFS or IFS for later analysis or reprocessing. Our goal is that files which can fit on the LFS can be written there by the application program, while larger output files can be written directly to IFS, and output files too large to fit on the LFS or IFS are written directly to GFS. (This differentiation is not implemented in the prototype). In this way, we can optimize the performance of output operations such as file and directory creation and small write operations. The collector operates as follows. When application programs complete, any output data on the LFS is copied to an IFS (Figure 7, Step 3). When the copy is complete, the data is atomically moved to a staging directory, where the following algorithm (Step 4) is used: One consequence of this design is that short tasks can complete quicker, without having each task remain on a compute node waiting for its data to be written to GFS, as the staging of data from IFS to GFS is handled asynchronously by The fact that data managed by the output collector on LFSs or IFSs can be retained for subsequent processing makes it possible to re-process the output data of one stage of a workflow far more efficiently than if the data had to be retrieved from GFS. When previously written output does need to be retrieved from GFS, the ability to access files in parallel from a randomly accessible archive (as described below) further improves performance. And intermediate output data that doesn’t need to be retained persistently can be left on LFS or IFS storage without moving it to GFS at all. To facilitate multi-stage workflows, in which the output of one stage of a parallel computation is consumed by the next, we incorporate two capabilities in our design: 1) the use of an archive format for collective output that can be efficiently re- processed in parallel, and 2) the ability to cache intermediate results on LFS and/or IFS file systems. We base our output collector design on the use of a relatively new archive utility xar ...

Similar publications

Conference Paper
Full-text available
The ever-growing gap between the computation and I/O is one of the fundamental challenges for future computing systems. This computation-I/O gap is even larger for modern large scale high-performance systems due to their state-of-the-art yet decades long architecture: the compute and storage resources form two cliques that are interconnected with s...

Citations

... The MTC model [70] consists of many, usually sequential, individual tasks that are executed on processor cores, without inter-task communication. The tasks communicate only through the standard file system interfaces, and optimization are possible [71]. We do not assume in our work the availability of a global file system. ...
Thesis
Full-text available
In this thesis, we present our research work in the field of high performance computing in fluid mechanics (CFD)for cluster and cloud architectures. In general, we propose to develop an efficient solver, called ADAPT, for problemsolving of CFDs in a classic view corresponding to developments in MPI and in a view that leads us to representADAPT as a graph of tasks intended to be ordered on a cloud computing platform.As a first contribution, we propose a parallelization of the diffusion-convection equation coupled to a linear systemin 2D and 3D using MPI. A two-level parallelization is used in our implementation to take advantage of the currentdistributed multicore machines. A balanced distribution of the computational load is obtained by using the decompositionof the domain using METIS, as well as a relevant resolution of our very large linear system using the parallelsolver MUMPS (Massive Parallel MUltifrontal Solver).Our second contribution illustrates how to imagine the ADAPT framework, as depicted in the first contribution, asa Service. We transform the framework (in fact, a part of the framework) as a DAG (Direct Acyclic Graph) in orderto see it as a scientific workflow. Then we introduce new policies inside the RedisDG workflow engine, in order toschedule tasks of the DAG, opportunistically. We introduce into RedisDG the possibility to work with dynamic workers(they can leave or enter into the computing system as they want) and a multi-criteria approach to decide on the“best” worker to choose to execute a task. Experiments are conducted on
... Socket.IO is a communication mechanism to achieve real-time apps on various browsers and mobile devices. It blurs the differences between the different transport mechanisms [14]. It is care-free real-time 100 % in JavaScript. ...
Chapter
Full-text available
The research of the internet of things (IoT) applications in smart-home monitoring was mainly discussed, and the design of a smart-home monitoring system was accomplished. In the system that was constructed based on the ZigBee scheme, the technologies of WiFi, Node.js and MongoDB were incorporated. Hence, the system supported both the local and the remote accesses. As for the aspect of system applications, many functions were implemented, such as the electrical control, environmental monitor, indoor monitor, abnormity warning, query historical information, and user settings. Compared with the traditional smart-home systems, it centers at the monitoring of communities, where the smart homes are the main branches. In the system, the different scenarios of multi-users and multi-applications with high flexibility and strong extensibility were considered. With the aid of the emerging technologies, the local access system has energy-saving features while the remote access system has good real-time response performance.
... While this architectural change is being deemed necessary to provide the much needed scalability advantage of concurrency and throughput, it cannot be achieved without providing an efficient storage layer for conducting metadata operations. The centralized metadata repository in parallel file systems has shown to be inefficient at large scale for conducting metadata operations, growing for instance from tens of milliseconds on a single node (four-cores), to tens of seconds at 16K-core scales [120,166]. ...
... While this architectural change is being deemed necessary to provide the much needed scalability advantage of concurrency and throughput, it cannot be achieved without providing an efficient storage layer for conducting metadata operations [12]. The centralized metadata repository in parallel file systems has shown to be inefficient at large scale for conducting metadata operations, growing for instance from tens of milliseconds on a single node (four-cores), to tens of seconds at 16K-core scales [12] [13]. Similarly, auditing and querying of provenance metadata in a centralized fashion has shown poor performance over distributed architectures [14]. ...
Conference Paper
Full-text available
It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. In this paper, we explore the feasibility of a general metadata storage and management layer for parallel file systems, in which metadata includes both file operations and provenance metadata. We experimentally investigate the design optimality—whether provenance metadata should be loosely-coupled or tightly integrated with a file metadata storage systems. We consider two systems that have applied similar distributed concepts to metadata management, but focusing singularly on kind of metadata: (i) FusionFS, which implements a distributed file metadata management based on distributed hash tables, and (ii) SPADE, which uses a graph database to store audited provenance data and provides distributed module for querying provenance. Our results on a 32- node cluster show that FusionFS+SPADE is a promising prototype with negligible provenance overhead and has promise to scale to petascale and beyond. Furthermore, FusionFS with its own storage layer for provenance capture is able to scale up to 1K nodes on BlueGene/P supercomputer.
... The MTC model consists of many, usually sequential , individual applications (called tasks) that are executed on individually addressable system components, such as processor cores, without intertask communication . These tasks communicate only through the standard filesystem interfaces , although optimizations are possible [54]. MTC allows use of large-scale parallel systems with little or no explicit parallel programming. ...
Article
Full-text available
Many-task computing is a well-established paradigm for implementing loosely coupled applications (tasks) on large-scale computing systems. However, few of the model's existing implementations provide efficient, low-latency support for executing tasks that are tightly coupled multiprocessing applications. Thus, a vast array of parallel applications cannot readily be used effectively within many-task workloads. In this work, we present JETS, a middleware component that provides high performance support for many-parallel-task computing (MPTC). JETS is based on a highly concurrent approach to parallel task dispatch and on new capabilities now available in the MPICH2 MPI implementation and the ZeptoOS Linux operating system. JETS represents an advance over the few known examples of multilevel many-parallel-task scheduling systems: it more efficiently schedules and launches many short-duration parallel application invocations; it overcomes the challenges of coupling the user processes of each multiprocessing application invocation via the messaging fabric; and it concurrently manages many application executions in various stages. We report here on the JETS architecture and its performance on both synthetic benchmarks and an MPTC application in molecular dynamics.
... Here, the goal is to help a parallel program written in a particular programming model to realize efficient and portable file I/O beyond UNIX file I/O. A myriad of research exists that design and evaluate efficient and scalable I/O strategies to support middleware concepts such as the description of the file data and collective I/O [35], [36], [37], [38]. Again, FGFS is complementary to this approach by being able to give a hint to the I/O middleware if a file would be better off with collective or serial I/O. ...
Conference Paper
Full-text available
Large-scale systems typically mount many different file systems with distinct performance characteristics and capacity. Applications must efficiently use this storage in order to realize their full performance potential. Users must take into account potential file replication throughout the storage hierarchy as well as contention in lower levels of the I/O system, and must consider communicating the results of file I/O between application processes to reduce file system accesses. Addressing these issues and optimizing file accesses requires detailed runtime knowledge of file system performance characteristics and the location(s) of files on them. In this paper, we propose Fast Global File Status (FGFS), a scalable mechanism to retrieve file information, such as its degree of distribution or replication and consistency. We use a novel node-local technique that turns expensive, non-scalable file system calls into simple string comparison operations. FGFS raises the namespace of a locally-defined file path to a global namespace with little or no file system calls to obtain global file properties efficiently. Our evaluation on a large multi-physics application shows that most FGFS file status queries on its executable and 848 shared library files complete in 272 milliseconds or faster at 32,768 MPI processes. Even the most expensive operation, which checks global file consistency, completes in under 7 seconds at this scale, an improvement of several orders of magnitude over the traditional checksum technique.
... (a) File/dir create; 16K-cores [2] (b) Read throughput [3] (c) Write efficiency [3] Fig. 2. GPFS performance on IBM BlueGene/P Metadata operations on parallel file systems can be inefficient at large scale. Early experiments on the BlueGene/P system at 16K-core scales (seeFigure 2(a)) shows the various costs (wall-clock time measured at remote processor) for file/directory create on GPFS. ...
... (a) File/dir create; 16K-cores [2] (b) Read throughput [3] (c) Write efficiency [3] Fig. 2. GPFS performance on IBM BlueGene/P Metadata operations on parallel file systems can be inefficient at large scale. Early experiments on the BlueGene/P system at 16K-core scales (seeFigure 2(a)) shows the various costs (wall-clock time measured at remote processor) for file/directory create on GPFS. ...
... create directory) that took milliseconds on a single core, taking over 1000 seconds at 16K-core scales. [2, 3] Read/Write: Reading performance of common datasets (e.g. application binaries, read-only databases) is challenging. ...
Article
Full-text available
State-of-the-art yet decades-old architecture of HPC storage systems has segregated compute and storage resources, bringing unprecedented inefficiencies and bottlenecks at petascale levels and beyond. This paper presents FusionFS, a new distributed file system designed from the ground up for high scalability (16K nodes) while achieving significantly higher I/O performance (2.5TB/sec). FusionFS achieves these levels of scalability and performance through complete decentralization, and the co-location of storage and compute resources. It supports POSIX-like interfaces important for ease of adoption and backwards compatibility with legacy applications. It is made reliable through data replication, and it supports both strong and weak consistency semantics. Furthermore, it supports scalable data provenance capture and querying, a much needed feature in large scale scientific computing systems towards achieving reproducible and verifiable experiments.
... This work is limited by its centralized metadata server design, and it does not provide a data management facility for various dataflow patterns of the tasks. A preliminary version of a collective data management system [33] was previously implemented within Swift [29]. That system used RAM disks on the BG/P I/O nodes to cache data before transferring it to GPFS, in order to improve write speeds to GPFS: once intermediate data stored in cache exceeded a size limit, the system flushed the cache to GPFS. ...
Conference Paper
Full-text available
We seek to enable efficient large-scale parallel execution of applications in which a shared filesystem abstraction is used to couple many tasks. Such parallel scripting (manytask computing, MTC) applications suffer poor performance and utilization on large parallel computers because of the volume of filesystem I/O and a lack of appropriate optimizations in the shared filesystem. Thus, we design and implement a scalable MTC data management system that uses aggregated compute node local storage for more efficient data movement strategies. We co-design the data management system with the data-aware scheduler to enable dataflow pattern identification and automatic optimization. The framework reduces the time to solution of parallel stages of an astronomy data analysis application, Montage, by 83.2% on 512 cores; decreases the time to solution of a seismology application, CyberShake, by 7.9% on 2,048 cores; and delivers BLAST performance better than mpiBLAST at various scales up to 32,768 cores, while preserving the flexibility of the original BLAST application.
... This task scheduling mechanism bypasses the normal system scheduler for individual user jobs, reducing a job execution to the time it takes for a short interprocess communications (IPC) message exchange. Second, appropriate data management and movement mechanisms must be used to transfer data among HPC file services (e.g., GPFS [7]) and between intermediate system layers and user processes [8]. While MTC applications typically access parallel file systems over POSIX interfaces, they cannot directly benefit from parallel I/O optimizations [9] as made available in MPI-IO [10]. ...
... Their use of the file system typically appears as many small, uncoordinated accesses to the file system, resulting in poor performance. However, applications patterns may be observed and categorized [11] and then exploited by appropriate software [8]. More generally, aggressive caching may be used by distributing data items across caches on the compute sites [12] or by employing a distributed hash table [13]. ...
Article
This report discusses many-task computing (MTC) generically and in the context of the proposed Blue Waters systems, which is planned to be the largest NSF-funded supercomputer when it begins production use in 2012. The aim of this report is to inform the BW project about MTC, including understanding aspects of MTC applications that can be used to characterize the domain and understanding the implications of these aspects to middleware and policies. Many MTC applications do not neatly fit the stereotypes of high-performance computing (HPC) or high-throughput computing (HTC) applications. Like HTC applications, by definition MTC applications are structured as graphs of discrete tasks, with explicit input and output dependencies forming the graph edges. However, MTC applications have significant features that distinguish them from typical HTC applications. In particular, different engineering constraints for hardware and software must be met in order to support these applications. HTC applications have traditionally run on platforms such as grids and clusters, through either workflow systems or parallel programming systems. MTC applications, in contrast, will often demand a short time to solution, may be communication intensive or data intensive, and may comprise very short tasks. Therefore, hardware and software for MTC must be engineered to support the additional communication and I/O and must minimize task dispatch overheads. The hardware of large-scale HPC systems, with its high degree of parallelism and support for intensive communication, is well suited for MTC applications. However, HPC systems often lack a dynamic resource-provisioning feature, are not ideal for task communication via the file system, and have an I/O system that is not optimized for MTC-style applications. Hence, additional software support is likely to be required to gain full benefit from the HPC hardware.
... Thus it is natural that the existing hardware/software architecture is inadequate for MTC applications. Our experience [3] confirms that naïvely running MTC applications on the existing hardware/software stack will often result in a series of problems, including low machine utilization, low scalability, and file system bottlenecks. To address these issues, rather than aiming for a complete reengineering of the resource management stack, this paper explores optimization avenues within the context of existing, heavily-adopted resource management systems: we propose and evaluate scheduling and data-storage mechanisms that address or alleviate the scalability, performance, and loadbalancing issues mentioned. ...
... This is an unacceptable overhead for MTC applications that have tasks durations of a few seconds or less. 3 Task dependency resolution: At the scale of today's largest machines, which are approaching 10 6 cores, task dependency resolution must be done in parallel, yet no such scheme has previously existed for MTC applications. 4 Load balancing: To obtain high machine utilization, MTC applications require workload-specific load balancing techniques. ...
... To address the data management gap ( 5 ), we classify data passing according to the usage patterns described in Zhang et. al. [3] as common input, unique input, output, and intermediate data. ...
Conference Paper
Full-text available
Many-Task Computing (MTC) is a new application category that encompasses increasingly popular applications in biology, economics, and statistics. The high inter-task parallelism and data-intensive processing capabilities of these applications pose new challenges to existing supercomputer hardware-software stacks. These challenges include resource provisioning; task dispatching, dependency resolution, and load balancing; data management; and resilience. This paper examines the characteristics of MTC applications which create these challenges, and identifies related gaps in the middleware that supports these applications on extreme-scale systems. Based on this analysis, we propose AME, an Anyscale MTC Engine, which addresses the scalability aspects of these gaps. We describe the AME framework and present performance results for both synthetic benchmarks and real applications. Our results show that AME's dispatching performance linearly scales up to 14,120 tasks/second on 16,384 cores with high efficiency. The overhead of the intermediate data management scheme does not increase significantly up to 16,384 cores. AME eliminates 73% of the file transfer between compute nodes and the global filesystem for the Montage astronomy application running on 2,048 cores. Our results indicate that AME scales well on today's petascale machines, and is a strong candidate for exascale machines.