Figure 13 - uploaded by Peter Lindstrom
Content may be subject to copyright.
The compression ratio for a 2X APAX compression run is shown. The three curves repre- sent the maximum, minimum, and mean compression rate among the 1024 processes. 

The compression ratio for a 2X APAX compression run is shown. The three curves repre- sent the maximum, minimum, and mean compression rate among the 1024 processes. 

Source publication
Conference Paper
Full-text available
This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3-5X can be applied without causing significant changes t...

Similar publications

Article
Full-text available
This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3–5X can be applied without causing significant changes t...

Citations

... Due to its popularity, Lulesh is widely used for testing new hardware, applying optimizations, parallelization schemes, APIs, and more [26,44]. For instance, Laney et al. [45] used Lulesh and several other benchmarks to assess the effects of data compression on performance, as they are representative of real scientific applications. Bercea et al. [26] evaluated the effectiveness of the OpenMP offloading capabilities on NVIDIA Kepler GPUs K40m (2015), and it is the latest known version of its kind, using OpenMP version 4.0 [26]. ...
Chapter
Full-text available
Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs – the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs – were released to the market, with the oneAPI and NVHPC compilers for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE’s OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the coverage for version 4.5 is nearly complete in both latest NVHPC and oneAPI tools. However, we observed a lack of support in versions 5.0, 5.1, and 5.2, which is particularly noticeable when using NVHPC. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size (\(400^3\)) scalably due to the larger memory size.The results are available at: https://github.com/Scientific-Computing-Lab-NRCN/Accel-OpenMP-Portability-Scalability.
... Due to its popularity, Lulesh is widely used for testing new hardware, applying optimizations, parallelization schemes, APIs, and more [26,44]. For instance, Laney et al. [45] used Lulesh and several other benchmarks to assess the effects of data compression on performance, as they are representative of real scientific applications. Bercea et al. [26] evaluated the effectiveness of the OpenMP offloading capabilities on NVIDIA Kepler GPUs K40m (2015), and it is the latest known version of its kind, using OpenMP version 4.0 [26]. ...
Chapter
In situ visualization and analysis is a valuable yet under utilized commodity for the simulation community. There is hesitance or even resistance to adopting new methodologies due to the uncertainties that in situ holds for new users. There is a perceived implementation cost, maintenance cost, risk to simulation fault tolerance, potential lack of scalability, a new resource cost for running in situ processes, and more. The list of reasons why in situ is overlooked is long. We are attempting to break down this barrier by introducing Inshimtu. Inshimtu is an in situ “shim” library that enables users to try in situ before they buy into a full implementation. It does this by working with existing simulation output files, requiring no changes to simulation code. The core visualization component of Inshimtu is ParaView Catalyst, allowing it to take advantage of both interactive and non-interactive visualization pipelines that scale. We envision Inshimtu as stepping stone to show users the value of in situ and motivate them to move to one of the many existing fully-featured in situ libraries available in the community. We demonstrate the functionality of Inshimtu with a scientific workflow on the Shaheen II supercomputer.Inshimtu is available for download at: https://github.com/kaust-vislab/Inshimtu-basic.
... Due to its popularity, Lulesh is widely used for testing new hardware, applying optimizations, parallelization schemes, APIs, and more [43,25]. For instance, Laney et al. [44] used Lulesh and several other benchmarks to assess the effects of data compression on performance, as they are representative of real scientific applications. Bercea et al. [25] evaluated the effectiveness of the OpenMP offloading capabilities on NVIDIA Kepler GPUs K40m (2015), and it is the latest known version of its kind, using OpenMP version 4.0 [25]. ...
Preprint
Full-text available
Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs - the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs - were released to the market, with the oneAPI and GNU LLVM-backed compilation for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the potability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the vast majority of the offloading directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU compilers; however, the support in v5.1 and v5.2 is still lacking. From the performance perspective, we found that PVC is up to 37% better than the A100 on the LULESH benchmark, presenting better performance in computing and data movements.
... Due to its popularity, LULESH is commonly used for testing new hardware, applying optimizations, parallelization schemes, APIs, and more [32], [43]. For example, Laney et al. [46] assessed the effects of data compression on the performance of LULESH and several other benchmarks -as they are the most representative of real scientific applications. Furthermore, Bercea et al. [43] evaluated the effectiveness of the OpenMP version on a GPU. ...
Preprint
Full-text available
Supercomputers worldwide provide the necessary infrastructure for groundbreaking research. However, most supercomputers are not designed equally due to different desired figure of merit, which is derived from the computational bounds of the targeted scientific applications' portfolio. In turn, the design of such computers becomes an optimization process that strives to achieve the best performances possible in a multi-parameters search space. Therefore, verifying and evaluating whether a supercomputer can achieve its desired goal becomes a tedious and complex task. For this purpose, many full, mini, proxy, and benchmark applications have been introduced in the attempt to represent scientific applications. Nevertheless, as these benchmarks are hard to expand, and most importantly, are over-simplified compared to scientific applications that tend to couple multiple scientific domains, they fail to represent the true scaling capabilities. We suggest a new physical scalable benchmark framework, namely ScalSALE, based on the well-known SALE scheme. ScalSALE's main goal is to provide a simple, flexible, scalable infrastructure that can be easily expanded to include multi-physical schemes while maintaining scalable and efficient execution times. By expanding ScalSALE, the gap between the over-simplified benchmarks and scientific applications can be bridged. To achieve this goal, ScalSALE is implemented in Modern Fortran with simple OOP design patterns and supported by transparent MPI-3 blocking and non-blocking communication that allows such a scalable framework. ScalSALE is compared to LULESH via simulating the Sedov-Taylor blast wave problem using strong and weak scaling tests. ScalSALE is executed and evaluated with both rezoning options - Lagrangian and Eulerian.
... A S scientific datasets continue to grow in size and complexity, adaptive representations have become key to enabling interactive analysis and visualization [1]. Such representations can reduce the memory footprint and processing costs of large-scale data by orders of magnitude, often without perceptible degradation of visualization quality or analysis results [2]. However, existing approaches are limited to either compressed representations of regular grids [3], [4] or multiresolution structures, such as octrees and k-d trees [5], [6]. ...
Article
Full-text available
Adaptive representations are increasingly indispensable for reducing the in-memory and on-disk footprints of large-scale data. Usual solutions are designed broadly along two themes: reducing data precision, e.g. , through compression, or adapting data resolution, e.g. , using spatial hierarchies. Recent research suggests that combining the two approaches, i.e. , adapting both resolution and precision simultaneously, can offer significant gains over using them individually. However, there currently exist no practical solutions to creating and evaluating such representations at scale. In this work, we present a new resolution-precision-adaptive representation to support hybrid data reduction schemes and offer an interface to existing tools and algorithms. Through novelties in spatial hierarchy, our representation, Adaptive Multilinear Meshes (AMM), provides considerable reduction in the mesh size. AMM creates a piecewise multilinear representation of uniformly sampled scalar data and can selectively relax or enforce constraints on conformity, continuity, and coverage, delivering a flexible adaptive representation. AMM also supports representing the function using mixed-precision values to further the achievable gains in data reduction. We describe a practical approach to creating AMM incrementally using arbitrary orderings of data and demonstrate AMM on six types of resolution and precision datastreams. By interfacing with state-of-the-art rendering tools through VTK, we demonstrate the practical and computational advantages of our representation for visualization techniques. With an open-source release of our tool to create AMM, we make such evaluation of data reduction accessible to the community, which we hope will foster new opportunities and future data reduction schemes.
... Indeed, there exist models that are insensitive to ample reduction in the width of the floating-point significand (Hatfield et al. 2018). The use of lossy data compression techniques has also been advocated by showing that compression effects are often unimportant or disappear in post-processing analyses (Baker et al. 2016), and substantial gain in the compression ratio can be achieved while keeping the error at acceptable level in terms of physically motivated metrics (Laney et al. 2013). Consequently, it appears reasonable to adjust the restart data storage to the precision justified by the level of model error. ...
Article
Full-text available
A wavelet-based method for compression of three-dimensional simulation data is presented and its software framework is described. It uses wavelet decomposition and subsequent range coding with quantization suitable for floating-point data. The effectiveness of this method is demonstrated by applying it to example numerical tests, ranging from idealized configurations to realistic global-scale simulations. The novelty of this study is in its focus on assessing the impact of compression on post-processing and restart of numerical simulations. Graphical abstract
... The use of lossy compression in numerical simulation has been suggested before: it has been considered for checkpointing numerical simulations [5] and for inline compression of the solution state in [14]. In both cases, it was demonstrated that lossy compression can be used without causing significant changes to important physical quantities. ...
Article
Currently, the dominating constraint in many high performance computing applications is data capacity and bandwidth, in both internode communications and even moreso in intranode data motion. A new approach to address this limitation is to make use of data compression in the form of a compressed data array. Storing data in a compressed data array and converting to standard IEEE-754 types as needed during a computation can reduce the pressure on bandwidth and storage. However, repeated conversions (lossy compression and decompression) introduce additional approximation errors, which need to be shown to not significantly affect the simulation results. We extend recent work [J. Diffenderfer et al., SIAM J. Sci. Comput., 41 (2019), pp. A1867-A1898] that analyzed the error of a single use of compression and decompression of the ZFP compressed data array representation [P. Lindstrom, IEEE Trans. Vis. Comput. Graph., 20 (2014), pp. 2674-2683; P. Lindstrom, ZFP version 0.5.5, May 2019] to the case of time-stepping and iterative schemes, where an advancement operator is repeatedly applied in addition to the conversions. We show that the accumulated error for iterative methods involving fixed-point and time evolving iterations is bounded under standard constraints. An upper bound is established on the number of additional iterations required for the convergence of stationary fixed-point iterations. An additional analysis of traditional forward and backward error of stationary iterative methods using ZFP compressed arrays is also presented. The results of several 1D, 2D, and 3D test problems are provided to demonstrate the correctness of the theoretical bounds.
... As scientific datasets continue to grow in size and complexity, adaptive representations are key to enabling interactive analysis and visualization [11]. Adaptive meshes can reduce both the memory footprint and processing costs of large-scale data by orders of magnitude, often without any perceptible impact to visualization quality or analysis results [43]. However, existing approaches are limited to either compressed representations [37,44,45] of regular grids or multiresolution structures such as octrees and k-d trees [5,60,70]. ...
Preprint
Full-text available
We present Adaptive Multilinear Meshes (AMM), a new framework that significantly reduces the memory footprint compared to existing data structures. AMM uses a hierarchy of cuboidal cells to create continuous, piecewise multilinear representation of uniformly sampled data. Furthermore, AMM can selectively relax or enforce constraints on conformity, continuity, and coverage, creating a highly adaptive and flexible representation to support a wide range of use cases. AMM supports incremental updates in both spatial resolution and numerical precision establishing the first practical data structure that can seamlessly explore the tradeoff between resolution and precision. We use tensor products of linear B-spline wavelets to create an adaptive representation and illustrate the advantages of our framework. AMM provides a simple interface for evaluating the function defined on the adaptive mesh, efficiently traversing the mesh, and manipulating the mesh, including incremental, partial updates. Our framework is easy to adopt for standard visualization and analysis tasks. As an example, we provide a VTK interface, through efficient on-demand conversion, which can be used directly by corresponding tools, such as VisIt, disseminating the advantages of faster processing and a smaller memory footprint to a wider audience. We demonstrate the advantages of our approach for simplifying scalar-valued data for commonly used visualization and analysis tasks using incremental construction, according to mixed resolution and precision data streams.
... To optimize these interactions, applications often use data compression [38] to reduce the amount of data transmitted between these components or to external storage. For instance, applications such as AstroPortal [39], Community Earth System Model [40], and Particle Physics simulations [41] utilize compression to reduce the cost of data movement within the application. Similarly, many HPC applications [38], [42] perform data compression to reduce the amount of intermediate data produced in the staging servers. ...
... The use of lossy compression in numerical simulation has been suggested before: it has been considered for checkpointing numerical simulations [5] and for inline compression of the solution state in [14]. In both cases, it was demonstrated that lossy compression can be used without causing significant changes to important physical quantities. ...
Preprint
Full-text available
Currently, the dominating constraint in many high performance computing applications is data capacity and bandwidth, in both inter-node communications and even more-so in on-node data motion. A new approach to address this limitation is to make use of data compression in the form of a compressed data array. Storing data in a compressed data array and converting to standard IEEE-754 types as needed during a computation can reduce the pressure on bandwidth and storage. However, repeated conversions (lossy compression and decompression) introduce additional approximation errors, which need to be shown to not significantly affect the simulation results. We extend recent work [J. Diffenderfer, et al., Error Analysis of ZFP Compression for Floating-Point Data, SIAM Journal on Scientific Computing, 2019] that analyzed the error of a single use of compression and decompression of the ZFP compressed data array representation [P. Lindstrom, Fixed-rate compressed floating-point arrays, IEEE Transactions on Visualization and Computer Graphics, 2014] to the case of time-stepping and iterative schemes, where an advancement operator is repeatedly applied in addition to the conversions. We show that the accumulated error for iterative methods involving fixed-point and time evolving iterations is bounded under standard constraints. An upper bound is established on the number of additional iterations required for the convergence of stationary fixed-point iterations. An additional analysis of traditional forward and backward error of stationary iterative methods using ZFP compressed arrays is also presented. The results of several 1D, 2D, and 3D test problems are provided to demonstrate the correctness of the theoretical bounds.