Graphical view of HWLOC topology discovery on our quad-socket quad-core...

Improving the Accessibility of Heterogeneous System Resources for Application Developers using Programming Abstractions

Thesis

Full-text available

Aug 2022

Max Frederik Plauth

The heterogeneity of today's state-of-the-art computer architectures is confronting application developers with an immense degree of complexity which results from two major challenges. First, developers need to acquire profound knowledge about the programming models or the interaction models associated with each type of heterogeneous system resource to make efficient use thereof. Second, developers must take into account that heterogeneous system resources always need to exchange data with each other in order to work on a problem together. However, this data exchange is always associated with a certain amount of overhead, which is why the amounts of data exchanged should be kept as low as possible. This thesis proposes three programming abstractions to lessen the burdens imposed by these major challenges with the goal of making heterogeneous system resources accessible to a wider range of application developers. The lib842 compression library provides the first method for accessing the compression and decompression facilities of the NX-842 on-chip compression accelerator available in IBM Power CPUs from user space applications running on Linux. Addressing application development of scale-out GPU workloads, the CloudCL framework makes the resources of GPU clusters more accessible by hiding many aspects of distributed computing while enabling application developers to focus on the aspects of the data parallel programming model associated with GPUs. Furthermore, CloudCL is augmented with transparent data compression facilities based on the lib842 library in order to improve the efficiency of data transfers among cluster nodes. The improved data transfer efficiency provided by the integration of transparent data compression yields performance improvements ranging between 1.11x and 2.07x across four data-intensive scale-out GPU workloads. To investigate the impact of programming abstractions for data placement in NUMA systems, a comprehensive evaluation of the PGASUS framework for NUMA-aware C++ application development is conducted. On a wide range of test systems, the evaluation demonstrates that PGASUS does not only improve the developer experience across all workloads, but that it is also capable of outperforming NUMA-agnostic implementations with average performance improvements of 1.56x. Based on these programming abstractions, this thesis demonstrates that by providing a sufficient degree of abstraction, the accessibility of heterogeneous system resources can be improved for application developers without occluding performance-critical properties of the underlying hardware.

Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization

Preprint

Full-text available

Mar 2022

There is a large space of NUMA and hardware prefetcher configurations that can significantly impact the performance of an application. Previous studies have demonstrated how a model can automatically select configurations based on the dynamic properties of the code to achieve speedups. This paper demonstrates how the static Intermediate Representation (IR) of the code can guide NUMA/prefetcher optimizations without the prohibitive cost of performance profiling. We propose a method to create a comprehensive dataset that includes a diverse set of intermediate representations along with optimum configurations. We then apply a graph neural network model in order to validate this dataset. We show that our static intermediate representation based model achieves 80% of the performance gains provided by expensive dynamic performance profiling based strategies. We further develop a hybrid model that uses both static and dynamic information. Our hybrid model achieves the same gains as the dynamic models but at a reduced cost by only profiling 30% of the programs.

Nanvix : A Distributed Operating System for Lightweight Manycore Processors

Thesis

Sep 2021

Pedro Henrique Penna

Lightweight Manycore (LW Manycore) processors were introduced to deliver performancescalability with low-power consumption. To address the former aspect, they rely on specificarchitectural characteristics, such as a distributed memory architecture and a rich Network-on-Chip (NoC). To achieve low-power consumption, they are built with simple low-power MultipleInstruction Multiple Data (MIMD) cores, they have a memory system based on ScratchpadMemories (SPMs) and they exploit heterogeneity by featuring cores with different capabilities.Some industry-successful examples of these processors are the Kalray MPPA-256, the PULP andthe Sunway SW26010. While this unique set of architectural features grant to LW Manycoresperformance scalability and energy efficiency, they also introduce multiple challenges in softwareprogrammability and portability. First, the high density circuit integration turns dark silicon intoreality. Second, the distributed memory architecture requires data to be explicitly fetched/of-floaded from remote memories to local ones. Third, the small amount of on-chip memory forcesthe software to partition its working data set into chunks and decide which of them should be keptlocal and which should be offloaded to remote memory. Fourth, the on-chip interconnect invitessoftware engineers to embrace a message-passing programming model. Finally, the on-chipheterogeneity makes the deployment of applications complex. One approach for addressing thesechallenges is by means of an Operating System (OS). This type of solution craves to bridgeintricacies of an architecture, by exposing rich abstractions and programming interfaces, as wellas handling resource allocation, sharing and multiplexing. Unfortunately, existing OSes struggleto fully address programmability and portability challenges in LW Manycores, because they werenot designed to cope with architectural features of these processors. In this context, the maingoal of this work boils down to propose a novel OS for LW Manycores that specifically copeswith these uncovered challenges. The main contribution of this work lies with the advancementsof resource management in LW Manycore processors. On the one hand, from the scientificperspective this main contribution may be unfolded in three specific contributions. First, acomprehensive Hardware Abstraction Layer (HAL) that makes the development and deploymentof a fully-featured OS for LW Manycores easier, as well as it enables the portability of an OSacross multiple of these processors. Second, a rich memory management approach that is basedon Distributed Paging System (DPS). This is a novel system-level solution that we devisedfor managing memory of a LW Manycore. Third, a lightweight communication facility thatmanages the on-chip interconnect and exposes primitives with hardware channel multiplexing.On the other hand, as a technical contribution, this work introduces Nanvix. This is a concreteimplementation of an OS for LW Manycore processor that features the aforementioned scientificadvancements. Nanvix supports multiple architectures (Bostan, x86, OpenRISC, ARMv8 andRISC-V), runs on baremetal processors, exposes rich abstractions and high-level programminginterfaces.

Sharing-Aware Data Mapping in Software Transactional Memory

Chapter

Full-text available

Jul 2021

Software transactional memory (STM) is an abstraction used for thread synchronization that borrows the concept of transactions from databases. It is often easier to use than locks, proving a high-level abstraction for software developers. In current multicore architectures, data locality is an important aspect of STM performance. Sharing-aware mapping is a technique that aims to improve the performance of applications by mapping threads and data (in the form of memory pages) according to their memory access behavior. In prior work, we successfully used information gained from tracking STM variables and STM operations to perform an effective sharing-aware thread mapping. In this paper, we attempt to extend such a mechanism to perform data mapping. Although initial results using a synthetic application were encouraging, data mapping did not improve performance when using realistic workloads. Contrary to thread mapping, where only keeping track of STM operations is sufficient to perform an effective thread mapping, data mapping requires a global vision of memory page accesses of the application to be able to improve the performance, which STM runtimes can not provide.

PsmArena: Partitioned shared memory for NUMA-awareness in multithreaded scientific applications

Article

Jun 2021

The Distributed Shared Memory (DSM) architecture is widely used in today's computer design to mitigate the ever-widening processing-memory gap, and it inevitably exhibits Non-Uniform Memory Access (NUMA) to shared-memory parallel applications. Failure to adapt to the NUMA effect can significantly downgrade application performance, especially on today's manycore platforms with tens to hundreds of cores. However, traditional approaches such as first-touch and memory policy fall short in false page-sharing, fragmentation, or ease of use. In this paper, we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation. Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.

Dynamic concurrency throttling on NUMA systems and data migration impacts

Article

Full-text available

Jun 2021
DES AUTOM EMBED SYST

Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed strategies for tuning the number of threads to optimize for performance or energy. Since parallel applications may have more than one parallel region, these tuning strategies can determine a specific number of threads for each application’s parallel region, or determine a fixed number of threads for the whole application execution. In the former case, strategies apply Dynamic Concurrency Throttling (DCT), which enables adapting the number of threads at runtime. However, the use of DCT implies on overheads, such as creating/destroying threads and cache warm-up. DCT’s overhead can be further aggravated in Non-uniform Memory Access systems, where changing the number of threads may incur in remote memory accesses or, more importantly, data migration between nodes. In this way, tuning strategies should not only determine the best number of threads locally, for each parallel region, but also be aware of the impacts when applying DCT. This work investigates how parallel regions may influence each other during DCT employment, showing that data migration may represent a considerable overhead. Effectively, those overheads affect the strategy’s solution, impacting the overall application performance and energy consumption. We demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware.

OpenMP and StarPU Abreast: the Impact of Runtime in Task-Based Block QR Factorization Performance

Conference Paper

Nov 2019
Acoust Speech Signal Process Newslett IEEE

Directed Acyclic Graph (DAG) is a high-level abstraction to describe the activities of parallel applications. A DAG contains tasks (nodes) and dependencies (edges) in the task-based programming paradigm. Application performance depends on the choices of the runtime system. Our work intends to evaluate and compare the performance of three different runtime systems, GCC/libgomp, LLVM/libomp, and StarPU for a task-based dense block QR factorization. The obtained results show that while GCC/libgomp achieves up to 5.4% better performance in the best case, it has scalability problems for finegrain problems with large DAGs. LLVM/libomp and StarPU are more scalable, and StarPU is much faster in task creation and submission than the other runtimes.

An OpenMP-Based Algorithm for Multi-Nodes Computational of Super-Resolution

Article

Full-text available

Nov 2019
J Phys Conf

A new common OpenMP based parallel programming method MPMC (multi-node paralleling model base on multiprocessor devices) is proposed and implemented for data separation based to accelerate Super-Resolution (SR) task. PanguOS, a common parallel programming system designed with MPMC, is deployed with Secure Shell (SSH) to control devices and Secure Copy (SCP) to transmit the data stream on Ubuntu 16.04, and it has a good performance for SR task with remote sensing images. Experiments with images from geostationary-orbit earth observing satellite GaoFen(GF)-4, the method proposed can achieve almost 2.95 times acceleration at PanguOS, deployed with 3 Jetson TX2s, than a single Jetson TX2.

Improving the scalabiliy of neutron cross-section lookup codes on multicore NUMA system

Preprint

Full-text available

Sep 2019

We use the XSBench proxy application, a memory-intensive OpenMP program, to explore the source of on-node scalability degradation of a popular Monte Carlo (MC) reactor physics benchmark on non-uniform memory access (NUMA) systems. As background, we present the details of XSBench, a performance abstraction "proxy app" for the full MC simulation, as well as the internal design of the Linux kernel. We explain how the physical memory allocation inside the kernel affects the multicore scalability of XSBench. On a sixteen-core, two-socket NUMA testbed, the scaling efficiency is improved from a nonoptimized 70% to an optimized 95%, and the optimized version consumes 25% less energy than does the nonoptimized version. In addition to the NUMA optimization we evaluate a page-size optimization to XSBench and observe a 1.5x performance improvement, compared with a nonoptimized one.

Efficient thread/page/parallelism autotuning for NUMA systems

Conference Paper

Full-text available

Jun 2019

Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Access (NUMA) effects: memory performance depends on the location of the data and the thread. This complexity means that thread- and data-mappings have a significant impact on performance. However, it is hard to find efficient data mappings and thread configurations due to the complex interactions between applications and systems. In this paper we explore the combined search space of thread mappings, data mappings, number of NUMA nodes, and degree-of-parallelism, per application phase, and across multiple systems. We show that there are significant performance benefits from optimizing this wide range of parameters together. However, such an optimization presents two challenges: accurately modeling the performance impact of configurations across applications and systems, and exploring the vast space of configurations. To overcome the modeling challenge, we use native execution of small, representative codelets, which reproduce the system and application interactions. To make the search practical, we build a search space by combining a range of state of the art thread- and data-mapping policies. Combining these two approaches results in a tractable search space that can be quickly and accurately evaluated without sacrificing significant performance. This search finds non-intuitive configurations that perform significantly better than previous works. With this approach we are able to achieve an average speedup of 1.97× on a four node NUMA system.

Graphical view of HWLOC topology discovery on our quad-socket quad-core OPTERON machine.

Citations