Fig 1 - uploaded by Nathalie Furmento
Content may be subject to copyright.
Graphical view of HWLOC topology discovery on our quad-socket quad-core OPTERON machine.  

Graphical view of HWLOC topology discovery on our quad-socket quad-core OPTERON machine.  

Source publication
Article
Full-text available
Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providi...

Citations

... As OpenMP has no concept of memory locality, developers have to consider data placement themselves. Even though several approaches have been presented that extend OpenMP with task-to-data associations and a locality-aware scheduling policies [21,135], none of these approaches have gained traction. As a result thereof, the canonical methods for making OpenMP-based applications NUMA-aware are to rely on the first touch memory allocation policy of the operating system or to use the libnuma API [100] to manually manage the placement of memory resources. ...
Thesis
Full-text available
The heterogeneity of today's state-of-the-art computer architectures is confronting application developers with an immense degree of complexity which results from two major challenges. First, developers need to acquire profound knowledge about the programming models or the interaction models associated with each type of heterogeneous system resource to make efficient use thereof. Second, developers must take into account that heterogeneous system resources always need to exchange data with each other in order to work on a problem together. However, this data exchange is always associated with a certain amount of overhead, which is why the amounts of data exchanged should be kept as low as possible. This thesis proposes three programming abstractions to lessen the burdens imposed by these major challenges with the goal of making heterogeneous system resources accessible to a wider range of application developers. The lib842 compression library provides the first method for accessing the compression and decompression facilities of the NX-842 on-chip compression accelerator available in IBM Power CPUs from user space applications running on Linux. Addressing application development of scale-out GPU workloads, the CloudCL framework makes the resources of GPU clusters more accessible by hiding many aspects of distributed computing while enabling application developers to focus on the aspects of the data parallel programming model associated with GPUs. Furthermore, CloudCL is augmented with transparent data compression facilities based on the lib842 library in order to improve the efficiency of data transfers among cluster nodes. The improved data transfer efficiency provided by the integration of transparent data compression yields performance improvements ranging between 1.11x and 2.07x across four data-intensive scale-out GPU workloads. To investigate the impact of programming abstractions for data placement in NUMA systems, a comprehensive evaluation of the PGASUS framework for NUMA-aware C++ application development is conducted. On a wide range of test systems, the evaluation demonstrates that PGASUS does not only improve the developer experience across all workloads, but that it is also capable of outperforming NUMA-agnostic implementations with average performance improvements of 1.56x. Based on these programming abstractions, this thesis demonstrates that by providing a sufficient degree of abstraction, the accessibility of heterogeneous system resources can be improved for application developers without occluding performance-critical properties of the underlying hardware.
... A common method is to build models on performance counters [10]- [12]. While executing the programs, performance metrics, that describe the execution behavior, are collected to guide the choice of the optimization online [6], [13] or offline [14], [15]. For instance, if we observe many remote accesses from a single node, it can be efficient to scatter the pages to the rest of the system. ...
Preprint
Full-text available
There is a large space of NUMA and hardware prefetcher configurations that can significantly impact the performance of an application. Previous studies have demonstrated how a model can automatically select configurations based on the dynamic properties of the code to achieve speedups. This paper demonstrates how the static Intermediate Representation (IR) of the code can guide NUMA/prefetcher optimizations without the prohibitive cost of performance profiling. We propose a method to create a comprehensive dataset that includes a diverse set of intermediate representations along with optimum configurations. We then apply a graph neural network model in order to validate this dataset. We show that our static intermediate representation based model achieves 80% of the performance gains provided by expensive dynamic performance profiling based strategies. We further develop a hybrid model that uses both static and dynamic information. Our hybrid model achieves the same gains as the dynamic models but at a reduced cost by only profiling 30% of the programs.
... The scientific and industry communities are constantly seeking and developing solutions to address the ever-increasing performance demands of software applications. These solutions are often tailored to particular intricacies of a target application and platform, and they range from the software-level and compiler support (KAMIL et al., 2010), to the runtime system (BROQUEDIS et al., 2010) and the underlying hardware architecture . In this way, specialized techniques may be applied and thus cutting-edge performance achieved. ...
Thesis
Lightweight Manycore (LW Manycore) processors were introduced to deliver performancescalability with low-power consumption. To address the former aspect, they rely on specificarchitectural characteristics, such as a distributed memory architecture and a rich Network-on-Chip (NoC). To achieve low-power consumption, they are built with simple low-power MultipleInstruction Multiple Data (MIMD) cores, they have a memory system based on ScratchpadMemories (SPMs) and they exploit heterogeneity by featuring cores with different capabilities.Some industry-successful examples of these processors are the Kalray MPPA-256, the PULP andthe Sunway SW26010. While this unique set of architectural features grant to LW Manycoresperformance scalability and energy efficiency, they also introduce multiple challenges in softwareprogrammability and portability. First, the high density circuit integration turns dark silicon intoreality. Second, the distributed memory architecture requires data to be explicitly fetched/of-floaded from remote memories to local ones. Third, the small amount of on-chip memory forcesthe software to partition its working data set into chunks and decide which of them should be keptlocal and which should be offloaded to remote memory. Fourth, the on-chip interconnect invitessoftware engineers to embrace a message-passing programming model. Finally, the on-chipheterogeneity makes the deployment of applications complex. One approach for addressing thesechallenges is by means of an Operating System (OS). This type of solution craves to bridgeintricacies of an architecture, by exposing rich abstractions and programming interfaces, as wellas handling resource allocation, sharing and multiplexing. Unfortunately, existing OSes struggleto fully address programmability and portability challenges in LW Manycores, because they werenot designed to cope with architectural features of these processors. In this context, the maingoal of this work boils down to propose a novel OS for LW Manycores that specifically copeswith these uncovered challenges. The main contribution of this work lies with the advancementsof resource management in LW Manycore processors. On the one hand, from the scientificperspective this main contribution may be unfolded in three specific contributions. First, acomprehensive Hardware Abstraction Layer (HAL) that makes the development and deploymentof a fully-featured OS for LW Manycores easier, as well as it enables the portability of an OSacross multiple of these processors. Second, a rich memory management approach that is basedon Distributed Paging System (DPS). This is a novel system-level solution that we devisedfor managing memory of a LW Manycore. Third, a lightweight communication facility thatmanages the on-chip interconnect and exposes primitives with hardware channel multiplexing.On the other hand, as a technical contribution, this work introduces Nanvix. This is a concreteimplementation of an OS for LW Manycore processor that features the aforementioned scientificadvancements. Nanvix supports multiple architectures (Bostan, x86, OpenRISC, ARMv8 andRISC-V), runs on baremetal processors, exposes rich abstractions and high-level programminginterfaces.
... ForestGOMP [3] is an extension of OpenMP that uses hints provided by application programmers, compiler techniques and hardware counters to perform thread and data placement dynamically. Threads that share data or synchronize often are organized in bubbles. ...
Chapter
Full-text available
Software transactional memory (STM) is an abstraction used for thread synchronization that borrows the concept of transactions from databases. It is often easier to use than locks, proving a high-level abstraction for software developers. In current multicore architectures, data locality is an important aspect of STM performance. Sharing-aware mapping is a technique that aims to improve the performance of applications by mapping threads and data (in the form of memory pages) according to their memory access behavior. In prior work, we successfully used information gained from tracking STM variables and STM operations to perform an effective sharing-aware thread mapping. In this paper, we attempt to extend such a mechanism to perform data mapping. Although initial results using a synthetic application were encouraging, data mapping did not improve performance when using realistic workloads. Contrary to thread mapping, where only keeping track of STM operations is sufficient to perform an effective thread mapping, data mapping requires a global vision of memory page accesses of the application to be able to improve the performance, which STM runtimes can not provide.
... To hide the complexity of NUMA while optimizing the performance, runtime-based approaches have been widely exploited, such as NUMA-aware OpenMP [19] and TBB [20] . Task-based runtimes and frameworks also make extensive NUMA-aware optimizations as those in PaRSEC [21] , Charm++ [22] , and Boxlib [23] . ...
Article
The Distributed Shared Memory (DSM) architecture is widely used in today's computer design to mitigate the ever-widening processing-memory gap, and it inevitably exhibits Non-Uniform Memory Access (NUMA) to shared-memory parallel applications. Failure to adapt to the NUMA effect can significantly downgrade application performance, especially on today's manycore platforms with tens to hundreds of cores. However, traditional approaches such as first-touch and memory policy fall short in false page-sharing, fragmentation, or ease of use. In this paper, we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation. Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.
... In this work, we have used the Linux NUMA-balancing tool to show the impact of data migration on optimization techniques that tune the number of threads dynamically. Other approaches also perform data mapping and data migration, like the ones proposed in [6,11,[14][15][16]21]. We describe these works below, categorizing them in online or offline mapping. ...
... ForestGOMP [5,6], Carrefour [11], kMAF [15], and AsymSched [21] are online approaches that optimize parallel application through data mapping. Online techniques are those that decide the data placement during application execution. ...
... ForestGOMP [5,6] and kMAF [15] aim to improve data locality. ForestGOMP is an extension of the OpenMP runtime environment [5,6]. ...
Article
Full-text available
Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed strategies for tuning the number of threads to optimize for performance or energy. Since parallel applications may have more than one parallel region, these tuning strategies can determine a specific number of threads for each application’s parallel region, or determine a fixed number of threads for the whole application execution. In the former case, strategies apply Dynamic Concurrency Throttling (DCT), which enables adapting the number of threads at runtime. However, the use of DCT implies on overheads, such as creating/destroying threads and cache warm-up. DCT’s overhead can be further aggravated in Non-uniform Memory Access systems, where changing the number of threads may incur in remote memory accesses or, more importantly, data migration between nodes. In this way, tuning strategies should not only determine the best number of threads locally, for each parallel region, but also be aware of the impacts when applying DCT. This work investigates how parallel regions may influence each other during DCT employment, showing that data migration may represent a considerable overhead. Effectively, those overheads affect the strategy’s solution, impacting the overall application performance and energy consumption. We demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware.
... Runtime systems play critical roles in this scenario. Some studies promote improvements on NUMA machines by leveraging memory-aware scheduling [Broquedis et al. 2010]. Other assess the behavior of task implementations of OpenMP in these environments [Terboven et al. 2012], discussing some essential aspects related to task creation overhead, such as different strategies to do it like the single-producer and the parallel-producer pattern. ...
Conference Paper
Directed Acyclic Graph (DAG) is a high-level abstraction to describe the activities of parallel applications. A DAG contains tasks (nodes) and dependencies (edges) in the task-based programming paradigm. Application performance depends on the choices of the runtime system. Our work intends to evaluate and compare the performance of three different runtime systems, GCC/libgomp, LLVM/libomp, and StarPU for a task-based dense block QR factorization. The obtained results show that while GCC/libgomp achieves up to 5.4% better performance in the best case, it has scalability problems for finegrain problems with large DAGs. LLVM/libomp and StarPU are more scalable, and StarPU is much faster in task creation and submission than the other runtimes.
... It is well known that the more information interacted the more resources and time cost [6][7][8][9], because the network adapter cards and the switches always consume some time to understand the packages and resend the package to other devices. Therefore, MPI cannot be the most efficient method for distributed parallel programming, because it always needs more message interaction. ...
Article
Full-text available
A new common OpenMP based parallel programming method MPMC (multi-node paralleling model base on multiprocessor devices) is proposed and implemented for data separation based to accelerate Super-Resolution (SR) task. PanguOS, a common parallel programming system designed with MPMC, is deployed with Secure Shell (SSH) to control devices and Secure Copy (SCP) to transmit the data stream on Ubuntu 16.04, and it has a good performance for SR task with remote sensing images. Experiments with images from geostationary-orbit earth observing satellite GaoFen(GF)-4, the method proposed can achieve almost 2.95 times acceleration at PanguOS, deployed with 3 Jetson TX2s, than a single Jetson TX2.
... Many studies have been conducted to extend shared memory programming languages to NUMA. Broquedis et al. [28] combined NUMA-aware memory manager with their runtime system to enable dynamic load distribution, utilizing the information from the application structure and hardware topology. Olivier et al. [29] propose a hierarchical scheduling strategy to improve the performance. ...
Preprint
Full-text available
We use the XSBench proxy application, a memory-intensive OpenMP program, to explore the source of on-node scalability degradation of a popular Monte Carlo (MC) reactor physics benchmark on non-uniform memory access (NUMA) systems. As background, we present the details of XSBench, a performance abstraction "proxy app" for the full MC simulation, as well as the internal design of the Linux kernel. We explain how the physical memory allocation inside the kernel affects the multicore scalability of XSBench. On a sixteen-core, two-socket NUMA testbed, the scaling efficiency is improved from a nonoptimized 70% to an optimized 95%, and the optimized version consumes 25% less energy than does the nonoptimized version. In addition to the NUMA optimization we evaluate a page-size optimization to XSBench and observe a 1.5x performance improvement, compared with a nonoptimized one.
... Online optimization [3,6] requires very low overhead, which typically limits the scope of the search. Offline optimizations [2,11,32], such as our work, require additional profiling steps and can be sensitive to the input data used when profiling. ...
... There have been many proposals that address Thread Mapping [12,17,27,29,30,34]. ForestGOMP [3] groups threads sharing data close to each other in the memory hierarchy. Wang et al. [34] further optimize thread placement and degree of parallelism via an integer programming model that quantifies bandwidth. ...
Conference Paper
Full-text available
Current multi-socket systems have complex memory hierarchies with significant Non-Uniform Memory Access (NUMA) effects: memory performance depends on the location of the data and the thread. This complexity means that thread- and data-mappings have a significant impact on performance. However, it is hard to find efficient data mappings and thread configurations due to the complex interactions between applications and systems. In this paper we explore the combined search space of thread mappings, data mappings, number of NUMA nodes, and degree-of-parallelism, per application phase, and across multiple systems. We show that there are significant performance benefits from optimizing this wide range of parameters together. However, such an optimization presents two challenges: accurately modeling the performance impact of configurations across applications and systems, and exploring the vast space of configurations. To overcome the modeling challenge, we use native execution of small, representative codelets, which reproduce the system and application interactions. To make the search practical, we build a search space by combining a range of state of the art thread- and data-mapping policies. Combining these two approaches results in a tractable search space that can be quickly and accurately evaluated without sacrificing significant performance. This search finds non-intuitive configurations that perform significantly better than previous works. With this approach we are able to achieve an average speedup of 1.97× on a four node NUMA system.