Memory bandwidth on the SNB and the Intel Xeon Phi system, measured with the stream benchmark.

Source publication

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Conference Paper

Full-text available

Aug 2013

The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its nat...

Context 1

... stream benchmark [12] is a standard package to measure the available memory bandwidth on a system. Fig. 2 shows the results for the Triad vector operation (¯ a = ¯ b + x * ¯ c) for a memory footprint of roughly 2 GB. A good thread to core mapping was ensured with the affinity support of the Intel Compiler. We set KMP AFFINITY to Fig. 3. Memory latency on the SNB system and the Xeon Phi system for different memory foot- prints. A random ...

View in full-text

Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU

Preprint

Full-text available

Jul 2020

In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc 3. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX2. This CPU is the same one that powers the Astra supercompu...

Quantifying OpenMP: Statistical Insights into Usage and Adoption

Preprint

Full-text available

Aug 2023

In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is a lack of comprehensive data on the actual usage of OpenMP constructs, hindering unbiased insights into its popularity and evolution. This paper presents a statistical analysis of OpenMP usage and adoption trends based on a novel and extensive database, HPCORPUS, compiled from GitHub repositories containing C, C++, and Fortran code. The results reveal that OpenMP is the dominant parallel programming model, accounting for 45% of all analyzed parallel APIs. Furthermore, it has demonstrated steady and continuous growth in popularity over the past decade. Analyzing specific OpenMP constructs, the study provides in-depth insights into their usage patterns and preferences across the three languages. Notably, we found that while OpenMP has a strong "common core" of constructs in common usage (while the rest of the API is less used), there are new adoption trends as well, such as simd and target directives for accelerated computing and task for irregular parallelism. Overall, this study sheds light on OpenMP's significance in HPC applications and provides valuable data for researchers and practitioners. It showcases OpenMP's versatility, evolving adoption, and relevance in contemporary parallel programming, underlining its continued role in HPC applications and beyond. These statistical insights are essential for making informed decisions about parallelization strategies and provide a foundation for further advancements in parallel programming models and techniques.

Performance Evaluation of Massively Parallel Systems Using SPECOMP Suite

Article

Full-text available

May 2022

Dheya Mustafa

Performance analysis plays an essential role in achieving a scalable performance of applications on massively parallel supercomputers equipped with thousands of processors. This paper is an empirical investigation to study, in depth, the performance of two of the most common High-Performance Computing architectures in the world. IBM has developed three generations of Blue Gene supercomputers—Blue Gene/L, P, and Q—that use, at a large scale, low-power processors to achieve high performance. Better CPU core efficiency has been empowered by a higher level of integration to gain more parallelism per processing element. On the other hand, the Intel Xeon Phi coprocessor armed with 61 on-chip x86 cores, provides high theoretical peak performance, as well as software development flexibility with existing high-level programming tools. We present an extensive evaluation study of the performance peaks and scalability of these two modern architectures using SPEC OMP benchmarks.

A Survey on Evaluating and Optimizing Performance of Intel Xeon Phi

Article

Full-text available

Mar 2020
CONCURR COMP-PRACT E

Sparsh Mittal

Intel's Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer-architecture and high-performance computing.

ThrustHetero: A Framework to Simplify Heterogeneous Computing Platform Programming using Design Abstraction

Conference Paper

Feb 2019

Heterogeneous compute architectures like Multi-Core CPUs, CUDA GPUs, and Intel Xeon Phis have become prevalent over the years. While heterogeneity makes architecture specific features available to the programmer, it also makes application development difficult, as one needs to plan for optimal usage of architectural features, suitable partitioning of the workload, communication and data transfer among the participating devices. A suitable design abstraction that hides such variabilities of the underlying devices and at the same time exploits their computing capabilities, can improve developer productivity. In this work, we present "ThrustHetero", a lightweight framework based on NVIDIA's Thrust, that provides an abstraction over several devices such as GPUs, Xeon Phis and multicore, yet allows developers to easily leverage the full compute capability of these devices. We also demonstrate a novel method for workload distribution in two stages - micro-benchmarking during framework installations to find good proportions and then using this information during application execution. We consider four classes of applications based on how they would perform on various computing architectures on the basis of the amount of branching present in the application. We show that the framework produces a good workload distribution proportions for each class of application and also show that the framework is scalable and portable. Further, we compare the performance and ease of development when using the framework with the native versions of various benchmarks and obtain favorable results.

How Easy it is to Write Software for Heterogeneous Systems?

Article

Full-text available

Jan 2018
Software Eng Notes

Every other personal computer today is provided with a coprocessor making it a heterogeneous computing environment. As the heterogeneous and high-performance computing (HPC) infrastructure becomes a commodity, the need to improve software development productivity to build efficient parallel programs for this infrastructure becomes all the more crucial. While the mainstream software development methodology focuses on modular design, reusability, ease of understanding and so on, parallel program development emphasizes on performance, optimal use of a hardware resource, scalability, execution correctness, and portability across multiple hardware platforms. In this paper, we identify a few unique software development productivity requirements for heterogeneous systems. These requirements are concerned with design abstraction, reusability, and design verification. While these requirements are applicable for a conventional software as well, their implications are far reaching in the context of parallel programs. Here we discuss significant e orts in building tools and frameworks to i) provide powerful abstraction over the hardware, ii) build software libraries for parallel hardware access and iii) implement verification mechanisms to check the correctness of a program behavior in a heterogeneous runtime environment. We also identify several important gaps in the existing work that needs to be addressed in order to make the current body of work useful in practice.

Thrust++: Extending Thrust Framework for Better Abstraction and Performance

Conference Paper

Full-text available

Dec 2017

Endpoint Security in Networks: An OpenMP Approach for Increasing Malware Detection Speed

Article

Full-text available

Aug 2017

Increasingly sophisticated antivirus (AV) software and the growing amount and complexity of malware demand more processing power from personal computers, specifically from the central processor unit (CPU). This paper conducted performance tests with Clam AntiVirus (ClamAV) and improved its performance through parallel processing on multiple cores using the Open Multi-Processing (OpenMP) library. All the tests used the same dataset constituted of 1.33 GB of data distributed among 2766 files of different sizes. The new parallel version of ClamAV implemented in our work achieved an execution time around 62% lower than the original software version, reaching a speedup of 2.6 times faster. The main contribution of this work is to propose and implement a new version of the ClamAV antivirus using parallel processing with OpenMP, easily portable to a variety of hardware platforms and operating systems.

An Empirical Evaluation of Design Abstraction and Performance of Thrust Framework

Conference Paper

Aug 2017

How Effective is Design Abstraction in Thrust?: An Empirical Evaluation

Conference Paper

Full-text available

Jun 2017

High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. A good design abstraction paradigm strikes a balance between the abstraction and visibility over the hardware. This allows the programmer to write applications without having to understand the hardware nuances while exploiting the computing power optimally. In this paper we have analyzed the power of design abstraction of a popular design abstraction framework called Thrust both from ease of programming and performance perspectives. We have shown that while Thrust framework is good in describing an algorithm compared to the native CUDA or OpenMP version but it has quite a few design limitations. With respect to CUDA it does not provide any abstraction over the shared, texture or constant memory usage to the programmer. We have compared the performance of a Thrust application code in CUDA, OpenMP and the CPP backends with respect to the native versions (implementing exactly same algorithm), written for these backends and found that the current Thrust version performs poorly in most of the cases. While we conclude that the framework is not ready for writing applications that can exploit the optimal performance from the hardware, we also highlight the improvements necessary for the framework to make the performance comparable.

Using the Xeon Phi Platform to Run Speculatively-Parallelized Codes

Article

Full-text available

Apr 2017
INT J PARALLEL PROG

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.

Memory bandwidth on the SNB and the Intel Xeon Phi system, measured with the stream benchmark.

Context in source publication

Similar publications

Citations