Fig 2 - uploaded by Dirk Schmidl
Content may be subject to copyright.
Memory bandwidth on the SNB and the Intel Xeon Phi system, measured with the stream benchmark.  

Memory bandwidth on the SNB and the Intel Xeon Phi system, measured with the stream benchmark.  

Source publication
Conference Paper
Full-text available
The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its nat...

Context in source publication

Context 1
... stream benchmark [12] is a standard package to measure the available memory bandwidth on a system. Fig. 2 shows the results for the Triad vector operation (¯ a = ¯ b + x * ¯ c) for a memory footprint of roughly 2 GB. A good thread to core mapping was ensured with the affinity support of the Intel Compiler. We set KMP AFFINITY to Fig. 3. Memory latency on the SNB system and the Xeon Phi system for different memory foot- prints. A random ...

Similar publications

Preprint
Full-text available
In this paper, we analyze the performance and energy consumption of an Arm-based high-performance computing (HPC) system developed within the European project Mont-Blanc 3. This system, called Dibona, has been integrated by ATOS/Bull, and it is powered by the latest Marvell's CPU, ThunderX2. This CPU is the same one that powers the Astra supercompu...

Citations

... OpenMP differs from MPI [8], which is designed for distributedmemory parallelism by targeting shared-memory parallelism within a single program. Compared to CUDA [9] and OpenCL [10], which are geared towards throughput-optimized accelerators (and in the case of CUDA, a specific vendor), OpenMP provides a portable solution that works across different platforms, including general-purpose CPUs, accelerators [11], GPUs [12]- [14], and FPGAs [15]. Additionally, OpenMP's compatibility with multiple programming languages (specifically C, C++, Fortran, and partially Python [16], [17]) sets it apart from other GPU programming languages such as SYCL [18] (C++ only) and OpenCL (C and C++). ...
Preprint
Full-text available
In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is a lack of comprehensive data on the actual usage of OpenMP constructs, hindering unbiased insights into its popularity and evolution. This paper presents a statistical analysis of OpenMP usage and adoption trends based on a novel and extensive database, HPCORPUS, compiled from GitHub repositories containing C, C++, and Fortran code. The results reveal that OpenMP is the dominant parallel programming model, accounting for 45% of all analyzed parallel APIs. Furthermore, it has demonstrated steady and continuous growth in popularity over the past decade. Analyzing specific OpenMP constructs, the study provides in-depth insights into their usage patterns and preferences across the three languages. Notably, we found that while OpenMP has a strong "common core" of constructs in common usage (while the rest of the API is less used), there are new adoption trends as well, such as simd and target directives for accelerated computing and task for irregular parallelism. Overall, this study sheds light on OpenMP's significance in HPC applications and provides valuable data for researchers and practitioners. It showcases OpenMP's versatility, evolving adoption, and relevance in contemporary parallel programming, underlining its continued role in HPC applications and beyond. These statistical insights are essential for making informed decisions about parallelization strategies and provide a foundation for further advancements in parallel programming models and techniques.
... Iwainsky et al. used empirical performance models to study the scalability of OpenMP constructs on several machines, including IBM Blue Gene/Q and Intel Xeon Phi [31]. In [32], the authors studied the performance and scalability of OpenMP programs on Xeon Phi in stand-alone mode, and they compared it with a two- ...
... Iwainsky et al. used empirical performance models to study the scalability of OpenMP constructs on several machines, including IBM Blue Gene/Q and Intel Xeon Phi [31]. In [32], the authors studied the performance and scalability of OpenMP programs on Xeon Phi in stand-alone mode, and they compared it with a two-socket Xeon-based system. ...
... Several studies concentrated on one specific application [33][34][35], while others focused on smaller compute kernels as well as NAS parallel benchmarks [32,36,37]. recent performance studies of HPC architectures, more specifically Blue Genes and Xeon Phi, concentrated on specific compute-intensive applications, such as drug discovery [26], lattice QCD [27], molecular dynamics [28], DNA sequence aligner [38], pseudospectral ultrasound simulations [39], and Alpha magnetic spectrometer [40]. ...
Article
Full-text available
Performance analysis plays an essential role in achieving a scalable performance of applications on massively parallel supercomputers equipped with thousands of processors. This paper is an empirical investigation to study, in depth, the performance of two of the most common High-Performance Computing architectures in the world. IBM has developed three generations of Blue Gene supercomputers—Blue Gene/L, P, and Q—that use, at a large scale, low-power processors to achieve high performance. Better CPU core efficiency has been empowered by a higher level of integration to gain more parallelism per processing element. On the other hand, the Intel Xeon Phi coprocessor armed with 61 on-chip x86 cores, provides high theoretical peak performance, as well as software development flexibility with existing high-level programming tools. We present an extensive evaluation study of the performance peaks and scalability of these two modern architectures using SPEC OMP benchmarks.
... Hence, compared to other platforms such as GPU, FPGA and ASIC, Phi incurs much lower overhead of programming and development. This is especially beneficial for large-scale legacy codes [33]. ...
... Table 2 shows the architectural parameters of selected models of CPU, GPU and Phi. [14], 3120 [15], 3120A [16], 3120P [17][18][19], 31SP [20,21], 31S1P [22], 5100 [15], 5110P [2,4,6,7,14,[23][24][25][26][27][28][29][30][31][32][33], 5120D [34], 5120P, 5510P [35], SE10P [13,[36][37][38], SE10X [39], 7110P [40], 7110X [25] , 7120A [41], 7120D, 7120P [14,31,38,[42][43][44][45][46][47][48][49], 7120X. KNL x200 7210P [50], 7210F [51], 7210 [12,[52][53][54][55][56][57], 7230 [11], 7230F [56,58], 7250 [17,34,38,[59][60][61][62][63][64][65][66], 7250F [47], 7290 [52], 7290F, 7290B [ ...
... Schmidl et al. [33] compare the performance of a Phi (whose configuration is similar to 5110P) with that of Sandy Bridge CPU. They note that the bandwidth of Phi is highest for 60 threads and degrades on increasing the number of threads. ...
Article
Full-text available
Intel's Xeon Phi combines the parallel processing power of a many-core accelerator with the programming ease of CPUs. In this paper, we present a survey of works that study the architecture of Phi and use it as an accelerator for a broad range of applications. We review performance optimization strategies as well as the factors that bottleneck the performance of Phi. We also review works that perform comparison or collaborative execution of Phi with CPUs and GPUs. This paper will be useful for researchers and developers in the area of computer-architecture and high-performance computing.
... Programming language and compiler designers for parallel computing have always aimed for a unified programming abstraction that can run on multiple devices. ANSI C/C++ extensions like CUDA (for NVIDIA GPUs) [1], OpenCL (CPUs, GPUs, DSPs, FPGAs), OpenMP (CPU, Xeon Phi, GPUs) [21,28], and OpenACC [31] are the popular choices for parallel programming for a heterogeneous computing platform. While the above-mentioned mechanisms provide some level of abstraction over the hardware, these approaches still quite a low level wherein the programmer is responsible to write the code necessary for parallel features within the application. ...
Conference Paper
Heterogeneous compute architectures like Multi-Core CPUs, CUDA GPUs, and Intel Xeon Phis have become prevalent over the years. While heterogeneity makes architecture specific features available to the programmer, it also makes application development difficult, as one needs to plan for optimal usage of architectural features, suitable partitioning of the workload, communication and data transfer among the participating devices. A suitable design abstraction that hides such variabilities of the underlying devices and at the same time exploits their computing capabilities, can improve developer productivity. In this work, we present "ThrustHetero", a lightweight framework based on NVIDIA's Thrust, that provides an abstraction over several devices such as GPUs, Xeon Phis and multicore, yet allows developers to easily leverage the full compute capability of these devices. We also demonstrate a novel method for workload distribution in two stages - micro-benchmarking during framework installations to find good proportions and then using this information during application execution. We consider four classes of applications based on how they would perform on various computing architectures on the basis of the amount of branching present in the application. We show that the framework produces a good workload distribution proportions for each class of application and also show that the framework is scalable and portable. Further, we compare the performance and ease of development when using the framework with the native versions of various benchmarks and obtain favorable results.
... Technically, OpenMP can be as good a choice as CUDA or OpenCL to write a parallel program as the directives are completely device agnostic. OpenMP initiative is strongly supported by Intel and it is mainly used in task parallel, multi/many-core applications as well as for vector processing applications that run on Intel's Xeon Phi coprocessor [34]. ...
Article
Full-text available
Every other personal computer today is provided with a coprocessor making it a heterogeneous computing environment. As the heterogeneous and high-performance computing (HPC) infrastructure becomes a commodity, the need to improve software development productivity to build efficient parallel programs for this infrastructure becomes all the more crucial. While the mainstream software development methodology focuses on modular design, reusability, ease of understanding and so on, parallel program development emphasizes on performance, optimal use of a hardware resource, scalability, execution correctness, and portability across multiple hardware platforms. In this paper, we identify a few unique software development productivity requirements for heterogeneous systems. These requirements are concerned with design abstraction, reusability, and design verification. While these requirements are applicable for a conventional software as well, their implications are far reaching in the context of parallel programs. Here we discuss significant e orts in building tools and frameworks to i) provide powerful abstraction over the hardware, ii) build software libraries for parallel hardware access and iii) implement verification mechanisms to check the correctness of a program behavior in a heterogeneous runtime environment. We also identify several important gaps in the existing work that needs to be addressed in order to make the current body of work useful in practice.
... This provides a better abstraction over the commonly known thread library like POSIX thread library [12]. Intel Xeon Phi coprocessor supports OpenMP programming model, removing necessity to modify codes for using coprocessor-specific programming [16]. ...
... Schmidl et al. [27] access the performance of OpenMP programs on the new Intel Xeon Phi coprocessor and the Intel Sandy Bridge (SNB). The first one, offering more than 60 cores, can act as an accelerator like in a GPU model or as a standalone SMP. ...
... The first one, offering more than 60 cores, can act as an accelerator like in a GPU model or as a standalone SMP. The work [27] studies the performance of Xeon Phi and SNB using kernels, benchmark codes and four real-world applications, achieving a speedup superior to 100 when compared to one core use on Xeon Phi. It shows that OpenMP programs can be portable for the state-of-the-art CPU with almost no modification when compared to GPU based approaches [27]. ...
... The work [27] studies the performance of Xeon Phi and SNB using kernels, benchmark codes and four real-world applications, achieving a speedup superior to 100 when compared to one core use on Xeon Phi. It shows that OpenMP programs can be portable for the state-of-the-art CPU with almost no modification when compared to GPU based approaches [27]. Despite the described outstanding performance of Xeon Phi, this processor still is not widely in systems for most users of ClamAV. ...
Article
Full-text available
Increasingly sophisticated antivirus (AV) software and the growing amount and complexity of malware demand more processing power from personal computers, specifically from the central processor unit (CPU). This paper conducted performance tests with Clam AntiVirus (ClamAV) and improved its performance through parallel processing on multiple cores using the Open Multi-Processing (OpenMP) library. All the tests used the same dataset constituted of 1.33 GB of data distributed among 2766 files of different sizes. The new parallel version of ClamAV implemented in our work achieved an execution time around 62% lower than the original software version, reaching a speedup of 2.6 times faster. The main contribution of this work is to propose and implement a new version of the ClamAV antivirus using parallel processing with OpenMP, easily portable to a variety of hardware platforms and operating systems.
... This provides a better abstraction over the commonly known thread library like POSIX thread library [10]. Intel R Xeon Phi coprocessor supports OpenMP programming model, that allows an OpenMP program to be ported to the coprocessor seamlessly [17]. Intel R has provided a proprietary directive based framework called Clik TM Plus 1 that allows to perform both task and data parallel activities on Intel multi and manycore hardware. ...
... This provides a better abstraction over the commonly known thread library like POSIX thread library [10]. Intel R Xeon Phi coprocessor supports OpenMP programming model, that allows an OpenMP program to be ported to the coprocessor seamlessly [17]. Intel R has provided a proprietary directive based framework called Clik TM Plus 1 that allows to perform both task and data parallel activities on Intel multi and manycore hardware. ...
Conference Paper
Full-text available
High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. A good design abstraction paradigm strikes a balance between the abstraction and visibility over the hardware. This allows the programmer to write applications without having to understand the hardware nuances while exploiting the computing power optimally. In this paper we have analyzed the power of design abstraction of a popular design abstraction framework called Thrust both from ease of programming and performance perspectives. We have shown that while Thrust framework is good in describing an algorithm compared to the native CUDA or OpenMP version but it has quite a few design limitations. With respect to CUDA it does not provide any abstraction over the shared, texture or constant memory usage to the programmer. We have compared the performance of a Thrust application code in CUDA, OpenMP and the CPP backends with respect to the native versions (implementing exactly same algorithm), written for these backends and found that the current Thrust version performs poorly in most of the cases. While we conclude that the framework is not ready for writing applications that can exploit the optimal performance from the hardware, we also highlight the improvements necessary for the framework to make the performance comparable.
... They affirmed that common OpenMP codes could be easily migrated to Intel Xeon Phi, gaining more parallel performance without adding overheads. This study was enhanced in [39]. [16] also tested the Xeon Phi through the development of some microbenchmarks. ...
Article
Full-text available
Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.