Comparison of Xeon and Xeon Phi

Source publication

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

Conference Paper

Full-text available

Nov 2013

This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nod...

Context 1

... Xeon Phi coprocessors are the first commercial prod- uct of the Intel R mic architecture family, whose specification is compared with a dual-socket Xeon E5-2680 in Table 2. It is equipped with many cores, each with wide-vector units (512-bit simd), backed by large caches and high memory bandwidth. ...

View in full-text

Context 2

... communication-to-computation ratio measured in bytes per ops (bops) is about 0.7. The machine bops of a dual-socket Xeon E5-2680 running at 2.7 GHz is 0.23 as shown in Table 2, which is considerably smaller than the algorithmic bops, thus making the performance of fft bound by memory bandwidth. The gap between the ma- chine and algorithmic bops is even wider in Xeon Phi: the machine bops of Xeon Phi is 0.14 as shown in Table 2. As- suming that compute is completely overlapped with mem- ory transfers, the maximum achievable compute efficiency is only 0.14 0.7 = 20%. ...

View in full-text

Context 3

... machine bops of a dual-socket Xeon E5-2680 running at 2.7 GHz is 0.23 as shown in Table 2, which is considerably smaller than the algorithmic bops, thus making the performance of fft bound by memory bandwidth. The gap between the ma- chine and algorithmic bops is even wider in Xeon Phi: the machine bops of Xeon Phi is 0.14 as shown in Table 2. As- suming that compute is completely overlapped with mem- ory transfers, the maximum achievable compute efficiency is only 0.14 0.7 = 20%. ...

View in full-text

Energy-Performance Modeling of Speculative Checkpointing for Exascale Systems

Article

Full-text available

Dec 2017

Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on \emph{speculative checkpointing}, a CPR mechanism that...

A Structured Approach to Performance Analysis

Conference Paper

Full-text available

Jan 2018

Performance analysis tools are essential in the process of understanding application behavior, identifying critical performance issues and adapting applications to new architectures and increasingly scaling HPC systems. State-of-the-art tools provide extensive functionality and a plenitude of specialized analysis capabilities. At the same time, the...

Figure 1: A conceptual description for an MPI-based program (CG)...

Figure 2: e benchmark performance (execution time) on NVMbased main...

Figure 3: e benchmark performance (execution time) on NVMbased main...

Figure 4: e impact of data placement on performance (execution time) of...

Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory

Article

Full-text available

Apr 2017

Non-volatile memory (NVM) provides a scalable and power-efficient solution to replace DRAM as main memory. However, because of relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory system (HMS). As a result, data objects of the application must be carefully placed to NVM and DRAM for best pe...

tcFFT: Accelerating Half-Precision FFT through Tensor Cores

Preprint

Apr 2021

Fast Fourier Transform (FFT) is an essential tool in scientific and engineering computation. The increasing demand for mixed-precision FFT has made it possible to utilize half-precision floating-point (FP16) arithmetic for faster speed and energy saving. Specializing in lower precision, NVIDIA Tensor Cores can deliver extremely high computation performance. However, the fixed computation pattern makes it hard to utilize the computing power of Tensor Cores in FFT. Therefore, we developed tcFFT to accelerate FFT with Tensor Cores. Our tcFFT supports batched 1D and 2D FFT of various sizes and it exploits a set of optimizations to achieve high performance: 1) single-element manipulation on Tensor Core fragments to support special operations needed by FFT; 2) fine-grained data arrangement design to coordinate with the GPU memory access pattern. We evaluated our tcFFT and the NVIDIA cuFFT in various sizes and dimensions on NVIDIA V100 and A100 GPUs. The results show that our tcFFT can outperform cuFFT 1.29x-3.24x and 1.10x-3.03x on the two GPUs, respectively. Our tcFFT has a great potential for mixed-precision scientific applications.

FFT algorithms are not mine. However, I am going to convince you soon regarding the visit of RMS to our university. Believe it or not, this is me. This is us. Univeristy of Patras you have chosen a quite wrong enemy to mess with.

Method

Full-text available

May 2020

Apology or stipulation, we fight fascism drinking carnation.

Graphphi: efficient parallel graph processing on emerging throughput-oriented architectures

Conference Paper

Full-text available

Nov 2018

Modern parallel architecture design has increasingly turned to throughput-oriented devices to address concerns about energy efficiency and power consumption. However, graph applications cannot tap into the full potential of such architectures because of highly unstructured computations and irregular memory accesses. In this paper, we present GraphPhi, a new approach to graph processing on emerging Intel Xeon Phi-like architectures, by addressing the restrictions of migrating existing graph processing frameworks on shared-memory multi-core CPUs to this new architecture. Specifically, GraphPhi consists of 1) an optimized hierarchically blocked graph representation to enhance the data locality for both edges and vertices within and among threads, 2) a hybrid vertex-centric and edge-centric execution to efficiently find and process active edges, and 3) a uniform MIMD-SIMD scheduler integrated with a lock-free update support to achieve both good thread-level load balance and SIMD-level utilization. Besides, our efficient MIMD-SIMD execution is capable of hiding memory latency by increasing the number of concurrent memory access requests, thus benefiting more from the latest High-Bandwidth Memory technique. We evaluate our GraphPhi on six graph processing applications. Compared to two state-of-the-art shared-memory graph processing frameworks, it results in speedups up to 4X and 35X, respectively.

mSNP: A Massively Parallel Algorithm for Large-Scale SNP Detection

Article

Full-text available

May 2018
IEEE T PARALL DISTR

Single Nucleotide Polymorphism (SNP) detection is a fundamental procedure of whole genome analysis. SOAPsnp, a classic tool for detection, would take more than one week to analyze one typical human genome, which limits the efficiency of downstream analyses. In this paper, we present mSNP, an optimized version of SOAPsnp, which leverages Intel Xeon Phi coprocessors for large-scale SNP detection. Firstly, we redesigned the essential data structures of SOAPsnp, which significantly reduces memory footprint and improves computing efficiency. Then we developed a coordinated parallel framework for a higher hardware utilization of both CPU and Xeon Phi. Also, we tailored the data structures and operations to utilize the wide VPU of Xeon Phi to improve data throughput. Last but not the least, we proposeed a read-based window division strategy to improve throughput and obtain better load balance. mSNP is the first SNP detection tool empowered by Xeon Phi. We achieved a 38x single thread speedup on CPU, without any loss in precision. Moreover, mSNP successfully scaled to 4,096 nodes on Tianhe-2. Our experiments demonstrate that mSNP is efficient and scalable for large-scale human genome SNP detection.

High-Scalable Collaborated Parallel Framework for Large-Scale Molecular Dynamic Simulation on Tianhe-2 Supercomputer

Article

Feb 2018
IEEE ACM T COMPUT BI

Molecular dynamics (MD) is a computer simulation method of studying physical movements of atoms and molecules that provide detailed microscopic sampling on molecular scale. With the continuous efforts and improvements, MD simulation gained popularity in materials science, biochemistry and biophysics with various application areas and expanding data scale. Assisted Model Building with Energy Refinement (AMBER) is one of the most widely used software packages for conducting MD simulations. However, the speed of AMBER MD simulations for system with millions of atoms in microsecond scale still need to be improved. In this paper, we propose a parallel acceleration strategy for AMBER on the Tianhe-2 supercomputer. The parallel optimization of AMBER is carried out on three different levels: fine grained OpenMP parallel on a single CPU, single node CPU/MIC parallel optimization and multi-node multi-MIC collaborated parallel acceleration. By the three levels of parallel acceleration strategy above, we achieved the highest speedup of 25-33 times compared with the original program.

Low communication FMM-accelerated FFT on GPUs

Conference Paper

Nov 2017

Cris Cecka

Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.

Applying EMD/HHT analysis to power traces of applications executed on systems with Intel Xeon Phi

Article

Oct 2017
INT J HIGH PERFORM C

Power draw is a complex physical response to the workload of a given application on the hardware, which is difficult to model, in part, due to its variability. The empirical mode decomposition and Hilbert–Huang transform (EMD/HHT) is a method commonly applied to physical systems varying with time to analyze their complex behavior. In authors’ work, the EMD/HHT is considered for the first time to study power usage of high-performance applications. Here, this method is applied to the power measurement sequences (called here power traces) collected on three different computing platforms featuring two generations of Intel Xeon Phi, which are an attractive solution under the power budget constraints. The high-performance applications explored in this work are codesign molecular synamics and general atomic and molecular electronic structure system—which exhibit different power draw characteristics—to showcase strengths and limitations of the EMD/HHT analysis. Specifically, EMD/HHT measures intensity of an execution, which shows the concentration of power draw with respect to execution time and provides insights into performance bottlenecks. This article compares intensity among executions, noting on a relationship between intensity and execution characteristics, such as computation amount and data movement. In general, this article concludes that the EMD/HHT method is a viable tool to compare application power usage and performance over the entire execution and that it has much potential in selecting most appropriate execution configurations.

Using the Xeon Phi Platform to Run Speculatively-Parallelized Codes

Article

Full-text available

Apr 2017
INT J PARALLEL PROG

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.

mAMBER: A CPU/MIC collaborated parallel framework for AMBER on Tianhe-2 supercomputer

Conference Paper

Dec 2016

Molecular dynamics (MD) is a computer simulation method of studying physical movements of atoms and molecules that provide detailed microscopic sampling on molecular scale. With the continuous efforts and improvements, MD simulation gained popularity in materials science, biochemistry and biophysics with various application areas and expanding data scale. Assisted Model Building with Energy Refinement (AMBER) is one of the most widely used software packages for conducting MD simulations. However, the speed of AMBER MD simulations for system with millions of atoms in microsecond scale still need to be improved. In this paper, we propose a parallel acceleration strategy for AMBER on Tianhe-2 supercomputer. The parallel optimization of AMBER is carried out on three different levels: fine grained OpenMP parallel on a single MIC, single-node CPU/MIC collaborated parallel optimization and multi-node multi-MIC collaborated parallel acceleration. By the three levels of parallel acceleration strategy above, we achieved the highest speedup of 25-33 times compared with the original program. Source Code: https://github.com/tianhe2/mAMBER.

Evaluation of K-means data clustering algorithm on Intel Xeon Phi

Conference Paper

Full-text available

Dec 2016

Intel Xeon Phi is a processor based on MIC architecture that contains a large number of compute cores with a high local memory bandwidth and 512-bit vector processing units. To achieve high performance on Xeon Phi, it is important for programmers to explore all the software features provided by the Intel compiler and libraries to fully utilize the new hardware resources. In this paper, we use the K-Means algorithm to study the performance of various Intel software settings available for Xeon Phi and their impacts to the performance of K-means. At first we examine different memory layouts for storing data points using Intel compiler-intrinsic functions. During distance calculation, the computational kernel of K-means, when the size of individual input data points is not vector-friendly, we pad the data points to align with the VPU width. At last, we implement a parallel reduction to increase memory access parallelism and cache hits. These techniques enable us to successfully take advantage of thread-level parallelism and data-level parallelism on Xeon Phi. Experimental results demonstrate large performance gains over the default auto-vectorization approach. The K-Means implemented with the proposed techniques achieves up to 68.65% and 56.14% performance improvements for aligned datasets and unaligned datasets, respectively. For high-dimensional aligned datasets, we achieved up to 53.49% performance improvement on a large-scale parallel computer.

Comparison of Xeon and Xeon Phi

Contexts in source publication

Similar publications

Citations