Table 2 - uploaded by Jongsoo Park
Content may be subject to copyright.
Comparison of Xeon and Xeon Phi 

Comparison of Xeon and Xeon Phi 

Source publication
Conference Paper
Full-text available
This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nod...

Contexts in source publication

Context 1
... Xeon Phi coprocessors are the first commercial prod- uct of the Intel R mic architecture family, whose specification is compared with a dual-socket Xeon E5-2680 in Table 2. It is equipped with many cores, each with wide-vector units (512-bit simd), backed by large caches and high memory bandwidth. ...
Context 2
... communication-to-computation ratio measured in bytes per ops (bops) is about 0.7. The machine bops of a dual-socket Xeon E5-2680 running at 2.7 GHz is 0.23 as shown in Table 2, which is considerably smaller than the algorithmic bops, thus making the performance of fft bound by memory bandwidth. The gap between the ma- chine and algorithmic bops is even wider in Xeon Phi: the machine bops of Xeon Phi is 0.14 as shown in Table 2. As- suming that compute is completely overlapped with mem- ory transfers, the maximum achievable compute efficiency is only 0.14 0.7 = 20%. ...
Context 3
... machine bops of a dual-socket Xeon E5-2680 running at 2.7 GHz is 0.23 as shown in Table 2, which is considerably smaller than the algorithmic bops, thus making the performance of fft bound by memory bandwidth. The gap between the ma- chine and algorithmic bops is even wider in Xeon Phi: the machine bops of Xeon Phi is 0.14 as shown in Table 2. As- suming that compute is completely overlapped with mem- ory transfers, the maximum achievable compute efficiency is only 0.14 0.7 = 20%. ...

Similar publications

Article
Full-text available
Coordinated checkpointing is a widely-used checkpoint/restart protocol for fault-tolerance in large-scale HPC systems. However, this protocol will involve massive amounts of I/O concentration, resulting in considerably high checkpoint overhead and high energy consumption. This paper focuses on \emph{speculative checkpointing}, a CPR mechanism that...
Conference Paper
Full-text available
Performance analysis tools are essential in the process of understanding application behavior, identifying critical performance issues and adapting applications to new architectures and increasingly scaling HPC systems. State-of-the-art tools provide extensive functionality and a plenitude of specialized analysis capabilities. At the same time, the...
Article
Full-text available
Non-volatile memory (NVM) provides a scalable and power-efficient solution to replace DRAM as main memory. However, because of relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory system (HMS). As a result, data objects of the application must be carefully placed to NVM and DRAM for best pe...

Citations

... Many efforts have been made from algorithm and hardware aspects. Lots of optimized implementations of FFT have been proposed on the CPU platform [11,12], the GPU platform [5,22] and other accelerator platforms [18,25,28]. ...
Preprint
Fast Fourier Transform (FFT) is an essential tool in scientific and engineering computation. The increasing demand for mixed-precision FFT has made it possible to utilize half-precision floating-point (FP16) arithmetic for faster speed and energy saving. Specializing in lower precision, NVIDIA Tensor Cores can deliver extremely high computation performance. However, the fixed computation pattern makes it hard to utilize the computing power of Tensor Cores in FFT. Therefore, we developed tcFFT to accelerate FFT with Tensor Cores. Our tcFFT supports batched 1D and 2D FFT of various sizes and it exploits a set of optimizations to achieve high performance: 1) single-element manipulation on Tensor Core fragments to support special operations needed by FFT; 2) fine-grained data arrangement design to coordinate with the GPU memory access pattern. We evaluated our tcFFT and the NVIDIA cuFFT in various sizes and dimensions on NVIDIA V100 and A100 GPUs. The results show that our tcFFT can outperform cuFFT 1.29x-3.24x and 1.10x-3.03x on the two GPUs, respectively. Our tcFFT has a great potential for mixed-precision scientific applications.
... Parallel one-dimensional FFT algorithms in distributed-memory parallel computers have been investigated extensively [12,17,19,20,27,29,36,39,48,51,52]. ...
... Parallel three-dimensional FFT implementations on GPU clusters have previously been presented [18,35]. Also, parallel one-dimensional FFT algorithm on Xeon Phi cluster has been proposed [36]. ...
... In contrast, our goal is to provide a more general graph processing framework on an emerging throughput-oriented architecture. There also exist some research efforts using Xeon Phi for algorithms other than graph processing, such as FFT [35], an important scientific computing kernel. However, scaling FFT and graph computation on KNL are different. ...
Conference Paper
Full-text available
Modern parallel architecture design has increasingly turned to throughput-oriented devices to address concerns about energy efficiency and power consumption. However, graph applications cannot tap into the full potential of such architectures because of highly unstructured computations and irregular memory accesses. In this paper, we present GraphPhi, a new approach to graph processing on emerging Intel Xeon Phi-like architectures, by addressing the restrictions of migrating existing graph processing frameworks on shared-memory multi-core CPUs to this new architecture. Specifically, GraphPhi consists of 1) an optimized hierarchically blocked graph representation to enhance the data locality for both edges and vertices within and among threads, 2) a hybrid vertex-centric and edge-centric execution to efficiently find and process active edges, and 3) a uniform MIMD-SIMD scheduler integrated with a lock-free update support to achieve both good thread-level load balance and SIMD-level utilization. Besides, our efficient MIMD-SIMD execution is capable of hiding memory latency by increasing the number of concurrent memory access requests, thus benefiting more from the latest High-Bandwidth Memory technique. We evaluate our GraphPhi on six graph processing applications. Compared to two state-of-the-art shared-memory graph processing frameworks, it results in speedups up to 4X and 35X, respectively.
... In this paper, we aim to achieve this by adopting Intel Xeon Phi as coprocessors. This idea was inspired by the fact that Intel Xeon Phi [14] has been widely used to accelerate many applications, such as sparse matrix-vector multiplication [20], 1D FFT computations [21], Linpack Benchmark calculation [22], molecular dynamics [23], computational biology [24], [32], et al. ...
... Due to the weakness in IO and control, Xeon Phi coprocessor are widely used in this mode. As mentioned in Section 1, more and more applications are accelerated by Xeon Phi, from basic scientific computation to biology applications [20], [21], [22], [23], [24], [32]. Xeon Phi is showing its great potential in parallel computing. ...
Article
Full-text available
Single Nucleotide Polymorphism (SNP) detection is a fundamental procedure of whole genome analysis. SOAPsnp, a classic tool for detection, would take more than one week to analyze one typical human genome, which limits the efficiency of downstream analyses. In this paper, we present mSNP, an optimized version of SOAPsnp, which leverages Intel Xeon Phi coprocessors for large-scale SNP detection. Firstly, we redesigned the essential data structures of SOAPsnp, which significantly reduces memory footprint and improves computing efficiency. Then we developed a coordinated parallel framework for a higher hardware utilization of both CPU and Xeon Phi. Also, we tailored the data structures and operations to utilize the wide VPU of Xeon Phi to improve data throughput. Last but not the least, we proposeed a read-based window division strategy to improve throughput and obtain better load balance. mSNP is the first SNP detection tool empowered by Xeon Phi. We achieved a 38x single thread speedup on CPU, without any loss in precision. Moreover, mSNP successfully scaled to 4,096 nodes on Tianhe-2. Our experiments demonstrate that mSNP is efficient and scalable for large-scale human genome SNP detection.
... An Intel ® Xeon ® Phi ™ coprocessor has at most 61 cores, each of which has 4 Hyper-Threading and includes a vector processing unit with the width of 512 bits. The floating-point calculation peak speed of a single MIC card can reach 1 TFLOPS [17]. One of the most important features of MIC is that it applies an x86-compatible instruction set, which makes it lighter and faster to port software to MIC comparing with GPU porting [18]. ...
Article
Molecular dynamics (MD) is a computer simulation method of studying physical movements of atoms and molecules that provide detailed microscopic sampling on molecular scale. With the continuous efforts and improvements, MD simulation gained popularity in materials science, biochemistry and biophysics with various application areas and expanding data scale. Assisted Model Building with Energy Refinement (AMBER) is one of the most widely used software packages for conducting MD simulations. However, the speed of AMBER MD simulations for system with millions of atoms in microsecond scale still need to be improved. In this paper, we propose a parallel acceleration strategy for AMBER on the Tianhe-2 supercomputer. The parallel optimization of AMBER is carried out on three different levels: fine grained OpenMP parallel on a single CPU, single node CPU/MIC parallel optimization and multi-node multi-MIC collaborated parallel acceleration. By the three levels of parallel acceleration strategy above, we achieved the highest speedup of 25-33 times compared with the original program.
... Tang et al. [24] developed an alternative algorithm based on convolution and oversampling which reduces the communication requirements to a single all-to-all transposition and a negligible halo exchange of the oversampled values. Park et al. [20] further analyze and present results for this method with Intel Xeon Phi coprocessors. ...
Conference Paper
Communication-avoiding algorithms have been a subject of growing interest in the last decade due to the growth of distributed memory systems and the disproportionate increase of computational throughput to communication bandwidth. For distributed 1D FFTs, communication costs quickly dominate execution time as all industry-standard implementations perform three all-to-all transpositions of the data. In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies heavily on existing library primitives, demonstrate that our strategy achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on up to eight NVIDIA Tesla P100 GPUs, and develop an accurate compute model to analyze the performance and dependencies of the algorithm.
... This trend indicates that determining the optimal mapping to the Xeon Phi is critical to application developers. In a total of 19 works (Apra`et al., 2014;Brown et al., 2015;Heinecke et al., 2013;Ho¨hnerbach et al., 2016;Jundt et al., 2015;Krishnaiyer et al., 2013;Lai et al., 2014;Liu et al., 2015;Lopez et al., 2015;Mathew et al., 2015;Misra et al., 2013;Newburn et al., 2013;Park et al., 2013;Saini et al., 2015;Sainz et al., 2015;Saule et al., 2014;Teodoro et al., 2014), it was found the Xeon Phi outperformed the CPU, and only two works found the CPU to be better (Li et al., 2014;Luo et al., 2013). Hence, Xeon Phi was found to be a promising accelerator. ...
Article
Power draw is a complex physical response to the workload of a given application on the hardware, which is difficult to model, in part, due to its variability. The empirical mode decomposition and Hilbert–Huang transform (EMD/HHT) is a method commonly applied to physical systems varying with time to analyze their complex behavior. In authors’ work, the EMD/HHT is considered for the first time to study power usage of high-performance applications. Here, this method is applied to the power measurement sequences (called here power traces) collected on three different computing platforms featuring two generations of Intel Xeon Phi, which are an attractive solution under the power budget constraints. The high-performance applications explored in this work are codesign molecular synamics and general atomic and molecular electronic structure system—which exhibit different power draw characteristics—to showcase strengths and limitations of the EMD/HHT analysis. Specifically, EMD/HHT measures intensity of an execution, which shows the concentration of power draw with respect to execution time and provides insights into performance bottlenecks. This article compares intensity among executions, noting on a relationship between intensity and execution characteristics, such as computation amount and data movement. In general, this article concludes that the EMD/HHT method is a viable tool to compare application power usage and performance over the entire execution and that it has much potential in selecting most appropriate execution configurations.
... Some essays are focused in the implementation of existing algorithms into coprocessors. For example, [33] developed a multi-node 1D FFT implementation on coprocessors; [27] implemented a sparse matrix-vector multiplication; and [32] developed a SQL engine that benefited from the inherent parallelism related to Xeon Phi coprocessors. ...
Article
Full-text available
Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.
... An Intel ® Xeon ® Phi ™ coprocessor has at most 61 cores, each of which has 4 Hyper-Threading and includes a vector processing unit with the width of 512 bits. The floating-point calculation peak speed of a single MIC card can reach 1 TFLOPS [17]. One of the most important features of MIC is that it applies an x86-compatible instruction set, which makes it lighter and faster to port software to MIC comparing with GPU porting [18]. ...
Conference Paper
Molecular dynamics (MD) is a computer simulation method of studying physical movements of atoms and molecules that provide detailed microscopic sampling on molecular scale. With the continuous efforts and improvements, MD simulation gained popularity in materials science, biochemistry and biophysics with various application areas and expanding data scale. Assisted Model Building with Energy Refinement (AMBER) is one of the most widely used software packages for conducting MD simulations. However, the speed of AMBER MD simulations for system with millions of atoms in microsecond scale still need to be improved. In this paper, we propose a parallel acceleration strategy for AMBER on Tianhe-2 supercomputer. The parallel optimization of AMBER is carried out on three different levels: fine grained OpenMP parallel on a single MIC, single-node CPU/MIC collaborated parallel optimization and multi-node multi-MIC collaborated parallel acceleration. By the three levels of parallel acceleration strategy above, we achieved the highest speedup of 25-33 times compared with the original program. Source Code: https://github.com/tianhe2/mAMBER.
... The studies mainly focus on providing high level of multithreading parallelism [5]- [7]. Some others studied how to utilize many-core architectures to solve the scaling issue [8]- [10]. However, many previous work relies on the Intel compiler's auto-vectorization. ...
Conference Paper
Full-text available
Intel Xeon Phi is a processor based on MIC architecture that contains a large number of compute cores with a high local memory bandwidth and 512-bit vector processing units. To achieve high performance on Xeon Phi, it is important for programmers to explore all the software features provided by the Intel compiler and libraries to fully utilize the new hardware resources. In this paper, we use the K-Means algorithm to study the performance of various Intel software settings available for Xeon Phi and their impacts to the performance of K-means. At first we examine different memory layouts for storing data points using Intel compiler-intrinsic functions. During distance calculation, the computational kernel of K-means, when the size of individual input data points is not vector-friendly, we pad the data points to align with the VPU width. At last, we implement a parallel reduction to increase memory access parallelism and cache hits. These techniques enable us to successfully take advantage of thread-level parallelism and data-level parallelism on Xeon Phi. Experimental results demonstrate large performance gains over the default auto-vectorization approach. The K-Means implemented with the proposed techniques achieves up to 68.65% and 56.14% performance improvements for aligned datasets and unaligned datasets, respectively. For high-dimensional aligned datasets, we achieved up to 53.49% performance improvement on a large-scale parallel computer.