Fig 1 - uploaded by Wim Vanderbauwhede
Content may be subject to copyright.
The relative speed of the FPGA implementation Vs the CPU implementation.

The relative speed of the FPGA implementation Vs the CPU implementation.

Source publication
Conference Paper
Full-text available
Heterogeneous computing offers a promising solution for high performance and energy efficient computing. Until recently the high performance heterogeneous computing arena was dominated by discrete GPUs but in recent years, new solutions based on devices such as APUs and FPGAs have emerged. These new solutions show promise for further improvements i...

Context in source publication

Context 1
... total execution time of NN and LavaMD on the FPGA is 1.6X and 4.2X faster compared to that of running on the CPU. Relative speedup can be seen in Fig 1. The throughput of the Document Classification Application running on the CPU is 312MB/s while on the FPGA it is 324MB/s. ...

Similar publications

Article
Full-text available
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clo...
Thesis
Full-text available
The latest advances on reconfigurable systems resulted in FPGA devices with enough resources to implement multiprocessing systems with more than 100 soft-cores on a single FPGA. In this thesis I present how to implement solutions based on such systems and I present two scenarios where those kind of sytems would be valuable. FPGAs are increasingly b...
Article
Full-text available
In technologically advanced world, automation of various processes is very essential in many applications such as security, production, medical, remote sensing etc. In past, computing power and algorithm computational complexity is became barrier to use image processing as a solution in many applications. Recently, many researchers have proposed co...
Article
Full-text available
The ever-increasing computational requirements of HPC and service provider applications are becoming a great challenge for hardware and software designers. These requirements are reaching levels where the isolated development on either computational field is not enough to deal with such challenge. A holistic view of the computational thinking is th...
Technical Report
Full-text available
The Zynq platform combines a general purpose processor and a Field Programmable Gate Array (FPGA), all within one single chip. Zynq-like systems are likely to become commonplace in various future computing system, from HPC compute nodes to embedded systems. In an HPC setting, the reconfigurable FPGA will be used to implement application specific ac...

Citations

... An adaptive compute acceleration platform (ACAP) architecture was demonstrated which provides a hybrid of programmable logic and vector processing that improves the computation speed of data centers as well as the performance of wireless network communications. This kind of architecture is 20 times faster than FPGA and 100x faster than CPU [39] as discussed in Table 6. ...
... Two HPC applications such as Nearest neighbors and Lava MD (Molecular Dynamics) and a Document Classification were implemented using Nallatech FPGA with OpenCL compiler. This hardware architecture produces the results 4.3x, 5.3x, and 1.3x faster compared to the Xeon type processor [39]. The power dissipation is also reduced for FPGA implementation. ...
... IMPLEMENTATION TIME OF CPU VS FPGA[39] ...
... Leading high performance computing (HPC) systems are steadily embracing heterogeneity of compute and memory resources to preserve performance scaling and reduce system power [1], [2], [3]. This trend is already apparent with the integration of GPUs [4], [5], [6] and is expected to continue with fixed-function or reconfigurable accelerators such as field programmable gate arrays (FPGAs) [7], [8], [9], [10], [11], [12], [13], and heterogeneous memory [14]. Also, key HPC workloads show considerable diversity in computational and memory access patterns [15], [16]. ...
... Leading high performance computing (HPC) systems are steadily embracing heterogeneity of compute and memory resources to preserve performance scaling and reduce system power Liu et al. [2012], Top [2018], Ujaldón [2016]. This trend is already apparent with the integration of GPUs Mittal and Vetter [2015], Tiwari et al. [2015], Gao and Zhang [2016] and is expected to continue with fixed-function or reconfigurable accelerators such as field programmable gate arrays (FPGAs) Milojicic [2020], Asaadi and Chapman [2017], Segal et al. [2014], Hogervorst et al. [2021], Lant et al. [2020], Dimond et al. [2011], Ramirez-Gargallo et al. [2019], emerging customized accelerators, and heterogeneous memory Venkata et al. [2017]. In addition, key HPC workloads show considerable diversity in computational and memory access patterns Michelogiannakis et al. [2022], Rodrigo et al. [2016]. ...
Preprint
Full-text available
The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Disaggregation separates servers into their constituent compute and memory resources so that they can be allocated as required to each workload. Previous work has shown the potential of intra-rack resource disaggregation, but it is not clear how to realize these gains and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonic components can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the high escape bandwidth of contemporary multi chip module (MCM) packages and all chip types in modern HPC racks with negligible power overhead. We show how to use distributed indirect routing to meet these demands without the need for significant complexity for reconfiguration that spatial optical switches require. We then show that intra-rack resource disaggregation implemented using emerging photonics and parallel optical wavelength-selective switches satisfies bit error rate (BER) and bandwidth constraints and provides an average application speedup of 23.9% for 31 out-of-order CPU and 61.5% for 27 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation, due to their higher latency. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.
... On the one hand, Graphic Processing Units (GPUs) are one of the most popular architectures to accelerate CNN computations thanks to their parallel arrays of streaming multiprocessors allowing a straightforward elaboration of high-level parallel software algorithms [1]. On the other hand, the high performance of recent FPGAs as well as their capability to be reprogrammed easily lead them to be an appealing solution for highperformance demanding algorithms with limited power consumption and high efficiency [2]. Furthermore, by the advancement of high-level synthesis tools and therefore speeding up the designers' productivity [3], the implementation of FPGA-based CNN core accelerators and the implementation of GPU-like architectures on FPGA becomes feasible [4] [5]. ...
Article
Full-text available
Convolutional Neural Networks (CNNs) are quickly becoming one of the most common applications running on hardware accelerators. Considering Field Programmable Gate Arrays (FPGAs), due to their high flexibility and computational performance, they are suitable for fast classification tasks and therefore, pave the way for new machine learning inference approaches. In this work, we first designed a fully interconnected CNN architecture implementable on a single FPGA. Secondly, we developed a new Neural Node-oriented placement algorithm to enable resilient CNN accelerators on space-grade FPGAs. The proposed solution reduces the single event transient error sensitivity of CNN single neuron cores while achieving high performance and effective overall convolutional architecture fault tolerance. The developed approach has been applied and integrated into a state-of-the-art Radiation Tolerant FPGAs (RTG4) implementation flow. The experimental evaluation has been performed on a Microchip test board through benchmark application performance evaluation and transient error analysis. Experimental results demonstrate an improvement of 27.2% of the maximal working frequency and a reduction of the transient error sensitivity of about three times with respect to the previous mitigation approaches.
... For supercomputer clusters, the dominant factor is the energy consumption and therefore we need to consider the performance per Watt. From that perspective, the picture is quite different: the measured power consumption of the FPGA board is 25 W (Segal et al., 2014); the host CPU consumes 160 W (not including RAM power consumption). So already the FPGA simulation has more than 3× better performance per Watt than the CPU. ...
Article
Full-text available
Among the many computational models for quantum computing, the Quantum Circuit Model is the most well-known and used model for interacting with current quantum hardware. The practical implementation of quantum computers is a very active research field. Despite this progress, access to physical quantum computers remains relatively limited. Furthermore, the existing machines are susceptible to random errors due to quantum decoherence, as well as being limited in number of qubits, connectivity and built-in error correction. Simulation on classical hardware is therefore essential to allow quantum algorithm researchers to test and validate new algorithms in a simulated-error environment. Computing systems are becoming increasingly heterogeneous, using a variety of hardware accelerators to speed up computational tasks. One such type of accelerators, Field Programmable Gate Arrays (FPGAs), are reconfigurable circuits that can be programmed using standardized high-level programming models such as OpenCL and SYCL. FPGAs allow to create specialized highly-parallel circuits capable of mimicking the quantum parallelism properties of quantum gates, in particular for the class of quantum algorithms where many different computations can be performed concurrently or as part of a deep pipeline. They also benefit from very high internal memory bandwidth. This paper focuses on the analysis of quantum algorithms for applications in computational fluid dynamics. In this work we introduce novel quantum-circuit implementations of model lattice-based formulations for fluid dynamics, specifically the D1Q3 model using quantum computational basis encoding, as well as, efficient simulation of the circuits using FPGAs. This work forms a step toward quantum circuit formulation of the Lattice Boltzmann Method (LBM). For the quantum circuits implementing the nonlinear equilibrium distribution function in the D1Q3 lattice model, it is shown how circuit transformations can be introduced that facilitate the efficient simulation of the circuits on FPGAs, exploiting their fine-grained parallelism. We show that these transformations allow us to exploit more parallelism on the FPGA and improve memory locality. Preliminary results show that for this class of circuits the introduced transformations improve circuit execution time. We show that FPGA simulation of the reduced circuits results in more than 3× improvement in performance per Watt compared to the CPU simulation. We also present results from evaluating the same kernels on a GPU.
... Therefore, it is advised to use XMLbased programming language for all the needs of 6G. According to Segal et al. [16], heterogeneous computing is a potential approach for high-performance and energyefficient computing. Till now, the high-performance heterogeneous computing industry was dominated by discrete GPUs, but new options based on APUs and FPGAs have emerged. ...
Article
Full-text available
The exchange of information from one person to another is called communication. Telecommunication makes it possible with electronic devices and their tools. The scientist Alexander Graham Bell has invented the basic telephone in 1876 in the USA. Telephones now have the new format in the form of mobile phones, which are the primary media for communicating and transmitting data. We are using 5th-generation mobile network standards. Still, there are some requirements for the users that are believed to be solved in the 6th-generation mobile network standards. By 2030, all of the people would be using 6G. The computing model in the cloud is not dependent on either the location or any specific device that would provide the service. It is an on-demand computational service-oriented mechanism. Combining these two technologies as mobile cloud computing provides customized options with more flexible implementations. Artificial intelligence is being used in devices in many fields. AI can be used in mobile network services (MNS) to provide more reliable and customized services to the users, such as network operation monitoring, network operation management, fraud detection, and reduction in mobile transactions and security to the cyber devices. Combining cloud with AI in mobile network services in the 6th generation would improve human beings’ lives, such as zero road accidents, advanced level special health care, and zero crime rates in society. However, the most vital needs for sixth-generation standards are the capability to manage large volumes of records and excessive-statistics-fee connectivity in step with gadgets. The sixth-generation mobile network is under development. This generation has many exciting features. Security is the central issue that needs to be sorted out using appropriate forensic mechanisms. There is a need to approach high-performance computing for improved services to the end-user. Considering three-dimensional research methodologies (technical dimension, organizational dimension, and applications hosted on the cloud) in a high-performance computing environment leads to two different cases such as real-time stream processing and remote desktop connection and performance test. By ‘narrowing the targeted worldwide audience with a wide range of experiential opportunities,’ this paper is aimed at delivering dynamic and varied resource allocation for reliable and justified on-demand services.
... Availability of OpenCL-HLS for FPGAs poses many interesting research questions to OpenCL-HLS designers and system architects. Many recent publications have studied and explored the OpenCL execution efficiency on FPGAs devices [9,10,11,24,25,8,26,27,28]. One such work [24] introduces a generic taxonomy to classify and maximize parallelism potential on Intel FPGAs. ...
Article
The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path customization. This article presents a novel LLVM-based tool for decoupling memory access from computation when synthesizing massively parallel OpenCL kernels on FPGAs. To enable systematic decoupling, we use the idea of kernel parallelism and implement a new parallelism granularity that breaks down kernels to separate data-path and memory-path (memory read/write) which work concurrently to overlap the computation of current threads[1] with the memory access of future threads (memory pre-fetching at large scale). At the same time, this paper proposes an LLVM-based static analysis to detect the decouplable data for resolving the data dependency and maximize concurrency across the kernels. The experimental results on eight Rodinia benchmarks on Intel Stratix V FPGA demonstrate significant performance and energy improvement over the baseline implementation using Intel OpenCL SDK. The proposed sub-kernel parallelism achieves more than 2x speedup, with only 3% increase in resource utilization, and 7% increase in power consumption which reduces the overall energy consumption more than 40%.
... OpenCL-HLS has provided many benefits in the field of high-performance computing opening up interesting areas of research for a wide variety of applications. Many recent articles have explored the possibility of improving performance of applications on FPGA [5], [8], [9], [14], [24]. One such method is to explore the spatial parallelism on FPGA [21]. ...
Conference Paper
Full-text available
OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path customization. This paper explores the efficiency of ”OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into ”read”, ”compute” and ”write back” sub kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS)
... The energy efficiency of FPGA devices stem from customized deeply pipelined datapath, making them very attractive device for high-performance energyefficient computing [7], [8]. Many researches [11], [13], [16] have been conducted on OpenCL programming capabilities to improve FPGAs efficiency. Further, [3], [17], [18] have worked on exploiting the parallelism on FPGAs with Opencl attributes as well as by suggesting new architectural modifications to improve performance. ...
Conference Paper
Full-text available
OpenCL for FPGAs has emerged as an attractive solution for realizing massively parallel compute-intensive applications. It offers a customizable application-specific datapath while abstracting away hardware development complexity. Research on OpenCL for FPGAs is at early stages and many aspects such as the spatial parallelism matching with respect to OpenCL execution semantic has not been explored in detail. An in-depth understanding and formalization are required to enhance the efficiency of OpenCL written codes on FPGAs and improve the parallelism potentials to the fullest. This paper presents a comprehensive study to identify, analyze and categorize the spatial parallelism when mapping OpenCL kernels to the FPGAs. The paper studies and explores the impact of Data-Path (DP) replication and Compute Unit (CU) replication on performance and power efficiency of OpenCL execution on FPGAs. To this end, this paper proposes a generic taxonomy for classifying spatial parallelism when mapping OpenCL to FPGAs. This results in developing FPGA-aware OpenCL codes that can achieve much higher efficiency over a baseline implementation. Our experimental results on Altera Stratix-V FPGA device for eight applications of Rodinia benchmarks demonstrate that FPGA-aware OpenCL codes achieve 3.4X, 2.2X and 2.6X performance improvement on average for SCU-MDP, MCUSDP, and MCU-MDP versions over SCU-SDP as the baseline implementation. Furthermore, we compare the performance and power efficiency against AMD FirePro W7100 GPU. Our results demonstrate that benchmarks with regular execution patterns can outperform GPUs, achieving much higher performance per watt. Furthermore, OpenCL source-code decisions that can exploit spatial parallelism will be able to hide the memory access latency and thus result in a higher speedup.
... While it has been shown theoretically and in practice [5] [6] [7] that heterogeneous systems offer potential performance and performance per watt improvements over homogenous systems, it is, in our experience [8], not trivial to exploit the advantages of heterogeneity because of existing non-unified memory architectures and the way accelerators integrate into the compute fabric. In order to achieve improved performance we need to maintain a high compute to data-transfer ratio [9], otherwise we spend the majority of our time and energy moving data around and do not benefit from the inclusion of accelerators in a system's design. ...
... In our opinion, next to the problem of high level programmability of heterogeneous systems/clusters, the biggest hurdle to large scale general accelerator use is the memory hierarchy in use with traditional accelerators today. Once we break that barrier by reducing the cost of memory transfer and/or increasing the compute intensity of an algorithm (such as we did in this paper and in others [8] [7]) we discover that there is significant justification to using accelerators and heterogeneous systems instead of homogenous systems. New shared memory architectures and single dye heterogeneous devices [10] [11] should open up the way to more application acceleration opportunities. ...
Conference Paper
Full-text available
in this paper we evaluate the potential of running a compute-intensive simulation on a heterogeneous cluster built from CPU, GPU and FPGA devices. We do so by augmenting a commercially available cluster of CPUs and GPUs with an FPGA device and running a distributed n-body simulation on top of Spark for unconventional cores (SparkCL) on the three different types of computing architectures. We show that given an algorithm with a sufficiently high compute intensity, such as pairwise additive n-body, we can significantly increase performance and performance per watt in comparison to running the same algorithm on a homogeneous CPU based cluster. In addition, we show the potential of using FPGAs in future commodity heterogeneous clusters alongside CPUs and GPUs