The relative speed of the FPGA implementation Vs the CPU implementation.

Source publication

High Level Programming of FPGAs for HPC and Data Centric Applications

Conference Paper

Full-text available

Sep 2014

Heterogeneous computing offers a promising solution for high performance and energy efficient computing. Until recently the high performance heterogeneous computing arena was dominated by discrete GPUs but in recent years, new solutions based on devices such as APUs and FPGAs have emerged. These new solutions show promise for further improvements i...

Context 1

... total execution time of NN and LavaMD on the FPGA is 1.6X and 4.2X faster compared to that of running on the CPU. Relative speedup can be seen in Fig 1. The throughput of the Document Classification Application running on the CPU is 312MB/s while on the FPGA it is 324MB/s. ...

View in full-text

Fig. 2 Pipeline structure in Stream Engine module for sample matrices...

Fig. 3 Comparison of vendor-optimized library CUBLAS-XT with ZZGemmOOC...

Using Partitioner module for decomposition of matrix A into 4...

Speed function of XeonPhiOOC on Intel Xeon Phi 3120P. The green line...

Speed function of FPGAOOC on Xilinx Virtex 7 690T FPGA. The green line...

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Article

Full-text available

Feb 2018

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clo...

Scalable parallel architectures on reconfigurable platforms

Thesis

Full-text available

Feb 2016

David Castells-Rufas

The latest advances on reconfigurable systems resulted in FPGA devices with enough resources to implement multiprocessing systems with more than 100 soft-cores on a single FPGA. In this thesis I present how to implement solutions based on such systems and I present two scenarios where those kind of sytems would be valuable. FPGAs are increasingly b...

Image Processing using FPGAs: A framework

Article

Full-text available

Sep 2016

In technologically advanced world, automation of various processes is very essential in many applications such as security, production, medical, remote sensing etc. In past, computing power and algorithm computational complexity is became barrier to use image processing as a solution in many applications. Recently, many researchers have proposed co...

Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance

Article

Full-text available

Nov 2017

The ever-increasing computational requirements of HPC and service provider applications are becoming a great challenge for hardware and software designers. These requirements are reaching levels where the isolated development on either computational field is not enough to deal with such challenge. A holistic view of the computational thinking is th...

Exploring OpenCL Memory Throughput on the Zynq

Technical Report

Full-text available

Jun 2016

Bo Joel Svensson

The Zynq platform combines a general purpose processor and a Field Programmable Gate Array (FPGA), all within one single chip. Zynq-like systems are likely to become commonplace in various future computing system, from HPC compute nodes to embedded systems. In an HPC setting, the reconfigurable FPGA will be used to implement application specific ac...

FPGAs as Hardware Accelerators in Data Centers: A Survey From the Data Centric Perspective

Conference Paper

Full-text available

Mar 2024

Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

Conference Paper

Oct 2023

Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

Preprint

Full-text available

Jan 2023

The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Disaggregation separates servers into their constituent compute and memory resources so that they can be allocated as required to each workload. Previous work has shown the potential of intra-rack resource disaggregation, but it is not clear how to realize these gains and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonic components can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the high escape bandwidth of contemporary multi chip module (MCM) packages and all chip types in modern HPC racks with negligible power overhead. We show how to use distributed indirect routing to meet these demands without the need for significant complexity for reconfiguration that spatial optical switches require. We then show that intra-rack resource disaggregation implemented using emerging photonics and parallel optical wavelength-selective switches satisfies bit error rate (BER) and bandwidth constraints and provides an average application speedup of 23.9% for 31 out-of-order CPU and 61.5% for 27 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation, due to their higher latency. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.

CNN-Oriented Placement Algorithm for High-Performance Accelerators on Rad-Hard FPGAs

Article

Full-text available

Jan 2023
IEEE T COMPUT AID D

Convolutional Neural Networks (CNNs) are quickly becoming one of the most common applications running on hardware accelerators. Considering Field Programmable Gate Arrays (FPGAs), due to their high flexibility and computational performance, they are suitable for fast classification tasks and therefore, pave the way for new machine learning inference approaches. In this work, we first designed a fully interconnected CNN architecture implementable on a single FPGA. Secondly, we developed a new Neural Node-oriented placement algorithm to enable resilient CNN accelerators on space-grade FPGAs. The proposed solution reduces the single event transient error sensitivity of CNN single neuron cores while achieving high performance and effective overall convolutional architecture fault tolerance. The developed approach has been applied and integrated into a state-of-the-art Radiation Tolerant FPGAs (RTG4) implementation flow. The experimental evaluation has been performed on a Microchip test board through benchmark application performance evaluation and transient error analysis. Experimental results demonstrate an improvement of 27.2% of the maximal working frequency and a reduction of the transient error sensitivity of about three times with respect to the previous mitigation approaches.

Investigating hardware acceleration for simulation of CFD quantum circuits

Article

Full-text available

Oct 2022

Among the many computational models for quantum computing, the Quantum Circuit Model is the most well-known and used model for interacting with current quantum hardware. The practical implementation of quantum computers is a very active research field. Despite this progress, access to physical quantum computers remains relatively limited. Furthermore, the existing machines are susceptible to random errors due to quantum decoherence, as well as being limited in number of qubits, connectivity and built-in error correction. Simulation on classical hardware is therefore essential to allow quantum algorithm researchers to test and validate new algorithms in a simulated-error environment. Computing systems are becoming increasingly heterogeneous, using a variety of hardware accelerators to speed up computational tasks. One such type of accelerators, Field Programmable Gate Arrays (FPGAs), are reconfigurable circuits that can be programmed using standardized high-level programming models such as OpenCL and SYCL. FPGAs allow to create specialized highly-parallel circuits capable of mimicking the quantum parallelism properties of quantum gates, in particular for the class of quantum algorithms where many different computations can be performed concurrently or as part of a deep pipeline. They also benefit from very high internal memory bandwidth. This paper focuses on the analysis of quantum algorithms for applications in computational fluid dynamics. In this work we introduce novel quantum-circuit implementations of model lattice-based formulations for fluid dynamics, specifically the D1Q3 model using quantum computational basis encoding, as well as, efficient simulation of the circuits using FPGAs. This work forms a step toward quantum circuit formulation of the Lattice Boltzmann Method (LBM). For the quantum circuits implementing the nonlinear equilibrium distribution function in the D1Q3 lattice model, it is shown how circuit transformations can be introduced that facilitate the efficient simulation of the circuits on FPGAs, exploiting their fine-grained parallelism. We show that these transformations allow us to exploit more parallelism on the FPGA and improve memory locality. Preliminary results show that for this class of circuits the introduced transformations improve circuit execution time. We show that FPGA simulation of the reduced circuits results in more than 3× improvement in performance per Watt compared to the CPU simulation. We also present results from evaluating the same kernels on a GPU.

Sixth-Generation (6G) Mobile Cloud Security and Privacy Risks for AI System Using High-Performance Computing Implementation

Article

Full-text available

May 2022
WIREL COMMUN MOB COM

The exchange of information from one person to another is called communication. Telecommunication makes it possible with electronic devices and their tools. The scientist Alexander Graham Bell has invented the basic telephone in 1876 in the USA. Telephones now have the new format in the form of mobile phones, which are the primary media for communicating and transmitting data. We are using 5th-generation mobile network standards. Still, there are some requirements for the users that are believed to be solved in the 6th-generation mobile network standards. By 2030, all of the people would be using 6G. The computing model in the cloud is not dependent on either the location or any specific device that would provide the service. It is an on-demand computational service-oriented mechanism. Combining these two technologies as mobile cloud computing provides customized options with more flexible implementations. Artificial intelligence is being used in devices in many fields. AI can be used in mobile network services (MNS) to provide more reliable and customized services to the users, such as network operation monitoring, network operation management, fraud detection, and reduction in mobile transactions and security to the cyber devices. Combining cloud with AI in mobile network services in the 6th generation would improve human beings’ lives, such as zero road accidents, advanced level special health care, and zero crime rates in society. However, the most vital needs for sixth-generation standards are the capability to manage large volumes of records and excessive-statistics-fee connectivity in step with gadgets. The sixth-generation mobile network is under development. This generation has many exciting features. Security is the central issue that needs to be sorted out using appropriate forensic mechanisms. There is a need to approach high-performance computing for improved services to the end-user. Considering three-dimensional research methodologies (technical dimension, organizational dimension, and applications hosted on the cloud) in a high-performance computing environment leads to two different cases such as real-time stream processing and remote desktop connection and performance test. By ‘narrowing the targeted worldwide audience with a wide range of experiential opportunities,’ this paper is aimed at delivering dynamic and varied resource allocation for reliable and justified on-demand services.

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

Article

Oct 2019
MICROPROCESS MICROSY

The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path customization. This article presents a novel LLVM-based tool for decoupling memory access from computation when synthesizing massively parallel OpenCL kernels on FPGAs. To enable systematic decoupling, we use the idea of kernel parallelism and implement a new parallelism granularity that breaks down kernels to separate data-path and memory-path (memory read/write) which work concurrently to overlap the computation of current threads[1] with the memory access of future threads (memory pre-fetching at large scale). At the same time, this paper proposes an LLVM-based static analysis to detect the decouplable data for resolving the data dependency and maximize concurrency across the kernels. The experimental results on eight Rodinia benchmarks on Intel Stratix V FPGA demonstrate significant performance and energy improvement over the baseline implementation using Intel OpenCL SDK. The proposed sub-kernel parallelism achieves more than 2x speedup, with only 3% increase in resource utilization, and 7% increase in power consumption which reduces the overall energy consumption more than 40%.

Exploring the Efficiency of OpenCL Pipe for Hiding Memory Latency on Cloud FPGAs

Conference Paper

Full-text available

Sep 2019

OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path customization. This paper explores the efficiency of ”OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into ”read”, ”compute” and ”write back” sub kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS)

Taxonomy of Spatial Parallelism on FPGAs for Massively Parallel Applications

Conference Paper

Full-text available

Sep 2018

OpenCL for FPGAs has emerged as an attractive solution for realizing massively parallel compute-intensive applications. It offers a customizable application-specific datapath while abstracting away hardware development complexity. Research on OpenCL for FPGAs is at early stages and many aspects such as the spatial parallelism matching with respect to OpenCL execution semantic has not been explored in detail. An in-depth understanding and formalization are required to enhance the efficiency of OpenCL written codes on FPGAs and improve the parallelism potentials to the fullest. This paper presents a comprehensive study to identify, analyze and categorize the spatial parallelism when mapping OpenCL kernels to the FPGAs. The paper studies and explores the impact of Data-Path (DP) replication and Compute Unit (CU) replication on performance and power efficiency of OpenCL execution on FPGAs. To this end, this paper proposes a generic taxonomy for classifying spatial parallelism when mapping OpenCL to FPGAs. This results in developing FPGA-aware OpenCL codes that can achieve much higher efficiency over a baseline implementation. Our experimental results on Altera Stratix-V FPGA device for eight applications of Rodinia benchmarks demonstrate that FPGA-aware OpenCL codes achieve 3.4X, 2.2X and 2.6X performance improvement on average for SCU-MDP, MCUSDP, and MCU-MDP versions over SCU-SDP as the baseline implementation. Furthermore, we compare the performance and power efficiency against AMD FirePro W7100 GPU. Our results demonstrate that benchmarks with regular execution patterns can outperform GPUs, achieving much higher performance per watt. Furthermore, OpenCL source-code decisions that can exploit spatial parallelism will be able to hide the memory access latency and thus result in a higher speedup.

Exploring the Performance Benefits of Heterogeneity and Reconfigurable Architectures in a Commodity Cloud

Conference Paper

Full-text available

Jul 2016

in this paper we evaluate the potential of running a compute-intensive simulation on a heterogeneous cluster built from CPU, GPU and FPGA devices. We do so by augmenting a commercially available cluster of CPUs and GPUs with an FPGA device and running a distributed n-body simulation on top of Spark for unconventional cores (SparkCL) on the three different types of computing architectures. We show that given an algorithm with a sufficiently high compute intensity, such as pairwise additive n-body, we can significantly increase performance and performance per watt in comparison to running the same algorithm on a homogeneous CPU based cluster. In addition, we show the potential of using FPGAs in future commodity heterogeneous clusters alongside CPUs and GPUs

The relative speed of the FPGA implementation Vs the CPU implementation.

Context in source publication

Similar publications

Citations