Multi-bit accelerator architecture [22].

Source publication

FIGURE 6. Peak performance of different layers for fixed bit...

A Power Efficiency Enhancements of a Multi-Bit Accelerator for Memory Prohibitive Deep Neural Networks

Article

Full-text available

Jan 2021

Convolutional Neural Networks (CNN) are widely employed in the contemporary artificial intelligence systems. However these models have millions of connections between the layers, that are both memory prohibitive and computationally expensive. Employing these models on an embedded mobile application is resource limited with high power consumption an...

Context 1

... top-level architecture of the accelerator with memory hierarchy is shown in Fig. 1. It has two clock domains: processing clock domain for the processing stage operating at 180MHz (for FPGA implementation) and 800MHz (for ASIC implementation) and memory clock domain for the communication with MIG-7 (Memory Interface generator) interface to access the DDR3 DRAM operating at 200 MHz. The two domains communicate with ...

View in full-text

Context 2

... networks on Virtex ultrascale FPGA. Analysis_1, we further tried to reduce the bit width of the last layer of each of the four DNN models. Based on that, we reported the resource utilization, power consumption and peak performance shown in Table 12. Further, top-1 and top-5 accuracy computation for different DNN models is shown in Fig. 9 and Fig. 10. During the computation of the top-1 and top-5 accuracies, the different orders of bit width were exploited on the last layer of the DNNs where the softmax function is applied and the class of the image is obtained. The accuracies obtained are relative to the performance of the original network. The overall FPGA resource utilization ...

View in full-text

The approximate 8×8\documentclass[12pt]{minimal} \usepackage{amsmath}...

CMOS Implementation and Performance Analysis of Known Approximate 4:2 Compressors

Article

Full-text available

Jul 2022

Approximate computing is one of the emerging concepts in multimedia applications like image processing applications. In the research world, it is getting more attention from researchers. Because of sacrificing a smaller scale in the accuracy of the design, it reduces the circuit parameters like area complexity, delay, and power. The purpose of this...

Figure 2. Overview of the proposed KWS system.

Figure 6. Hardware architecture of proposed mel-processing unit.

Figure 7. Hardware architecture of proposed neural network unit.

Figure 8. FPGA platform for the verification of the proposed KWS system.

FPGA Implementation of Keyword Spotting System Using Depthwise Separable Binarized and Ternarized Neural Networks

Article

Full-text available

Jun 2023

Keyword spotting (KWS) systems are used for human–machine communications in various applications. In many cases, KWS involves a combination of wake-up-word (WUW) recognition for device activation and voice command classification tasks. These tasks present a challenge for embedded systems due to the complexity of deep learning algorithms and the nee...

A Trigonometric Hardware Acceleration in 32-bit RISC-V Microcontroller with Custom Instruction

Article

Full-text available

Jul 2021

This work presents a 32-bit Reduced Instruction Set Computer fifth-generation (RISC-V) microprocessor with a COordinate Rotation DIgital Computer (CORDIC) accelerator. The accelerator is implemented inside the core and being used by the software via custom instruction. The used microprocessor is the VexRiscv with the Instruction Set Architecture (I...

Static power calculation of inverter using Model‐2

Static power calculation of NAND using Model‐2

Estimation of Static Power for Series and Parallel Circuits (Power...

Static power model for CMOS and FPGA circuits

Article

Full-text available

Mar 2021

Abstract In Ultra‐Low‐Power (ULP) applications, power consumption is a key parameter for process independent architectural level design decisions. Traditionally, time‐consuming Spice simulations are used to measure the static power consumption. Herein, a technology‐independent static power estimation model is presented, which can estimate static po...

Proposed data scheduling and bit reversal circuit

A Novel Digital Logic for Bit Reversal and Address Generations in FFT Computations

Article

Full-text available

Sep 2022

The Fast Fourier Transform and Inverse Fast Fourier Transform are high efficient algorithm that have wide a range of Digital Signal Processing (DSP) and telecommunication based applications. The FFT/IFFT structure with any number of complex valued input/output can be categorized as Decimation In Frequency (DIF) and Decimation In Time (DIT) decompos...

Algorithm–Hardware Co-Optimization and Deployment Method for Field-Programmable Gate-Array-Based Convolutional Neural Network Remote Sensing Image Processing

Article

Full-text available

Dec 2023

In recent years, convolutional neural networks (CNNs) have gained widespread adoption in remote sensing image processing. Deploying CNN-based algorithms on satellite edge devices can alleviate the strain on data downlinks. However, CNN algorithms present challenges due to their large parameter count and high computational requirements, which conflict with the satellite platforms’ low power consumption and high real-time requirements. Moreover, remote sensing image processing tasks are diverse, requiring the platform to accommodate various network structures. To address these issues, this paper proposes an algorithm–hardware co-optimization and deployment method for FPGA-based CNN remote sensing image processing. Firstly, a series of hardware-centric model optimization techniques are proposed, including operator fusion and depth-first mapping technology, to minimize the resource overhead of CNN models. Furthermore, a versatile hardware accelerator is proposed to accelerate a wide range of commonly used CNN models after optimization. The accelerator architecture mainly consists of a parallel configurable network processing unit and a multi-level storage structure, enabling the processing of optimized networks with high throughput and low power consumption. To verify the superiority of our method, the introduced accelerator was deployed on an AMD-Xilinx VC709 evaluation board, on which the improved YOLOv2, VGG-16, and ResNet-34 networks were deployed. Experiments show that the power consumption of the accelerator is 14.97 W, and the throughput of the three networks reaches 386.74 giga operations per second (GOPS), 344.44 GOPS, and 182.34 GOPS, respectively. Comparison with related work demonstrates that the co-optimization and deployment method can accelerate remote sensing image processing CNN models and is suitable for applications in satellite edge devices.

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

Article

Full-text available

Jun 2023

Since the lightweight convolutional neural network EfficientNet was proposed by Google in 2019, the series of models have quickly become very popular due to their superior performance with a small number of parameters. However, the existing convolutional neural network hardware accelerators for EfficientNet still have much room to improve the performance of the depthwise convolution, squeeze-and-excitation module and nonlinear activation functions. In this paper, we first design a reconfigurable register array and computational kernel to accelerate the depthwise convolution. Next, we propose a vector unit to implement the nonlinear activation functions and the scale operation. An exchangeable-sequence dual-computational kernel architecture is proposed to improve the performance and the utilization. In addition, the memory architectures are designed to complete the hardware accelerator for the above computing architecture. Finally, in order to evaluate the performance of the hardware accelerator, the accelerator is implemented based on Xilinx XCVU37P. The results show that the proposed accelerator can work at the main system clock frequency of 300 MHz with the DSP kernel at 600 MHz. The performance of EfficientNet-B3 in our architecture can reach 69.50 FPS and 255.22 GOPS. Compared with the latest EfficientNet-B3 accelerator, which uses the same FPGA development board, the accelerator proposed in this paper can achieve a 1.28-fold improvement of single-core performance and 1.38-fold improvement of performance of each DSP.

Optimization of Multi-Core Accelerator Performance Based on Accurate Performance Estimation

Article

Full-text available

Jan 2022

Multicore accelerators have emerged to efficiently execute recent applications with complex computational dimensions. Compared to a single-core accelerator, a multicore accelerator handles a larger amount of communication and computation simultaneously. Since the conventional performance estimation algorithm tailored to single-core accelerators cannot estimate the performance of multicore accelerators accurately, we propose a novel performance estimation algorithm for a multicore accelerator. The proposed algorithm predicts a dynamic communication bandwidth of each direct memory access controller (DMAC) based on the runtime state of DMACs, making it possible to estimate the communication amounts handled by DMACs accurately by taking into account the temporal intervals. The proposed algorithm is evaluated for convolutional neural networks and wireless communications. The experimental results using a pre-register transfer level (RTL) simulator shows that the proposed algorithm can estimate the performance of a multicore accelerator with the estimation error of up to 2.8%, regardless of the system communication bandwidth. These results were also verified by the hardware implementations on Xilinx ZYNQ. In addition, the proposed algorithm is used to explore a design space of accelerator core dimensions, and the resulting optimal core dimension provides performance gains of 10.8% and 31.2%, compared to the conventional multicore accelerator and single-core accelerator, respectively. The source code is available on the GitHub repository: https://github.com/SDL-KU/OptAccTile .

Optimization of Communication Schemes for DMA-Controlled Accelerators

Article

Full-text available

Oct 2021

The hardware accelerator controlled by direct memory access (DMA) is greatly influenced by the communication bandwidth from/to DRAM through on-chip buses. This paper proposes a novel performance estimation algorithm to optimize the communication schemes (CSs), which are defined by the number of direct memory access controllers (DMACs) and the bank allocation of DRAM. In order to facilitate the optimization of CSs, a communication primitive (CP) is defined by the bank allocation and the set of activated DMACs. By using the communication bandwidths of CPs obtained from prior full-system simulations, the proposed performance estimation algorithm can predict the communication performance of CSs more accurately, compared with the conventional performance estimation algorithms. When it is applied to convolutional neural networks (CNNs) and wireless communications (LDPC-coded MIMO-OFDM), the estimation error is measured to be no more than 6.4% and 5%, respectively. In addition, compared with the conventional simulation-based approaches, the proposed estimation algorithm provides a speedup of two orders of magnitudes. The proposed performance estimation algorithm is used to optimize the CS of the CNNs and explore a design space characterized by bank interleaving, outstanding transactions, layer shape, tile size, and hardware frequency. It is shown that the optimized CS improves communication performance by up to 68% for the third convolutional layers of AlexNet and 60% for the MIMO of LDPCcoded MIMO-OFDM. In addition, the DRAM latency is minimized by setting the bank interleaving to the number of outstanding transactions. Moreover, the simulation results show that the optimum CS depends on the application. It is also shown that the use of an extra DMAC does not necessarily improve the communication performance.

A Hardware Accelerator for Standard Convolution and Depthwise Convolution

Conference Paper

Jun 2023

A Reconfigurable Convolutional Neural Networks Accelerator Based on FPGA

Chapter

Jun 2023

With the development of lightweight convolutional neural networks (CNNs), these newly proposed networks are more powerful than previous conventional models [4, 5] and can be well applied in Internet-of-Things (IoT) and edge computing. However, they perform inefficiently on conventional hardware accelerators because of the irregular connectivity in the structure. Though there are some accelerators based on unified engine (UE) architecture or separated engine (SE) architecture which can perform well for both standard convolution and depthwise convolution, these versatile structures are still not efficient for lightweight CNNs such as EfficientNet-lite. In this paper, we propose a reconfigurable engine (RE) architecture to improve the efficiency, which is used in communications such as IoT and edge computing. In addition, we adopt integer quantization method to reduce computational complexity and memory access. Also, the block-based calculation scheme is used to further reduce the off-chip memory access and the unique computational mode is used to improve the utilization of the processing elements. The proposed architecture can be implemented on Xilinx ZC706 with a 100 MHz system clock for EfficientNet-lite0. Our accelerator achieved 196 FPS and 72.9% top-1 accuracy on ImageNet classification, which is 27% and 18% speedup compared to CPU and GPU of Pixel 4 respectively.Keywordsconvolutional neural networkdepthwise convolutionquantizationhardware acceleratorEfficientNet

A Quick and Accurate Method to Identify Betel Nut Based on Mobilenetv3

Chapter

Mar 2022

The rapid development of artificial intelligence has brought a lot of convenience to our life and also affected the development of industry. Betel nut is a very popular food in China, which is fall with the majority of men. The market demand is also expanding. Combining the process of betel nut with machine vision is a great attempt to improve the efficiency of the plant. At present, the processing of betel nut is complex and there is an urgent need to improve the degree of automation. The development of machine vision can provide a good idea. In this paper, a lightweight network model is used to solve the classification problem of betel nut core in view of the key process of betel nut processing. At the same time, through comparative experiments, this paper selects the best performance indicators and the best recognition of mobilenetv3 network as an accurate identification method.KeywordsArtificial intelligenceMachine visionBetel nutMobilenetv3

Design of target positioning system for monocular vision manipulator based on neural network

Conference Paper

Dec 2021

LETA: A lightweight exchangeable-track accelerator for efficientnet based on FPGA

Conference Paper

Dec 2021

Multi-bit accelerator architecture [22].

Contexts in source publication

Similar publications

Citations