Multi-bit accelerator architecture [22].

Multi-bit accelerator architecture [22].

Source publication
Article
Full-text available
Convolutional Neural Networks (CNN) are widely employed in the contemporary artificial intelligence systems. However these models have millions of connections between the layers, that are both memory prohibitive and computationally expensive. Employing these models on an embedded mobile application is resource limited with high power consumption an...

Contexts in source publication

Context 1
... top-level architecture of the accelerator with memory hierarchy is shown in Fig. 1. It has two clock domains: processing clock domain for the processing stage operating at 180MHz (for FPGA implementation) and 800MHz (for ASIC implementation) and memory clock domain for the communication with MIG-7 (Memory Interface generator) interface to access the DDR3 DRAM operating at 200 MHz. The two domains communicate with ...
Context 2
... networks on Virtex ultrascale FPGA. Analysis_1, we further tried to reduce the bit width of the last layer of each of the four DNN models. Based on that, we reported the resource utilization, power consumption and peak performance shown in Table 12. Further, top-1 and top-5 accuracy computation for different DNN models is shown in Fig. 9 and Fig. 10. During the computation of the top-1 and top-5 accuracies, the different orders of bit width were exploited on the last layer of the DNNs where the softmax function is applied and the class of the image is obtained. The accuracies obtained are relative to the performance of the original network. The overall FPGA resource utilization ...

Similar publications

Article
Full-text available
Approximate computing is one of the emerging concepts in multimedia applications like image processing applications. In the research world, it is getting more attention from researchers. Because of sacrificing a smaller scale in the accuracy of the design, it reduces the circuit parameters like area complexity, delay, and power. The purpose of this...
Article
Full-text available
Keyword spotting (KWS) systems are used for human–machine communications in various applications. In many cases, KWS involves a combination of wake-up-word (WUW) recognition for device activation and voice command classification tasks. These tasks present a challenge for embedded systems due to the complexity of deep learning algorithms and the nee...
Article
Full-text available
This work presents a 32-bit Reduced Instruction Set Computer fifth-generation (RISC-V) microprocessor with a COordinate Rotation DIgital Computer (CORDIC) accelerator. The accelerator is implemented inside the core and being used by the software via custom instruction. The used microprocessor is the VexRiscv with the Instruction Set Architecture (I...
Article
Full-text available
Abstract In Ultra‐Low‐Power (ULP) applications, power consumption is a key parameter for process independent architectural level design decisions. Traditionally, time‐consuming Spice simulations are used to measure the static power consumption. Herein, a technology‐independent static power estimation model is presented, which can estimate static po...
Article
Full-text available
The Fast Fourier Transform and Inverse Fast Fourier Transform are high efficient algorithm that have wide a range of Digital Signal Processing (DSP) and telecommunication based applications. The FFT/IFFT structure with any number of complex valued input/output can be categorized as Decimation In Frequency (DIF) and Decimation In Time (DIT) decompos...

Citations

... The images are downloaded from satellites to processing equipment at ground stations, where algorithms are used to analyze and interpret the images. However, in recent years, remote sensing images' resolution and data volume have increased rapidly, placing a considerable burden on data downlinks [10]. An intuitive and effective solution to this problem is to perform the image • An algorithm-hardware co-optimization and deployment method for FPGA-based CNN remote sensing image processing is proposed, including a series of hardwarecentric model optimization techniques and a versatile FPGA-based CNN accelerator architecture. ...
... The weights can be quantized into integers for inference using Equations (9) and (10). However, directly obtaining the scaling factor of input fmaps with Equation (10) is not feasible. ...
Article
Full-text available
In recent years, convolutional neural networks (CNNs) have gained widespread adoption in remote sensing image processing. Deploying CNN-based algorithms on satellite edge devices can alleviate the strain on data downlinks. However, CNN algorithms present challenges due to their large parameter count and high computational requirements, which conflict with the satellite platforms’ low power consumption and high real-time requirements. Moreover, remote sensing image processing tasks are diverse, requiring the platform to accommodate various network structures. To address these issues, this paper proposes an algorithm–hardware co-optimization and deployment method for FPGA-based CNN remote sensing image processing. Firstly, a series of hardware-centric model optimization techniques are proposed, including operator fusion and depth-first mapping technology, to minimize the resource overhead of CNN models. Furthermore, a versatile hardware accelerator is proposed to accelerate a wide range of commonly used CNN models after optimization. The accelerator architecture mainly consists of a parallel configurable network processing unit and a multi-level storage structure, enabling the processing of optimized networks with high throughput and low power consumption. To verify the superiority of our method, the introduced accelerator was deployed on an AMD-Xilinx VC709 evaluation board, on which the improved YOLOv2, VGG-16, and ResNet-34 networks were deployed. Experiments show that the power consumption of the accelerator is 14.97 W, and the throughput of the three networks reaches 386.74 giga operations per second (GOPS), 344.44 GOPS, and 182.34 GOPS, respectively. Comparison with related work demonstrates that the co-optimization and deployment method can accelerate remote sensing image processing CNN models and is suitable for applications in satellite edge devices.
... In [21,22], the authors accelerate EfficientNet-lite, which is a simplified version of the EfficientNet series models and also lacks the implementations of the SE module and complex NAF. In [23], the author can only accelerate EfficientNet-B0 without the implementations of DWC and NAF and cannot support other versions of EfficientNet series models because of its customized pipelined architecture. In [24], a hardware/software co-optimizer is proposed to provide adaptive data reuses to minimize the off-chip memory access and improve the MAC efficiency with the on-chip buffer constraints and accelerate EfficientNet-B1 with 2240 DSP slices. ...
... Our hardware accelerator can achieve the performance of 69.50 FPS (frames per second) with a throughput of 255.22 GOPS (giga operations per second) in EfficientNet-B3. In [21,[23][24][25], four typical EfficientNet hardware accelerators based on FPGA are presented with abundant experimental results. The hardware accelerator in [23] can only accelerate EfficientNet-B0 because of its customized pipelined architecture for each layer in EfficientNet. ...
... In [21,[23][24][25], four typical EfficientNet hardware accelerators based on FPGA are presented with abundant experimental results. The hardware accelerator in [23] can only accelerate EfficientNet-B0 because of its customized pipelined architecture for each layer in EfficientNet. Compared with the latest EfficientNet-B3 accelerator [25] based on the same FPGA platform, our hardware accelerator can achieve a 1.28 times improvement in throughput. ...
Article
Full-text available
Since the lightweight convolutional neural network EfficientNet was proposed by Google in 2019, the series of models have quickly become very popular due to their superior performance with a small number of parameters. However, the existing convolutional neural network hardware accelerators for EfficientNet still have much room to improve the performance of the depthwise convolution, squeeze-and-excitation module and nonlinear activation functions. In this paper, we first design a reconfigurable register array and computational kernel to accelerate the depthwise convolution. Next, we propose a vector unit to implement the nonlinear activation functions and the scale operation. An exchangeable-sequence dual-computational kernel architecture is proposed to improve the performance and the utilization. In addition, the memory architectures are designed to complete the hardware accelerator for the above computing architecture. Finally, in order to evaluate the performance of the hardware accelerator, the accelerator is implemented based on Xilinx XCVU37P. The results show that the proposed accelerator can work at the main system clock frequency of 300 MHz with the DSP kernel at 600 MHz. The performance of EfficientNet-B3 in our architecture can reach 69.50 FPS and 255.22 GOPS. Compared with the latest EfficientNet-B3 accelerator, which uses the same FPGA development board, the accelerator proposed in this paper can achieve a 1.28-fold improvement of single-core performance and 1.38-fold improvement of performance of each DSP.
... For example, the burst length of the on-chip bus, the number of ports or bitwidth of the bus [1], [10], etc., are included. In addition, in general, the clock frequencies of the on-chip bus and DRAM subsystem may differ from those of the accelerator [15]- [19]. In this study, it is assumed that the clock frequency is divided into the computation clock domain (fp) and communication clock domain (fm) of the accelerator, the microprocessor clock domain (fup) and DRAM clock domain (fd), as shown in Fig. 1. ...
Article
Full-text available
Multicore accelerators have emerged to efficiently execute recent applications with complex computational dimensions. Compared to a single-core accelerator, a multicore accelerator handles a larger amount of communication and computation simultaneously. Since the conventional performance estimation algorithm tailored to single-core accelerators cannot estimate the performance of multicore accelerators accurately, we propose a novel performance estimation algorithm for a multicore accelerator. The proposed algorithm predicts a dynamic communication bandwidth of each direct memory access controller (DMAC) based on the runtime state of DMACs, making it possible to estimate the communication amounts handled by DMACs accurately by taking into account the temporal intervals. The proposed algorithm is evaluated for convolutional neural networks and wireless communications. The experimental results using a pre-register transfer level (RTL) simulator shows that the proposed algorithm can estimate the performance of a multicore accelerator with the estimation error of up to 2.8%, regardless of the system communication bandwidth. These results were also verified by the hardware implementations on Xilinx ZYNQ. In addition, the proposed algorithm is used to explore a design space of accelerator core dimensions, and the resulting optimal core dimension provides performance gains of 10.8% and 31.2%, compared to the conventional multicore accelerator and single-core accelerator, respectively. The source code is available on the GitHub repository: https://github.com/SDL-KU/OptAccTile .
... Here the hardware accelerator frequency is normalized to DRAM subsystem frequency and it is assumed to be as much as twice the DRAM subsystem frequency [29]. In Figure 7, the hardware accelerator frequency is assumed to range from a minimum of 125 MHz to a maximum of 1000 MHz [41][42][43]. In addition, the DRAM frequency is assumed to be fixed at 500 MHz [44]. ...
Article
Full-text available
The hardware accelerator controlled by direct memory access (DMA) is greatly influenced by the communication bandwidth from/to DRAM through on-chip buses. This paper proposes a novel performance estimation algorithm to optimize the communication schemes (CSs), which are defined by the number of direct memory access controllers (DMACs) and the bank allocation of DRAM. In order to facilitate the optimization of CSs, a communication primitive (CP) is defined by the bank allocation and the set of activated DMACs. By using the communication bandwidths of CPs obtained from prior full-system simulations, the proposed performance estimation algorithm can predict the communication performance of CSs more accurately, compared with the conventional performance estimation algorithms. When it is applied to convolutional neural networks (CNNs) and wireless communications (LDPC-coded MIMO-OFDM), the estimation error is measured to be no more than 6.4% and 5%, respectively. In addition, compared with the conventional simulation-based approaches, the proposed estimation algorithm provides a speedup of two orders of magnitudes. The proposed performance estimation algorithm is used to optimize the CS of the CNNs and explore a design space characterized by bank interleaving, outstanding transactions, layer shape, tile size, and hardware frequency. It is shown that the optimized CS improves communication performance by up to 68% for the third convolutional layers of AlexNet and 60% for the MIMO of LDPCcoded MIMO-OFDM. In addition, the DRAM latency is minimized by setting the bank interleaving to the number of outstanding transactions. Moreover, the simulation results show that the optimum CS depends on the application. It is also shown that the use of an extra DMAC does not necessarily improve the communication performance.
Chapter
With the development of lightweight convolutional neural networks (CNNs), these newly proposed networks are more powerful than previous conventional models [4, 5] and can be well applied in Internet-of-Things (IoT) and edge computing. However, they perform inefficiently on conventional hardware accelerators because of the irregular connectivity in the structure. Though there are some accelerators based on unified engine (UE) architecture or separated engine (SE) architecture which can perform well for both standard convolution and depthwise convolution, these versatile structures are still not efficient for lightweight CNNs such as EfficientNet-lite. In this paper, we propose a reconfigurable engine (RE) architecture to improve the efficiency, which is used in communications such as IoT and edge computing. In addition, we adopt integer quantization method to reduce computational complexity and memory access. Also, the block-based calculation scheme is used to further reduce the off-chip memory access and the unique computational mode is used to improve the utilization of the processing elements. The proposed architecture can be implemented on Xilinx ZC706 with a 100 MHz system clock for EfficientNet-lite0. Our accelerator achieved 196 FPS and 72.9% top-1 accuracy on ImageNet classification, which is 27% and 18% speedup compared to CPU and GPU of Pixel 4 respectively.Keywordsconvolutional neural networkdepthwise convolutionquantizationhardware acceleratorEfficientNet
Chapter
The rapid development of artificial intelligence has brought a lot of convenience to our life and also affected the development of industry. Betel nut is a very popular food in China, which is fall with the majority of men. The market demand is also expanding. Combining the process of betel nut with machine vision is a great attempt to improve the efficiency of the plant. At present, the processing of betel nut is complex and there is an urgent need to improve the degree of automation. The development of machine vision can provide a good idea. In this paper, a lightweight network model is used to solve the classification problem of betel nut core in view of the key process of betel nut processing. At the same time, through comparative experiments, this paper selects the best performance indicators and the best recognition of mobilenetv3 network as an accurate identification method.KeywordsArtificial intelligenceMachine visionBetel nutMobilenetv3