Fig 3 - uploaded by Magnus Själander
Content may be subject to copyright.
A multiply-accumulate operation using inputs X and Y , assuming the three-cycle MAC architecture of Fig. 1. The multiply-accumulate operation starts with the generation (assuming the Baugh-Wooley algorithm) and reduction of partial products. The final adder performs carry propagation of the sums and carries produced by the PP unit. Finally, the accumulate adder sums the pipelined products (M ) to the accumulated result (F ), producing the new result (G). 

A multiply-accumulate operation using inputs X and Y , assuming the three-cycle MAC architecture of Fig. 1. The multiply-accumulate operation starts with the generation (assuming the Baugh-Wooley algorithm) and reduction of partial products. The final adder performs carry propagation of the sums and carries produced by the PP unit. Finally, the accumulate adder sums the pipelined products (M ) to the accumulated result (F ), producing the new result (G). 

Source publication
Article
Full-text available
We propose a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports two's complement numbers, and includes accumulation guard bits and saturation circuitry. The first MAC pipeline stage contains only partial-product generation circuitry and a reduction tree, while the second stage, thanks to a special sign-ex...

Similar publications

Article
Full-text available
In this paper, we propose a new hardware architecture of a very high-speed finite impulse response (FIR) filter using fine-grained seamless pipelining. The proposed full-parallel pipeline FIR filter can produce an output sample in a few gate delays by placing the pipeline registers not only in between components, but also across the components. A p...

Citations

... These multipliers are used in MAC unit to red t. In existing design [15], they have implemented MAC architecture for the reduction of critical path delay. So, in this design they have integrated a part of additions into the PPR process. ...
Article
Unit for Digital Signal Processing Applications Kausar Jahan1, Pala Kalyani2, V Satya Sai3, GRK Prasad4, Syed Inthiyaz5, Sk Hasane Ahammad6 1Department of ECE, Dadi Institute of Engineering and Technology Anakapalle, Andhra Pradesh, India 2Department of ECE, Vardhaman College of Engineering Kacharam, Shamshabad, India 3Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 4Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 5Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 6Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 Abstract—The fundamental component used in many of the Digital signal Processing (DSP) applications are Multiply and Accumulation Unit (MAC). In the literature, a multiplier consists of greater number of full adders and half adder in partial product reduction stage, which increases the hardware complexity and critical path delay to MAC unit. To overcome this problem, two novel multipliers are proposed in this article. The proposed multipliers are designed and implemented in hardware, which reduces the circuit complexity and improves the overall performance of the MAC unit with less delay. The proposed multipliers are compared with the 4-bit existing designs and observed that the number of slices Look Up Tables (LUTs) are minimized from 113 to 43, Slices are reduced from 46 to 14, Full Adders (FAs) are lessened from 28 to 23, bonded Input Output Blocks (IOBs) and Half Adders (HAs) were not altered. The time delay is reduced from 14.251ns to 7.876ns. The proposed multipliers are compared in the literature with the 8-bit multiplier, then the number of Slice LUTs are reduced from 510 to 231, Slices are reduced from 218 to 113, FAs are reduced from 120 to 110, HAs are reduced from 56 to 39, time delay is reduced from 26.228ns to12.748ns, but bonded IOBs count remains same. The synthesis and simulations results are verified by using Xilinx ISE 14.7 version tool.
... The Multiply-Accumulate (MAC) unit is a basic and essential digital component in most microprocessors and Digital Signal Processors (DSPs). The MAC unit is used to efficiently accelerate the computations of FIR or FFT/IFFT, which are required by dataintensive applications such as filters, orthogonal frequency-division multiplexing algorithms, and channel estimators [Hoang et al., 2010]. As shown by Figure 4, ing Spartan-3E FPGA. ...
Article
Full-text available
Multiplication is considered one of the most time-consuming and a key operation in wide variety of embedded applications. Speeding up this operation has a significant impact on the overall performance of these applications. A vast number of multiplication approaches are found in the literature where the goal is always to achieve a higher performance. One of these approaches relies on using smaller multiplier blocks which are built based on direct Boolean algebra equations to build large multipliers. In this work, we present a methodology for designing binary multipliers where different sizes customized partial products generation (CPPG) cells are designed and used as smaller building blocks. The sizes of the designed CPPG cells are 2×2, 3×3, 4×4, 5×5, and 6×6. We use these cells to build 8×8, 16×16, 32×32, 64×64, and 128×128 binary multipliers. All of the CPPG cells and the binary multipliers are described using the VHDL language, tested, and implemented using XILINX ISE 14.6 tools targeting different FPGA families. The implementation results show that the best performance is achieved when cell 3×3 is used and Virtex-7 FPGA is targeted. The binary multipliers that are designed using the proposed CPPG cells achieve better performance when compared with the binary multipliers presented in the literature. As an application that utilizes the proposed multiplier, a Multiply-Accumulate (MAC) unit is designed and implemented in Spartan-3E. The implementation results of the MAC unit demonstrate the effectiveness of the proposed multiplier.
... Tab. 2 reveals that the architectures in [12,33,34] consume considerably higher static and average power (in mW) than the proposed SFMAC architecture. The architectures in [35,36] are examined for 16-bit operations at 1 V and 8-bit operations at 1.8 V in 90 and 180 nm technologies. ...
Article
Full-text available
In the era of digital signal processing, like graphics and computation systems, multiplication-accumulation is one of the prime operations. A MAC unit is a vital component of a digital system, like different Fast Fourier Transform (FFT) algorithms, convolution, image processing algorithms, etcetera. In the domain of digital signal processing, the use of normalization architecture is very vast. The main objective of using normalization is to perform comparison and shift operations. In this research paper, an evolutionary approach for designing an optimized normalization algorithm is proposed using basic logical blocks such as Multiplexer, Adder etc. The proposed normalization algorithm is further used in designing an 8×8 bit Signed Floating-Point Multiply-Accumulate (SFMAC) architecture. Since the SFMAC can accept an 8-bit significand and a 3-bit exponent, the input to the said architecture can be somewhere between −7.96872 to +7.96872. The proposed architecture is designed and implemented using the Cadence Virtuoso using 90 and 130 nm technologies (in Generic Process Design Kit (GPDK) and Taiwan Semiconductor Manufacturing Company (TSMC), respectively). To reduce the power consumption of the proposed normalization architecture, techniques such as “block enabling” and “clock gating” are used rigorously. According to the analysis done on Cadence, the proposed architecture uses the least amount of power compared to its current predecessors.
... Note that the core of convolution operation is multiplication and accumulation. Therefore, in the SIMD architecture, multiply-accumulate (MAC) engines [28][29][30] are used to support convolution operations between input activations and kernel weights. No matter if a CNN is sparse or not, the compression format cannot be directly applied to the SIMD architecture; otherwise, irregularly distributed nonzero values will break the alignment of input activations and kernel weights. ...
Article
Full-text available
In a CNN (convolutional neural network) accelerator, to reduce memory traffic and power consumption, there is a need to exploit the sparsity of activation values. Therefore, some research efforts have been paid to skip ineffectual computations (i.e., multiplications by zero). Different from previous works, in this paper, we point out the similarity of activation values: (1) in the same layer of a CNN model, most feature maps are either highly dense or highly sparse; (2) in the same layer of a CNN model, feature maps in different channels are often similar. Based on the two observations, we propose a block-based compression approach, which utilizes both the sparsity and the similarity of activation values to further reduce the data volume. Moreover, we also design an encoder, a decoder and an indexing module to support the proposed approach. The encoder is used to translate output activations into the proposed block-based compression format, while both the decoder and the indexing module are used to align nonzero values for effectual computations. Compared with previous works, benchmark data consistently show that the proposed approach can greatly reduce both memory traffic and power consumption.
... However, in this state-of-the-art approach [24], the multipliers and the adder tree are still two separate computation components. On the other hand, some previous multiply-accumulate (MAC) designs [25][26][27][28] have tried to reduce the overheads caused by final additions of multiplications. However, since these MAC designs [25][26][27][28] assume that only one multiplier is used, their approaches cannot be directly applied to the design of 2-D convolver hardware circuit. ...
... On the other hand, some previous multiply-accumulate (MAC) designs [25][26][27][28] have tried to reduce the overheads caused by final additions of multiplications. However, since these MAC designs [25][26][27][28] assume that only one multiplier is used, their approaches cannot be directly applied to the design of 2-D convolver hardware circuit. ...
Article
Full-text available
Convolution operations have a significant influence on the overall performance of a convolutional neural network, especially in edge-computing hardware design. In this paper, we propose a low-power signed convolver hardware architecture that is well suited for low-power edge computing. The basic idea of the proposed convolver design is to combine all multipliers’ final additions and their corresponding adder tree to form a partial product matrix (PPM) and then to use the reduction tree algorithm to reduce this PPM. As a result, compared with the state-of-the-art approach, our convolver design not only saves a lot of carry propagation adders but also saves one clock cycle per convolution operation. Moreover, the proposed convolver design can be adapted for different dataflows (including input stationary dataflow, weight stationary dataflow, and output stationary dataflow). According to dataflows, two types of convolve-accumulate units are proposed to perform the accumulation of convolution results. The results show that, compared with the state-of-the-art approach, the proposed convolver design can save 15.6% power consumption. Furthermore, compared with the state-of-the-art approach, on average, the proposed convolve-accumulate units can reduce 15.7% power consumption.
... The comparison in terms of power consumption of the proposed MAC architectures with the existing MAC architecture is shown in The differences are visible from Table 12 that the performance of [15,70,71] has a significantly higher static as well as average power (in mW) than proposed SFMAC architecture. The performance of [72,73] is evaluated in 90 nm and 180 nm technologies for 16-bit operations at 1 V and 8-bit operations 1.8 V, respectively. ...
Article
Full-text available
Recently established Harris Hawks optimization (HHO) has natural behaviour for finding an optimum solution in global search space without getting trapped in previous convergence. However, the exploitation phase of the current Harris Hawks optimizer algorithm is poor. In the present research, an improved version of the Harris Hawks optimization algorithm, which combined HHO with Particle Swarm Optimization and named as ameliorated Harris Hawks optimizer algorithm, has been proposed to find the solution of various optimization problems such as nonlinear, non-convex and highly constrained engineering design problem. In the proposed research, the exploitation phase of the existing HHO algorithm is improved using a particle swarm optimization algorithm and its performance tested for CEC2005, CECE2017 and CEC2018 benchmark problems. Also, discrete algorithms such as FFT algorithms, convolution and image processing algorithm use multiply and accumulate (MAC) unit as a critical component. The efficiency of a MAC is mainly dependent upon the speed of operation, power dissipation and chip area along with the complexity level of the circuit. In this research paper, a power-efficient signed floating-point MAC (SFMAC) is proposed using universal compressor-based multiplier (UCM). Instead of having a complex design architecture, a simple multiplexer-based circuit is used to achieve signed floating output. The 8 × 8 SFMAC can take 8-bit mantissa and 3-bit exponent. And therefore, the input to the SFMAC can be in the range of − (7.96875)10 to + (7.96875)10. The design and implementation of the proposed architecture is done on the Cadence Spectre tool in GPDK 90 nm and TSMC 130 nm technologies. The analysis has proved that the proposed SFMAC architecture has consumed the least power than the recent MAC architectures available in the literature.
... In the other hand, only the final consequence of accumulation is true, and the proposed architecture and timing diagram as seen in Fig. 4. Fig. 5 demonstrates examples of how the standard PA and the proposed approach run. The proposed scheme from the lower half of the 32-bit adder [11], [12]. Moreover, the number of cycles from the exclusive input to the final output is same as the two architectures. ...
Article
Full-text available
In a modern technology-based application, digital signal processing (DSP) is a major priority one, in this gadgets application, the Multiply Accumulate Unit (MAC) will occupy more memory usages, power consumptions and critical path delay. Due to the number of arithmetic operations, this MAC unit will play’s a major role in this application product. Thus, the pipelined based architecture will be used to reduce the number of critical paths delay and to improve the performance of MAC architecture. However, the number of flip flops will be increased in the MAC unit, due to number of pipelined architectures. Consequently, it will increase the area and the power consumption. Thus, proposed work of this paper will get a novelty process of feed forward cut-set based MAC architecture with high level synchronization of XOR-MUX full adder with compressor technique. It will reduce the number of logic gates in MAC architecture and hence prove the performance in FPGA Implementation of LUT based area, critical path delay and average power consumption.
... In the systolic-array architecture, a MAC unit, which enables MAC functions, is used [10]. The MAC unit performs multiplication and accumulation processes repeatedly to perform continuous and complex operations in digital signal processing. ...
Article
Full-text available
Neural-network computing has revolutionized the field of machine learning. The systolic-array architecture is a widely used architecture for neural-network computing acceleration that was adopted by Google in its Tensor Processing Unit (TPU). To ensure the correct operation of the neural network, the reliability of the systolic-array architecture should be guaranteed. This paper proposes an efficient systolic-array redundancy architecture that is based on systolic-array partitioning and rearranging connections of the systolic-array elements. The proposed architecture allows both offline and online repair with an extended redundancy architecture and programmable fuses and can ensure reliability even in an online situation, for which the previous fault-tolerant schemes have not been considered.
... It achieves the 100% hardware utilization. Tung et al. [23] proposed a two's complement supportive multiply accumulate unit. The critical path is reduced using a pipeline unit in partial generation circuit and the speed is improved by 31%. ...
... Therefore, power consumption, performance, and hardware area are among the principal criteria to be concerned in the development of these systems. In this area, the multiplier accumulator (MAC) operation is one of the fundamental component [1][2][3] and presents a key factor especially for portable multimedia devices. In this context, it is mandatory to improve the performance of this module. ...
Article
Full-text available
QCA is a hopeful technology in the field of nanotechnology that seems to suit well with signal processing needs. It is concerned with great interest because of its benefits such as ultra-low power consumption, small size and can operate at one Terahertz. The Multiply-Accumulator (MAC) unit is considered as one of the essential operation in Digital Signal Processing (DSP). In the real-time DSP systems, several applications like speech processing, video coding and digital filtering etc, require MAC operations. However, the power dissipation and area are the most significant aspects in these systems. In this paper, we design low power MAC Unit based on QCA technology. QCADesigner version 2.0.3 is used to validate the accuracy of the proposed circuit. The reliability of this unit is taken at different temperatures. The power dissipation is estimated using QCAPro tool. The total power consumed by this unit is 2.183 μW. The proposed circuit has 90% improvement in terms of power over complementary metal–oxide–semiconductor (CMOS) circuits. Since the works in the field of QCA logic signal processing has started to progress, the suggested contribution will give rise to a new thread of research in the field of real time signal and image treatment.