A multiply-accumulate operation using inputs X and Y , assuming the three-cycle MAC architecture of Fig. 1. The multiply-accumulate operation starts with the generation (assuming the Baugh-Wooley algorithm) and reduction of partial products. The final adder performs carry propagation of the sums and carries produced by the PP unit. Finally, the accumulate adder sums the pipelined products (M ) to the accumulated result (F ), producing the new result (G).

Source publication

A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit

Article

Full-text available

Jan 2011

We propose a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports two's complement numbers, and includes accumulation guard bits and saturation circuitry. The first MAC pipeline stage contains only partial-product generation circuitry and a reduction tree, while the second stage, thanks to a special sign-ex...

FIGURE 1. Structure of K-tap direct-form full-parallel FIR filter...

FIGURE 6. Dot diagram of the third stage WRT with 4 partial products.

FIGURE 11. Structure of a single MAC FIR filter (SMFF) with m = 16.

FIGURE 15. Structure of a folded FIR filter (FDFF) with m = 16.

Comparison of synthesis results of full-parallel FIR filters when m = 8...

Design of Very High-Speed Pipeline FIR Filter Through Precise Critical Path Analysis

Article

Full-text available

Feb 2021

In this paper, we propose a new hardware architecture of a very high-speed finite impulse response (FIR) filter using fine-grained seamless pipelining. The proposed full-parallel pipeline FIR filter can produce an output sample in a few gate delays by placing the pipeline registers not only in between components, but also across the components. A p...

Design and Analysis of High Speed Multiply and Accumulation Unit for Digital Signal Processing Applications

Article

Feb 2023

Unit for Digital Signal Processing Applications Kausar Jahan1, Pala Kalyani2, V Satya Sai3, GRK Prasad4, Syed Inthiyaz5, Sk Hasane Ahammad6 1Department of ECE, Dadi Institute of Engineering and Technology Anakapalle, Andhra Pradesh, India 2Department of ECE, Vardhaman College of Engineering Kacharam, Shamshabad, India 3Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 4Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 5Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 6Department of ECE, Koneru Lakshmaiah Education Foundation Guntur, India-522502 Abstract—The fundamental component used in many of the Digital signal Processing (DSP) applications are Multiply and Accumulation Unit (MAC). In the literature, a multiplier consists of greater number of full adders and half adder in partial product reduction stage, which increases the hardware complexity and critical path delay to MAC unit. To overcome this problem, two novel multipliers are proposed in this article. The proposed multipliers are designed and implemented in hardware, which reduces the circuit complexity and improves the overall performance of the MAC unit with less delay. The proposed multipliers are compared with the 4-bit existing designs and observed that the number of slices Look Up Tables (LUTs) are minimized from 113 to 43, Slices are reduced from 46 to 14, Full Adders (FAs) are lessened from 28 to 23, bonded Input Output Blocks (IOBs) and Half Adders (HAs) were not altered. The time delay is reduced from 14.251ns to 7.876ns. The proposed multipliers are compared in the literature with the 8-bit multiplier, then the number of Slice LUTs are reduced from 510 to 231, Slices are reduced from 218 to 113, FAs are reduced from 120 to 110, HAs are reduced from 56 to 39, time delay is reduced from 26.228ns to12.748ns, but bonded IOBs count remains same. The synthesis and simulations results are verified by using Xilinx ISE 14.7 version tool.

FPGA Implementation of Fast Binary Multiplication Based on Customized Basic Cells

Article

Full-text available

Oct 2022
J UNIVERS COMPUT SCI

Multiplication is considered one of the most time-consuming and a key operation in wide variety of embedded applications. Speeding up this operation has a significant impact on the overall performance of these applications. A vast number of multiplication approaches are found in the literature where the goal is always to achieve a higher performance. One of these approaches relies on using smaller multiplier blocks which are built based on direct Boolean algebra equations to build large multipliers. In this work, we present a methodology for designing binary multipliers where different sizes customized partial products generation (CPPG) cells are designed and used as smaller building blocks. The sizes of the designed CPPG cells are 2×2, 3×3, 4×4, 5×5, and 6×6. We use these cells to build 8×8, 16×16, 32×32, 64×64, and 128×128 binary multipliers. All of the CPPG cells and the binary multipliers are described using the VHDL language, tested, and implemented using XILINX ISE 14.6 tools targeting different FPGA families. The implementation results show that the best performance is achieved when cell 3×3 is used and Virtex-7 FPGA is targeted. The binary multipliers that are designed using the proposed CPPG cells achieve better performance when compared with the binary multipliers presented in the literature. As an application that utilizes the proposed multiplier, a Multiply-Accumulate (MAC) unit is designed and implemented in Spartan-3E. The implementation results of the MAC unit demonstrate the effectiveness of the proposed multiplier.

An Evolutionary Normalization Algorithm for Signed Floating-Point Multiply-Accumulate Operation

Article

Full-text available

Jan 2022
CMC-COMPUT MATER CON

In the era of digital signal processing, like graphics and computation systems, multiplication-accumulation is one of the prime operations. A MAC unit is a vital component of a digital system, like different Fast Fourier Transform (FFT) algorithms, convolution, image processing algorithms, etcetera. In the domain of digital signal processing, the use of normalization architecture is very vast. The main objective of using normalization is to perform comparison and shift operations. In this research paper, an evolutionary approach for designing an optimized normalization algorithm is proposed using basic logical blocks such as Multiplexer, Adder etc. The proposed normalization algorithm is further used in designing an 8×8 bit Signed Floating-Point Multiply-Accumulate (SFMAC) architecture. Since the SFMAC can accept an 8-bit significand and a 3-bit exponent, the input to the said architecture can be somewhere between −7.96872 to +7.96872. The proposed architecture is designed and implemented using the Cadence Virtuoso using 90 and 130 nm technologies (in Generic Process Design Kit (GPDK) and Taiwan Semiconductor Manufacturing Company (TSMC), respectively). To reduce the power consumption of the proposed normalization architecture, techniques such as “block enabling” and “clock gating” are used rigorously. According to the analysis done on Cadence, the proposed architecture uses the least amount of power compared to its current predecessors.

Block-Based Compression and Corresponding Hardware Circuits for Sparse Activations

Article

Full-text available

Nov 2021
SENSORS-BASEL

In a CNN (convolutional neural network) accelerator, to reduce memory traffic and power consumption, there is a need to exploit the sparsity of activation values. Therefore, some research efforts have been paid to skip ineffectual computations (i.e., multiplications by zero). Different from previous works, in this paper, we point out the similarity of activation values: (1) in the same layer of a CNN model, most feature maps are either highly dense or highly sparse; (2) in the same layer of a CNN model, feature maps in different channels are often similar. Based on the two observations, we propose a block-based compression approach, which utilizes both the sparsity and the similarity of activation values to further reduce the data volume. Moreover, we also design an encoder, a decoder and an indexing module to support the proposed approach. The encoder is used to translate output activations into the proposed block-based compression format, while both the decoder and the indexing module are used to align nonzero values for effectual computations. Compared with previous works, benchmark data consistently show that the proposed approach can greatly reduce both memory traffic and power consumption.

Convolver Design and Convolve-Accumulate Unit Design for Low-Power Edge Computing

Article

Full-text available

Jul 2021
SENSORS-BASEL

Convolution operations have a significant influence on the overall performance of a convolutional neural network, especially in edge-computing hardware design. In this paper, we propose a low-power signed convolver hardware architecture that is well suited for low-power edge computing. The basic idea of the proposed convolver design is to combine all multipliers’ final additions and their corresponding adder tree to form a partial product matrix (PPM) and then to use the reduction tree algorithm to reduce this PPM. As a result, compared with the state-of-the-art approach, our convolver design not only saves a lot of carry propagation adders but also saves one clock cycle per convolution operation. Moreover, the proposed convolver design can be adapted for different dataflows (including input stationary dataflow, weight stationary dataflow, and output stationary dataflow). According to dataflows, two types of convolve-accumulate units are proposed to perform the accumulation of convolution results. The results show that, compared with the state-of-the-art approach, the proposed convolver design can save 15.6% power consumption. Furthermore, compared with the state-of-the-art approach, on average, the proposed convolve-accumulate units can reduce 15.7% power consumption.

Application of ameliorated Harris Hawks optimizer for designing of low-power signed floating-point MAC architecture

Article

Full-text available

Jul 2021
NEURAL COMPUT APPL

Recently established Harris Hawks optimization (HHO) has natural behaviour for finding an optimum solution in global search space without getting trapped in previous convergence. However, the exploitation phase of the current Harris Hawks optimizer algorithm is poor. In the present research, an improved version of the Harris Hawks optimization algorithm, which combined HHO with Particle Swarm Optimization and named as ameliorated Harris Hawks optimizer algorithm, has been proposed to find the solution of various optimization problems such as nonlinear, non-convex and highly constrained engineering design problem. In the proposed research, the exploitation phase of the existing HHO algorithm is improved using a particle swarm optimization algorithm and its performance tested for CEC2005, CECE2017 and CEC2018 benchmark problems. Also, discrete algorithms such as FFT algorithms, convolution and image processing algorithm use multiply and accumulate (MAC) unit as a critical component. The efficiency of a MAC is mainly dependent upon the speed of operation, power dissipation and chip area along with the complexity level of the circuit. In this research paper, a power-efficient signed floating-point MAC (SFMAC) is proposed using universal compressor-based multiplier (UCM). Instead of having a complex design architecture, a simple multiplexer-based circuit is used to achieve signed floating output. The 8 × 8 SFMAC can take 8-bit mantissa and 3-bit exponent. And therefore, the input to the SFMAC can be in the range of − (7.96875)10 to + (7.96875)10. The design and implementation of the proposed architecture is done on the Cadence Spectre tool in GPDK 90 nm and TSMC 130 nm technologies. The analysis has proved that the proposed SFMAC architecture has consumed the least power than the recent MAC architectures available in the literature.

High Level Synchronization and Computations of Feed Forward Cut-Set based Multiply Accumulate Unit

Article

Full-text available

Feb 2021
J Phys Conf

In a modern technology-based application, digital signal processing (DSP) is a major priority one, in this gadgets application, the Multiply Accumulate Unit (MAC) will occupy more memory usages, power consumptions and critical path delay. Due to the number of arithmetic operations, this MAC unit will play’s a major role in this application product. Thus, the pipelined based architecture will be used to reduce the number of critical paths delay and to improve the performance of MAC architecture. However, the number of flip flops will be increased in the MAC unit, due to number of pipelined architectures. Consequently, it will increase the area and the power consumption. Thus, proposed work of this paper will get a novelty process of feed forward cut-set based MAC architecture with high level synchronization of XOR-MUX full adder with compressor technique. It will reduce the number of logic gates in MAC architecture and hence prove the performance in FPGA Implementation of LUT based area, critical path delay and average power consumption.

Efficient Systolic-Array Redundancy Architecture for Offline/Online Repair

Article

Full-text available

Feb 2020

Neural-network computing has revolutionized the field of machine learning. The systolic-array architecture is a widely used architecture for neural-network computing acceleration that was adopted by Google in its Tensor Processing Unit (TPU). To ensure the correct operation of the neural network, the reliability of the systolic-array architecture should be guaranteed. This paper proposes an efficient systolic-array redundancy architecture that is based on systolic-array partitioning and rearranging connections of the systolic-array elements. The proposed architecture allows both offline and online repair with an extended redundancy architecture and programmable fuses and can ensure reliability even in an online situation, for which the previous fault-tolerant schemes have not been considered.

A vedic mathematics based processor core for discrete wavelet transform using FinFET and CNTFET technology for biomedical signal processing

Article

Aug 2019

Design of Efficient Quantum-Dot Cellular Automata (QCA) Multiply Accumulate (MAC) Unit with power dissipation analysis

Article

Full-text available

Jun 2019
IET CIRC DEVICE SYST

QCA is a hopeful technology in the field of nanotechnology that seems to suit well with signal processing needs. It is concerned with great interest because of its benefits such as ultra-low power consumption, small size and can operate at one Terahertz. The Multiply-Accumulator (MAC) unit is considered as one of the essential operation in Digital Signal Processing (DSP). In the real-time DSP systems, several applications like speech processing, video coding and digital filtering etc, require MAC operations. However, the power dissipation and area are the most significant aspects in these systems. In this paper, we design low power MAC Unit based on QCA technology. QCADesigner version 2.0.3 is used to validate the accuracy of the proposed circuit. The reliability of this unit is taken at different temperatures. The power dissipation is estimated using QCAPro tool. The total power consumed by this unit is 2.183 μW. The proposed circuit has 90% improvement in terms of power over complementary metal–oxide–semiconductor (CMOS) circuits. Since the works in the field of QCA logic signal processing has started to progress, the suggested contribution will give rise to a new thread of research in the field of real time signal and image treatment.

Similar publications

Citations