Fig 4 - uploaded by Amir Momeni
Content may be subject to copyright.
Optimized 4-2 compressor of [8]. 

Optimized 4-2 compressor of [8]. 

Source publication
Article
Full-text available
Inexact (or approximate) computing is an attractive paradigm for digital processing at nanometric scales. Inexact computing is particularly interesting for computer arithmetic designs. This paper deals with the analysis and design of two new approximate 4-2 compressors for utilization in a multiplier. These designs rely on different features of com...

Context in source publication

Context 1
... next step. The correction constant ( n þ k bits) is selected to be as close as possible to the estimated value of the sum of these errors to reduce the error distance. A truncated multiplier with constant correction has the maximum error if the partial products in the n À k least significant columns are all ones or all zeros . A variable correction truncated multiplier has been proposed in [6]. This method changes the correction term based on column n À k À 1 . If all partial products in column n À k À 1 are one , then the correction term is increased. Similarly, if all partial products in this column are zero , the correction term is decreased. In [7], a simplified (and thus inaccurate) 2 Â 2 multiplier block is proposed for building larger multiplier arrays. In the design of a fast multiplier, compressors have been widely used [8], [9], [10] to speed up the partial product reduction tree and decrease power dissipation. Optimized designs of 4-2 exact compressors have been proposed in [8], [11], [12], [13], [14], [15], [16]. Kelly et al. [17] and Ma et al. [18] have also considered compression for approximate multiplication. In [17], an approximate signed multiplier has been proposed for use in arithmetic data value specula- tion (AVDS); multiplication is performed using the Baugh- Wooley algorithm. However, no new design is proposed for the compressors for the inexact computation. Designs of approximate compressors have been proposed in [18]; however, these designs do not target multiplication. It should be noted that the approach of [7] improves over [17], [18] by utilizing a simplified multiplier block that is amenable to approximate multiplication. Initially in this paper, two novel approximate 4-2 compressors are proposed and analyzed. It is shown that these simplified compressors have better delay and power consumption than the optimized (exact) 4-2 compressor designs found in the technical literature [8]. These approximate compressors are then used in the restoration module of a Dadda multiplier; four different schemes are proposed for inexact multiplication. Extensive simulation results are provided at circuit-level for figures of merit, such as delay, transistor count, power dissipation, error rate and normalized error distance under CMOS feature sizes of 32, 22 and 16 nm. The application of these multipliers to image processing is then presented. The results of two examples of multiplication of two images are reported; these results show that the third and fourth approximate multipliers yield an output product image that has a very high quality and resemblance to the image generated by an exact multiplier, i.e., excellent values for the average NED and the peak signal-to-noise ratio (PSNR) are found (for the PSNR more than 50db). The analysis and simulation results show that the proposed approximate designs for both the compressor and the multiplier are viable candidates for inexact computing. This paper is organized as follows. Section 2 is a review of existing schemes for (exact) compressors. The two new designs of an approximate 4-2 compressor are presented in Section 3. Multiplication and four different approximate multipliers are proposed in Section 4. Simulation results for the approximate compressors and multipliers are provided in Section 5. The application of the proposed approximate multipliers to image processing is presented in Section 6. Section 7 concludes the manuscript. The main goal of either multi-operand carry-save addition or parallel multiplication is to reduce n numbers to two numbers; therefore, n À 2 compressors (or n À 2 counters) have been widely used in computer arithmetic. A n À 2 compressor (Fig. 1) is usually a slice of a circuit that reduces n numbers to two numbers when properly replicated. In slice i of the circuit, the n À 2 compressor receives n bits in position i and one or more carry bits from the positions to the right, such as i – 1 or i – 2. It produces two output bits in positions i and i þ 1 and one or more carry bits into the higher positions, such as i þ 1 or i þ 2. For the correct operation of the circuit shown in Fig. 1, the following inequality must be satisfied where c j denotes the number of carry bits from slice i to slice i þ j . A widely used structure for compression is the 4-2 compressor; a 4-2 compressor (Fig. 2) can be implemented with a carry bit between adjacent slices ð c 1 1⁄4 1 Þ . The carry bit from the position to the right is denoted as c in while the carry bit into the higher position is denoted as c out . The two output bits in positions i and i þ 1 are also referred to as the sum and carry respectively. The following equations give the outputs of the 4-2 compressor, while Table 1 shows its truth table. The common implementation of a 4-2 compressor is accomplished by utilizing two full-adder (FA) cells (Fig. 3) [8]. Different designs have been proposed in the literature for 4-2 compressor [8], [11], [12], [13], [14], [15], [16]. Fig. 4 shows the optimized design of an exact 4-2 compressor based on the so-called XOR-XNOR gates [8]; a XOR- XNOR gate simultaneously generates the XOR and XNOR output signals. The design of [8] consists of three XOR- XNOR (denoted by XOR Ã ) gates, one XOR and two 2-1 MUXes. The critical path of this design has a delay of 3 D , where D is the unitary delay through any gate in the design. In this section, two designs of an approximate compressor are proposed. Intuitively to design an approximate 4-2 compressor, it is possible to substitute the exact full-adder cells in Fig. 3 by an approximate full-adder cell (such as the first design proposed in [2]). However, this is not very efficient, because it produces at least 17 incorrect results out of 32 possible outputs, i.e., the error rate of this inexact compressor is more than 53 percent (where the error rate is given by the ratio of the number of erroneous outputs over the total number of outputs). Two different designs are proposed next to reduce the error rate; these designs offer significant performance improvement compared to an exact compressor with respect to delay, number of transistors and power consumption. As shown in Table 1, the carry output in an exact compressor has the same value of the input c in in 24 out of 32 states. Therefore, an approximate design must consider this feature. In Design 1, the carry is simplified to c in by changing the value of the other eight outputs. Since the Carry output has the higher weight of a binary bit, an erroneous value of this signal will produce a difference value of two in the output. For example, if the input pattern is “01001” (row 10 of Table 2), the correct output is “010” that is equal to 2. By simplifying the carry output to c in , the approximate compressor will generate the “000” pattern at the output (i.e., a value of 0). This substantial difference may not be acceptable; however, it can be compensated or reduced by simplifying the c out and sum signals. In particular, the simplification of sum to a value of 0 (second half of Table 2) reduces the difference between the approximate and the exact outputs as well as the complexity of its design. Also, the presence of some errors in the sum signal will results in a reductions of the delay of pro- ducing the approximate sum and the overall delay of the design (because it is on the critical path). In the last step, the change of the value of c out in some states, may reduce the error distance provided by approximate carry and sum and also more simplification in the proposed design. Although the above mentioned simplifications of carry and sum increase the error rate in the proposed approximate compressor, its design complexity and therefore the power consumption are considerably decreased. This can be realized by comparing (2)-(4) and (5)-(7). Table 2 shows the truth table of the first proposed approximate compressor. It also shows the difference between the inexact output of the proposed approximate compressor and the output of the exact compressor. As shown in Table 2, the proposed design has 12 incorrect outputs out of 32 outputs (thus yielding an error rate of 37.5 percent). This is less than the error rate using the best approximate full-adder cell of [2]. Equations (5)-(7) are the logic expressions for the outputs of the first design of the approximate 4-2 compressor proposed in this manuscript. The gate level structure of the first proposed design (Fig. 6) shows that the critical path of this compressor has still a delay of 3 D , so it is the same as for the exact compressor of Fig. 5. However, the propagation delay through the gates of this design is lower than the one for the exact compressor. For example, the propagation delay in the XOR Ã gate that generates both the XOR and XNOR signals in [8], is higher than the delay through a XNOR gate of the proposed design. Therefore, the critical path delay in the proposed design is lower than in the exact design and moreover, the total number of gates in the proposed design is significantly less than that in the optimized exact compressor of [8]. A second design of an approximate compressor is proposed to further increase performance as well as reducing the error rate. Since the carry and c out outputs have the same weight, the proposed equations for the approximate carry and c out in the previous part can be interchanged. In this new design, carry uses the right hand side of (7) and c out is always equal to c in ; since c in is zero in the first stage, c out and c in will be zero in all stages. So, c in and c out can be ignored in the hardware design. Fig. 7 shows the block diagram of this approximate 4-2 compressor and the expressions below describe its outputs. Note that (9) is the same as (7) and (8) is the same as (6) for c in 1⁄4 0 . Fig. 8 shows the gate level implementation of the second proposed design. The delay of the critical path of this approximate design is 2 D , so it is 1 D ...

Similar publications

Article
Full-text available
Due to several physical limitations in the realization of quantum hardware, today's quantum computers are qualified as noisy intermediate-scale quantum (NISQ) hardware. NISQ hardware is characterized by a small number of qubits (50 to a few hundred) and noisy operations. Moreover, current realizations of superconducting quantum chips do not have th...

Citations

... The approximation and error correction combining structure is detailed in [5], where approximation is accomplished by interchangeably using the preprocessed input data bits. For the purpose of applying error-resistant systems, truncation and approximation are used in the multiplier design [6] for lower order input bits. The disadvantage of [7]'s multiplier is that it is approximated by a 4:2 compressor and has non-zero output even when all inputs are zero. ...
Chapter
Full-text available
For the energy-efficient multiply-accumulate (MAC) processing, we introduce a unique approximate computing approach in this study. In order to generate mistakes in the opposite direction while minimizing computing costs, we first develop approximate 4-2 compressors. Positive and negative multipliers are then meticulously built based on the probabilistic analysis to produce a comparable error distance. The proposed MAC processing provides the energy-efficient computing scenario by expanding the range of approximative portions, according to simulation results on various real-world applications. The low-energy, MAC-intensive algorithms are created by this work, which brand-new introduces the advanced interleaving technique for the balanced error accumulation. We created two different types of approximate multipliers with opposing error directions based on the previous compressor-based approximation. To be more specific, when designing compressors, we simply take into account the direction of the faults, and each approximative multiplier is built to have the lowest hardware cost possible. Verilog HDL is used to implement this design, and Model Sim 6.4 c is used to simulate it. The Synthesis Process tool from Xilinx measures performance.
... The multiplication block affects most of the power consumption of the whole circuits and systems in which it is present [4,5]. Numerous applications use the multiplication block for computation [6][7][8][9][10]. ...
Article
Full-text available
The multiplier is one of the most essential arithmetic blocks in computer architecture, as it has an impact on the system’s overall performance. Approximate computing help in improving multiplier performance with low power consumption at the expense of computing precision. In this paper, approximate novel compressors are proposed and further used for the implementation of the proposed approximate multiplier. In the multiplication, process compressors are used for the reduction of partial products with low consumption of power. In comparison to the exact multiplier, the proposed multiplier shows efficient results in terms of Look-up tables, area, memory utilization, and power consumption. The validation of the approximate multiplier is done in an error-tolerant application. In this paper, validation is done in an image processing application for image blending which results in 23.87 dB and 22.7 PSNR values for set 1 and set 2 respectively.
... In this section, a brief discussion about exact 4:2 compressor and some of the prominent existing reported inexact 4:2 compressor designs [3,8,18,25,29,30] are discussed. ...
... Both Cout and Carr y are of same weight values, whereas Sum is of the next lower weight than that of Cout and Carr y. The Boolean expressions of a 4:2 compressor [18] are given by ...
... In the recent past, several inexact 4:2 compressors are proposed in the literature. Momeni et al. [18] have proposed two inexact 4:2 compressors by introducing errors in the truth table. In this design [18], one error is introduced in the most probable input pattern, i.e., when all the inputs are zero. ...
Article
Full-text available
Approximate multipliers are widely used in image processing and multimedia signal processing applications for the reduction in area, computation time, and power consumption. 4:2 compressors play a key role in multipliers for efficient addition of partial product bits. In this paper, an inexact 4:2 compressor is proposed, where a major portion of the logic associated with the generation of Carry and Sum are shared to reduce the logic/circuit complexity. During the design time overestimates are balanced with the underestimates so as to reduce the total error distance. The proposed compressor is used for the addition of partial products of the Baugh-Wooley multiplier. A novel algorithm is proposed for the placement of inexact 4:2 compressors in the partial product bit array, based on the minimum peak signal-to-noise ratio requirement of the application. The approximate multipliers thus obtained are used for edge detection using Sobel operator. It is observed from the Pareto analysis that the proposed design offers significant saving in area, computation time, and power consumption over the design with nearly the same error performance. Besides, the proposed design is more error tolerant compared with the design with nearly the same area-delay trade-off and power-delay trade-off. Moreover, the proposed inexact 4:2 compressor-based Baugh-Wooley multiplier is found to offer better accuracy-area trade-off compared to the state-of-the-art 4:2 compressors in convolutional neural network-based classification.
... Arithmetic-based multipliers adopt inaccurate arithmetic operations to approximate multiplication outputs with affordable accuracy degradation [2,3,[11][12][13][14][15][16][17][18][19]. Notably, probabilistic multipliers using inaccurate partial product reduction steps can produce their erroneous outputs, assuring that applications will not be significantly degraded [20][21][22][23][24][25][26][27][28][29][30]. ...
... On the other hand, a positive error happens when the inaccurate compressor output is larger than the exact value. Since the appearance of Momeni's compressor [20], different inaccurate compressors have been proposed by redefining their truth tables. Fig. 2 illustrates the truth tables and schematics of several inaccurate 4:2 compressors. ...
... Fig. 2 illustrates the truth tables and schematics of several inaccurate 4:2 compressors. Whereas the exact 4:2 compressor contains three-bit output ( , , and ) in Fig. 1 [20] or Sabetz [28] that produce a positive error when 4 3 2 1 = 0000 2 . In other words, even though all multiplier inputs are zero, the multiplication output is greater than zero. ...
Article
This paper proposes low-biased probabilistic multipliers and applies the proposed designs to the inference stage of convolutional neural networks (CNNs). Highly inaccurate compressors in the probabilistic multiplier make the error distribution unbalanced and produce significant relative errors. We describe design rules for applying different compressors to the probabilistic multiplier. Besides, the proposed design rules enhance the error characteristics by determining the input values of each compressor, suppressing the significant relative error. Notably, motivated by the design rules, we propose a novel inaccurate 4:2 compressor suitable for our new design. The design rules and proposed 4:2 compressor are adopted in 8-bit probabilistic multipliers. The error analysis shows that the new designs can enhance error characteristics without increasing hardware costs. Furthermore, the proposed approximate multipliers are applied to the convolutional layers of CNNs. We describe the fine-tuning scheme to reduce the retraining time of approximation-aware training. The experiments show that proposed designs can produce acceptable classification results compared with those using the floating-point and 8-bit exact multiplications on CIFAR and ImageNet datasets.
... Partial product generation, partial product reduction, and final carry propagating addition are the three standard components of multipliers [35]. The partial product reduction component has garnered the most interest from designers because it consumes the most power and has the largest occupied space of the three [26,53]. For a n × n multiplier, the partial product reduction decreases the significant digits generated in the carry generation bits into two digits. ...
Article
Full-text available
Recent developments in multiplication circuits with fewer transistors, higher speed, and reduced energy consumption are lowering hardware costs. Approximate multiplication is used for fault-tolerant applications, such as image and signal processing. These applications offer considerable improvement at the cost of reduced accuracy. Compressors are the main component of any multiplier structure. In this work, the design of an approximate multiplier based on current mode, using 4:2 and 5:2 approximate compressors and current over scaling technique is suggested to reduce power consumption, delay, and hardware components compared to existing architectures. In the proposed approach for the design of compressor circuits, current mode binary converters with 32 nm CNTFET technology are used. This work is done in three steps: (1) converting input currents into voltage and creating multi-level voltages. (2) Implementing voltage level detector. (3) Generation of output current based on threshold voltage. The proposed technique uses less number of transistors. The simulation was done under HSPICE software and the efficiency of the compressors was compared in terms of delay, power, and power delay product (PDP). Based on the simulation results, the PDP value for 4:2 and 5:2 compressor circuits is calculated as 0.0097 and 0.0131 fJ, respectively. Then, an approximate multiplication of 8 × 8 using these compressors is designed for multiplying images under MATLAB software. This multiplier has achieved a good improvement in terms of PSNR and MSSIM (respectively 9.14 and 2.85%) compared to advanced approximate multipliers. Also, the proposed approach performs better in terms of accuracy, power consumption, and delay.
... Momeni et al. [14] have suggested a new method for designing approximate 4:2 compressors by manipulating the truth table of the exact compressor in order to implement four multipliers. This structure is based on XNOR/NOR logic and is implemented with 26 transistors. ...
... This structure can be implemented with only 12 transistors. The structure of Momeni [14], Akbari [15] and Taheri [6] multipliers with the aim of decreasing power consumption only contains one region, in which they used their approximate compressors and did not disregard any of the PPs. ...
Article
Multipliers are one of the most commonly used parts in a system, responsible for performing computations, while significantly contributing to power consumption. In this article, by removing the least significant bits, a new architecture is presented to implement 3 multipliers (Mul-1, Mul-2 and Mul-3), in order to reduce complexity and power consumption. Compared to the previous works, Mul-1 has the most accuracy in addition to its low energy consumption, therefore, it has been able to deliver a good trade-off between accuracy and energy consumption. All proposed designs and existing multipliers have been simulated and compared in 7 nm FinFET technology using Hspice tool. Moreover, the accuracy and quality of the proposed approximate multipliers are also evaluated using MATLAB. The results show that Mul-1 and Mul-3 are very efficient in image processing applications. According to the results, Mul-1 outperforms its counterpart by 10%, 50% and 50% in PDP, NMED and MRED, respectively. Furthermore, Mul-3 has satisfactory MSSIM in DSP applications and is better than its counterpart by 23% and 16% in PDP and MRED. Meanwhile, Mul-2 improves PDP by nearly 53% compared to Mul-1 and has the lowest power consumption.
... In the first computational step, the approximated circuits are used to reduce the partial product with high speed and decrease the complexity of the circuit. In the literature, different approximated circuits are designed for the implementation of the energy-efficient digital circuit [15,16]. Researchers are implementing novel algorithms to leverage specific features in errortolerant applications for the reduction of power. ...
... In Table 8, the proposed 4:2 compressor error rate (%) is compared with stateof-the-art work in which proposed compressor shows 50%, 75%, 83.3%, 75%, 90%, 87.5%, 80% and 75% less error rate (%) in comparison with [3,18] (Design 1, Design 2, Design 3), [16] (Design 1), [16] (Design 2), [2] (Structure 1, Structure 2), [2] (Structure 3), [2] (Structure 4) and [7]. Table 9 shows the comparison in terms of area (LUT) of proposed approximate multiplier designs (AOM, POM, PAOM) with state-of-the-art work. ...
... In Table 8, the proposed 4:2 compressor error rate (%) is compared with stateof-the-art work in which proposed compressor shows 50%, 75%, 83.3%, 75%, 90%, 87.5%, 80% and 75% less error rate (%) in comparison with [3,18] (Design 1, Design 2, Design 3), [16] (Design 1), [16] (Design 2), [2] (Structure 1, Structure 2), [2] (Structure 3), [2] (Structure 4) and [7]. Table 9 shows the comparison in terms of area (LUT) of proposed approximate multiplier designs (AOM, POM, PAOM) with state-of-the-art work. ...
Article
Full-text available
In this paper, three approximate multiplier architectures are proposed: area-optimized approximate multiplier (AOM), power-optimized approximate multiplier (POM), and power- and area-optimized approximate multiplier (PAOM). These designs are implemented using speculative Han–Carlson adder and compressor-based multiplier blocks. Han–Carlson adder is used as the basic adder block in the final addition stage of all the three approximate multiplier designs. Different types of compressors (3:2, 4:2, 5:2, 6:2, 7:2, 8:2) are used for the implementation of the energy-efficient approximate multiplier blocks. All the simulations are performed on VIVADO design tool. Also, the designed multipliers are validated for image blending (an error-tolerant) application. The proposed power optimization approximate multiplier shows 0.86%, 10.54% PSNR improvement in comparison with area optimization approximate multiplier and power and area optimization approximate multiplier, respectively.
... This optimization involves omitting a number of AND gates in the partial product reduction circuit of the Dadda multiplier. Several compressors are introduced in [14][15][16][17][18] with the goal of enhancing energy efficiency and reducing complexity in Dadda [19] and Wallace multiplier [20] structures. Error-correcting modules are developed in [14,17,21,22] to address errors introduced by approximations at the logic level. ...
... Where is the maximum value of ED calculated in the approximate multiplier as (2 − 1) 2 [18]. Therefore, Eq. (6) can be replaced with: ...
... Assuming Sum = 0 and Simultaneously, With respect to ED indicated for the AC2 in Table 3, and the probability column, some cases namely, ''0011'', ''0101'', ''1010'', and ''1100'' are preferred to have a Carry equal to 1, and also, we need to have both negative and positive ED for 3 group. Hence, (18) can be considered to meet the Previously mentioned objectives and improve the compressor in case of error rate. ...
Article
Approximate computing is one of the promising techniques in error-resilient applications to overcome high-density integration challenges, such as energy consumption and performance. Multipliers constitute a significant portion of computer arithmetic units, leading to considerable energy and time consumption. In this paper, we propose low-power and compact approximate compressors for composing approximate Dadda multiplier structures, including compressors, half adders, and full adders, which utilize three-phase partial product compression: truncated, approximation, and exact columns. In approximate columns, we consider approximate compressors derived from the truth table of the exact 4:2 compressor and simplified K-map entries based on the probability of each combination of inputs. An error-correcting module (ECM) is designed to distinguish specific cases and reduce the error metrics. All circuits were simulated using ModelSim and then synthesized using Design Compiler with the 15 nm FinFET technology. When compared to state-of-the-art works, our multipliers exhibit approximately 30%, 43%, and 10% reductions in power, area, and delay, respectively. To evaluate their functionality, we conducted image multiplication and implemented a simple multi-layer perceptron (MLP) neural network using the modified National Institute of Standards and Technology (MNIST) dataset in MATLAB with 0.998 mean structural similarity index metric (MSSIM), 51 dB peak-signal noise ratio (PSNR), and 95% classification accuracy.
... Firstly, a radix-2 array multiplier that constructs a pyramidshaped array of partial product bits and compresses (adds) them in a tree. We approximate this multiplier by truncating some least significant rows or columns of partial product bits (denoted by RTrunc and CTrunc), by collapsing columns of partial product bits into one (denoted by ORCmp) [27], or by employing inexact compressors (denoted by Miscnt) [28]. Secondly, a recursive multiplier implementing the algorithm of [29], which we approximate by using inexact 2 × 2-bit multiplier primitives in some least significant sub-products [30]. ...
... Secondly, our observation about adders from above also applies to multipliers: truncating too many LSBs or sub-products increases errors. Performing instead some, albeit inexact, compression on these bits can potentially reduce these errors; in line with observations in [28]. ...
Conference Paper
Full-text available
Many popular applications show resilience to computational errors. Approximate Computing (AxC) exploits this to reduce their execution time and energy consumption by introducing approximations in software and hardware. Using AxC raises new challenges to ensure that hardware designs satisfy their demands before deployment, which hardware designers address by spending significant efforts on verification flows for their designs. However, there exist no tools for verifying approximate hardware designs, meaning that designers must replicate code to keep track of circuit outputs and subsequently compute relevant error metrics. We aim to solve this issue with a library that abstracts away port sampling and error computations behind a simple interface. With the library, designs can retrieve error metric values and constraint satisfaction results with only a few extra lines of code. We demonstrate these features with code examples and by characterizing a collection of inexact adders and multipliers and an approximate matrix-vector multiplier.
... The general structure of the approximate unsigned 8 × 8 Dadda multiplier based on the 4:2 compressor is fully explained in [39], as shown in Fig. 6. It is not possible to apply the MLAPC directly in the Dadda approximate multiplier due to the structure being specified for the 4:2 compressor. ...
Article
In this paper, we propose an approximate multiplier that is Approximate computing (AC) offers benefits by reducing the requirement for accuracy, thereby reducing delay. The majority logic (ML) gate functions as the fundamental logic block of many emerging nanotechnologies. These adders are designed to prevent the propagation of inexact carry-out signals to higher order computing parts to enhance accuracy. We implemented the proposed multiplier by using a unique partial product reduction (PPR) circuitry, which was based on the parallel approximate 6:3 compressor. The implemented by quantum-dot cellular automata (QCA) are analyzed to evaluate the adder designs. A significant improvement is observed over previous designs based on the experimental results. The proposed design is further designed using kogge stone adder. Finally, It has added advantage that reduces logic size and facilitates with less power and delay. Here we are using Verilog HDL and Xilinx ISE14.8 software tools for simulation and synthesis purpose